Development Audit Brief

Use this audit brief when you want an independent reviewer to assess whether the reporting-risk-cascade code and generated artifacts support the current paper plan. The audit should stay close to accounting evidence, filing timing, and ML evaluation. Do not turn it into a generic code review or a data-shopping list.
Audit Instructions

You are auditing the reporting-risk-cascade paper repo. The paper is about a
public-data-first workflow for estimating a pre-disclosure reporting-risk state,
not a generic corporate misstatement prediction task.

Role:
Review the repo as both an accounting researcher and an ML engineer. Focus on
filing timing, public observability, label construction, bridge validity, model
evaluation under rare events, and whether the code supports the paper's actual
claims.

Scope boundary:
This is a development audit, not a manuscript review. Do not audit prose polish,
title style, journal fit, or submission strategy unless a wording choice directly
contradicts a claim boundary in docs/paper_plan.md.

Primary contract:
Treat docs/paper_plan.md as the binding research and implementation contract.
Do not treat it as automatically correct. If docs/paper_plan.md, README.md,
docs/results_snapshot.md, code, tests, and generated artifacts disagree, say
which side is stale and why. Your job is to judge support for the current paper
claims, not to make the claims sound stronger.
For any disagreement, label the stale side explicitly as stale_docs,
stale_snapshot, stale_tests, or stale_code, and give the evidence chain.

Data availability stance:
This audit is public-data-first. Assume no WRDS, Audit Analytics, CRSP,
Compustat, FactSet, Refinitiv, RavenPack, or similar institutional database
access unless the user explicitly says otherwise. The core reproducible spine is
public SEC/PCAOB/EDGAR data plus the local
legacy gvkey x data_year benchmark layer when available.

Do not treat absence of WRDS or Audit Analytics as a code bug. A farr
gvkey-CIK bridge can support candidate validation, but it is not WRDS-verified.
Paid or professional data can be mentioned only as an optional validation or
enrichment path after a concrete blocker is identified.

Fallback hierarchy:
1. First evaluate native public sources already in or near the repo: SEC bulk
   submissions, FSDS, Notes summaries, UPLOAD/CORRESP, NT 10-K/10-Q,
   10-K/A and 10-Q/A amendments, 8-K Item 4.02, PCAOB Form AP, PCAOB
   inspections, AAER pages, farr support exports, SEC ticker files, and any
   public issuer metadata already wired into the public lake.
2. Then evaluate affordable external APIs only when they solve a named blocker
   without replacing the local EDGAR/PCAOB lake as source of record.
3. Treat institutional paid sources as out-of-scope future work unless the user
   explicitly opens that path.

External-source feasibility priors:
- OpenFIGI requires a seed identifier such as ticker, CUSIP, ISIN, or SEDOL;
  gvkey is not supported. It cannot solve a raw gvkey-only bridge blocker unless
  another source first supplies a security identifier.
- sec-api.io can be considered as an SEC search/parser optional accelerator for
  8-K Item 4.02, filing extraction, and correspondence discovery. It should not
  replace the local EDGAR lake.
- Financial Modeling Prep and EODHD can be considered for market/security
  enrichment. They are not core reproducibility sources and do not solve
  restatement timing, detector identity, or gvkey-CIK validation by themselves.
- stockdata.dev / FactStream can be considered only as SEC-derived parser or
  enrichment support.
- Intrinio / Calcbench are professional paid options. Evaluate later only if the
  user explicitly asks for a budgeted data-acquisition review.
- Audit Analytics / WRDS / Ideagen Audit Analytics are institutional only under
  current assumptions. If access appears later, they may help with restatement
  dates, severity, and detector/notifier fields, but they are not required for
  the current v1 public-data paper.

Do not search pricing by default. If the user wants pricing, coverage, or
license review, treat it as a separate data-acquisition review.

Availability fallback:
If local benchmark data, public-lake panels, or generated artifacts are absent
in the working checkout, classify the relevant item as not_auditable_from_checkout
or artifact_unavailable. Do not treat missing local data/artifacts as a code
bug. Distinguish code-path support, documented contract, and live local evidence.

Repository context:
- Authoritative reference files are README.md, docs/paper_plan.md,
  docs/results_snapshot.md, docs/future_work.md, justfile, .env,
  pyproject.toml, uv.lock, and the YAML files under config/.
- Source code, wrappers, and tests live under src/, scripts/, and tests/.
  Discover the current file inventory from the checkout instead of relying on a
  hand-maintained list in this prompt.
- Generated evidence lives under artifacts/. Discover study outputs from
  manifests, summaries, component subdirectories, and artifact indexes. Do not
  assume that a directory name is a just recipe.
- The raw benchmark source is the local gvkey x data_year misstatement panel
  when available. Public evidence comes from the public SEC/PCAOB lake and its
  derived gold panels when available.
- UV_PROJECT_ENVIRONMENT must be an absolute path outside the repository;
  repo-local .venv creation is workflow drift.

Required first pass:
1. Read docs/paper_plan.md end to end.
2. Read README.md, docs/results_snapshot.md, and justfile to understand the
   current workflow and artifact claims.
3. Inspect config/*.yaml and confirm the command defaults against justfile.
4. Inspect the relevant src/ modules and scripts/ wrappers. Do not skip
   construct-overlap or public-peer code paths if they exist in the checkout.
5. Inspect tests to see which invariants are actually locked. Do not treat
   tests for prompt keywords as semantic correctness; they are presence checks.
6. Inspect the raw benchmark schema and basic row counts without loading more
   data than needed. Report rows, columns, identifier columns, target column,
   res_an* columns, missing_* flags, and whether raw-side CIK/ticker/company
   name/CUSIP/PERMNO fields exist.
7. If the public lake exists, report whether the gold issuer-year and
   filing-origin panels exist, their row counts, fiscal-year span, and whether
   comment_thread, amendment, 8-K Item 4.02, and AAER proxy labels have nonzero
   positives.
8. If artifacts exist, inspect the study manifest, component summaries, and
   selected artifact indexes. Say when a docs number is a static snapshot rather
   than a live result.

Audit dimensions:

0. Data availability stance
- Apply the public-data-first assumption before judging missing features.
- Do not invent data availability.
- Do not recommend paid data as required for the current v1 paper.
- Separate code bugs from data constraints.
- Treat raw_identifier_blocker as a bridge/input condition, not as proof that
  the public lake design is wrong.
- Treat farr gvkey-CIK output as candidate_farr validation, not as a
  WRDS-quality reference bridge.

1. Workflow and command contract
- Discover the active command surface from justfile rather than assuming recipe
  names from this prompt.
- Does just check remain the data-free quality gate?
- Does the paper-facing full command still run setup/tests/lint, public-lake
  build or resume, benchmark, public cascade, bridge probe, and construct
  overlap when required inputs exist?
- Does just full intentionally leave peer comparison at its default none mode,
  with peer evidence requiring an explicit study rerun that passes
  --peer-comparison-mode and, when needed, --peer-target? Treat full_with_peer as
  a conventional output directory name if present, not as a recipe unless
  justfile defines it.
- Does the task dispatcher, if present, route prep, component analyses, study
  runs, and public-data fetch/build tasks consistently with README and justfile?
- Does the study runner orchestrate benchmark, public cascade, bridge probe,
  legacy peer comparison, public peer comparison, and construct overlap
  consistently with config defaults and CLI overrides?
- If skip flags or target flags exist, verify that they skip only their intended
  component and leave unrelated components unchanged.
- Does CI use bounded smoke targets rather than accidentally triggering full
  public-lake or full peer suites?
- Does status avoid mutating environment state, and does setup own dependency
  sync through UV_PROJECT_ENVIRONMENT?
- Does the justfile keep UV_PROJECT_ENVIRONMENT outside the repo and avoid
  silently creating a repo-local .venv?

2. Paper-plan compliance
- Are the execution gates in docs/paper_plan.md implemented or honestly marked
  as pending?
- Are res_an0, res_an1, res_an2, and res_an3 excluded from benchmark predictors?
- Are res_an* outputs treated as label-observability sensitivity rather than
  paper-grade label maturation?
- Are unknown-timing positives counted, and are drop-observed versus imputed-lag
  sensitivity scenarios clearly separated?
- Are public cascade labels based on first public event dates?
- Are source_available_*, public_date_*, vintage_*, and as_of_date excluded from
  default public-cascade predictors?
- Does the bridge path avoid silent many-to-many gvkey-CIK joins?
- Do construct-overlap outputs carry validation_tier = candidate_farr until a
  verified WRDS bridge is supplied?

3. Data and label integrity
- Check whether each public label is observable only after the correct filing
  origin.
- Check whether any event after origin_date can enter predictors.
- Check whether origin_date is the focal public filing date, not
  fiscal_period_end. If a panel uses fiscal_period_end as the prediction anchor,
  treat that as a timing-risk finding unless the code proves otherwise.
- Check whether acceptance_datetime or an equivalent filing acceptance timestamp
  is retained when same-day ordering matters. If the modeling panel uses
  date-only fields, record same-day predictor/label ambiguity as a
  timing-resolution limitation unless the code proves acceptance-time-safe
  ordering.
- Check whether all rolling history features use event_date < origin_date,
  including prior comment threads, prior NT filings, prior amendments, prior
  8-K instability items, and prior auditor/oversight events.
- Require code or artifact evidence for temporal ordering. Do not accept a
  narrative statement that there is no post-origin leakage without checking the
  joins or generated feature dates.
- Check whether label_8k_402_365 is derived from SEC submissions items metadata.
  If items are missing for a filing, verify that item_metadata_missing is
  recorded rather than silently falling back to HTML/TXT parsing or
  primary_doc_description as a label source.
- Check whether XBRL features are as-first-reported for the origin-time panel:
  facts must be tied to filings available before or at the origin, and later
  amendments or restated values must not overwrite what was known at origin_date.
- Check whether censoring is horizon-specific and task-specific.
- Check whether comment letters are described as public comment-letter scrutiny,
  not full SEC review.
- Check whether AAER is treated only as a high-severity enforcement proxy or external
  validation anchor, not as a full enforcement universe or a stable headline
  prediction target.
- Check whether label_comment_thread_365, label_amendment_365,
  label_8k_402_365, and label_aaer_proxy_730 remain separate rather than being
  collapsed into a single fraud or restatement label.

4. Benchmark layer
- Does the benchmark run emit a complete, inspectable set of panel, timing,
  rolling-prediction, rolling-metric, drift, missingness, DML-style, and summary
  artifacts? Discover the exact filenames from the output directory and code
  rather than relying on this prompt as an exhaustive list.
- Does timing_claim_status distinguish sensitivity evidence from paper-grade
  maturation evidence?
- Do benchmark models use annual out-of-time windows rather than random
  cross-validation for headline prediction tables?
- Are DML-style benchmark rows framed as adjusted associations, not causal
  effects?
- Do reported metrics include prevalence, PR-AUC, ROC-AUC, Brier, Brier Skill
  Score, ECE where applicable, top-k precision, and Bao-style top-fraction
  metrics where expected?

5. Public cascade and public-label DML
- Are xbrl_ratio_* and xbrl_coverage_* features present in the issuer-year panel
  and joined into public-cascade modeling?
- Check whether xbrl_coverage_* by fiscal year or origin year is reported or can
  be computed from the artifacts. Pay special attention to 2011-2013 because
  phased XBRL adoption can create sample-composition differences.
- If early years have materially lower XBRL feature density than later years,
  require the audit to report this as a sample-composition limitation rather
  than a model failure.
- Check whether source_available_form_ap or auditor/oversight feature density is
  reported by fiscal year or origin year. Form AP is available from 2017-01-31,
  so 2011-2016 auditor partner features should be treated as structurally
  coverage-limited unless another public source is documented.
- Do the public-cascade summaries report readiness level, zero-positive tasks,
  task status counts, feature-family summaries, and public-opacity DML status?
- Are one-class train/test task fits skipped and reported rather than forced into
  metrics?
- Are public-opacity DML artifacts based on label_comment_thread_365,
  label_amendment_365, and label_8k_402_365 as primary outcomes?
- Are public-label DML results described as adjusted associations, not causal
  evidence of strategic silence?
- Is AAER proxy status-only or robustness-only when positives are sparse?

6. Benchmark and public peer model-family transfer
- Discover legacy-peer and public-peer artifacts from their component
  directories, manifests, summaries, and schema tests. Do not treat this prompt
  as an exhaustive artifact list.
- For each peer suite, check metrics, predictions, task status, mapping
  attrition, imbalance strategy, feature importance where applicable, manifests,
  and blockers when those artifact classes are implemented.
- Check whether task-status tables carry imbalance_strategy and reason_code
  fields, not just an uninformative status flag.
- Check whether the peer suites are described as model-family transfer and
  metric-compatible ranking evidence, not same-estimand replication of prior
  fraud-prediction papers.
- Check whether skipped rows include specific reason_code values that a
  developer can act on. Do not accept a generic "skipped" status without a
  reason.
- Check whether mapping-attrition outputs record missing, exact, and proxy
  mappings for public Dechow/Bao-style variables.
- Check whether Bao/Dechow public transfer states plainly when raw
  accounting-number model replication is not supported by public issuer-origin
  inputs.
- Check whether mapping_attrition_rate is interpreted as variable-mapping
  attrition, not sample attrition.
- Check whether weak proxy mappings for Dechow/Bao-style variables are reported
  plainly.
- Check whether undersampling, class weights, calibration warnings, and Brier/ECE
  interpretation are documented under rare-event imbalance.
- Check whether public peer transfer covers comment_thread, amendment, and
  8k_402 while keeping aaer_proxy as high-severity status.
- Check whether aaer_proxy is skipped or status-only with
  severity_tail_sparse_not_headline, blocked_sparse, or an equally explicit
  reason code.

7. Bridge and construct-overlap validation
- Discover bridge-probe and construct-overlap artifacts from manifests,
  summaries, blocker files, component directories, and docs artifact indexes.
- Check bridge coverage, multiplicity, unmatched diagnostics, validation tier,
  overlap sample flow, overlap panel grain, bridge confidence tiers, aggregation
  sensitivity, label contingency/lift, public-score-to-legacy ranking,
  legacy-score-to-public ranking, top-decile lift, label co-occurrence,
  event-time concentration, event-time coverage, AAER support, and res_an proxy
  coverage when those outputs are implemented.
- Check opacity-refresh outputs separately from construct-overlap outputs.
  Missing opacity artifacts should be reported as a refresh blocker rather than
  as a failure of construct-overlap validation.
- Check whether high-confidence, ambiguous, and dropped bridge tiers are reported
  separately.
- Audit the grain explicitly: raw panel grain, prediction grain, and overlap
  aggregation grain. State whether the primary overlap result is annual-primary,
  origin-level, or max-score/max-label aggregated, and require an aggregation
  sensitivity output when multiple public-origin rows can map to one annual
  legacy row.
- Check whether overlap claims are limited to related-but-non-identical
  constructs.
- Check whether amendment and 8-K Item 4.02 evidence are separated from the
  broader comment-letter signal.
- Check whether aaer_proxy_730 sparsity is interpreted correctly: lack of public
  AAER proxy positives is not evidence that legacy positives have no AAER
  relation; farr AAER support is a separate high-severity check.
- Check whether candidate_farr is clearly labeled as not WRDS-verified.

8. Public Data Utilization Audit
- Before recommending external data, check whether public SEC/PCAOB sources are
  already ingested, normalized, joined, and documented.
- Check whether amendment annotations use the conservative mixed-content
  priority: financial/non-admin triggers override Part III/proxy admin triggers,
  with an explicit annotation note.
- Check whether amendment_annotation uses a bounded explanatory-note scan, not only
  filing timing or form type. Confirm that
  tests or artifacts expose admin_part_iii/proxy handling, financial_override,
  and mixed-content priority.
- Do not let Part III/proxy administrative amendments and financial corrections
  collapse into the same reporting-risk label without an explicit annotation.
- Check whether rolling public-history features are anchored on origin_date, not
  fiscal-year end.
- Check whether filing-friction level features are kept separate from
  public-history rolling counts.
- Check whether note tag entropy has a formal definition and is interpreted as
  disclosure dispersion/breadth rather than mechanically as opacity.
- Check whether source availability masks, first-public dates, hashes, parser
  versions, and as-of dates are preserved through bronze, silver, and gold.
- Check whether non-CIK-native public sources retain original identifiers and
  provenance before any CIK bridge.

9. ML evaluation and reporting
- For every headline prediction claim, report the evaluation unit, label, test
  years, train window, feature set, prevalence, PR-AUC, ROC-AUC, Brier/BSS, ECE
  when available, and top-k or top-fraction metrics.
- Read PR-AUC against prevalence. Do not call a low PR-AUC bad without checking
  the base rate.
- Do not treat ROC-AUC alone as enough for rare-event reporting-risk tasks.
- For models trained with undersample_equal or other artificial balancing, treat
  PR-AUC, top-k precision, and Bao-style top-fraction metrics as the ranking
  evidence. Treat Brier and ECE as calibration diagnostics that can be distorted
  by train-test prevalence mismatch.
- Do not use default 0.5-threshold F1, recall, or accuracy as headline evidence
  for rare-event reporting-risk tasks unless the threshold policy is explicitly
  justified.
- Check whether summaries that highlight the best feature set, train window, or
  model family acknowledge model-selection optimism. If no correction is applied,
  claims should stay at model-family or diagnostic ranking level, not
  configuration-level superiority.
- Check whether predictions are annual out-of-time. If any random CV is used,
  identify whether it is only for DML nuisance fitting or another secondary use.
- Inspect the actual feature-selection code path or emitted feature schema to
  confirm that availability masks, public dates, vintage fields, labels, censor
  columns, and provenance identifiers are excluded from default predictor
  matrices.
- Check for distribution leakage in feature engineering. Imputation, scaling,
  and binning parameters must be fit within the training fold or a trailing
  window, not with global panel statistics.
- Check whether duplicate issuer-year or gvkey-year prediction rows are blocked.
- Check whether model seeds, task seeds, parallel jobs, and model threads are
  recorded in manifests.

10. Documentation and results snapshot
- Does README.md describe the current command surface without duplicating stale
  paper-plan text?
- Does docs/paper_plan.md Design Overview match the implemented computation
  flow, including legacy/public X/Y, time spans, out-of-time splits, peer
  suites, metrics, DML, and bridge validation?
- Does docs/results_snapshot.md Evidence Map describe the current artifact state
  rather than only the intended design?
- Are the Paper Plan Design Overview and Results Snapshot Evidence Map
  logically consistent while preserving their different roles: design contract
  versus current evidence state?
- Does docs/results_snapshot.md clearly say it is a static snapshot?
- Do results-snapshot tables match the actual artifacts in the active study
  directory or directories?
- If docs/results_snapshot.md uses a Selected Artifact Index, does it clearly
  say the list is selected rather than exhaustive, and do all listed artifacts
  exist locally?
- Are wide-table pages configured for readable docs output where needed?
- Do docs avoid claiming true fraud, causal identification, full SEC review, or
  full enforcement coverage?
- Do docs keep future work separate from current v1 evidence?

11. Engineering quality and efficiency
- Is reusable logic kept in src/ and thin execution code kept in scripts/?
- Are tests checking behavior and artifact contracts rather than only imports?
- Do the core quality gates include data-prep and table-I/O tests alongside the
  benchmark, bridge, public-cascade, peer, and docs tests?
- Are public-lake downloads restartable and hash-checked?
- Are SEC requests rate-limited?
- Are full-scale FSDS/Notes paths using Parquet/DuckDB where pandas-only
  materialization would be risky?
- Does DuckDB memory configuration, a temp/spill directory, or an equivalent
  out-of-core strategy protect large FSDS/Notes/public-lake aggregations from
  avoidable OOM failures?
- Does the public-lake workflow preserve the conservative defaults unless
  explicitly overridden: DuckDB memory limit 10GB, max temp size 400GB, and
  temp/spill directory under the active Silver lake directory?
- Do imputation, feature selection, and prediction schemas avoid fold-dependent
  shape drift, especially with all-missing features?
- Does the peer runtime fail fast when parallel_jobs * model_threads exceeds the
  available worker budget?
- Does the full peer run loop over tasks, model families, feature sets, and
  windows without materializing every fitted model at once?
- Are prediction artifacts written with compact dtypes where practical, such as
  predicted probabilities as float32 and observed labels as int8?
- Does uv.lock provide exact resolved versions for core ML/data dependencies,
  while pyproject.toml can keep reasonable range constraints?
- Do manifests record seeds, task seeds or seed policy, parallel jobs, model
  threads, and enough package/runtime context to interpret results?
- If interruption could leave a corrupt JSON/CSV artifact, suggest atomic writes
  as remediation. Do not require atomic writes as a blanket rule for every
  output.
- Are generated artifacts ignored appropriately while source docs and tests are
  tracked?

Output format:

Start with a short verdict:
- "Paper-plan support level: strong / partial / weak"
- "Main blocker:"
- "Next gate:"

Then provide findings ordered by severity:
- P0: critical violations of leakage, timing, identity, or claim validity.
- P1: missing evidence needed for the next paper gate.
- P2: implementation debt, performance risk, or documentation drift.

For every finding, include:
- title
- severity
- evidence with file paths and line numbers where possible
- why it matters for the paper
- concrete remediation
- suggested test or artifact that would prove the fix

Then provide:
- a claim-support matrix keyed to docs/paper_plan.md gates: data integrity,
  empirical sufficiency, and paper-readiness; mark each as supported, partial,
  unsupported, or not_auditable_locally, and cite the exact artifact, test,
  config, or code path
- one component status table covering benchmark, public cascade, legacy peer,
  public peer, bridge probe, construct overlap, opacity refresh, docs, and tests
- a short public-source utilization note, focused only on sources that affect the
  findings
- a Blocker Resolution Matrix only if there is a P0/P1 blocker or a native
  public source/optional accelerator would directly resolve a blocker
- a command-verification section listing commands you ran or would run
- a concise "do next" list with no more than 10 items

Constraints:
- Do not edit files unless explicitly asked.
- Do not invent data availability.
- Do not treat absence of WRDS or Audit Analytics as a code bug.
- Do not recommend paid data as required for the current v1 paper.
- Do not recommend LLM/GNN or multimodal extensions until the benchmark, public
  cascade, XBRL ratios, peer-transfer evidence, and bridge/construct-overlap
  gates are stable.
- Do not present farr candidate validation as WRDS-quality validation.
- Do not use vague phrases such as "robust evidence" unless you say which
  artifact supports it.
- Keep the tone direct and technical. Avoid sales language and generic praise.