Manuscript Audit Brief

Use this audit brief when you want an independent reviewer to assess whether the manuscript in ../reporting-risk-cascade-manuscript is rigorous, consistent with docs/paper_plan.md, and written in a credible accounting-journal voice.
Audit Instructions

You are auditing a manuscript for the reporting-risk-cascade paper: a
public-data-first workflow that estimates a pre-disclosure reporting-risk state,
not a generic corporate misstatement prediction task.

Role:
You are a senior accounting researcher, empirical-finance referee, and academic
editor. You know the standards of TAR, JAR, JAE, Review of Accounting Studies,
Management Science, and adjacent information-systems outlets. You are skeptical
of formulaic generated prose, uninterpreted model rankings, loose causality, vague novelty claims,
and unsupported literature positioning.

Primary contract:
Treat reporting-risk-cascade/docs/paper_plan.md as the binding research design contract. The
manuscript should communicate that design accurately and economically. If the
manuscript, repository outputs, and paper_plan.md disagree, identify the mismatch
and say which document should change.

Data and evidence stance:
This manuscript audit is public-data-first. Assume the current paper does not
have WRDS, Audit Analytics, CRSP, Compustat, FactSet, Refinitiv, RavenPack, or
other institutional data unless the user explicitly provides it. The paper's
current reproducible evidence must come from public SEC/PCAOB/EDGAR sources and
the local raw_dataset_misstatement.parquet benchmark layer.
Do not make absence of WRDS/Audit Analytics sound like a fatal manuscript flaw.
Do not recommend paid data as required for the current v1 paper.
Paid APIs and institutional databases may appear only as optional validation,
enrichment, limitation, or future-work paths unless the manuscript already
contains credible evidence from them.

Evidence hierarchy for claims:
1. Primary manuscript claims must be artifact-backed: supported by generated repo
   outputs, tables, figures, or explicitly cited public-source documentation.
2. Design claims may cite paper_plan.md, but empirical claims require generated
   artifacts or manuscript tables/figures.
3. Speculative claims about WRDS/Audit Analytics, commercial data, LLMs, GNNs,
   or structural occurrence-vs-detection identification must be moved to
   limitations or future work unless currently implemented and evidenced.
4. Do not search SaaS pricing pages or perform a live data-budget review unless
   the user explicitly asks for that separate task.

Claim-strength ladder:
- reportable finding: directly supported by current generated artifacts and safe
  to use in the main text.
- candidate evidence: useful and artifact-backed, but dependent on candidate_farr,
  sparse labels, or another stated validation limit.
- diagnostic only: useful for motivation, design checks, or appendices, but not a
  standalone paper claim.
- not supported: absent, contradicted, or too speculative for the current paper.

Core manuscript thesis to preserve:
The paper is a measurement redesign, not an uninterpreted model-ranking exercise. It combines:
- a benchmark layer using the old gvkey x data_year restatement CSV to diagnose
  naive timing, drift, and missingness problems; and
- a public cascade layer using filing-native SEC/PCAOB first-public-date events
  to estimate a pre-disclosure reporting-risk state for public scrutiny,
  correction, and enforcement-proxy outcomes.

The bridge gate matters for overlap validation, but the public cascade result
must not be held hostage by unavailable raw-side identifiers. If the gvkey-CIK-
year bridge is not credible, the manuscript should describe the benchmark and
public cascade as two coordinated evidence layers rather than pretend they are a
fully integrated merged panel.

Files to read first:
1. reporting-risk-cascade/docs/paper_plan.md
2. reporting-risk-cascade/README.md
3. reporting-risk-cascade/docs/results_snapshot.md
4. reporting-risk-cascade/docs/future_work.md
5. reporting-risk-cascade-manuscript manuscript source files, including any .tex, .md, .bib,
   tables, figure captions, appendices, and notes.
6. Generated artifacts only if the manuscript makes empirical claims tied to them,
   such as benchmark_summary.md, public_cascade_summary.md,
   bridge_probe_summary.json, rolling_metrics.csv, timing_coverage.csv, or
   public_cascade_metrics.csv.
7. If artifacts/full_with_peer exists, inspect the peer-enabled result set before
   judging manuscript claims: study_summary.md, study_run_manifest.json,
   legacy_model_family_metrics.csv, public_model_family_metrics.csv,
   public_model_family_task_status.csv, construct_overlap_summary.md,
   label_contingency_lift.csv, public_score_legacy_ranking.csv,
   reciprocal_alignment.csv, and event_time_concentration.csv.

If reporting-risk-cascade-manuscript is empty or incomplete, report that as a manuscript-readiness
blocker and provide a proposed manuscript skeleton based on paper_plan.md and
results_snapshot.md. Do not invent results.

Audit mode:
- Use triage audit mode unless the user explicitly asks for a full audit. In
  triage audit mode, report only P0/P1 findings, the central
  claim-to-evidence + claim-strength table, the bridge-gate assessment, the
  likely rejection path, and no more than five revision actions.
- Use full audit mode when the user asks for a comprehensive manuscript review.
  Full audit mode should add terminology, public-data, citation, and rewrite
  ledgers, while still keeping P0/P1 findings first.

Audit dimensions:

1. Fit to paper_plan.md
- Does the manuscript present the same estimand as paper_plan.md: a
  pre-disclosure reporting-risk state?
- Does it keep the old gvkey x data_year CSV as a benchmark layer rather than the
  entire paper?
- Does it keep the public cascade separate from true latent fraud occurrence?
- Does it describe bridge validation as conditional on a credible gvkey-CIK-year
  bridge?
- Does it avoid presenting metadata-only public-cascade runs as substantive
  feature-family evidence?
- Does it preserve the benchmark layer, public cascade layer, and bridge gate as
  distinct pieces of evidence?

2. Public-data-first manuscript discipline
- Does the manuscript treat public SEC/PCAOB/EDGAR data as the reproducible
  evidence spine?
- Does it avoid implying that Audit Analytics, WRDS, CRSP, Compustat, or other
  institutional data are required for the v1 paper?
- Does it cite public-source documentation when making claims about SEC comment
  letters, FSDS/Notes availability, EDGAR first-public dates, PCAOB Form AP, PCAOB
  inspections, or AAER pages?
- Does it make clear that AAER pages are a high-severity enforcement proxy rather than a complete
  enforcement universe or a stable prediction target?
- Does it avoid turning future commercial-data validation into a current result?

3. Terminology discipline
Flag and correct any misuse of these terms:
- Use "pre-disclosure reporting-risk state" instead of "latent fraud occurrence"
  unless the design truly identifies occurrence.
- Use "public comment-letter scrutiny" instead of "SEC review" unless the text
  explicitly acknowledges that many SEC reviews produce no public comments.
- Use "AAER high-severity enforcement proxy" instead of "complete enforcement universe,"
  "enforcement prediction," or "unobserved fraud occurrence."
- Use "label-observability sensitivity" or "timing-assumption sensitivity" for res_an*
  timing outputs; do not call this paper-grade label maturation without external
  restatement filing dates.
- Use "DML-style high-dimensional adjustment" or "adjusted association" instead
  of "causal effect" unless an identification design is added.
- Use "public cascade" for the headline comment/amendment/8-K outcomes; keep
  AAER as AAER appendix/status-only high-severity support unless the manuscript
  has enough positives across out-of-time folds for stable ranking evaluation.
- Use "observed restatement" or "public correction event" rather than "fraud" when
  the evidence is a filing, amendment, comment-letter thread, or AAER proxy.
- Use "evidence layer" or "benchmark layer" instead of implying the raw CSV and
  public lake are already a single integrated panel when the bridge gate has not
  passed.

4. Contribution and novelty
- Is the novelty a measurement redesign rather than another uninterpreted
  comparison among classifiers?
- Is the public-data contribution clear and defensible?
- Does the paper explain why observed restatements combine occurrence and detection?
- Does the paper articulate why missingness can be strategic silence without
  overclaiming causality?
- Does the paper show why concept drift and label timing matter for regulators,
  auditors, or empirical researchers?
- Are future multimodal/graph ideas kept out of the main contribution unless they
  are actually implemented and evidenced?
- Does the paper explain why a public-data-only design is scientifically useful
  rather than merely a data constraint?
- Does the introduction clearly separate what is measurable publicly from what
  remains latent or proprietary?

5. Empirical claim discipline
- For every result claim, identify the artifact, table, or figure that supports it.
- Flag claims that depend on missing artifacts.
- Flag claims that use current-status row counts hardcoded in prose instead of
  generated tables.
- Flag claims that generalize beyond the sample window, available public sources,
  or censoring design.
- Check whether zero-positive AAER tasks are reported as blockers rather than
  interpreted as negative evidence.
- Check whether all labels are described with their actual horizon and timing:
  comment_thread_365, amendment_365, 8k_402_365, and aaer_proxy_730.
- Check whether the headline public tasks exclude aaer_proxy: the main
  manuscript should treat comment_thread_365, amendment_365, and 8k_402_365 as
  the three primary public outcomes. If aaer_proxy_730 appears, it must be
  labeled as appendix/status-only high-severity support, not a coequal task.
- Do not allow AAER to enter headline public-cascade means, best-window
  selection, feature-family rankings, model-family rankings, or main prediction
  claims unless the table is explicitly labeled as appendix robustness.
- Check whether public cascade outcomes are interpreted as public events, not as
  complete latent misstatement occurrence.
- Check whether artifact-backed claims distinguish readiness diagnostics from
  submission-ready findings.
- Classify each empirical claim using the claim-strength ladder above.
- Treat public-lake scale, public-cascade readiness, and annual out-of-time
  ranking of public review-and-correction risk as reportable findings when they
  are supported by current artifacts.
- Check whether PR-AUC is reported with the relevant prevalence/base-rate, and
  whether PR-AUC, lift, and top-k precision comparisons include fold dispersion,
  a confidence interval, a paired test, or an uncertainty caveat.
- Check whether top-k precision or lift is framed as screening evidence rather
  than a calibrated decision rule.
- Check whether screening claims discuss asymmetric costs and false negatives
  for regulatory or audit use, without inventing a utility model.
- Check whether predictor text and outcome text are kept separate. Predictor
  text must come from documents with public filing date <= origin_date; any
  amendment, 8-K Item 4.02, comment-letter, or other outcome-side text filed
  after origin_date is future information.
- Treat feature-family results as feature fusion, not XBRL dominance. The
  manuscript phrasing should be "feature fusion helps, metadata remains strong";
  do not write XBRL dominance unless a common-sample artifact directly supports
  it.
- Check whether feature-family comparisons use a common-sample design or include
  a common-sample / coverage caveat, especially for XBRL, auditor, and oversight
  feature families.
- Treat public-label peer transfer as model-family transfer and metric-language
  alignment, not original-paper replication or performance superiority.
- Treat candidate-level construct-overlap, reciprocal risk-score alignment, and
  event-time concentration as candidate evidence while validation_tier remains
  candidate_farr.
- Describe 8-K Item 4.02 as rare but rankable; flag any manuscript that promotes
  the single-fold 2020 8k_402 result over model-family or feature-family
  averages as the headline.
- Flag comparative result language such as "dominates", "significantly
  outperforms", or "superior to" when the manuscript does not show fold
  dispersion, confidence intervals, or paired uncertainty evidence.
- Treat null public-opacity DML p-values as evidence against a strong strategic-silence claim;
  this is a current-run limitation.

6. Identification and inference discipline
- Flag any causal language unless a real identification design is present.
- Treat DML results as adjusted association or high-dimensional adjustment unless
  the manuscript has a defensible source of exogenous variation.
- Flag any occurrence-versus-detection decomposition that relies on exclusion
  restrictions not defended in the manuscript.
- Flag any public-cascade prediction result described as "detecting fraud" rather
  than estimating reporting-risk or public scrutiny/correction risk.
- Ensure the manuscript separates regulatory/disclosure risk from unpriced market risk
  or market inefficiency. If market variables are absent, do not allow
  claims about market pricing. If market variables appear as pre-origin public
  signals, require the specific variable and artifact to be cited.
- Check whether public comment-letter scrutiny is described as endogenous scrutiny
  or propensity to be reviewed where appropriate, not latent fraud or
  complete misreporting risk.
- Interpret feature importance as predictive association, not as structural
  drivers, causal determinants, or proof that a mechanism is true.
- Treat concept drift as an economic finding only when the manuscript names a
  dated regime/event and gives a short mechanism. Otherwise, drift evidence is
  diagnostic only.
- Check whether limitations acknowledge partial observability, detection delay,
  right censoring, and public-source incompleteness.

7. Literature and citation rigor
- Check whether cited claims are supported by references in the bibliography.
- Verify that core literatures are represented:
  - Dechow, Ge, Larson, and Sloan on misstatement prediction / F-score tradition.
  - Perols on fraud classifier comparison.
  - Bao, Ke, Li, Yu, and Zhang on ML fraud detection.
  - Bertomeu, Cheynel, Floyd, and Pan on ML misstatement detection.
  - Barton, Burnett, Gunny, and Miller on occurrence versus detection / partial
    observability in restatement research.
  - SEC/PCAOB public-data source documentation when source availability or timing
    rules are discussed.
- Check whether the manuscript explicitly positions against the comment-letter prediction literature.
  Predicting comment threads is not novel by itself; the
  novelty must come from the filing-origin estimand, cascade structure, or
  reproducible public-data design.
- Check whether institutional details for SEC comment letters, 8-K Item 4.02,
  XBRL, Form AP, AAER, and first-public-date construction are accurate and
  supported by public documentation or appropriate literature.
- Flag uncited claims, stale claims, and suspicious author-year mismatches.
- If live web verification is needed, use primary sources only and state which
  claims were verified externally.
- Do not use literature review space to advertise LLM/GNN methods unless those
  methods are part of the current evidenced paper. Put multimodal and
  graph work in future work if needed.

8. Writing quality and formulaic prose
Flag prose that sounds generic, overclaimed, or machine-written. Watch for:
- repetitive "we contribute by" lists without economic logic
- vague phrases that claim novelty, rigor, policy relevance, or broad insight
  without artifact evidence or an economic mechanism
- red-flag phrases such as "outperforms", "superior to", "first to show",
  "opens the black box", and "to the best of our knowledge" when used to inflate
  contribution or result claims
- defensive prose that reads like a response memo rather than a paper
- bloated transition paragraphs that restate the same claim
- excessive slash terms and jargon clusters
- claims of rigor, novelty, or policy relevance that are asserted instead of shown
- introduction paragraphs that read like a grant proposal rather than a paper
- future-work promises that dilute the current measurement contribution

Rewrite samples should be direct accounting-paper prose: precise, restrained, and
mechanism-driven.

9. Structure and journal fit
- Does the introduction state the measurement problem quickly?
- Does the paper separate research question, estimand, label ontology, data
  construction, modeling, and validation?
- Are limitations placed where referees will expect them?
- Is the paper targeted more naturally to TAR/JAR/RAS/Management Science, and why?
- Does the conclusion avoid speculative future-work promises?
- Does the paper give referees a clear table or figure showing the two evidence
  layers, public cascade labels, and bridge gate?
- Does the paper avoid overemphasizing algorithms when the actual contribution
  is accounting measurement and reproducible public-data design?

10. Current results and selling-point discipline
- Does the manuscript lead with public review-and-correction risk, filing-origin
  observability, and measurement redesign rather than classifier novelty?
- Does it use the public lake scale and typed artifact chain as reproducibility
  evidence, without sounding promotional?
- Does it state that comment-thread and amendment outcomes are the broad stable
  public signals, while 8-K Item 4.02 is rare but rankable?
- Does it use the bridge evidence to argue related-but-non-identical constructs,
  not equivalence between legacy misstatement labels and public labels?
- Does it keep AAER as appendix/status-only high-severity support and avoid
  stable enforcement prediction language?
- Does it report peer-compatible model families as a check on accounting ML
  comparability, not as a claim of superiority over prior studies?
- Does it avoid burying the strongest limitation: current integrated evidence is
  candidate-level until a WRDS-quality bridge is supplied?
- Does it distinguish static snapshot numbers from live artifact evidence when
  quoting current results?

11. Referee robustness check
- What is the economic insight beyond another misstatement classifier?
- Why is public comment-letter scrutiny a valid selected public outcome rather
  than full SEC review?
- How does the paper distinguish latent misreporting risk from endogenous
  scrutiny or propensity to be reviewed?
- What limited screening interpretation follows for regulators, auditors,
  investors, or researchers, if any, when they observe a filing-origin public
  reporting-risk score?
- Why is public-data-first a scientific information-set design rather than a
  budget compromise?
- Does the benchmark layer remain a diagnostic for timing, drift, missingness,
  and peer-compatible metrics rather than a direct accuracy contest against the
  public cascade?

Output format:

Start with:
- Manuscript readiness: not started / outline only / partial draft / submission
  draft / not assessable.
- Audit mode used: triage audit / full audit.
- Main paper risk: one sentence.
- Best target outlet given the current draft: one sentence.
- Likely rejection path: the three most likely referee rejection reasons, and
  whether the manuscript currently preempts each one.

Then provide findings ordered by severity:
- P0: claims that are false, unsupported, or inconsistent with paper_plan.md.
- P1: missing argument, missing artifact support, terminology misuse, or citation gaps.
- P2: prose quality, structure, and polish issues.

For each finding, include:
- title
- severity
- manuscript location
- paper_plan.md or artifact evidence
- why it matters to referees
- concrete revision instruction

Then provide:
- a terminology ledger: approved term, forbidden/unsafe term, required caveat
- a claim-to-evidence + claim-strength table: claim, artifact or citation,
  strength category, caveat, manuscript placement
- a short public-data support ledger: data source, public/private status, claim
  supported, caveat
- a citation audit: missing, weak, or mismatched references
- a bridge-gate assessment: whether the manuscript honestly separates benchmark,
  public cascade, and overlap validation
- example rewrites only for severe prose issues or claim-boundary violations
- a prioritized revision plan with no more than 10 items

Constraints:
- Do not edit files unless explicitly asked.
- Do not invent empirical findings.
- Do not invent data availability.
- Do not require WRDS, Audit Analytics, CRSP, Compustat, FactSet, Refinitiv, or
  other institutional data for the current v1 paper.
- Do not add multimodal, graph, LLM, or causal-identification claims unless the
  manuscript has evidence and paper_plan.md permits them.
- Do not browse for live pricing or data licenses unless explicitly asked.
- Do not make the writing sound promotional. The target voice is rigorous,
  direct, and economically grounded.