Results Snapshot
Discussion
-
Bottom line. The current evidence supports a reproducible measurement-and-ranking paper on public review-and-correction risk; it does not support causal claims, fraud-truth claims, or leaderboard superiority over prior fraud-prediction papers.
-
Does the task have empirical value? Yes. The useful object is not another static
misstatement = 1classifier; it is a filing-origin estimand for whether an issuer later enters an observable public review or correction channel. That makes the task closer to the information environment faced by investors, auditors, researchers, and regulators at the filing date. -
What is the main research decision behind the pipeline? The legacy
gvkey x data_yearbenchmark is retained as a diagnostic layer, not treated as the only ground truth. The public cascade is the main empirical object:comment_threadfor public scrutiny,amendmentfor correction/friction,8k_402for severe material correction, andaaer_proxyonly for sparse enforcement-tail support. -
What data are used? The workflow combines four data layers. First, the legacy detected-misstatement benchmark provides 82,908
gvkey x data_yearobservations from 2001-2019 for timing, drift, missingness, and peer-model diagnostics. Second, the public SEC/PCAOB lake normalizes EDGAR filing metadata, FSDS/XBRL numeric facts, Notes summaries, Form AP, PCAOB inspection records, comment-letter correspondence, amended filings, 8-K Item 4.02 events, and AAER support data. Third, the gold public panels create a 205,831-row issuer-year modeling table and a 21.7 million-row filing provenance table; the main domestic public-cascade sample has 90,445 issuer-year rows from 2011-2023. Fourth, farr support files supply the candidategvkey-CIK-yearbridge plus AAER/date support used for construct-overlap and severity-tail checks. -
What data matter most? The most useful evidence comes from the public SEC/PCAOB issuer-year panel, especially comment threads, amendments, 8-K Item 4.02 events, XBRL ratios, filing metadata, and auditor/oversight features. The farr
gvkey-CIK-yearbridge is valuable for candidate construct-overlap evidence; AAER support data are useful only as a sparse severity-tail descriptor. -
Why use Parquet and the public lake? The storage choice is part of the research design. Large public filing, XBRL, Notes summary, and gold-panel tables use Parquet so the workflow can rerun at realistic scale with typed columns, projection pushdown, and lower repeated I/O cost. Small diagnostics remain CSV/JSON/Markdown for inspection.
-
What setup choices are being compared? The public cascade fixes the task definitions and varies two modeling dimensions. The task dimension covers
comment_thread,amendment, and8k_402as headline labels;aaer_proxyis retained as sparse severity-tail status. The feature-family dimension comparesmetadata(filing and issuer-origin descriptors),xbrl(financial ratios and XBRL coverage),auditor(Form AP and engagement features),oversight(PCAOB-style oversight exposure), andall(their union). The temporal dimension compares annualrolling_5y,rolling_7y,rolling_10y, andexpandingtrain windows. -
Are these model hyperparameters? Mostly no. The main comparisons are experimental design factors: outcome definition, feature-family scope, and temporal training window. They determine the empirical estimand and evaluation design. Algorithmic hyperparameters, such as XGBoost
n_estimators=250,max_depth=4, andlearning_rate=0.05, are held fixed in the core public-cascade model so the results are interpretable as design comparisons rather than a tuning exercise. -
Which design factors are studied? The legacy benchmark varies label timing assumptions (
naive,proxy_drop_observed, andproxy_imputed_lagwith 1-, 2-, 3-, and 5-year lags) and train windows (rolling_5y,rolling_7y,rolling_10y,expanding). The public cascade varies public-label tasks, feature families, and the same train-window logic. The peer suites vary model family and imbalance handling, while preserving the same out-of-time split design. -
Why this setup design? The split design follows the filing-origin question. The model trains on past fiscal years and tests on later fiscal years, so the evaluation reflects out-of-time prediction rather than random within-panel interpolation. The feature-family design separates the value of filing metadata, XBRL, auditor, and oversight information, while the train-window design asks whether recent history or longer accumulated history is more useful. This prevents the public-data claim from collapsing into a single black-box
all featuresresult. -
What setup works best? The strongest core public-cascade setup is
all + rolling_5y, with equal-task mean PR-AUC0.2475. In the public peer transfer, theallfeature family is also strongest on average (mean PR-AUC0.2510), with metadata remaining a strong baseline. The right conclusion is feature-fusion gain, not XBRL dominance. -
What models are included, and why these models? The model set is designed for peer-compatible evidence rather than novelty. It includes Dechow-family logit/F-score language, Perols-style logit/SVM/tree/bagging/stacking/MLP families, Bao-inspired tree ensembles, and Bertomeu-style XGBoost. These cover the main model families accounting reviewers expect to see without claiming an original-paper replication.
-
Which models perform best? In the legacy benchmark peer suite,
bertomeu_style_xgbis the strongest mean PR-AUC model (0.0427). In the public-label peer suite,bao_inspired_tree_ensembleandbertomeu_style_xgblead on mean public-label PR-AUC (0.2244and0.2243). High single-fold8k_402rows are useful diagnostics, but they are not the stable headline result. -
How should peer comparisons be read? They are model-family transfer and metric-language alignment, not original-paper numeric replication. Dechow fixed F-score is skipped unless mapping quality is sufficient; Bao is reported as
bao_inspired_tree_ensemblebecause the repo panel is not the same raw accounting-number input used in the original setting. -
What metrics are reported? The snapshot reports PR-AUC, ROC-AUC, Brier, Brier skill, equal-width and quantile ECE, fixed top-k precision, top-decile lift, and Bao-style top-fraction precision, sensitivity, specificity, balanced accuracy, and NDCG. The goal is to show discrimination, ranking, screening, and calibration diagnostics without pretending they answer the same question.
-
Which metric is most reasonable for the headline? PR-AUC is the primary headline ranking metric because the tasks are imbalanced and prevalence differs sharply by label. It is also more aligned with the practical question: whether high-scored issuer-years concentrate later public review or correction events.
-
Which metrics are most informative beyond PR-AUC? Top-decile lift and Bao-style top-fraction metrics translate ranking into screening language. ROC-AUC is useful as a secondary discrimination metric. Brier and ECE are calibration diagnostics, especially fragile for undersampled Perols-style models.
-
Why not random cross-validation? The benchmark and public-cascade prediction results use annual out-of-time rolling or expanding splits. That choice is deliberate: random folds would be a weaker design for a filing-origin prediction question with changing disclosure, review, and reporting regimes.
-
What is the economic interpretation? Public reporting-risk states are observable and rankable at the filing origin. Comment-letter scrutiny captures broad public review, amendments capture correction/friction, and 8-K Item 4.02 captures a rarer material-correction channel. Legacy misstatement positives overlap most strongly with serious public correction outcomes, especially 8-K Item 4.02, but the constructs are related rather than identical.
-
What should accounting readers take away? The paper's contribution is a measurement redesign: move from treating detected misstatement as the only target toward a filing-origin public cascade that separates public scrutiny, correction, severe correction, and sparse enforcement-tail evidence. The current bridge evidence is candidate-level under farr; WRDS-quality validation remains preferred before final manuscript-level integrated claims.
-
What will reviewers likely scrutinize? The strongest open gate is bridge quality. farr provides high-coverage candidate evidence, but a WRDS or equivalent institutional
gvkey-CIK-yearbridge is still preferred for final integrated claims. AAER also remains sparse and selective, so it should stay a severity-tail descriptor rather than a headline prediction target.
Run Metadata
| Field | Value |
|---|---|
| Study manifest timestamp | 2026-04-27T02:27:29+00:00 |
| Construct-overlap timestamp | 2026-04-27T02:56:13+00:00 |
| Runtime | parallel_jobs=4, model_threads=2, seed_policy=task-isolated |
| Benchmark input | data/raw_dataset_misstatement.parquet |
| Public issuer panel | data/public_lake/gold/issuer_origin_panel.parquet |
| Public filing panel | data/public_lake/gold/filing_origin_panel.parquet |
| Bridge crosswalk | data/external/gvkey_cik_year.csv |
| Construct overlap | complete, validation_tier=candidate_farr |
| Peer comparison | full, legacy PR1 suite plus public-label PR2 suite |
Key readings:
- The snapshot is based on the peer-enabled study directory,
artifacts/full_with_peer. - The run includes the legacy benchmark, public cascade, bridge probe, legacy-peer suite, public-label peer suite, and construct-overlap validation.
- Construct overlap is complete under the farr candidate bridge; this is not yet WRDS-verified manuscript-grade bridge evidence.
Evidence Map
flowchart LR
A["Legacy benchmark<br/>gvkey x data_year"] --> B["Timing diagnostics<br/>naive, proxy drop, imputed lag"]
B --> C["Benchmark evidence<br/>label observability, drift, missingness"]
B --> P["Peer-compatible legacy suite<br/>Dechow, Perols, Bao, Bertomeu model families"]
D["Public SEC/PCAOB lake<br/>filings, XBRL, Notes, Form AP, AAER"] --> E["Gold panels<br/>issuer-year modeling panel + filing provenance panel"]
E --> F["Public cascade labels<br/>comment thread, amendment, 8-K 4.02, AAER proxy"]
E --> G["Pre-origin features<br/>metadata, XBRL ratios, text summary, auditor, oversight"]
F --> H["Public cascade models<br/>ranking and feature-family ablation"]
F --> Q["Peer-compatible public-label suite<br/>same model families, public estimand"]
G --> H
G --> Q
G --> I["Public opacity DML<br/>adjusted association, not causal effect"]
C --> J["Bridge gate<br/>gvkey-CIK-year crosswalk"]
P --> J
H --> J
Q --> J
I --> J
J --> K["Overlap validation<br/>related but non-identical labels"]
Key readings:
- The left branch diagnoses legacy detected-misstatement labels; the right branch builds the public filing-origin cascade.
- Peer-compatible model families are evaluated on both the legacy and public estimands, but those are not same-label leaderboards.
- The bridge gate is what lets the paper test whether old and public labels are related without treating them as identical.
Table 1. Public Lake and Gold Panel Scale
| Layer | Artifact | Rows | Notes |
|---|---|---|---|
| Bronze | public source cache | 206 files | SEC, PCAOB, FSDS, Notes, AAER |
| Silver | filing_dim.parquet |
21,743,433 | normalized public filing index |
| Silver | issuer_dim.parquet |
966,095 | normalized issuer dimension |
| Silver | xbrl_core_fact/ |
18,010,256 | controlled XBRL core tags only |
| Silver | xbrl_fact_summary.parquet |
362,013 | accession-level fact coverage |
| Silver | note_summary.parquet |
345,490 | Notes summary mode, no raw text blobs |
| Silver | comment_thread.csv.gz |
125,266 | public SEC comment-thread signal |
| Silver | correction_event.csv.gz |
89,926 | amended-filing/correction signal |
| Gold | issuer_origin_panel.parquet |
205,831 | annual modeling panel with labels and features |
| Gold | filing_origin_panel.parquet |
21,743,433 | lightweight filing-origin provenance panel |
Key readings:
- The public lake is at realistic scale: more than 21.7 million normalized filing-origin rows support a compact annual issuer-year panel.
- The annual
issuer_origin_panelis the modeling table; the fullfiling_origin_panelis a provenance and auditability layer. - Notes are in summary mode, so the run avoids raw text blobs while retaining filing-level text-count signals.
Table 2. Public Cascade Readiness
| Field | Value |
|---|---|
| Main sample rows | 90,445 |
| Fiscal-year span | 2011-2023 |
| Domestic US GAAP only | True |
| Zero-positive tasks | none |
| Task status counts | fit=520, skipped_one_class_train=120 |
| Readiness level | xbrl_ratio_baseline |
| Best equal-task configuration | all + rolling_5y |
| Best equal-task mean PR-AUC | 0.2475 |
| Feature family | Features | XBRL ratio features | XBRL coverage features | Best window | Mean PR-AUC |
|---|---|---|---|---|---|
| all | 78 | 11 | 15 | rolling_5y | 0.2475 |
| metadata | 27 | 0 | 0 | rolling_5y | 0.2297 |
| xbrl | 42 | 11 | 15 | expanding | 0.1732 |
| auditor | 6 | 0 | 0 | rolling_5y | 0.1385 |
| oversight | 1 | 0 | 0 | expanding | 0.1225 |
Key readings:
- The full public-cascade panel is ready for modeling: there are no zero-positive public tasks in the headline run.
- The best equal-task configuration is
all + rolling_5y, with mean PR-AUC0.2475. - Feature fusion improves over metadata alone, but the margin is moderate; XBRL ratios clear the implementation gate without dominating the run.
Figure 1. Public Cascade Signal Gradient
flowchart TB
A["Comment thread<br/>24,880 positives<br/>PR-AUC 0.4484"] --> B["Amendment<br/>17,255 positives<br/>PR-AUC 0.3340"]
B --> C["8-K Item 4.02<br/>2,009 positives<br/>PR-AUC 0.0767"]
C --> D["AAER proxy<br/>20 positives<br/>PR-AUC 0.1308<br/>late-year feasibility only"]
| Task | Positives | Mean fitted test prevalence | Mean PR-AUC | Mean ROC-AUC | Fitted years | Interpretation |
|---|---|---|---|---|---|---|
comment_thread |
24,880 | 0.2615 | 0.4484 | 0.7105 | 8 | strongest public scrutiny signal |
amendment |
17,255 | 0.1552 | 0.3340 | 0.7176 | 8 | clear correction/friction signal |
8k_402 |
2,009 | 0.0221 | 0.0767 | 0.7768 | 8 | rare but rankable severe correction signal |
aaer_proxy |
20 | 0.0013 | 0.1308 | 0.7584 | 2 | feasibility signal only, not a stable claim |
Key readings:
- The public cascade supports a measurable public reporting-risk state; it does not recover latent true fraud.
comment_threadandamendmentprovide the strongest and most stable public review-and-correction signals.8k_402is rare but rankable;aaer_proxyis too sparse for headline performance claims.Prevalenceis the fitted test-set positive rate and the PR-AUC random-ranking baseline, so PR-AUC must be read relative to each task's base rate.- The reported
0.2475is an equal-task mean across public labels, not a single fraud-model headline score.
Table 3. Benchmark Timing Diagnostics
| Label mode | Best window | PR-AUC | Top-100 precision | Bao NDCG@1% | Mean retained positive share |
|---|---|---|---|---|---|
naive |
rolling_5y | 0.0729 | 0.0879 | 0.1606 | 1.000 |
proxy_imputed_lag_1y |
rolling_5y | 0.0451 | 0.0621 | 0.0952 | 0.897 |
proxy_imputed_lag_2y |
expanding | 0.0394 | 0.0543 | 0.0762 | 0.805 |
proxy_imputed_lag_3y |
expanding | 0.0340 | 0.0471 | 0.0543 | 0.696 |
proxy_imputed_lag_5y |
expanding | 0.0322 | 0.0379 | 0.0505 | 0.425 |
proxy_drop_observed |
rolling_7y | 0.0229 | 0.0243 | 0.0265 | 0.052 |
Benchmark panel:
| Field | Value |
|---|---|
| Rows | 82,908 |
| Firms | 9,156 |
| Years | 2001-2019 |
| Positive rows | 2,460 |
| Positive rate | 0.0297 |
Same-row positives with any res_an* |
151 |
Same-row positives without any res_an* |
2,309 |
Key readings:
- The timing grid is a sensitivity design, not a recovery of true detection dates.
- The naive detected-misstatement label ranks best, but ranking weakens as the label is constrained by visibility assumptions.
proxy_drop_observedis a severe attrition stress test; it should not be read as standalone proof of look-ahead bias.- Benchmark and public-cascade prediction rows use annual out-of-time rolling/expanding splits, not random cross-validation.
- Double / Debiased Machine Learning (DML) opacity rows use cross-fitting for nuisance models and are adjusted associations, not prediction leaderboard rows.
Figure 2. Timing-Sensitivity Pattern
flowchart LR
A["Naive final label<br/>PR-AUC 0.0729"] --> B["1-year imputed visibility<br/>PR-AUC 0.0451"]
B --> C["2-year imputed visibility<br/>PR-AUC 0.0394"]
C --> D["3-year imputed visibility<br/>PR-AUC 0.0340"]
D --> E["5-year imputed visibility<br/>PR-AUC 0.0322"]
E --> F["Drop observed-only proxy<br/>PR-AUC 0.0229"]
Key readings:
- The ordering is monotone in the expected direction: stricter timing visibility assumptions reduce apparent benchmark performance.
- The figure motivates the paper's timing concern, but it does not establish the true date of misstatement discovery.
- The strongest claim is label-timing fragility, not a definitive correction of the legacy benchmark.
Peer-Compatible Literature Benchmarks
The legacy peer suite transfers model families and metric language from the
prior literature to the repo-native legacy benchmark folds. It is not an
original-paper numeric replication and not a same-estimand leaderboard. Metrics
below summarize all fitted task-fold rows in legacy_model_family_metrics.csv.
The public-label peer transfer appears in the following section because it uses
the same model-family language under the filing-origin public-cascade estimand.
Metric coverage is complete for the implemented PR1 contract:
- Sample and prevalence fields:
n_train,n_test,n_pos_test,prevalence. - Discrimination:
roc_auc,pr_auc. - Calibration:
brier,brier_skill_score, equal-widthece, equal-massece_quantile, andece_method. - Fixed top-k ranking:
top_50_precision,top_100_precision,top_200_precision. - Bao-style ranking at each of top 1%, 2%, 3%, 4%, and 5%:
k,precision,sensitivity,specificity,bac, andndcg. - Design fields:
input_kind,imbalance_strategy,calibration_method,calibration_warning, andmapping_attrition_rate.
Table 4. Peer Suite Status and Mapping
| Model | Literature anchor | Total tasks | Fit tasks | Skipped tasks | Mapping quality | Imbalance strategy |
|---|---|---|---|---|---|---|
bertomeu_style_xgb |
Bertomeu-style interpretable ML | 336 | 336 | 0 | full | none |
perols_logit |
Perols logistic regression | 336 | 336 | 0 | full | undersample_equal |
perols_bagged |
Perols bagging family | 336 | 336 | 0 | full | undersample_equal |
perols_linear_svm |
Perols SVM family | 336 | 336 | 0 | full | undersample_equal |
perols_stacking |
Perols stacking family | 336 | 336 | 0 | full | undersample_equal |
perols_mlp |
Perols neural-network family | 336 | 336 | 0 | full | undersample_equal |
bao_inspired_tree_ensemble |
Bao-style ensemble language | 336 | 336 | 0 | insufficient | none |
dechow_variable_logit |
Dechow-family logit variables | 336 | 336 | 0 | insufficient | class_weight_balanced |
perols_entropy_tree |
Perols decision-tree family | 336 | 336 | 0 | full | undersample_equal |
dechow_fixed_fscore_model1 |
Dechow fixed F-score model 1 | 336 | 0 | 336 | skipped | none |
Key readings:
- The implemented peer suite covers Dechow-, Perols-, Bao-, and Bertomeu-style model families on the repo's legacy benchmark folds.
dechow_fixed_fscore_model1is deliberately skipped because fixed published coefficients require full mapping quality.dechow_variable_logitis a fold-local Dechow-family logit, not a faithful fixed-coefficient F-score replication.- The Bao adapter is reported as
bao_inspired_tree_ensemblebecause the legacy benchmark panel is mixed and engineered, not raw accounting-number input.
Figure 3. Peer Model Mean PR-AUC Ranking
flowchart LR
A["Bertomeu-style XGB<br/>mean PR-AUC 0.0427"] --> B["Perols logit<br/>0.0315"]
B --> C["Perols bagged<br/>0.0311"]
C --> D["Perols SVM<br/>0.0306"]
D --> E["Perols stacking<br/>0.0302"]
E --> F["Perols MLP<br/>0.0297"]
F --> G["Bao-inspired ensemble<br/>0.0283"]
G --> H["Dechow-variable logit<br/>0.0235"]
H --> I["Perols entropy tree<br/>0.0227"]
Key readings:
bertomeu_style_xgbhas the strongest mean legacy benchmark PR-AUC in this run.- The Perols-style models cluster closely together; their ranking is informative but not a calibrated-probability claim.
- The Dechow and Bao rows should be read through their mapping labels, not as original-paper numeric replications.
Table 5. Peer Model Metrics
All entries are model-level means across fitted rows, except max_pr_auc and
max_roc_auc, which report the best single task-fold row for that model.
| Model | Rows | Mean train N | Mean test N | Mean test positives | Mean prevalence | ROC-AUC | PR-AUC | Brier | BSS | ECE | ECE quantile | Top-50 precision | Top-100 precision | Top-200 precision | Max PR-AUC | Max ROC-AUC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bertomeu_style_xgb |
336 | 36,136.8 | 3,900.6 | 64.9 | 0.0164 | 0.6601 | 0.0427 | 0.0162 | -0.0053 | 0.0097 | 0.0110 | 0.0658 | 0.0548 | 0.0469 | 0.1710 | 0.8371 |
perols_logit |
336 | 36,136.8 | 3,900.6 | 64.9 | 0.0164 | 0.6156 | 0.0315 | 0.1775 | -10.8599 | 0.3058 | 0.3073 | 0.0452 | 0.0419 | 0.0378 | 0.0759 | 0.7619 |
perols_bagged |
336 | 36,136.8 | 3,900.6 | 64.9 | 0.0164 | 0.6271 | 0.0311 | 0.1809 | -11.2817 | 0.3665 | 0.3667 | 0.0429 | 0.0406 | 0.0365 | 0.0868 | 0.8468 |
perols_linear_svm |
336 | 36,136.8 | 3,900.6 | 64.9 | 0.0164 | 0.6130 | 0.0306 | 0.1862 | -11.5721 | 0.3935 | 0.3934 | 0.0453 | 0.0426 | 0.0366 | 0.0745 | 0.7553 |
perols_stacking |
336 | 36,136.8 | 3,900.6 | 64.9 | 0.0164 | 0.6075 | 0.0302 | 0.1967 | -12.2079 | 0.4112 | 0.4112 | 0.0432 | 0.0394 | 0.0358 | 0.0708 | 0.7743 |
perols_mlp |
336 | 36,136.8 | 3,900.6 | 64.9 | 0.0164 | 0.5888 | 0.0297 | 0.2022 | -12.8796 | 0.3676 | 0.3677 | 0.0405 | 0.0371 | 0.0334 | 0.0716 | 0.7367 |
bao_inspired_tree_ensemble |
336 | 36,136.8 | 3,900.6 | 64.9 | 0.0164 | 0.6251 | 0.0283 | 0.0165 | -0.0235 | 0.0118 | 0.0136 | 0.0332 | 0.0334 | 0.0327 | 0.0628 | 0.7887 |
dechow_variable_logit |
336 | 36,136.8 | 3,900.6 | 64.9 | 0.0164 | 0.5225 | 0.0235 | 0.2466 | -15.5804 | 0.4732 | 0.4732 | 0.0332 | 0.0282 | 0.0231 | 0.0672 | 0.6058 |
perols_entropy_tree |
336 | 36,136.8 | 3,900.6 | 64.9 | 0.0164 | 0.5810 | 0.0227 | 0.2245 | -14.1770 | 0.3565 | 0.3565 | 0.0298 | 0.0291 | 0.0287 | 0.0444 | 0.7981 |
Key readings:
- Legacy benchmark prevalence is very low, so absolute PR-AUC values are expected to be modest and should be compared within the same task/split design.
bertomeu_style_xgbleads on mean PR-AUC, Brier, and calibration diagnostics among the fitted peer families.- Perols-style full-mode models use equal positive/negative undersampling in training and untouched test folds in evaluation.
- Their Brier and ECE values are diagnostics, not evidence of calibrated probabilities.
Table 6. Best Peer Task-Fold Rows
| Model | Label mode | Train window | Test year | Input kind | N train | N test | N positive test | Prevalence | ROC-AUC | PR-AUC | Brier | BSS | ECE | ECE quantile | ECE method | Top-50 | Top-100 | Top-200 | Imbalance | Calibration | Warning | Mapping attrition |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bertomeu_style_xgb |
naive | rolling_5y | 2014 | mixed | 18,702 | 3,782 | 61 | 0.0161 | 0.7537 | 0.1710 | 0.0148 | 0.0672 | 0.0057 | 0.0053 | uniform_width_and_quantile | 0.2800 | 0.1800 | 0.1050 | none | native_or_class_weighted | false | 0.0000 |
perols_logit |
naive | rolling_7y | 2017 | mixed | 26,477 | 3,876 | 42 | 0.0108 | 0.7400 | 0.0759 | 0.2290 | -20.3685 | 0.4059 | 0.4059 | uniform_width_and_quantile | 0.1200 | 0.0700 | 0.0600 | undersample_equal | none_after_undersampling | true | 0.0000 |
perols_bagged |
naive | expanding | 2019 | mixed | 79,206 | 3,702 | 33 | 0.0089 | 0.8293 | 0.0868 | 0.1615 | -17.2748 | 0.3692 | 0.3692 | uniform_width_and_quantile | 0.1600 | 0.1000 | 0.0600 | undersample_equal | none_after_undersampling | true | 0.0000 |
perols_linear_svm |
naive | rolling_5y | 2010 | mixed | 22,021 | 3,742 | 79 | 0.0211 | 0.6587 | 0.0745 | 0.2308 | -10.1658 | 0.4534 | 0.4533 | uniform_width_and_quantile | 0.1600 | 0.1200 | 0.0800 | undersample_equal | none_after_undersampling | true | 0.0000 |
perols_stacking |
naive | rolling_10y | 2009 | mixed | 41,140 | 3,831 | 74 | 0.0193 | 0.6517 | 0.0708 | 0.1988 | -9.4962 | 0.4189 | 0.4189 | uniform_width_and_quantile | 0.1000 | 0.0700 | 0.0550 | undersample_equal | none_after_undersampling | true | 0.0000 |
perols_mlp |
naive | rolling_7y | 2009 | mixed | 35,311 | 3,831 | 74 | 0.0193 | 0.6644 | 0.0716 | 0.1561 | -7.2427 | 0.3266 | 0.3266 | uniform_width_and_quantile | 0.1200 | 0.0700 | 0.0600 | undersample_equal | none_after_undersampling | true | 0.0000 |
bao_inspired_tree_ensemble |
naive | rolling_10y | 2006 | mixed | 28,299 | 5,093 | 143 | 0.0281 | 0.6856 | 0.0628 | 0.0279 | -0.0232 | 0.0250 | 0.0250 | uniform_width_and_quantile | 0.0600 | 0.0800 | 0.1000 | none | native_or_class_weighted | false | 0.0000 |
dechow_variable_logit |
proxy_imputed_lag_1y | rolling_5y | 2009 | ratios | 23,636 | 3,831 | 74 | 0.0193 | 0.5688 | 0.0672 | 0.2320 | -11.2476 | 0.4579 | 0.4579 | uniform_width_and_quantile | 0.1200 | 0.0800 | 0.0500 | class_weight_balanced | native_or_class_weighted | false | 0.0000 |
perols_entropy_tree |
proxy_imputed_lag_2y | expanding | 2015 | mixed | 63,624 | 3,912 | 58 | 0.0148 | 0.6852 | 0.0444 | 0.1600 | -9.9542 | 0.3237 | 0.3237 | uniform_width_and_quantile | 0.1200 | 0.0700 | 0.0600 | undersample_equal | none_after_undersampling | true | 0.0000 |
Key readings:
- The best individual rows can be materially stronger than model-family means, so they are useful diagnostics but not the headline comparison.
- The top row remains
bertomeu_style_xgbon the naive legacy label in 2014. - Several Perols-style rows rank well in specific folds, but their calibration warnings remain binding.
Table 7. Bao-Style Top-Fraction Ranking Metrics
Rows are model-level means across all fitted task-fold rows. Each fraction
reports all Bao-style metrics implemented by the repo: cutoff k, precision,
sensitivity, specificity, balanced accuracy (bac), and NDCG.
| Model | Fraction | Mean k | Precision | Sensitivity | Specificity | BAC | NDCG |
|---|---|---|---|---|---|---|---|
bertomeu_style_xgb |
top_1pct | 39.0 | 0.0702 | 0.0433 | 0.9906 | 0.5169 | 0.0800 |
bertomeu_style_xgb |
top_2pct | 78.1 | 0.0583 | 0.0725 | 0.9808 | 0.5267 | 0.0793 |
bertomeu_style_xgb |
top_3pct | 116.9 | 0.0525 | 0.0991 | 0.9711 | 0.5351 | 0.0937 |
bertomeu_style_xgb |
top_4pct | 155.9 | 0.0495 | 0.1249 | 0.9614 | 0.5431 | 0.1085 |
bertomeu_style_xgb |
top_5pct | 195.1 | 0.0472 | 0.1492 | 0.9515 | 0.5504 | 0.1218 |
perols_logit |
top_1pct | 39.0 | 0.0463 | 0.0287 | 0.9903 | 0.5095 | 0.0489 |
perols_logit |
top_2pct | 78.1 | 0.0430 | 0.0535 | 0.9805 | 0.5170 | 0.0541 |
perols_logit |
top_3pct | 116.9 | 0.0415 | 0.0777 | 0.9708 | 0.5242 | 0.0679 |
perols_logit |
top_4pct | 155.9 | 0.0394 | 0.0988 | 0.9610 | 0.5299 | 0.0800 |
perols_logit |
top_5pct | 195.1 | 0.0380 | 0.1196 | 0.9511 | 0.5354 | 0.0914 |
perols_bagged |
top_1pct | 39.0 | 0.0426 | 0.0271 | 0.9903 | 0.5087 | 0.0433 |
perols_bagged |
top_2pct | 78.1 | 0.0406 | 0.0508 | 0.9805 | 0.5156 | 0.0491 |
perols_bagged |
top_3pct | 116.9 | 0.0397 | 0.0744 | 0.9708 | 0.5226 | 0.0624 |
perols_bagged |
top_4pct | 155.9 | 0.0381 | 0.0954 | 0.9609 | 0.5282 | 0.0746 |
perols_bagged |
top_5pct | 195.1 | 0.0367 | 0.1152 | 0.9510 | 0.5331 | 0.0854 |
perols_linear_svm |
top_1pct | 39.0 | 0.0463 | 0.0279 | 0.9903 | 0.5091 | 0.0476 |
perols_linear_svm |
top_2pct | 78.1 | 0.0427 | 0.0523 | 0.9805 | 0.5164 | 0.0517 |
perols_linear_svm |
top_3pct | 116.9 | 0.0411 | 0.0767 | 0.9708 | 0.5237 | 0.0654 |
perols_linear_svm |
top_4pct | 155.9 | 0.0386 | 0.0962 | 0.9609 | 0.5286 | 0.0767 |
perols_linear_svm |
top_5pct | 195.1 | 0.0366 | 0.1147 | 0.9510 | 0.5328 | 0.0867 |
perols_stacking |
top_1pct | 39.0 | 0.0424 | 0.0259 | 0.9903 | 0.5081 | 0.0454 |
perols_stacking |
top_2pct | 78.1 | 0.0413 | 0.0513 | 0.9805 | 0.5159 | 0.0508 |
perols_stacking |
top_3pct | 116.9 | 0.0395 | 0.0735 | 0.9707 | 0.5221 | 0.0635 |
perols_stacking |
top_4pct | 155.9 | 0.0374 | 0.0934 | 0.9609 | 0.5272 | 0.0749 |
perols_stacking |
top_5pct | 195.1 | 0.0360 | 0.1128 | 0.9510 | 0.5319 | 0.0855 |
perols_mlp |
top_1pct | 39.0 | 0.0426 | 0.0259 | 0.9903 | 0.5081 | 0.0491 |
perols_mlp |
top_2pct | 78.1 | 0.0389 | 0.0477 | 0.9804 | 0.5141 | 0.0516 |
perols_mlp |
top_3pct | 116.9 | 0.0363 | 0.0671 | 0.9707 | 0.5189 | 0.0623 |
perols_mlp |
top_4pct | 155.9 | 0.0348 | 0.0857 | 0.9608 | 0.5232 | 0.0731 |
perols_mlp |
top_5pct | 195.1 | 0.0333 | 0.1025 | 0.9508 | 0.5266 | 0.0824 |
bao_inspired_tree_ensemble |
top_1pct | 39.0 | 0.0330 | 0.0206 | 0.9902 | 0.5054 | 0.0317 |
bao_inspired_tree_ensemble |
top_2pct | 78.1 | 0.0339 | 0.0425 | 0.9803 | 0.5114 | 0.0388 |
bao_inspired_tree_ensemble |
top_3pct | 116.9 | 0.0333 | 0.0627 | 0.9706 | 0.5166 | 0.0503 |
bao_inspired_tree_ensemble |
top_4pct | 155.9 | 0.0328 | 0.0828 | 0.9607 | 0.5218 | 0.0618 |
bao_inspired_tree_ensemble |
top_5pct | 195.1 | 0.0325 | 0.1028 | 0.9508 | 0.5268 | 0.0728 |
dechow_variable_logit |
top_1pct | 39.0 | 0.0341 | 0.0214 | 0.9902 | 0.5058 | 0.0394 |
dechow_variable_logit |
top_2pct | 78.1 | 0.0313 | 0.0387 | 0.9803 | 0.5095 | 0.0406 |
dechow_variable_logit |
top_3pct | 116.9 | 0.0270 | 0.0495 | 0.9704 | 0.5099 | 0.0465 |
dechow_variable_logit |
top_4pct | 155.9 | 0.0243 | 0.0592 | 0.9604 | 0.5098 | 0.0522 |
dechow_variable_logit |
top_5pct | 195.1 | 0.0230 | 0.0700 | 0.9503 | 0.5102 | 0.0582 |
perols_entropy_tree |
top_1pct | 39.0 | 0.0313 | 0.0194 | 0.9902 | 0.5048 | 0.0369 |
perols_entropy_tree |
top_2pct | 78.1 | 0.0298 | 0.0370 | 0.9803 | 0.5087 | 0.0392 |
perols_entropy_tree |
top_3pct | 116.9 | 0.0292 | 0.0539 | 0.9704 | 0.5122 | 0.0488 |
perols_entropy_tree |
top_4pct | 155.9 | 0.0291 | 0.0718 | 0.9605 | 0.5162 | 0.0592 |
perols_entropy_tree |
top_5pct | 195.1 | 0.0283 | 0.0877 | 0.9506 | 0.5191 | 0.0679 |
Key readings:
- Bao-style top-fraction metrics translate ranking into screening language: precision, sensitivity, specificity, BAC, and NDCG at top 1%-5%.
bertomeu_style_xgbdominates this table across the top-fraction cutoffs.- Precision remains low in absolute terms because the legacy detected-misstatement base rate is low.
Table 8. Peer Imbalance and Feature-Importance Diagnostics
| Model | Imbalance strategy | Rows | Mean train N before | Mean positives before | Mean train N after | Mean positives after | Mean test prevalence |
|---|---|---|---|---|---|---|---|
bertomeu_style_xgb |
none | 336 | 36,136.8 | 839.7 | 36,136.8 | 839.7 | 0.0164 |
perols_logit |
undersample_equal | 336 | 36,136.8 | 839.7 | 1,679.5 | 839.7 | 0.0164 |
perols_bagged |
undersample_equal | 336 | 36,136.8 | 839.7 | 1,679.5 | 839.7 | 0.0164 |
perols_linear_svm |
undersample_equal | 336 | 36,136.8 | 839.7 | 1,679.5 | 839.7 | 0.0164 |
perols_stacking |
undersample_equal | 336 | 36,136.8 | 839.7 | 1,679.5 | 839.7 | 0.0164 |
perols_mlp |
undersample_equal | 336 | 36,136.8 | 839.7 | 1,679.5 | 839.7 | 0.0164 |
bao_inspired_tree_ensemble |
none | 336 | 36,136.8 | 839.7 | 36,136.8 | 839.7 | 0.0164 |
dechow_variable_logit |
class_weight_balanced | 336 | 36,136.8 | 839.7 | 36,136.8 | 839.7 | 0.0164 |
perols_entropy_tree |
undersample_equal | 336 | 36,136.8 | 839.7 | 1,679.5 | 839.7 | 0.0164 |
| Model | Importance type | Importance rows |
|---|---|---|
bertomeu_style_xgb |
feature_importance | 34,560 |
perols_logit |
absolute_coefficient | 34,560 |
perols_bagged |
mean_bagged_feature_importance | 34,560 |
perols_stacking |
mean_bagged_feature_importance | 34,560 |
bao_inspired_tree_ensemble |
feature_importance | 6,384 |
dechow_variable_logit |
absolute_coefficient | 3,024 |
perols_entropy_tree |
feature_importance | 34,560 |
Key readings:
- The imbalance table makes the Perols design explicit: undersampled training folds are balanced, while test folds keep the original low prevalence.
- Feature importance is emitted only where the model family exposes a stable importance or coefficient surface.
- SVM and MLP adapters remain in the performance comparison but do not emit comparable feature-importance rows.
Public-Label Peer Transfer
The public peer suite applies the same peer-compatible model-family language to the filing-origin public-cascade tasks. It is not a comparison to prior fraud-prediction papers on their original labels. It asks how transferred Dechow/Perols/Bao/Bertomeu-style families perform on the repo's public review-and-correction labels.
Status:
| Field | Value |
|---|---|
| Fitted task-fold rows | 4,320 |
| Skipped status rows | 2,080 |
| Missing Dechow fixed-score mapping rows | 480 |
| AAER severity-tail status rows | 1,600 |
| Headline public tasks | comment_thread, amendment, 8k_402 |
| Severity-tail task | aaer_proxy, status only |
Key readings:
- Public peer transfer is complete for the three headline public labels.
aaer_proxyis intentionally status-only because sparse positives make stable public-peer training inappropriate.- Skipped rows are mostly design-imposed status rows, not silent failures.
Model-family means over fitted public-label task-fold rows:
| Model | Rows | Mean PR-AUC | Mean ROC-AUC | Max PR-AUC | Mean Brier | Mean ECE |
|---|---|---|---|---|---|---|
bao_inspired_tree_ensemble |
480 | 0.2244 | 0.6440 | 0.5108 | 0.1124 | 0.0315 |
bertomeu_style_xgb |
480 | 0.2243 | 0.6436 | 0.5115 | 0.1124 | 0.0314 |
perols_bagged |
480 | 0.2131 | 0.6312 | 0.4600 | 0.2237 | 0.3134 |
perols_stacking |
480 | 0.2094 | 0.6206 | 0.6385 | 0.2228 | 0.3120 |
perols_logit |
480 | 0.2077 | 0.6215 | 0.6542 | 0.2251 | 0.3025 |
perols_linear_svm |
480 | 0.2056 | 0.6156 | 0.6621 | 0.2244 | 0.3119 |
perols_mlp |
480 | 0.2020 | 0.6103 | 0.4407 | 0.2315 | 0.3093 |
perols_entropy_tree |
480 | 0.2005 | 0.6139 | 0.4398 | 0.2276 | 0.3092 |
dechow_variable_logit |
480 | 0.1586 | 0.4883 | 0.3058 | 0.2500 | 0.3537 |
Key readings:
- Bao-inspired and Bertomeu-style tree ensembles have the highest mean public-label PR-AUC.
- Perols-style models are competitive in some folds but show weaker calibration diagnostics under undersampling.
- Dechow-variable logit is the weakest transferred public-label family in this run, consistent with conservative public proxy mapping.
Task-level public peer performance:
| Task | Rows | Mean prevalence | Mean PR-AUC | Mean ROC-AUC | Max PR-AUC |
|---|---|---|---|---|---|
comment_thread |
1,440 | 0.2615 | 0.3292 | 0.5925 | 0.5115 |
amendment |
1,440 | 0.1552 | 0.2331 | 0.6102 | 0.3870 |
8k_402 |
1,440 | 0.0221 | 0.0529 | 0.6270 | 0.6621 |
Key readings:
comment_threadis the easiest public-label task on mean PR-AUC because it is more prevalent and better supported.amendmentprovides a clear correction/friction ranking task.8k_402has low mean PR-AUC because it is rare, but some folds show strong ranking performance.
Feature-family means:
| Feature set | Rows | Mean PR-AUC | Mean ROC-AUC | Max PR-AUC |
|---|---|---|---|---|
all |
864 | 0.2510 | 0.6845 | 0.6621 |
metadata |
864 | 0.2389 | 0.6631 | 0.4695 |
xbrl |
864 | 0.1937 | 0.6032 | 0.4525 |
auditor |
864 | 0.1796 | 0.5647 | 0.6533 |
oversight |
864 | 0.1621 | 0.5340 | 0.3403 |
Key readings:
- The
allfeature family is strongest on average, supporting feature fusion. - Metadata remains a strong baseline; added public feature families help but do not erase the metadata signal.
- XBRL, auditor, and oversight families are useful ablations rather than standalone dominant feature groups in this run.
Best public-peer task-fold rows:
| Model | Task | Feature set | Window | Test year | Prevalence | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
perols_linear_svm |
8k_402 |
all |
expanding |
2020 | 0.0631 | 0.9305 | 0.6621 |
perols_logit |
8k_402 |
all |
expanding |
2020 | 0.0631 | 0.9345 | 0.6542 |
perols_logit |
8k_402 |
auditor |
rolling_7y |
2020 | 0.0631 | 0.9184 | 0.6533 |
perols_stacking |
8k_402 |
auditor |
rolling_10y |
2020 | 0.0631 | 0.8817 | 0.6385 |
perols_stacking |
8k_402 |
auditor |
rolling_5y |
2020 | 0.0631 | 0.9314 | 0.6180 |
perols_linear_svm |
8k_402 |
auditor |
rolling_5y |
2020 | 0.0631 | 0.9132 | 0.5996 |
perols_logit |
8k_402 |
all |
rolling_5y |
2020 | 0.0631 | 0.9288 | 0.5509 |
perols_linear_svm |
8k_402 |
auditor |
expanding |
2020 | 0.0631 | 0.7953 | 0.5492 |
Key readings:
- The highest single-row PR-AUC values are concentrated in the 2020
8k_402fold, where fitted prevalence is higher than the cross-year mean. - These rows demonstrate local ranking strength but should not replace the model-family and feature-family averages as the stable summary.
- Public peer transfer confirms that prior model families can rank public review-and-correction outcomes under the repo's estimand.
- The paper's contribution remains the measurement design, not a new classifier.
Table 9. Opacity and Missingness Diagnostics
| Layer | Outcome | N | Prevalence | Treatment | Coef. | SE | p-value | Interpretation |
|---|---|---|---|---|---|---|---|---|
| public cascade | comment_thread |
90,429 | 0.2751 | missingness_density_score |
-0.0242 | 0.0955 | 0.8002 | no adjusted association in this run |
| public cascade | amendment |
90,429 | 0.1908 | missingness_density_score |
-0.0467 | 0.0819 | 0.5688 | no adjusted association in this run |
| public cascade | 8k_402 |
90,429 | 0.0222 | missingness_density_score |
-0.0114 | 0.0312 | 0.7141 | no adjusted association in this run |
| legacy benchmark | misstatement firm-year |
82,908 | 0.0297 | missingness_density_score |
0.0028 | 0.0053 | 0.5925 | legacy diagnostic only |
Key readings:
- These Double / Debiased Machine Learning (DML) estimates are high-dimensional adjusted associations, not causal effects.
- None of the public-label opacity estimates supports a strong strategic-silence claim in this run.
- The legacy opacity row remains a diagnostic benchmark comparison, not the main public-cascade estimand.
Table 10. Bridge and Construct-Overlap Validation
| Field | Value |
|---|---|
| Bridge probe status | external_crosswalk_available |
| Construct-overlap status | complete |
| Validation tier | candidate_farr |
| Raw benchmark rows | 82,908 |
| Raw benchmark firms | 9,156 |
| Candidate crosswalk rows | 81,825 |
| Matched raw rows | 81,218 |
| Row coverage rate | 0.9796 |
| Matched raw firms | 9,075 |
| Firm coverage rate | 0.9912 |
| Matched positive rows | 2,433 of 2,460 |
| Public issuer-year rows | 205,831 |
| Raw rows overlapping public issuer-years | 47,824 |
| Public-overlap rate among crosswalk candidates | 0.5888 |
| Ambiguous raw rows | 583 |
Key readings:
- The farr-derived
gvkey-CIK-yearfile provides high raw-row and firm coverage for candidate construct-overlap validation. - Public-panel overlap is naturally smaller because the public cascade starts in 2011 while the legacy benchmark begins in 2001.
- The bridge is not WRDS-verified, so integrated old/public claims remain candidate-level rather than final manuscript-grade validation.
Table 11. Candidate Construct-Overlap Evidence
| Evidence item | Result | Interpretation |
|---|---|---|
| High-confidence overlap rows | 47,418 | annual gvkey x data_year rows with one public issuer-year match |
| High-confidence legacy positives | 1,425 | matched legacy detected-misstatement positives |
| Ambiguous matched rows | 406 | retained only for sensitivity |
| Dropped rows | 35,084 | mostly outside public-panel coverage or bridge match |
| Balanced event-time rows | 22,628 | rows with full [-3,+3] public coverage |
| farr AAER firm-years in raw benchmark | 422 | external severity-tail support rows |
| farr AAER and legacy-positive overlap | 243 | descriptive AAER support, not a headline target |
| Public label | High-confidence public positives | Both legacy and public positive | Lift of public label given legacy positive |
|---|---|---|---|
comment_thread_365 |
14,518 | 443 | 1.02 |
amendment_365 |
10,184 | 582 | 1.90 |
8k_402_365 |
1,109 | 286 | 8.58 |
aaer_proxy_730 |
0 | 0 | n/a |
Key readings:
- This is the clearest current evidence for related but non-identical constructs.
- Legacy positives are much more likely to coincide with serious public correction signals, especially 8-K Item 4.02.
- Comment-letter scrutiny is broader and only weakly concentrated in legacy positives, consistent with public scrutiny rather than fraud truth.
- The AAER proxy is absent in the high-confidence overlap table and remains a severity-tail support item, not a headline target.
Figure 4. Construct-Overlap Signal
flowchart LR
A["High-confidence overlap<br/>47,418 rows"] --> B["Legacy positives<br/>1,425 rows"]
B --> C["Any public label<br/>863 rows"]
B --> D["Amendment<br/>582 rows"]
B --> E["8-K Item 4.02<br/>286 rows"]
B --> F["No public cascade label<br/>562 rows"]
Key readings:
- A meaningful share of legacy positives also carries public review or correction labels in the matched sample.
- The split between amendment, 8-K Item 4.02, and no public cascade label is why the constructs should be described as related but non-identical.
- The figure summarizes overlap among legacy positives, not full population prevalence.
Table 12. Risk-Score Alignment
| Direction | Best row | Target positives | ROC-AUC | PR-AUC | Top-decile lift | 95% CI |
|---|---|---|---|---|---|---|
| Public cascade score -> legacy positives | public_cascade, 8k_402, all, rolling_10y |
146 | 0.6784 | 0.0316 | 3.01 | [2.32, 3.78] |
| Legacy/peer score -> public labels | bertomeu_style_xgb, label_8k_402_365, expanding, naive |
549 | 0.7038 | 0.0436 | 3.06 | [2.69, 3.45] |
Key readings:
- Public-cascade scores can rank legacy positives on the matched candidate-bridge sample.
- Legacy benchmark-style scores can also rank the severe public correction label.
- The reciprocal alignment supports construct overlap without collapsing the two labels into the same estimand.
Table 13. Event-Time Concentration
Balanced-window rows use full [-3,+3] public coverage around the legacy
data_year. These are descriptive rates only; no significance tests are
reported.
| Relative year | Public label | Legacy-positive rate | Legacy-negative rate | Raw difference |
|---|---|---|---|---|
| -1 | amendment_365 |
0.3024 | 0.1889 | 0.1135 |
| -1 | 8k_402_365 |
0.1003 | 0.0179 | 0.0824 |
| 0 | amendment_365 |
0.4087 | 0.1821 | 0.2266 |
| 0 | 8k_402_365 |
0.2111 | 0.0166 | 0.1945 |
| +1 | amendment_365 |
0.3728 | 0.1741 | 0.1987 |
| +1 | 8k_402_365 |
0.1916 | 0.0174 | 0.1743 |
Key readings:
- Public amendment and 8-K Item 4.02 labels are concentrated around legacy positive firm-years in the balanced event-time window.
- The strongest differences appear at relative year 0 and remain visible at +1.
- These are descriptive event-time rates only; the table intentionally reports no p-values or causal timing tests.
Table 14. AAER and Opacity Refresh
| Component | Result | Interpretation |
|---|---|---|
| farr AAER raw benchmark firm-years | 422 | external severity-tail support |
| farr AAER and legacy-positive overlap | 243 | old benchmark captures many farr AAER firm-years |
| farr AAER high-confidence public rows | 220 | bridgeable but sparse in public-score prediction years |
| farr AAER ranking status | blocked_sparse |
do not report stable AAER ranking metrics |
| Public opacity DML rows | 3 fitted outcomes | copied from existing DML artifacts, not refit |
| Public opacity DML p-values | 0.8002, 0.5688, 0.7141 | no adjusted association in this run |
Key readings:
- farr AAER support data reinforce the severity-tail interpretation of legacy positives but do not supply complete enforcement truth.
- AAER public-score ranking remains blocked for sparsity, so no stable AAER ranking metric is reported.
- The opacity refresh confirms that DML artifacts exist and are summarized, but it does not change the non-causal interpretation of the opacity analysis.
Artifact Index
artifacts/full_with_peer/study_summary.mdartifacts/full_with_peer/study_run_manifest.jsonartifacts/full_with_peer/benchmark/benchmark_summary.mdartifacts/full_with_peer/benchmark/rolling_metrics.csvartifacts/full_with_peer/peer_comparison/peer_comparison_summary.mdartifacts/full_with_peer/peer_comparison/legacy_model_family_metrics.csvartifacts/full_with_peer/peer_comparison/legacy_model_family_predictions.parquetartifacts/full_with_peer/peer_comparison/peer_task_status.csvartifacts/full_with_peer/peer_comparison/feature_mapping_attrition.csvartifacts/full_with_peer/peer_comparison/imbalance_strategy_report.csvartifacts/full_with_peer/peer_comparison/legacy_feature_importance.csvartifacts/full_with_peer/public_peer_comparison/public_model_family_summary.mdartifacts/full_with_peer/public_peer_comparison/public_model_family_metrics.csvartifacts/full_with_peer/public_peer_comparison/public_model_family_predictions.parquetartifacts/full_with_peer/public_peer_comparison/public_model_family_task_status.csvartifacts/full_with_peer/public_peer_comparison/public_model_family_feature_importance.csvartifacts/full_with_peer/public_cascade/public_cascade_summary.mdartifacts/full_with_peer/public_cascade/public_cascade_metrics.csvartifacts/full_with_peer/public_cascade/public_opacity_dml.csvartifacts/full_with_peer/bridge_probe/bridge_probe_summary.jsonartifacts/full_with_peer/bridge_probe/coverage_report.csvartifacts/full_with_peer/construct_overlap/construct_overlap_summary.mdartifacts/full_with_peer/construct_overlap/label_contingency_lift.csvartifacts/full_with_peer/construct_overlap/public_score_legacy_ranking.csvartifacts/full_with_peer/construct_overlap/reciprocal_alignment.csvartifacts/full_with_peer/construct_overlap/event_time_concentration.csvartifacts/full_with_peer/construct_overlap/farr_aaer_public_overlap.csvartifacts/full_with_peer/opacity_validation_refresh/opacity_diagnostics_summary.csv