Deferred Extensions

This page records extensions that follow the current benchmark and public cascade paper. They are deferred so the main study remains focused on the filing-origin measurement design before additional data layers are introduced.

Scope guardrail

The current paper should first keep label-observability diagnostics, concept drift, public-label opacity analysis, full public-lake construction, public cascade prediction, and gvkey-CIK-year overlap validation reproducible. Additional model classes should be added only after those foundations are stable.

Back to the current paper plan Return to docs home

Multimodal cascade

Add filing text only after the structured public cascade and lead-time baselines are stable.
Attention and market layers

Add 13F, EDGAR-log, FTD, and market-structure inputs only after temporal security bridges and source-availability masks are defensible. SEC Insider Transactions are a narrower issuer-CIK extension in the current paper plan, not part of this security-level expansion.
Auditor and oversight network

Expand Form AP and PCAOB structure into network-style monitoring exposure only after issuer-auditor joins are reliable.
Richer detector labels

Move toward occurrence-detection-disclosure decomposition only when stronger external restatement timing and detector data exist.

Near-Term PreconditionsDeferred Until

Finish the current benchmark plus public-cascade paper.
Validate gold-panel readiness and overlap diagnostics.
Promote only one extension family at a time.

Text and graph models have a clearly defined incremental estimand.
Security-level data can be linked through documented temporal bridges.
Additional channels strengthen, rather than dilute, the current measurement claim.

Extension Portfolio

Extension	Research contribution	Scope condition
Multimodal cascade	Test whether narrative filings add lead time and stage-specific information.	After public cascade labels are stable.
Public security and attention layers	Add institutional, attention, FTD, and market microstructure channels.	After temporal security-to-CIK bridges are available.
Auditor and oversight network	Model monitoring exposure through Form AP, partners, firms, and PCAOB inspections.	After Form AP and inspection joins are clean.
Severity and detector labels	Move toward occurrence-detection-disclosure decomposition.	After higher-quality restatement or detector data are acquired.
Reproducibility package	Make the empirical workflow submission-ready.	Before manuscript circulation and review.

Extension 1: Multimodal Cascade Model

Working title:

Narrative and Monitoring Signals in the Public Reporting-Risk Cascade

Alternative title:

Occurrence, Detection, Disclosure: A Multimodal Graph of Corporate Reporting Risk

Research Question

Do structured filings, narrative disclosures, auditor/partner monitoring networks, and public enforcement events load on different stages of the public reporting-risk cascade?

The key hypothesis is stage separation:

10-K narrative changes should help earlier, pre-disclosure risk states.
auditor and oversight variables should matter more for public scrutiny and correction.
8-K Item 4.02 framing should help explain downstream severity proxy outcomes.

Incremental Data

Start from the full public-cascade lake and add raw filing text:

raw 10-K and 10-K/A filing HTML
raw 8-K and 8-K/A filing HTML
parsed Item 1A Risk Factors
parsed Item 7 MD&A
parsed 8-K Item 4.02 disclosure text
amendment exhibit text when available

Then add graph/document nodes:

issuer nodes
filing nodes
Form AP auditor and partner nodes
PCAOB inspection nodes
public comment-thread nodes
correction-event nodes

Optional later sources:

13F institutional-holder features
insider transactions
EDGAR log attention measures
market-structure data
supplier-customer links

Model Design

Use a staged architecture rather than a single undifferentiated text-plus-tabular classifier:

occurrence-risk proxy head: XBRL and 10-K narrative signals
public scrutiny head: comment-letter and monitoring features
correction head: amendment and 8-K Item 4.02 labels
severity proxy head: disclosure-framing labels

Candidate model families:

multi-task tabular plus text model
discrete-time multi-state hazard model
temporal heterogeneous document graph

Text Strategy

CPU-first phase:

parse sections deterministically
compute length, readability, dictionary, tone, and revision features
compute compact embeddings on a filtered subset
prove incremental lead time before scaling

GPU phase:

use long-context finance embeddings for Item 1A and Item 7
embed 8-K Item 4.02 disclosure text for framing and severity labels
run full-corpus embedding only after the CPU-first phase shows value

Acceptance Criteria

This extension should not be judged by aggregate PR-AUC alone. It needs stage-specific evidence:

text adds lead time for distant-horizon risk
auditor and oversight variables matter more for scrutiny/correction stages
8-K framing predicts downstream severity proxy
graph features represent monitoring exposure, not untested contagion

Implementation Status

No runtime code is retained for this extension. The old multimodal prototype was removed so the active codebase stays focused on benchmark, public cascade, and the combined study workflow.

When this extension becomes active again, implement it as a new module rather than reintroducing it into the current study path.

Extension 2: Public Security and Attention Layers

Research Question

Do public capital-market attention and ownership structures predict which reporting risk states become publicly visible?

This extension should not be framed as "market variables improve prediction" in a generic way. The research mechanism is public visibility: institutional holdings, insider trading, attention, and liquidity may change the probability that reporting problems are scrutinized, corrected, or escalated.

Candidate Sources

SEC 13F datasets
SEC insider transactions datasets
SEC EDGAR log datasets
SEC market-structure datasets

Identity Challenge

These sources are not naturally CIK native. They require a temporal bridge across:

CIK
ticker
CUSIP
accession/adsh
reporting manager identifiers
security-level market structure identifiers

Every mapping must carry provenance and validity windows. A current ticker-to-CIK lookup is not enough for a historical panel.

Potential Features

institutional ownership concentration
institutional turnover
transient versus dedicated holder exposure
insider sale pressure
filing-attention shocks
liquidity and spread changes
market microstructure stress before public correction events

Acceptance Criteria

every security-level source must retain its original security key
no ticker or CUSIP should be coerced into CIK without a documented mapping table
feature availability masks must distinguish source non-existence from issuer silence
attention and liquidity variables must be timestamped strictly before origin_date

Extension 3: Auditor and Oversight Network

Research Question

Does public monitoring exposure through auditors, partners, and PCAOB inspections help explain public scrutiny and correction outcomes?

The mechanism is monitoring and lagged public exposure, not a causal contagion claim. The design should ask whether public oversight networks reveal where reporting-risk states are more likely to become visible.

Candidate Features

partner-level public workload
partner prior correction exposure
audit-firm inspection deficiency history
issuer exposure to recently inspected audit firms
peer correction exposure within the same audit firm or partner network
auditor turnover and monitoring discontinuity

Guardrails

do not frame this as contagion unless the design separates common shocks, monitoring intensity, and network exposure
do not treat partner or firm exposure as causal without a credible design
keep network variables in the monitoring channel first, not in a general-purpose graph black box

Extension 4: Restatement Severity And Detector Labels

Research Question

Can the public cascade be upgraded into a richer occurrence-detection-disclosure decomposition once stronger external labels are available?

Data Requirements

If paid or higher-quality data become available, add:

restatement filing dates
affected period start and end dates
severity categories
fraud indicators
detector or notifier identity
SEC comment-letter/enforcement linkage
restatement magnitude and account categories

Potential Contribution

This extension would move from public cascade prediction toward occurrence-detection-disclosure decomposition. The conceptual payoff is large, but the identification burden is also much higher.

Guardrails

do not put bivariate-probit or partial-observability identification in the main model unless there are defensible exclusion restrictions or detector-side variables
do not call public correction labels true latent occurrence
keep detector labels separate from severity labels

Extension 5: Reproducibility Package

Before submission, build a reproducibility package that is separate from exploratory development.

Required Contents

pinned data as-of date
source manifests and SHA256 hashes
parser versions
row-count reports for bronze, silver, and gold layers
model configuration files
table and figure reproduction commands
smoke-data test path for external replication without public downloads
documentation build command

Reproducibility Command Shape

just status
just task study raw artifacts/study
just docs

The full public-data download should remain a separate long-running operational job because it is network dependent.

Submission Criteria

a fresh clone with .env configured can build docs and run tests
smoke data can reproduce the workflow structure without the full public lake
full results can be regenerated from raw public sources and local manifests
every table in the manuscript maps to a single artifact path

Deferred Extensions

Extension Portfolio

Extension 1: Multimodal Cascade Model

Research Question

Incremental Data

Model Design

Text Strategy

Acceptance Criteria

Implementation Status

Extension 2: Public Security and Attention Layers

Research Question

Candidate Sources

Identity Challenge

Potential Features

Acceptance Criteria

Extension 3: Auditor and Oversight Network

Research Question

Network Nodes

Candidate Features

Guardrails

Extension 4: Restatement Severity And Detector Labels

Research Question

Data Requirements

Potential Contribution

Guardrails

Extension 5: Reproducibility Package

Required Contents

Reproducibility Command Shape

Submission Criteria