Skip to content

Deferred Extensions

This page records the research program after the current benchmark plus public cascade paper. These extensions are deliberately deferred so the main study does not become an overextended data-engineering exercise before the measurement spine is stable.

Scope guardrail

The current paper must first keep label-observability diagnostics, concept drift, public-label opacity analysis, full public-lake construction, public cascade prediction, and gvkey-CIK-year overlap validation reproducible. Future models should be added only after those foundations are stable.

Back to the current paper plan Return to docs home

  • Multimodal cascade


    Add filing text only after the structured public cascade and lead-time baselines are stable.

  • Attention and market layers


    Add 13F, EDGAR-log, FTD, and market-structure inputs only after temporal security bridges and source-availability masks are defensible. SEC Insider Transactions are a narrower P1 issuer-CIK extension in the current paper plan, not part of this security-level expansion.

  • Auditor and oversight network


    Expand Form AP and PCAOB structure into network-style monitoring exposure only after issuer-auditor joins are reliable.

  • Richer detector labels


    Move toward occurrence-detection-disclosure decomposition only when stronger external restatement timing and detector data exist.

  • Finish the current benchmark plus public-cascade paper.
  • Validate gold-panel readiness and overlap diagnostics.
  • Promote only one extension family at a time.
  • Text and graph models have a clearly defined incremental estimand.
  • Security-level data can be linked through documented temporal bridges.
  • Additional channels strengthen, rather than dilute, the current measurement claim.

Extension Portfolio

Extension Research contribution When to activate
Multimodal cascade Test whether narrative filings add lead time and stage-specific information. After public cascade labels are stable.
Public security and attention layers Add institutional, attention, FTD, and market microstructure channels. After temporal security-to-CIK bridges are available.
Auditor and oversight network Model monitoring exposure through Form AP, partners, firms, and PCAOB inspections. After Form AP and inspection joins are clean.
Severity and detector labels Move toward occurrence-detection-disclosure decomposition. After higher-quality restatement or detector data are acquired.
Reproducibility package Make the empirical pipeline submission-ready. Before manuscript circulation and review.

Extension 1: Multimodal Cascade Model

Working title:

Narrative and Monitoring Signals in the Public Reporting-Risk Cascade

Alternative title:

Occurrence, Detection, Disclosure: A Multimodal Graph of Corporate Reporting Risk

Research Question

Do structured filings, narrative disclosures, auditor/partner monitoring networks, and public enforcement events load on different stages of the public reporting-risk cascade?

The key hypothesis is stage separation:

  • 10-K narrative changes should help earlier, pre-disclosure risk states.
  • auditor and oversight variables should matter more for public scrutiny and correction.
  • 8-K Item 4.02 framing should help explain downstream severity proxy outcomes.

Incremental Data

Start from the full public-cascade lake and add raw filing text:

  • raw 10-K and 10-K/A filing HTML
  • raw 8-K and 8-K/A filing HTML
  • parsed Item 1A Risk Factors
  • parsed Item 7 MD&A
  • parsed 8-K Item 4.02 disclosure text
  • amendment exhibit text when available

Then add graph/document nodes:

  • issuer nodes
  • filing nodes
  • Form AP auditor and partner nodes
  • PCAOB inspection nodes
  • public comment-thread nodes
  • correction-event nodes
  • AAER proxy event nodes

Optional later sources:

  • 13F institutional-holder features
  • insider transactions
  • EDGAR log attention measures
  • market-structure data
  • supplier-customer links

Model Design

Use a staged architecture rather than a single undifferentiated text-plus-tabular classifier:

  • occurrence-risk proxy head: XBRL and 10-K narrative signals
  • public scrutiny head: comment-letter and monitoring features
  • correction head: amendment and 8-K Item 4.02 labels
  • severity proxy head: AAER proxy and disclosure-framing labels

Candidate model families:

  • multi-task tabular plus text model
  • discrete-time multi-state hazard model
  • temporal heterogeneous document graph

Text Strategy

CPU-first phase:

  • parse sections deterministically
  • compute length, readability, dictionary, tone, and revision features
  • compute compact embeddings on a filtered subset
  • prove incremental lead time before scaling

GPU phase:

  • use long-context finance embeddings for Item 1A and Item 7
  • embed 8-K Item 4.02 disclosure text for framing and severity labels
  • run full-corpus embedding only after the CPU-first phase shows value

Acceptance Criteria

This extension should not be judged by aggregate PR-AUC alone. It needs stage-specific evidence:

  • text adds lead time for distant-horizon risk
  • auditor and oversight variables matter more for scrutiny/correction stages
  • 8-K framing predicts downstream severity proxy
  • graph features represent monitoring exposure, not untested contagion

Implementation Status

No runtime code is retained for this extension. The old multimodal prototype was removed so the active codebase stays focused on benchmark, public cascade, and the combined study workflow.

When this extension becomes active again, implement it as a new module rather than reintroducing it into the current study path.

Extension 2: Public Security and Attention Layers

Research Question

Do public capital-market attention and ownership structures predict which reporting risk states become publicly visible?

This extension should not be framed as "market variables improve prediction" in a generic way. The research mechanism is public visibility: institutional holdings, insider trading, attention, and liquidity may change the probability that reporting problems are scrutinized, corrected, or escalated.

Candidate Sources

  • SEC 13F datasets
  • SEC insider transactions datasets
  • SEC EDGAR log datasets
  • SEC market-structure datasets

Identity Challenge

These sources are not naturally CIK native. They require a temporal bridge across:

  • CIK
  • ticker
  • CUSIP
  • accession/adsh
  • reporting manager identifiers
  • security-level market structure identifiers

Every mapping must carry provenance and validity windows. A current ticker-to-CIK lookup is not enough for a historical panel.

Potential Features

  • institutional ownership concentration
  • institutional turnover
  • transient versus dedicated holder exposure
  • insider sale pressure
  • filing-attention shocks
  • liquidity and spread changes
  • market microstructure stress before public correction events

Acceptance Criteria

  • every security-level source must retain its original security key
  • no ticker or CUSIP should be coerced into CIK without a documented mapping table
  • feature availability masks must distinguish source non-existence from issuer silence
  • attention and liquidity variables must be timestamped strictly before origin_date

Extension 3: Auditor and Oversight Network

Research Question

Does public monitoring exposure through auditors, partners, and PCAOB inspections help explain public scrutiny and correction outcomes?

The mechanism is monitoring and lagged public exposure, not a causal contagion claim. The design should ask whether public oversight networks reveal where reporting-risk states are more likely to become visible.

Network Nodes

  • issuer
  • audit firm
  • audit office, if available
  • engagement partner
  • other audit participants
  • PCAOB inspection report
  • inspection deficiency type
  • correction event
  • comment-thread event

Candidate Features

  • partner-level public workload
  • partner prior correction exposure
  • audit-firm inspection deficiency history
  • issuer exposure to recently inspected audit firms
  • peer correction exposure within the same audit firm or partner network
  • auditor turnover and monitoring discontinuity

Guardrails

  • do not frame this as contagion unless the design separates common shocks, monitoring intensity, and network exposure
  • do not treat partner or firm exposure as causal without a credible design
  • keep network variables in the monitoring channel first, not in a general-purpose graph black box

Extension 4: Restatement Severity And Detector Labels

Research Question

Can the public cascade be upgraded into a richer occurrence-detection-disclosure decomposition once stronger external labels are available?

Data Requirements

If paid or higher-quality data become available, add:

  • restatement filing dates
  • affected period start and end dates
  • severity categories
  • fraud indicators
  • detector or notifier identity
  • SEC comment-letter/enforcement linkage
  • restatement magnitude and account categories

Potential Contribution

This extension would move from public cascade prediction toward occurrence-detection-disclosure decomposition. The conceptual payoff is large, but the identification burden is also much higher.

Guardrails

  • do not put bivariate-probit or partial-observability identification in the main model unless there are defensible exclusion restrictions or detector-side variables
  • do not call public correction labels true latent occurrence
  • keep detector labels separate from severity labels

Extension 5: Reproducibility Package

Before submission, build a reproducibility package that is separate from exploratory development.

Required Contents

  • pinned data as-of date
  • source manifests and SHA256 hashes
  • parser versions
  • row-count reports for bronze, silver, and gold layers
  • model configuration files
  • table and figure reproduction commands
  • smoke-data test path for reviewers without public downloads
  • documentation build command

Reproducibility Command Shape

just status
just task study raw artifacts/study
just docs

The full public-data download should remain a separate long-running operational job because it is network dependent.

Submission Criteria

  • a fresh clone with .env configured can build docs and run tests
  • smoke data can reproduce the pipeline shape without the full public lake
  • full results can be regenerated from raw public sources and local manifests
  • every table in the manuscript maps to a single artifact path