Results Snapshot

This page is the current paper-facing ledger for the local canonical proxy run. Generated outputs remain under ignored artifacts/, reports/, and external DATA_DIR locations; selected figures are copied into docs/assets/images/modeling/.

Current Verdict

The repo has a complete no-NBBO proxy research package: data panel, model implementations, feature-schema artifacts, validation-only tuning artifacts, forecast/ranking/strategy tables, figures, and analysis notes.

It is good enough for internal research review and a conservative working-paper draft. It is not good enough for paper-grade execution claims. All current option prices are trade-aggregate proxies, not bid/ask, OPRA, or NBBO. The execution grade remains no_nbbo_trade_proxy; paper_grade=false.

The 2026-05-12 same-code feature ablation is negative for FE V2: the default fe_v2_sec_xbrl schema underperforms fe_v1_legacy on locked-test ranking and headline C2C proxy economics. Per SPEC.md, this is reported as a diagnostic finding rather than cherry-picked away.

Research Question

The paper-facing question is:

Can models improve trading decisions around option-implied earnings event variance mispricing?

This is not generic implied-volatility forecasting. Models forecast RVAR_event; ex post mispricing is RVAR_event - IVAR_event; the trade layer uses premium-space expected edge rather than raw variance edge alone.

Target	Role	Definition
`jump_c2o`	Primary scientific ranking target	close-to-open earnings jump variance
`day_c2c`	V1 proxy-PnL headline	close-to-close full reaction-day variance
`reaction_o2c`	Diagnostic target	open-to-close post-open digestion variance

Sample and Protocol

Item	Current state
Verified local run	2026-05-12 canonical tuned proxy package
Command	`just research args="--stage all --sequence-suite all --allow-high-sequence-risk --bootstrap-iter 1000 --tuning-profile tuned_phase1 --feature-schema-version fe_v2_sec_xbrl"`
Study window	2022-12-01 to 2025-12-31
Target paper window	2013-2025, pending historical quote/NBBO or equivalent data
Universe	Monthly top 50 liquid U.S. single-name option underlyings
Main timing sample	BMO and AMC earnings announcements
Split	Chronological event-level 70/15/15
Tuning protocol	`tuned_phase1`, train/validation-only selection, locked test once
Feature schema	`fe_v2_sec_xbrl` default; `fe_v1_legacy` retained for ablation
Bootstrap iterations	1,000
Research manifest	`artifacts/modeling/research_manifest.json`, `ok=true`
Ablation snapshots	`artifacts/modeling_ablations/fe_v2_sec_xbrl/`, `artifacts/modeling_ablations/fe_v1_legacy/`

Data Coverage

Measure	Value
Dynamic-calendar rows	1,054
BMO/AMC main-sample candidates	810
Trade-proxy event-panel rows	810
Events with C2C `rvar_event` alias	801
Events with trade-proxy `IVAR_event`	693
Proxy contract candidates	12,038
Contracts with usable pre-cutoff proxy price	10,165
Contracts with no trade in cutoff window	1,873
Contracts with local IV proxy	10,138
Main DTE 5-14 contracts	5,098
Robustness DTE 3-21 contracts	12,038
Proxy straddle diagnostic rows	779

Sequence coverage remains diagnostic: 678 of 810 events are daily-sequence eligible, the drop rate is 16.3%, and high_sequence_selection_risk=true. The active hybrid tensor is 31 x 21: 19 prior daily proxy-surface states plus 12 entry-day five-minute trade-aggregate proxy bins.

Model Matrix

Family	Models	Status
Market benchmark	Market-implied `IVAR_event`	Active neutral edge baseline
Historical baselines	Last-four RVAR, last-four IVAR	Active deterministic baselines
Classical spread benchmark	Goyal-Saretto-style RV-IV spread	Active comparator
Linear tabular	Elastic Net	sklearn `ElasticNetCV`
Nonlinear tabular	LightGBM, XGBoost	Validation-only tuning and train+validation refit
Ensemble	LightGBM/XGBoost rank-average	Equal-weight tuned-base ensemble
Neural tabular	FT-Transformer	Validation-tuned; not a current headline model
Sequence diagnostics	Ridge-flat aggregates, BiGRU 5-seed, official `mamba-ssm` 5-seed, attention pooling, dilated CNN, mask-only, time-shuffle	Diagnostic only

FE V2 Canonical Result

Forecast/ranking columns use jump_c2o; strategy uses the day_c2c headline proxy PnL. This is the active default schema.

Model	MAE	RMSE	OOS R2 vs IVAR	Top-decile precision	AUC	Day-C2C net proxy PnL
Market IVAR	0.0097	0.0145	0.000	0.000	0.500	n/a
Last-four RVAR	0.0123	0.0293	-1.787	0.200	0.505	-3,482
Last-four IVAR	0.0181	0.0540	-0.295	0.200	0.468	-15,904
Goyal-Saretto spread	0.0076	0.0134	0.141	0.300	0.602	-461
Elastic Net	0.0099	0.0212	0.125	0.300	0.493	-24,967
LightGBM	0.0097	0.0209	0.203	0.300	0.510	-2,862
XGBoost	0.0099	0.0211	0.142	0.100	0.515	-19,952
LightGBM/XGBoost ensemble	0.0098	0.0210	0.175	0.300	0.512	-19,694
FT-Transformer	0.0357	0.0384	-4.766	0.200	0.524	-4,793
Ridge-flat sequence	0.0110	0.0221	-0.047	0.200	0.434	19,918
Attention pooling sequence	0.0088	0.0236	0.054	0.100	0.480	-2,238
Dilated CNN sequence	0.0088	0.0233	0.116	0.100	0.476	-4,172
BiGRU 5-seed	0.0087	0.0236	0.059	0.200	0.472	-8,022
Official `mamba-ssm` 5-seed	0.0088	0.0239	0.024	0.200	0.501	-1,793
Mask-only sequence	0.0088	0.0242	-0.039	0.100	0.500	101
Time-shuffle sequence	0.0086	0.0237	0.056	0.200	0.475	-8,022

Interpretation. FE V2 is not the current sell. The strongest FE V2 jump_c2o AUC is the Goyal-Saretto spread at 0.602. Tuned LightGBM/XGBoost rows have weak jump_c2o AUC and negative day_c2c headline proxy PnL. The positive ridge-flat sequence C2C row is diagnostic and does not pass the sequence gate.

AUC and top-decile precision Strategy PnL by edge decile

FE V1 Versus FE V2 Ablation

Feature schema	Target	Best AUC model	Best AUC	Best OOS R2 model	Best OOS R2 vs IVAR	Best headline/diagnostic PnL model	Best net PnL
`fe_v1_legacy`	`jump_c2o`	LightGBM	0.677	XGBoost	0.375	Official `mamba-ssm` 5-seed, C2O intrinsic diagnostic	28,898
`fe_v1_legacy`	`day_c2c`	LightGBM	0.925	XGBoost	0.574	LightGBM, C2C headline	53,664
`fe_v1_legacy`	`reaction_o2c`	Ridge-flat sequence	0.799	XGBoost	0.949	FT-Transformer, O2C diagnostic	643
`fe_v2_sec_xbrl`	`jump_c2o`	Goyal-Saretto spread	0.602	LightGBM	0.203	Official `mamba-ssm` 5-seed, C2O intrinsic diagnostic	28,898
`fe_v2_sec_xbrl`	`day_c2c`	Ridge-flat sequence	0.636	Ridge-flat sequence	0.264	Ridge-flat sequence, C2C headline	19,918
`fe_v2_sec_xbrl`	`reaction_o2c`	Ridge-flat sequence	0.799	LightGBM/XGBoost ensemble	0.945	FT-Transformer, O2C diagnostic	753

The same-code ablation says the parsimonious legacy feature set currently dominates FE V2 for the paper-facing tabular story. FE V2 should remain in the artifact ledger because it is the default schema, but the manuscript should not claim that FE V2 improved the result without further feature diagnostics.

Cost Sensitivity

FE V2 day_c2c headline rows:

Model	1x cost	3x cost	5x cost
Goyal-Saretto spread	-461	-2,155	-3,849
LightGBM	-2,862	-4,556	-6,250
XGBoost	-19,952	-21,646	-23,340
LightGBM/XGBoost ensemble	-19,694	-21,388	-23,082
Ridge-flat sequence	19,918	18,224	16,530
Official `mamba-ssm` 5-seed	-1,793	-3,383	-4,972

Cost sensitivity

Sequence Diagnostics

No sequence model passes the diagnostic gate. Official mamba-ssm is now implemented through the mamba-ssm backend and appears in the artifacts, but the 5-seed row does not beat tabular baselines or controls robustly enough to upgrade the claim. Attention pooling and dilated CNN are also diagnostic rows.

What We Can Sell

The defensible near-term claim is now narrower:

In a no-NBBO proxy sample, a parsimonious event-level tabular feature set shows preliminary cross-sectional signal for earnings event-variance mispricing beyond market IVAR and simple historical baselines. The richer FE V2 schema is currently a negative diagnostic result and needs follow-up.

The current paper angle should emphasize:

The trading question is event-variance mispricing, not generic IV forecasting.
Ranking and top-decile selection matter more than unconditional RMSE.
FE V1 tabular LightGBM/XGBoost is the stronger same-code signal screen.
FE V2, sequence models, and Mamba are diagnostic rather than headline evidence.

Claim Boundaries

Do not claim paper-grade executable performance, bid/ask/NBBO execution, Mamba superiority, FE V2 improvement, or that lower RMSE alone implies economic value. Paper-grade claims require historical quote/NBBO or equivalent data, quote-based IVAR, leg-level execution with realistic bid/ask crossing, DTE and liquidity robustness, and clustered or bootstrap inference over a longer history.