Paper Plan¶
Working Title¶
U.S. Close Information and Pre-Open Tail Risk in OSE Nikkei 225 Futures
This page is the paper-facing manuscript blueprint. It follows the order of a finance paper: introduction, literature and gap, contribution, materials and methods, registered experiments, expected results/discussion, claim boundaries, and appendix/source notes.
1. Introduction¶
1.1 Overview And Why This Work¶
- The paper asks whether information observed by the U.S. cash-market close helps forecast the tail risk of the next Osaka Exchange (OSE) Nikkei 225 Futures day-session open.
- The primary empirical object is the settlement-to-open gap of the Nikkei 225 Futures large contract:
gap_t = log(day_session_open_t) - log(previous_settlement_{t-1}).
- The same gap is evaluated as two loss surfaces:
left_tail: downside opening-gap risk,realized_loss_t = -gap_t;right_tail: upside opening-gap risk,realized_loss_t = gap_t.
- The registered primary tail level is 95% VaR, with a nominal 5% exception rate.
- The empirical question is predictive and out-of-sample. It is not a structural causal design.
- Why this setting is useful:
- the target is economically concrete: a futures opening gap relative to a settlement reference;
- the forecast origin is observable before the next OSE open;
- the information experiment is naturally nested: Japan-only history, then U.S. close information, then Japan and Asia proxy blocks;
- tail-risk claims can be disciplined by VaR coverage tests before reading average loss improvements.
1.2 Market Context¶
- OSE Nikkei 225 Futures trade in both day and night sessions.
- The U.S. cash close occurs before the next OSE day-session open, but the Japanese night session means that some U.S. information may already be reflected before the opening auction.
- The paper therefore studies pre-open tail risk, not a generic close-to-close or overnight-return problem.
- The forecast origin is the matched U.S. cash-market close plus the registered vendor-data availability lag.
- The point-in-time condition is:
feature_available_ts_utc <= model_cutoff_ts_utc < target_open_ts_utc.
1.3 Literature Review And Existing Results¶
- International information transmission:
- The empirical setting is cross-market and timing-sensitive: U.S. equity, rates, volatility, FX, credit, and proxy-ETF information is observed before the Japanese futures open.
- The paper does not claim price discovery or structural spillover identification.
- VaR and ES forecasting:
- The study evaluates one-day-ahead opening-gap VaR and ES in positive loss units.
- VaR calibration is assessed through exception rates and coverage tests.
- ES enters through valid VaR-ES forecast pairs and Fissler-Ziegel (FZ) joint scoring.
- Dynamic quantile and tail models:
- Econometric comparators include historical quantiles, volatility-scaled quantiles, GARCH/GJR-GARCH, CAViaR, CARE/expectile models, and GAS models.
- Paper-facing evaluation terminology uses Fissler-Ziegel loss for the joint VaR-ES score.
- Machine-learning models use LightGBM as a flexible tabular forecaster, not as a new algorithmic contribution.
- Filtered EVT:
- The EVT component follows the filtered-tail logic: use a conditional model to remove body/scale variation, then fit a POT-GPD tail model to exceedances.
- Plain fixed-location POT-GPD is the registered EVT estimator.
- Forecast comparison:
- Average loss comparisons use paired out-of-sample losses.
- DM is interpreted as unconditional average-sample inference.
- Model-validation robustness:
- The paper separates scalar forecast ranking from pass/fail risk-model adequacy.
- Quantile loss and Fissler-Ziegel loss rank average predictive performance; Kupiec and Christoffersen tests assess VaR calibration and exception dynamics.
- A diagnostic-admissibility profile summarizes whether a model remains acceptable across tail sides and nested information sets. This is an information-set robustness or robust-satisficing idea.
- Existing evidence in the current research run:
- the clean forecast sample is large enough for a full-sample OOS comparison from 2018-06-20 to 2026-05-22, but still thin in realized 5% tail events;
- direct LightGBM quantile rows are useful information-set comparators but fail the current all-scenario calibration story;
- GJR-GARCH-EVT and two LightGBM+EVT families form the post-24-check comparison set for the current FZ DM heatmap;
- the current manuscript story should therefore sell calibration robustness first, then loss and information-set gains among admissible models.
1.4 Research Gap¶
- Standard international-transmission work is usually about returns, volatility, or price discovery, not the VaR/ES risk of the next OSE futures day-session open under a strict point-in-time U.S. close cutoff.
- Standard VaR/ES forecast comparisons often rank models by average scores without first asking whether a model remains usable across sparse and rich information sets.
- Flexible ML quantile methods can improve average loss while producing exception rates that are too high for risk-model claims; this paper makes that tension visible rather than hiding it behind a single ranking.
- Filtered EVT models are natural for heavy-tailed standardized losses, but the empirical question is whether the filtered-tail route remains stable under actual market timing, nested predictors, and finite tail-event counts.
1.5 Contributions¶
- A point-in-time OSE pre-open tail-risk dataset and timing design linking J-Quants Nikkei 225 Futures data to U.S. close market information.
- A nested information-set experiment that separates Japan-only history, U.S. close core variables, Japan proxy variables, and Asia proxy variables.
- A benchmark-versus-ML tail-risk comparison that evaluates VaR calibration, quantile loss, and Fissler-Ziegel joint VaR-ES loss in one consistent positive-loss convention.
- A post-coverage-screen comparison design: headline comparisons are made only among models that pass the current calibration/admissibility screen.
- A generated evidence map connecting every table and figure to source artifacts and claim scope, so manuscript statements remain traceable.
1.6 Research Questions¶
- Does U.S. close information add predictive content beyond Japan-only history?
- Is most of the marginal content captured by core U.S. close variables, or do Japan and Asia proxy blocks add further information?
- Do the left and right tails display different patterns in calibration, loss, and timing diagnostics?
- Are LightGBM direct quantile forecasts well calibrated at the 95% VaR level?
- Do LightGBM body filters combined with POT-GPD tail extrapolation improve VaR/ES behavior relative to direct 95% quantile forecasts?
- Are loss differentials related to ex-ante observables such as VIX or calendar conditions?
2. Materials And Data Description¶
2.1 Sample, Market, And Evaluation Window¶
- Current clean evaluation window:
2018-06-20to2026-05-22. - Current forecast-sample size:
1722trading-day observations. - The current clean run is a research-candidate evidence set, not a final manuscript freeze.
- The current primary level is 95% VaR/ES.
2.2 Market Description And Target Contract¶
- The Osaka Exchange day session opens at 08:45 JST and follows a prior settlement reference for the Nikkei 225 Futures large contract.
- The OSE night session overlaps the U.S. trading day, so U.S. close information is not simply "overnight" relative to the Japanese futures market.
- The empirical design therefore locks a forecast origin after the matched U.S. cash close and evaluates the next OSE day-session opening gap.
-
This market design makes timing alignment part of the empirical question, not only a data-cleaning detail.
-
Primary target:
- Settlement-to-open gap: log day-session open minus log previous settlement.
- This is the main target because settlement is the economically standard daily futures reference.
- Secondary target:
- Close-to-open gap: log day-session open minus log previous day-session close.
- This provides an alternative opening-gap reference.
- Absorption robustness target:
- Night-close-to-open gap: log day-session open minus log night-session close.
- This is available only when the night close is observed and point-in-time valid.
- Deferred target:
- U.S.-close-mark-to-open gap: log day-session open minus a timestamped Nikkei futures mark at the U.S. cash close.
- This requires licensed intraday OSE, CME, SGX, or equivalent Nikkei futures marks.
2.3 Japanese Data¶
- J-Quants Premium provides the domestic futures data used for the current target and Japan-only predictors:
- Nikkei 225 Futures large-contract OHLC fields;
- settlement price;
- day-session and night-session prices where available;
- volume;
- open interest;
- roll and SQ-related calendar variables.
- Lagged Japanese futures history supplies:
- prior settlement and prior day-session close;
- lagged gap and loss variables;
- rolling volatility;
- rolling 95% loss quantile;
- volume and open-interest state;
- contract-roll and days-to-SQ variables.
- J-Quants Nikkei 225 large options (
NK225E) are treated as domestic option-state predictors when enabled and audited:- lagged option-chain aggregates;
- prior available implied-volatility proxies;
- night-session option OHLC summaries;
- option volume, open interest, and days-to-SQ features.
- Same target-date option rows are not used as predictors for that target date.
2.4 U.S. And Cross-Market Data¶
- Massive daily data supply U.S. and regional market predictors:
- broad U.S. ETFs:
SPY,QQQ,DIA,IWM; - sector ETFs:
XLK,XLF,XLE,XLV,XLI,XLY,XLP,XLB,XLU,XLC; - cross-asset ETFs:
TLT,GLD,USO,SMH,HYG,LQD; - Japan proxies:
EWJ,DXJ; - Asia and regional proxies:
EEM,FXI,EWY,EWT,EWH.
- broad U.S. ETFs:
- Massive minute data supply late-session U.S. predictors:
- last-30-minute and last-60-minute returns;
- realized variance;
- upside and downside semivariance;
- late-session range;
- final-window momentum;
- volume pressure and volume-surge variables.
- Massive OPRA day aggregates are used only for opt-in historical option-feature reconstruction and are excluded from the canonical full-history run by default:
- core U.S. options enter the U.S. core block;
- sector and semiconductor options enter as aggregate U.S. market-state variables;
- Japan ETF and Japanese ADR option aggregates enter the Japan proxy block;
- Asia proxy option aggregates enter the Asia proxy block.
- Massive live option snapshots are not used for historical backfill.
2.5 FRED, Cboe, FX, Rates, Volatility, And Credit Controls¶
- FRED supplies macro-financial controls:
- Treasury yields:
DGS2,DGS10; - term spread:
T10Y2Y; - H.10 USD/JPY:
DEXJPUS; - VIX close where available through
VIXCLS; - credit-spread controls, including high-yield and investment-grade spread series when enabled in the clean run.
- Treasury yields:
- Cboe supplies volatility-index predictors:
- VIX close;
- VIX range and related volatility-state variables where available.
- FRED variables use conservative publication-lag controls.
- FRED predictors do not use unrevised real-time ALFRED vintages. This is a data-vintage limitation, not a look-ahead-bias failure.
- The canonical USD/JPY control is FRED
DEXJPUS; U.S.-listed dollar ETFs such asUUPare risk proxies, not a replacement for USD/JPY.
2.6 Pretreatment And Data Discipline¶
- Every row carries separate event, source, availability, cutoff, and target timestamps.
- A predictor can enter only if its availability timestamp is no later than the model cutoff.
- Data are staged through cache-first bronze/silver/gold artifacts:
- bronze: source-shaped cached data;
- silver: cleaned and source-specific intermediate data;
- gold: modeling panel and evaluation artifacts.
- Contract rolls and calendar joins are audited before model evaluation.
- Missingness, duplicate rows, source coverage, and calendar alignment are recorded in run artifacts.
- The current clean run includes a narrow timestamp-safe event-calendar layer: BOJ same-OSE-session information in the Japan-only set, and FOMC, CPI, NFP/payroll, plus simple major-event intensity controls from the U.S. close core set onward. Broader Japan macro-event expansion remains candidate work.
2.7 Feature Engineering And Nested Information Sets¶
- The information sets are nested by design:
japan_only;japan_only_plus_us_close_core;japan_only_plus_us_close_core_plus_japan_proxy;japan_only_plus_us_close_core_plus_japan_proxy_plus_asia_proxy.
japan_onlyincludes:- target history;
- lagged Japanese futures variables;
- rolling volatility and tail-loss history;
- volume and open-interest state;
- Japanese calendar, contract-roll, and SQ variables;
- lagged domestic option state when enabled and audited.
japan_only_plus_us_close_coreadds:- broad U.S. ETF daily and late-session information;
- sector ETF state;
- U.S. rates, volatility, FX, credit, dollar-risk, and cross-asset controls;
- core U.S. option aggregates when enabled and audited.
japan_only_plus_us_close_core_plus_japan_proxyadds:EWJandDXJdaily and minute features;- Japan ETF option aggregates;
- Japanese ADR spot and option aggregate state.
japan_only_plus_us_close_core_plus_japan_proxy_plus_asia_proxyadds:- Asia and regional ETF features;
- Asia proxy option aggregates when enabled and audited.
- These blocks test marginal predictive content. They are not an exhaustive variable search.
3. Methods¶
This section defines the empirical procedure after data construction: forecast origin, benchmark and ML-tail model families, EVT calibration, performance metrics, inference, and the criteria used to decide which comparisons are paper-facing.
3.1 Pipeline Structure¶
| Step | Layer | Purpose |
|---|---|---|
| 1 | Vendor and calendar sources | Pull or read J-Quants, Massive, FRED, Cboe, and exchange-calendar inputs. |
| 2 | Bronze and silver cache | Preserve typed vendor/cache rows, then normalize point-in-time research features. |
| 3 | Gold modeling panel | Join targets, calendar map, feature coverage, and leakage-bound signatures. |
| 4 | Leakage and coverage gates | Enforce timestamp ordering and sample eligibility before evaluation. |
| 5 | Baseline benchmarks and ML-tail registry | Run target-history/econometric baseline benchmarks and LightGBM tail-model families. |
| 6 | Metrics, inference, diagnostics | Build loss matrices, DM/Murphy diagnostics, stress windows, and result matrix artifacts. |
| 7 | Results snapshot | Summarize run-specific evidence and claim boundaries for reader review. |
- Data-access and cache artifacts live under
data/bronzeanddata/silver. - Durable modeling evidence lives under
data/gold. - Forecasts, metrics, diagnostics, and LaTeX exports live under
reports/runs/<run_id>. - Reporting rebuilds read from gold and reports; they must not trigger vendor data calls.
3.2 Model And Evaluation Protocol¶
- The registered risk level is
tail_level = 0.95; the nominal VaR exception rate is 5%. - A VaR exception is counted when
realized_loss > var_forecast. - Forecast evaluation uses coverage diagnostics, Kupiec/Christoffersen tests where available, quantile loss, Fissler-Ziegel joint VaR-ES loss, and DM inference.
- Benchmarks use target-history information only.
- ML-tail models add predictors through fixed nested information sets.
- DM inference is read as unconditional average-sample forecast-comparison evidence.
3.3 Forecasting Protocol¶
- All models use the same point-in-time forecasting protocol.
- The minimum training-history requirement is common across model families.
- Most specifications use expanding pre-forecast training histories.
- The rolling empirical quantile benchmark is the exception: it uses the most recent 1,000 clean observations by design.
- ML tail models are refit monthly using expanding training windows.
- LightGBM hyperparameters are held fixed across information sets and refit dates to avoid data-dependent tuning-search evidence.
- Forecasts are stored in positive loss units.
- A VaR exception is always:
realized_loss_t > var_forecast_t.
3.4 Baseline Benchmarks¶
- The baseline benchmarks are target-history and econometric:
- historical empirical quantile;
- rolling empirical quantile;
- EWMA or volatility-scaled quantile;
- GARCH with Student-t innovations;
- GJR-GARCH with Student-t innovations;
- GJR-GARCH-EVT in the McNeil-Frey filtered-EVT tradition.
- These models establish the external VaR/ES reference before adding high-dimensional cross-market predictors.
3.5 Advanced Econometric Benchmarks¶
- Advanced econometric benchmarks are implemented to widen the peer comparison:
- CAViaR;
- CARE and expectile-based tail models;
- Generalized Autoregressive Score (GAS) models.
- These rows are claim-gated.
- Numerical convergence and common-sample availability determine how they are used in the paper.
3.6 LightGBM Direct Quantile¶
lightgbm_direct_quantileestimates the conditional 95% loss quantile directly:
VaR_t = q_0.95(realized_loss_t | X_t).
- It uses LightGBM with a quantile objective.
- It is the cleanest specification for evaluating nested information sets.
- Its ES companion is empirical rather than a separate ES model.
- Current evidence shows that direct quantile rows must be read together with coverage diagnostics because lower average loss can coincide with higher exception rates.
3.7 LightGBM Location-Scale Empirical Tail¶
lightgbm_location_scale_empiricalseparates conditional body learning from tail calibration:- first-stage LightGBM estimates a conditional mean-like location with an L2 objective;
- second-stage LightGBM estimates log absolute residual scale;
- Duan-style smearing maps the scale estimate back to original units;
- out-of-fold standardized losses are used for empirical VaR/ES calibration.
- This is the main non-EVT filtered-tail comparator inside the LightGBM family.
3.8 LightGBM Standardized-Loss POT-GPD¶
- The standardized-loss POT-GPD family uses the same location-scale body filter, then fits a GPD to standardized-loss exceedances.
- Current registered variants:
lightgbm_standardized_loss_pot_gpd_plain_mle;lightgbm_standardized_loss_pot_gpd_unibm.
- Plain MLE is the registered fixed-location POT-GPD estimator and remains the standard comparator.
- The UniBM route keeps the same LightGBM mean/log-scale body filter and POT threshold, but replaces the MLE shape estimate with a UniBM block-maxima-derived estimate of
xi; the GPD scale is then refit withxifixed. - This is a shape-estimator diagnostic variant, not a new primary ML specification.
3.9 LightGBM Robust Body Filters¶
- New research-candidate LightGBM+EVT models are implemented at the 95% level only and remain outside the primary ML table until post-rerun review:
lightgbm_median_mad_pot_gpd_plain_mle;lightgbm_median_iqr_pot_gpd_plain_mle.
- Median/MAD route:
- LightGBM q50 estimates conditional median location;
- LightGBM L1 regression estimates conditional median absolute residual scale;
- the MAD normalization factor is recorded in artifacts.
- Median/IQR route:
- LightGBM q25, q50, and q75 estimate conditional quantiles;
- scale is
(q75 - q25) / 1.349; - quantile crossing is handled and recorded.
- These routes test whether a more robust body filter improves the filtered tail supplied to POT-GPD.
3.10 EVT Details¶
- POT-GPD is applied only to strictly positive exceedances.
- The GPD location is fixed at zero for exceedances.
- The base shape estimate is fixed-location maximum likelihood:
stats.genpareto.fit(excesses, floc=0.0).
- The registered EVT estimator uses the fixed-location MLE shape directly.
- The UniBM comparison estimates the GPD shape
xias an extreme value index from the selected-plateau slope of a sliding block-maxima summary scaling regression. This is not the reciprocal Pareto tail indexalpha; when a Pareto tail index is reported under the conventionP(X > x) ~ x^{-alpha}, the relationship isxi = 1 / alpha. - UniBM failures are fail-closed and reported as unavailable; they are not silently replaced by plain MLE.
- ES is available only when the fitted shape implies a finite ES.
- If the shape is negative, finite-endpoint support is checked before accepting the extrapolated quantile.
- EVT diagnostics include:
- log survival plots;
- QQ plots;
- mean excess plots;
- Hill/EVI paths;
- threshold stability;
- extremal-index diagnostics;
- raw versus filtered tail summaries.
3.11 Performance Metrics, Selection Criteria, And Inference¶
- VaR calibration:
- empirical breach rate;
- exception count;
- deviation from the nominal 5% exception rate;
- Kupiec unconditional coverage test;
- Christoffersen independence or conditional coverage test where sample size permits.
- Why calibration comes first:
- VaR is a risk-limit object, so an apparently low loss is not enough if realized exceptions are too frequent or clustered;
- the current paper sells robustness across tail sides and information sets, so pass/fail calibration evidence must precede any model-win language.
- VaR loss:
- quantile loss on paired out-of-sample forecasts.
- Why quantile loss is retained:
- it is the proper score for VaR alone;
- it keeps direct quantile rows interpretable even when ES is empirical or auxiliary.
- Joint VaR-ES evaluation:
- Fissler-Ziegel joint loss for valid VaR-ES pairs;
- ES exceedance severity, interpreted conditional on a VaR exception.
- Why FZ loss is the main joint score:
- it evaluates VaR and ES as a pair;
- it is used only as evaluation language in the paper;
- legacy likelihood-style implementation language is treated as benchmark objective interpretation, not as a second paper-facing loss.
- Terminology is fixed as follows:
FZ lossmeans the Fissler-Ziegel joint VaR-ES evaluation score;- no separate likelihood-style VaR-ES loss label is used in the paper.
- Scoring-function diagnostics:
- Murphy diagrams for target-history benchmarks and 24-check robust LGBM families across the four registered information sets.
- Model comparison:
- block-bootstrap Diebold-Mariano tests on paired loss differentials.
- Why DM is supporting inference:
- it tests average paired loss differences on common forecast dates;
- it is not a conditional state-by-state mechanism test;
- the post-24-check 3-by-3 heatmap uses strict common dates across GJR-GARCH-EVT, LGBM POT-GPD plain MLE (C), and LGBM POT-GPD UniBM (C).
- Supporting diagnostics:
- stress-window performance.
- Cross-scenario admissibility:
- The headline robustness question is whether a model remains acceptable under sparse and rich information sets, and on both tail sides.
- A 24-check version combines eight tail-by-information-set scenarios with three calibration diagnostics: breach-neighborhood, Kupiec unconditional coverage, and Christoffersen independence or conditional coverage where available.
- This diagnostic battery is a validation profile, not a single formal hypothesis test and not proof of universal optimality.
- The current breach-audit artifact reports the narrower breach-neighborhood and row-count gates; it should not be described as the full 24-check table unless the Kupiec and Christoffersen pass/fail grid is included.
4. Workflow Chart¶
4.1 Timing And Data Flow¶
flowchart TD
A["OSE target date t"]
B["Previous settlement<br/>Nikkei 225 Futures large contract"]
C["Matched U.S. cash session s(t)<br/>regular close or early close"]
D["Model cutoff<br/>U.S. close plus vendor lag"]
E["Eligible predictors<br/>availability timestamp no later than cutoff"]
F["OSE day-session open<br/>08:45 JST"]
G["Settlement-to-open gap"]
H["Left-tail loss<br/>minus gap"]
I["Right-tail loss<br/>gap"]
A --> B
A --> C
C --> D
D --> E
B --> G
F --> G
G --> H
G --> I
E --> H
E --> I
4.2 Empirical Pipeline¶
flowchart LR
subgraph Data["Materials"]
J["Japan futures and options<br/>J-Quants"]
U["U.S. and regional market data<br/>Massive"]
M["Rates, FX, volatility, credit<br/>FRED and Cboe"]
end
subgraph Features["Nested information sets"]
A["A: Japan only"]
B["B: A plus U.S. close core"]
C["C: B plus Japan proxies"]
D["D: C plus Asia proxies"]
end
subgraph Models["Forecast models"]
BF["Baseline benchmarks"]
AB["Advanced econometric benchmarks"]
LGBM["LightGBM (+EVT) families<br/>direct quantile; location-scale empirical;<br/>standardized/robust body filters; POT-GPD"]
end
subgraph Evaluation["Evaluation"]
CAL["VaR calibration<br/>breach rate, exceptions,<br/>Kupiec/Christoffersen"]
LOSS["Forecast scores<br/>quantile loss;<br/>FZ joint VaR-ES loss"]
INF["Loss comparison<br/>DM"]
DIAG["Forecast diagnostics<br/>Murphy,<br/>ES severity,<br/>stress windows"]
end
J --> A
J --> B
U --> B
M --> B
B --> C
C --> D
A --> BF
A --> AB
A --> LGBM
B --> LGBM
C --> LGBM
D --> LGBM
BF --> CAL
BF --> LOSS
BF --> DIAG
AB --> CAL
AB --> LOSS
AB --> DIAG
LGBM --> CAL
LGBM --> LOSS
LGBM --> DIAG
CAL --> INF
LOSS --> INF
- The LightGBM block represents the implemented ML-tail registry: direct quantile, location-scale empirical, standardized-loss POT-GPD, and robust median/MAD or median/IQR POT-GPD variants. All use the same registered nested information sets where the model family is eligible.
- Forecast diagnostics are computed from forecasts, realized losses, timing regimes, and scoring outputs. They are not downstream products of DM or other loss-comparison inference.
5. Expected Experiments¶
5.1 Primary Data And Timing Experiments¶
- Build the OSE settlement-to-open gap target and verify the final forecast sample.
- Audit U.S./Japan session matching, early closes, holiday desynchronization, DST regimes, roll/SQ exclusions, and vendor availability timestamps.
- Report target-tail motivation diagnostics: density versus Gaussian, log survival, mean excess, and Hill/GPD tail-index paths.
- Output expected evidence:
market_timing_design;target_tail_motivation;- run metadata, panel construction, target-audit, calendar, feature coverage, and leakage sections in the Results Snapshot.
5.2 Benchmark Experiments¶
- Run target-history and econometric baselines on left/right 95% loss surfaces.
- Include historical/rolling quantile, EWMA or volatility-scaled quantile, GARCH-t, GJR-GARCH-t, and GJR-GARCH-EVT.
- Include advanced econometric benchmarks where they converge and pass artifact gates; these rows remain claim-gated.
- Output expected evidence:
- benchmark metrics tables;
- benchmark Murphy diagnostics;
- selected benchmark-versus-LGBM performance figures;
- all-model diagnostic scan.
5.3 Nested Information-Set ML Experiments¶
- Run LightGBM direct quantile and LightGBM filtered-tail families over the nested A/B/C/D information sets.
- Evaluate left and right tails separately.
- Treat direct quantile rows as information-set comparators, but do not promote them when they fail calibration/admissibility gates.
- Output expected evidence:
- primary ML nested-information-set table;
- per-model ML-tail appendix table;
- 24-check coverage/admissibility discussion;
- LGBM Murphy diagnostics for the pass-all LGBM+EVT families.
5.4 Post-24-Check Cross-Suite Comparison¶
- Restrict the headline comparison set to models that pass the current
calibration/admissibility screen:
GJR-GARCH-EVT;LGBM POT-GPD plain MLE (C);LGBM POT-GPD UniBM (C).
- Compute the 3-by-3 pairwise FZ DM heatmap separately for left and right tails.
- Use a strict global common sample within each tail so the benchmark and LGBM rows are paired on identical forecast dates.
- Output expected evidence:
dm_heatmap_left_tail;dm_heatmap_right_tail;- common-sample N in the figure subtitle/caption.
5.5 Information-Increment And Stress Diagnostics¶
- Plot cumulative FZ gains relative to the same-family A-only LGBM+EVT anchor.
- Compare GJR-GARCH-EVT and B/C/D information expansions against that A-only anchor to show information increments after the 24-check screen.
- Use stress-window overlays to illustrate VaR/ES behavior in broad stress episodes; do not interpret them as PnL, trading alpha, or validation by themselves.
- Output expected evidence:
cumulative_lgbm_a_anchor_fz_gain;var_es_stress_overlay_2024_stress_episode;var_es_stress_overlay_2025_stress_episode;- full-sample VaR overlay diagnostics.
5.6 Appendix Robustness Experiments¶
- Run
just sensitivityas post-24-check appendix evidence only. - Perturb nearby LightGBM capacity for the two pass-all C-information LGBM+EVT families.
- Perturb POT thresholds at
0.875and0.925, while recording0.95as a boundary diagnostic at the 95% VaR level. - Do not feed sensitivity rows into model selection, promoted rows, DM gates, selected figures, or the cross-suite FZ DM heatmap.
6. Expected Results And Discussion Outputs¶
6.1 Main Tables¶
- Predictor block and coverage table: data/methods table showing information blocks, source families, feature counts, representative variables, missingness, and model role. Coverage is not admissibility; timestamp and feature-matrix gates still apply.
- Model inventory table: compact methods table explaining Historical, GARCH/GJR, GARCH-EVT, advanced econometric, direct LightGBM, location-scale, and POT-GPD constructions. Performance belongs in result tables, not here.
- Benchmark floor table: common-sample benchmark breach rates and loss metrics, with left/right tail detail available when page space allows.
- ML information-ladder table: the main nested information-set table for direct LightGBM, reported separately for left and right tails.
- Selected model performance table: deterministic selected-row summary after sample-size, coverage, FZ-loss, and quantile-loss gates.
- Compact DM summary table: headline paired inference only. Full matrices stay in the appendix.
6.2 Main Figures¶
- Figure 1, market timing design: institutional timing diagram for OSE settlement, night session, U.S. close, model cutoff, and next OSE open. This is a forecast-origin diagram, not a causal price-discovery diagram.
- Figure 2, opening-gap tail motivation: density versus Gaussian, left/right log survival, mean-excess diagnostics, and Hill tail-index paths. This single composite motivates the target and EVT route; it is not forecast validation.
- Coverage-screen evidence is summarized in tables rather than a compact main-text coverage figure; this avoids duplicating the 24-check pass/fail story.
- Direct LightGBM information-ladder graphics are not main-text figures under the 24-check robustness story because the direct quantile rows fail the calibration screen.
- Figure 3, cumulative FZ-gain diagnostics: one 2-by-2 figure after the 24-check coverage screen. Each panel fixes a tail side and one of the two 24-check-passing LGBM+EVT families. The anchor is the corresponding A-only LGBM+EVT forecast; plotted candidates are GJR-GARCH-EVT and the same-family B/C/D information expansions. Upward movement means the candidate has lower accumulated FZ loss than A-only under the fixed anchor-loss-minus-candidate-loss convention.
6.3 Appendix Figures And Tables¶
- Raw target diagnostics: histogram/density, left/right QQ plots, log survival, mean excess, and Hill plot.
- Full coverage diagnostics and selected performance figures: appendix checks backing the main coverage and selected-performance summaries.
- Stress-window overlays: broad OOS stress episodes with left/right tails sharing the same x-axis; LGBM lines use information set C, the best-FZ row within the two 24-check LGBM+EVT families. Illustration only, not validation, PnL, cost, or trading-performance evidence.
- DM heatmaps: appendix pairwise FZ detail for the post-24-check cross-suite set; rows are candidates, columns are anchors, and negative differences favor the row model. Each tail uses a strict global common sample across GJR-GARCH-EVT, LGBM plain MLE C, and LGBM UniBM C.
- Murphy diagrams: scoring-family diagnostics, not pairwise dominance claims.
- ES severity diagnostics: conditional exceedance diagnostics, not model-selection or alpha claims.
- EVT standardized-residual diagnostics: QQ, log survival, mean excess, Hill, and threshold stability for the POT-GPD route.
- Appendix tables: full benchmark scan, full LGBM scan, tail-side risk tables, promoted tail rows, restricted result matrix, ES severity diagnostics, claim-scope reference, and configuration robustness.
- The complete generated figure and table map is maintained in Results Snapshot, which now includes both result interpretation and artifact placement.
6.4 Appendix Configuration Robustness¶
- The primary design compares pre-specified point-in-time forecast specifications.
just sensitivityis fixed to the post-24-check paper set:GJR-GARCH-EVT,LGBM POT-GPD plain MLE (C), andLGBM POT-GPD UniBM (C).- The sensitivity run varies nearby LightGBM capacity only for the two pass-all C-information LGBM+EVT families.
- POT threshold sensitivity reports forecastable thresholds
0.875and0.925for the same post-24-check set, bracketing the registered primary threshold0.90. - At 95% VaR, threshold
0.95is recorded only as a boundary diagnostic with statusnot_applicable_threshold_not_below_tail_level. - Sensitivity artifacts live under
reports/runs/<run_id>/sensitivity/and carryprimary_claim_allowed=false. - Robustness labels describe conclusion stability versus the registered primary specification. They do not feed the cross-suite FZ DM heatmap, promoted rows, result-matrix selection, or selected-model figures.
7. Manuscript Structure¶
- Introduction:
- state the pre-open tail-risk problem;
- explain why the OSE night session makes the U.S. close question nontrivial;
- state the nested information-set design;
- preview the calibration-versus-loss tension in ML tail forecasts.
- Institutional setting:
- describe OSE day/night trading;
- define the U.S. close cutoff;
- state the point-in-time rule.
- Materials:
- describe Japanese futures and options data;
- describe U.S. ETF, minute, option, rates, FX, volatility, and credit data;
- describe preprocessing, contract rolls, calendar joins, and feature blocks.
- Methods:
- define target, left/right losses, VaR, and ES;
- describe benchmark and advanced econometric models;
- describe LightGBM direct quantile and filtered-tail models;
- describe POT-GPD shape, scale, and ES gates;
- describe evaluation metrics and inference.
- Results:
- begin with sample, timing, and target-tail diagnostics;
- report baseline benchmark calibration;
- report ML-tail nested information sets separately for left and right tails;
- report restricted model-family comparisons;
- report EVT, ES severity, and stress-window diagnostics as supporting evidence.
- Discussion:
- interpret U.S. close information content;
- distinguish downside and upside risk;
- discuss VaR coverage before loss-based claims;
- explain where LightGBM+EVT is useful and where sample gates are still limiting.
- Conclusion:
- summarize the predictive evidence;
- state limitations from coverage drift, FRED vintages, EVT sample size, and missing U.S.-close Nikkei futures marks;
- define the next empirical extension only where it sharpens interpretation of the current OSE pre-open tail-risk design.
8. Claim Boundaries¶
- No structural causal spillover claim.
- No price-discovery claim.
- No claim that left-tail and right-tail mechanisms are identical.
- No deployment claim from historical OHLC data.
- No
residual_usclosemark_to_openclaim without licensed timestamped intraday Nikkei futures marks. - No claim that LightGBM-EVT is a new ML algorithm.
- No options-risk primary claim unless historical options entitlement, timestamp safety, and liquidity gates pass.
- No model-family ranking claim from restricted short samples.
9. Appendix And Source Notes¶
9.1 Source Notes¶
- JPX Nikkei 225 Futures contract specifications: Nikkei 225 Futures | Japan Exchange Group
- JPX derivatives trading hours: Trading Hours | Derivatives | Japan Exchange Group
- J-Quants plan coverage: Available APIs and Data Periods per Plan | J-Quants API
- J-Quants data timing: Update Timing of Provided Data | J-Quants API
- Massive.com stock-market timestamp semantics: Stocks Overview | Massive.com
- NYSE trading hours and early closes: Holidays and Trading Hours | NYSE
- FRED observations API: fred/series/observations | FRED
- Cboe VIX historical data: VIX Index Historical Data | Cboe
- CME Nikkei products: Nikkei 225 futures | CME Group
9.2 Literature Notes¶
- McNeil-Frey filtered EVT: conditional volatility/body filtering followed by EVT tail estimation.
- Basel VaR backtesting: exception-counting intuition for VaR validation; this paper reports coverage and exception diagnostics but does not apply regulatory traffic-light capital zones.
- CAViaR: dynamic quantile modeling for VaR.
- CARE and expectile-based models: expectile links to tail-risk measures.
- GAS models: score-driven updating for dynamic conditional distributions.
- Fissler-Ziegel scoring: joint VaR-ES evaluation.
- Murphy diagrams: sensitivity of forecast comparison to scoring-function choice.
- Diebold-Mariano: unconditional average-sample model comparison.
- Diagnostic admissibility across nested information sets:
- VaR backtesting and interval-forecast evaluation: Kupiec 1995, Christoffersen 1998.
- Risk-model validation and governance: BIS MAR99, Federal Reserve SR 11-7.
- Proper scoring and joint VaR-ES scoring: Gneiting and Raftery 2007, Fissler and Ziegel 2016.
- Forecast comparison: Diebold and Mariano 1995.
- Specification robustness and robust satisficing: Simonsohn, Simmons, and Nelson 2020, Schwartz, Ben-Haim, and Dacso 2011, Ben-Haim 2014.
9.3 Reproducibility Notes¶
- The generated results snapshot is the evidence map; this paper plan is the manuscript design.
- Data source details are maintained in
docs/data.md. - Current result tables and figure provenance are maintained in
docs/results_snapshot.md. - The canonical run artifacts live under
REPORTS_DIR/runs/<run_id>/. - Paper-facing tables and figures are emitted under
REPORTS_DIR/runs/<run_id>/latex/. - The local data root should be external storage, either through absolute
.envpaths or a repo-localdata/symlink that resolves outside the cloud-synced repo.REPORTS_DIRcan remain local because generated reporting artifacts are comparatively small. table_manifest.jsonandfigure_manifest.jsonprovide source-artifact and claim-scope traceability for the generated paper-facing outputs.