Skip to content

Paper Plan

Working Title

U.S. Close Information and Pre-Open Tail Risk in OSE Nikkei 225 Futures

This page is the paper-facing manuscript blueprint. It follows the order of a finance paper: introduction, literature and gap, contribution, materials and methods, registered experiments, expected results/discussion, claim boundaries, and appendix/source notes.

1. Introduction

1.1 Overview And Why This Work

  • The paper asks whether information observed by the U.S. cash-market close helps forecast the tail risk of the next Osaka Exchange (OSE) Nikkei 225 Futures day-session open.
  • The primary empirical object is the settlement-to-open gap of the Nikkei 225 Futures large contract:

gap_t = log(day_session_open_t) - log(previous_settlement_{t-1}).

  • The same gap is evaluated as two loss surfaces:
    • left_tail: downside opening-gap risk, realized_loss_t = -gap_t;
    • right_tail: upside opening-gap risk, realized_loss_t = gap_t.
  • The registered primary tail level is 95% VaR, with a nominal 5% exception rate.
  • The empirical question is predictive and out-of-sample. It is not a structural causal design.
  • Why this setting is useful:
    • the target is economically concrete: a futures opening gap relative to a settlement reference;
    • the forecast origin is observable before the next OSE open;
    • the information experiment is naturally nested: Japan-only history, then U.S. close information, then Japan and Asia proxy blocks;
    • tail-risk claims can be disciplined by VaR coverage tests before reading average loss improvements.

1.2 Market Context

  • OSE Nikkei 225 Futures trade in both day and night sessions.
  • The U.S. cash close occurs before the next OSE day-session open, but the Japanese night session means that some U.S. information may already be reflected before the opening auction.
  • The paper therefore studies pre-open tail risk, not a generic close-to-close or overnight-return problem.
  • The forecast origin is the matched U.S. cash-market close plus the registered vendor-data availability lag.
  • The point-in-time condition is:

feature_available_ts_utc <= model_cutoff_ts_utc < target_open_ts_utc.

1.3 Literature Review And Existing Results

  • International information transmission:
    • The empirical setting is cross-market and timing-sensitive: U.S. equity, rates, volatility, FX, credit, and proxy-ETF information is observed before the Japanese futures open.
    • The paper does not claim price discovery or structural spillover identification.
  • VaR and ES forecasting:
    • The study evaluates one-day-ahead opening-gap VaR and ES in positive loss units.
    • VaR calibration is assessed through exception rates and coverage tests.
    • ES enters through valid VaR-ES forecast pairs and Fissler-Ziegel (FZ) joint scoring.
  • Dynamic quantile and tail models:
    • Econometric comparators include historical quantiles, volatility-scaled quantiles, GARCH/GJR-GARCH, CAViaR, CARE/expectile models, and GAS models.
    • Paper-facing evaluation terminology uses Fissler-Ziegel loss for the joint VaR-ES score.
    • Machine-learning models use LightGBM as a flexible tabular forecaster, not as a new algorithmic contribution.
  • Filtered EVT:
    • The EVT component follows the filtered-tail logic: use a conditional model to remove body/scale variation, then fit a POT-GPD tail model to exceedances.
    • Plain fixed-location POT-GPD is the registered EVT estimator.
  • Forecast comparison:
    • Average loss comparisons use paired out-of-sample losses.
    • DM is interpreted as unconditional average-sample inference.
  • Model-validation robustness:
    • The paper separates scalar forecast ranking from pass/fail risk-model adequacy.
    • Quantile loss and Fissler-Ziegel loss rank average predictive performance; Kupiec and Christoffersen tests assess VaR calibration and exception dynamics.
    • A diagnostic-admissibility profile summarizes whether a model remains acceptable across tail sides and nested information sets. This is an information-set robustness or robust-satisficing idea.
  • Existing evidence in the current research run:
    • the clean forecast sample is large enough for a full-sample OOS comparison from 2018-06-20 to 2026-05-22, but still thin in realized 5% tail events;
    • direct LightGBM quantile rows are useful information-set comparators but fail the current all-scenario calibration story;
    • GJR-GARCH-EVT and two LightGBM+EVT families form the post-24-check comparison set for the current FZ DM heatmap;
    • the current manuscript story should therefore sell calibration robustness first, then loss and information-set gains among admissible models.

1.4 Research Gap

  • Standard international-transmission work is usually about returns, volatility, or price discovery, not the VaR/ES risk of the next OSE futures day-session open under a strict point-in-time U.S. close cutoff.
  • Standard VaR/ES forecast comparisons often rank models by average scores without first asking whether a model remains usable across sparse and rich information sets.
  • Flexible ML quantile methods can improve average loss while producing exception rates that are too high for risk-model claims; this paper makes that tension visible rather than hiding it behind a single ranking.
  • Filtered EVT models are natural for heavy-tailed standardized losses, but the empirical question is whether the filtered-tail route remains stable under actual market timing, nested predictors, and finite tail-event counts.

1.5 Contributions

  • A point-in-time OSE pre-open tail-risk dataset and timing design linking J-Quants Nikkei 225 Futures data to U.S. close market information.
  • A nested information-set experiment that separates Japan-only history, U.S. close core variables, Japan proxy variables, and Asia proxy variables.
  • A benchmark-versus-ML tail-risk comparison that evaluates VaR calibration, quantile loss, and Fissler-Ziegel joint VaR-ES loss in one consistent positive-loss convention.
  • A post-coverage-screen comparison design: headline comparisons are made only among models that pass the current calibration/admissibility screen.
  • A generated evidence map connecting every table and figure to source artifacts and claim scope, so manuscript statements remain traceable.

1.6 Research Questions

  • Does U.S. close information add predictive content beyond Japan-only history?
  • Is most of the marginal content captured by core U.S. close variables, or do Japan and Asia proxy blocks add further information?
  • Do the left and right tails display different patterns in calibration, loss, and timing diagnostics?
  • Are LightGBM direct quantile forecasts well calibrated at the 95% VaR level?
  • Do LightGBM body filters combined with POT-GPD tail extrapolation improve VaR/ES behavior relative to direct 95% quantile forecasts?
  • Are loss differentials related to ex-ante observables such as VIX or calendar conditions?

2. Materials And Data Description

2.1 Sample, Market, And Evaluation Window

  • Current clean evaluation window: 2018-06-20 to 2026-05-22.
  • Current forecast-sample size: 1722 trading-day observations.
  • The current clean run is a research-candidate evidence set, not a final manuscript freeze.
  • The current primary level is 95% VaR/ES.

2.2 Market Description And Target Contract

  • The Osaka Exchange day session opens at 08:45 JST and follows a prior settlement reference for the Nikkei 225 Futures large contract.
  • The OSE night session overlaps the U.S. trading day, so U.S. close information is not simply "overnight" relative to the Japanese futures market.
  • The empirical design therefore locks a forecast origin after the matched U.S. cash close and evaluates the next OSE day-session opening gap.
  • This market design makes timing alignment part of the empirical question, not only a data-cleaning detail.

  • Primary target:

    • Settlement-to-open gap: log day-session open minus log previous settlement.
    • This is the main target because settlement is the economically standard daily futures reference.
  • Secondary target:
    • Close-to-open gap: log day-session open minus log previous day-session close.
    • This provides an alternative opening-gap reference.
  • Absorption robustness target:
    • Night-close-to-open gap: log day-session open minus log night-session close.
    • This is available only when the night close is observed and point-in-time valid.
  • Deferred target:
    • U.S.-close-mark-to-open gap: log day-session open minus a timestamped Nikkei futures mark at the U.S. cash close.
    • This requires licensed intraday OSE, CME, SGX, or equivalent Nikkei futures marks.

2.3 Japanese Data

  • J-Quants Premium provides the domestic futures data used for the current target and Japan-only predictors:
    • Nikkei 225 Futures large-contract OHLC fields;
    • settlement price;
    • day-session and night-session prices where available;
    • volume;
    • open interest;
    • roll and SQ-related calendar variables.
  • Lagged Japanese futures history supplies:
    • prior settlement and prior day-session close;
    • lagged gap and loss variables;
    • rolling volatility;
    • rolling 95% loss quantile;
    • volume and open-interest state;
    • contract-roll and days-to-SQ variables.
  • J-Quants Nikkei 225 large options (NK225E) are treated as domestic option-state predictors when enabled and audited:
    • lagged option-chain aggregates;
    • prior available implied-volatility proxies;
    • night-session option OHLC summaries;
    • option volume, open interest, and days-to-SQ features.
  • Same target-date option rows are not used as predictors for that target date.

2.4 U.S. And Cross-Market Data

  • Massive daily data supply U.S. and regional market predictors:
    • broad U.S. ETFs: SPY, QQQ, DIA, IWM;
    • sector ETFs: XLK, XLF, XLE, XLV, XLI, XLY, XLP, XLB, XLU, XLC;
    • cross-asset ETFs: TLT, GLD, USO, SMH, HYG, LQD;
    • Japan proxies: EWJ, DXJ;
    • Asia and regional proxies: EEM, FXI, EWY, EWT, EWH.
  • Massive minute data supply late-session U.S. predictors:
    • last-30-minute and last-60-minute returns;
    • realized variance;
    • upside and downside semivariance;
    • late-session range;
    • final-window momentum;
    • volume pressure and volume-surge variables.
  • Massive OPRA day aggregates are used only for opt-in historical option-feature reconstruction and are excluded from the canonical full-history run by default:
    • core U.S. options enter the U.S. core block;
    • sector and semiconductor options enter as aggregate U.S. market-state variables;
    • Japan ETF and Japanese ADR option aggregates enter the Japan proxy block;
    • Asia proxy option aggregates enter the Asia proxy block.
  • Massive live option snapshots are not used for historical backfill.

2.5 FRED, Cboe, FX, Rates, Volatility, And Credit Controls

  • FRED supplies macro-financial controls:
    • Treasury yields: DGS2, DGS10;
    • term spread: T10Y2Y;
    • H.10 USD/JPY: DEXJPUS;
    • VIX close where available through VIXCLS;
    • credit-spread controls, including high-yield and investment-grade spread series when enabled in the clean run.
  • Cboe supplies volatility-index predictors:
    • VIX close;
    • VIX range and related volatility-state variables where available.
  • FRED variables use conservative publication-lag controls.
  • FRED predictors do not use unrevised real-time ALFRED vintages. This is a data-vintage limitation, not a look-ahead-bias failure.
  • The canonical USD/JPY control is FRED DEXJPUS; U.S.-listed dollar ETFs such as UUP are risk proxies, not a replacement for USD/JPY.

2.6 Pretreatment And Data Discipline

  • Every row carries separate event, source, availability, cutoff, and target timestamps.
  • A predictor can enter only if its availability timestamp is no later than the model cutoff.
  • Data are staged through cache-first bronze/silver/gold artifacts:
    • bronze: source-shaped cached data;
    • silver: cleaned and source-specific intermediate data;
    • gold: modeling panel and evaluation artifacts.
  • Contract rolls and calendar joins are audited before model evaluation.
  • Missingness, duplicate rows, source coverage, and calendar alignment are recorded in run artifacts.
  • The current clean run includes a narrow timestamp-safe event-calendar layer: BOJ same-OSE-session information in the Japan-only set, and FOMC, CPI, NFP/payroll, plus simple major-event intensity controls from the U.S. close core set onward. Broader Japan macro-event expansion remains candidate work.

2.7 Feature Engineering And Nested Information Sets

  • The information sets are nested by design:
    • japan_only;
    • japan_only_plus_us_close_core;
    • japan_only_plus_us_close_core_plus_japan_proxy;
    • japan_only_plus_us_close_core_plus_japan_proxy_plus_asia_proxy.
  • japan_only includes:
    • target history;
    • lagged Japanese futures variables;
    • rolling volatility and tail-loss history;
    • volume and open-interest state;
    • Japanese calendar, contract-roll, and SQ variables;
    • lagged domestic option state when enabled and audited.
  • japan_only_plus_us_close_core adds:
    • broad U.S. ETF daily and late-session information;
    • sector ETF state;
    • U.S. rates, volatility, FX, credit, dollar-risk, and cross-asset controls;
    • core U.S. option aggregates when enabled and audited.
  • japan_only_plus_us_close_core_plus_japan_proxy adds:
    • EWJ and DXJ daily and minute features;
    • Japan ETF option aggregates;
    • Japanese ADR spot and option aggregate state.
  • japan_only_plus_us_close_core_plus_japan_proxy_plus_asia_proxy adds:
    • Asia and regional ETF features;
    • Asia proxy option aggregates when enabled and audited.
  • These blocks test marginal predictive content. They are not an exhaustive variable search.

3. Methods

This section defines the empirical procedure after data construction: forecast origin, benchmark and ML-tail model families, EVT calibration, performance metrics, inference, and the criteria used to decide which comparisons are paper-facing.

3.1 Pipeline Structure

Step Layer Purpose
1 Vendor and calendar sources Pull or read J-Quants, Massive, FRED, Cboe, and exchange-calendar inputs.
2 Bronze and silver cache Preserve typed vendor/cache rows, then normalize point-in-time research features.
3 Gold modeling panel Join targets, calendar map, feature coverage, and leakage-bound signatures.
4 Leakage and coverage gates Enforce timestamp ordering and sample eligibility before evaluation.
5 Baseline benchmarks and ML-tail registry Run target-history/econometric baseline benchmarks and LightGBM tail-model families.
6 Metrics, inference, diagnostics Build loss matrices, DM/Murphy diagnostics, stress windows, and result matrix artifacts.
7 Results snapshot Summarize run-specific evidence and claim boundaries for reader review.
  • Data-access and cache artifacts live under data/bronze and data/silver.
  • Durable modeling evidence lives under data/gold.
  • Forecasts, metrics, diagnostics, and LaTeX exports live under reports/runs/<run_id>.
  • Reporting rebuilds read from gold and reports; they must not trigger vendor data calls.

3.2 Model And Evaluation Protocol

  • The registered risk level is tail_level = 0.95; the nominal VaR exception rate is 5%.
  • A VaR exception is counted when realized_loss > var_forecast.
  • Forecast evaluation uses coverage diagnostics, Kupiec/Christoffersen tests where available, quantile loss, Fissler-Ziegel joint VaR-ES loss, and DM inference.
  • Benchmarks use target-history information only.
  • ML-tail models add predictors through fixed nested information sets.
  • DM inference is read as unconditional average-sample forecast-comparison evidence.

3.3 Forecasting Protocol

  • All models use the same point-in-time forecasting protocol.
  • The minimum training-history requirement is common across model families.
  • Most specifications use expanding pre-forecast training histories.
  • The rolling empirical quantile benchmark is the exception: it uses the most recent 1,000 clean observations by design.
  • ML tail models are refit monthly using expanding training windows.
  • LightGBM hyperparameters are held fixed across information sets and refit dates to avoid data-dependent tuning-search evidence.
  • Forecasts are stored in positive loss units.
  • A VaR exception is always:

realized_loss_t > var_forecast_t.

3.4 Baseline Benchmarks

  • The baseline benchmarks are target-history and econometric:
    • historical empirical quantile;
    • rolling empirical quantile;
    • EWMA or volatility-scaled quantile;
    • GARCH with Student-t innovations;
    • GJR-GARCH with Student-t innovations;
    • GJR-GARCH-EVT in the McNeil-Frey filtered-EVT tradition.
  • These models establish the external VaR/ES reference before adding high-dimensional cross-market predictors.

3.5 Advanced Econometric Benchmarks

  • Advanced econometric benchmarks are implemented to widen the peer comparison:
    • CAViaR;
    • CARE and expectile-based tail models;
    • Generalized Autoregressive Score (GAS) models.
  • These rows are claim-gated.
  • Numerical convergence and common-sample availability determine how they are used in the paper.

3.6 LightGBM Direct Quantile

  • lightgbm_direct_quantile estimates the conditional 95% loss quantile directly:

VaR_t = q_0.95(realized_loss_t | X_t).

  • It uses LightGBM with a quantile objective.
  • It is the cleanest specification for evaluating nested information sets.
  • Its ES companion is empirical rather than a separate ES model.
  • Current evidence shows that direct quantile rows must be read together with coverage diagnostics because lower average loss can coincide with higher exception rates.

3.7 LightGBM Location-Scale Empirical Tail

  • lightgbm_location_scale_empirical separates conditional body learning from tail calibration:
    • first-stage LightGBM estimates a conditional mean-like location with an L2 objective;
    • second-stage LightGBM estimates log absolute residual scale;
    • Duan-style smearing maps the scale estimate back to original units;
    • out-of-fold standardized losses are used for empirical VaR/ES calibration.
  • This is the main non-EVT filtered-tail comparator inside the LightGBM family.

3.8 LightGBM Standardized-Loss POT-GPD

  • The standardized-loss POT-GPD family uses the same location-scale body filter, then fits a GPD to standardized-loss exceedances.
  • Current registered variants:
    • lightgbm_standardized_loss_pot_gpd_plain_mle;
    • lightgbm_standardized_loss_pot_gpd_unibm.
  • Plain MLE is the registered fixed-location POT-GPD estimator and remains the standard comparator.
  • The UniBM route keeps the same LightGBM mean/log-scale body filter and POT threshold, but replaces the MLE shape estimate with a UniBM block-maxima-derived estimate of xi; the GPD scale is then refit with xi fixed.
  • This is a shape-estimator diagnostic variant, not a new primary ML specification.

3.9 LightGBM Robust Body Filters

  • New research-candidate LightGBM+EVT models are implemented at the 95% level only and remain outside the primary ML table until post-rerun review:
    • lightgbm_median_mad_pot_gpd_plain_mle;
    • lightgbm_median_iqr_pot_gpd_plain_mle.
  • Median/MAD route:
    • LightGBM q50 estimates conditional median location;
    • LightGBM L1 regression estimates conditional median absolute residual scale;
    • the MAD normalization factor is recorded in artifacts.
  • Median/IQR route:
    • LightGBM q25, q50, and q75 estimate conditional quantiles;
    • scale is (q75 - q25) / 1.349;
    • quantile crossing is handled and recorded.
  • These routes test whether a more robust body filter improves the filtered tail supplied to POT-GPD.

3.10 EVT Details

  • POT-GPD is applied only to strictly positive exceedances.
  • The GPD location is fixed at zero for exceedances.
  • The base shape estimate is fixed-location maximum likelihood:

stats.genpareto.fit(excesses, floc=0.0).

  • The registered EVT estimator uses the fixed-location MLE shape directly.
  • The UniBM comparison estimates the GPD shape xi as an extreme value index from the selected-plateau slope of a sliding block-maxima summary scaling regression. This is not the reciprocal Pareto tail index alpha; when a Pareto tail index is reported under the convention P(X > x) ~ x^{-alpha}, the relationship is xi = 1 / alpha.
  • UniBM failures are fail-closed and reported as unavailable; they are not silently replaced by plain MLE.
  • ES is available only when the fitted shape implies a finite ES.
  • If the shape is negative, finite-endpoint support is checked before accepting the extrapolated quantile.
  • EVT diagnostics include:
    • log survival plots;
    • QQ plots;
    • mean excess plots;
    • Hill/EVI paths;
    • threshold stability;
    • extremal-index diagnostics;
    • raw versus filtered tail summaries.

3.11 Performance Metrics, Selection Criteria, And Inference

  • VaR calibration:
    • empirical breach rate;
    • exception count;
    • deviation from the nominal 5% exception rate;
    • Kupiec unconditional coverage test;
    • Christoffersen independence or conditional coverage test where sample size permits.
  • Why calibration comes first:
    • VaR is a risk-limit object, so an apparently low loss is not enough if realized exceptions are too frequent or clustered;
    • the current paper sells robustness across tail sides and information sets, so pass/fail calibration evidence must precede any model-win language.
  • VaR loss:
    • quantile loss on paired out-of-sample forecasts.
  • Why quantile loss is retained:
    • it is the proper score for VaR alone;
    • it keeps direct quantile rows interpretable even when ES is empirical or auxiliary.
  • Joint VaR-ES evaluation:
    • Fissler-Ziegel joint loss for valid VaR-ES pairs;
    • ES exceedance severity, interpreted conditional on a VaR exception.
  • Why FZ loss is the main joint score:
    • it evaluates VaR and ES as a pair;
    • it is used only as evaluation language in the paper;
    • legacy likelihood-style implementation language is treated as benchmark objective interpretation, not as a second paper-facing loss.
  • Terminology is fixed as follows:
    • FZ loss means the Fissler-Ziegel joint VaR-ES evaluation score;
    • no separate likelihood-style VaR-ES loss label is used in the paper.
  • Scoring-function diagnostics:
    • Murphy diagrams for target-history benchmarks and 24-check robust LGBM families across the four registered information sets.
  • Model comparison:
    • block-bootstrap Diebold-Mariano tests on paired loss differentials.
  • Why DM is supporting inference:
    • it tests average paired loss differences on common forecast dates;
    • it is not a conditional state-by-state mechanism test;
    • the post-24-check 3-by-3 heatmap uses strict common dates across GJR-GARCH-EVT, LGBM POT-GPD plain MLE (C), and LGBM POT-GPD UniBM (C).
  • Supporting diagnostics:
    • stress-window performance.
  • Cross-scenario admissibility:
    • The headline robustness question is whether a model remains acceptable under sparse and rich information sets, and on both tail sides.
    • A 24-check version combines eight tail-by-information-set scenarios with three calibration diagnostics: breach-neighborhood, Kupiec unconditional coverage, and Christoffersen independence or conditional coverage where available.
    • This diagnostic battery is a validation profile, not a single formal hypothesis test and not proof of universal optimality.
    • The current breach-audit artifact reports the narrower breach-neighborhood and row-count gates; it should not be described as the full 24-check table unless the Kupiec and Christoffersen pass/fail grid is included.

4. Workflow Chart

4.1 Timing And Data Flow

flowchart TD
    A["OSE target date t"]
    B["Previous settlement<br/>Nikkei 225 Futures large contract"]
    C["Matched U.S. cash session s(t)<br/>regular close or early close"]
    D["Model cutoff<br/>U.S. close plus vendor lag"]
    E["Eligible predictors<br/>availability timestamp no later than cutoff"]
    F["OSE day-session open<br/>08:45 JST"]
    G["Settlement-to-open gap"]
    H["Left-tail loss<br/>minus gap"]
    I["Right-tail loss<br/>gap"]

    A --> B
    A --> C
    C --> D
    D --> E
    B --> G
    F --> G
    G --> H
    G --> I
    E --> H
    E --> I

4.2 Empirical Pipeline

flowchart LR
    subgraph Data["Materials"]
        J["Japan futures and options<br/>J-Quants"]
        U["U.S. and regional market data<br/>Massive"]
        M["Rates, FX, volatility, credit<br/>FRED and Cboe"]
    end

    subgraph Features["Nested information sets"]
        A["A: Japan only"]
        B["B: A plus U.S. close core"]
        C["C: B plus Japan proxies"]
        D["D: C plus Asia proxies"]
    end

    subgraph Models["Forecast models"]
        BF["Baseline benchmarks"]
        AB["Advanced econometric benchmarks"]
        LGBM["LightGBM (+EVT) families<br/>direct quantile; location-scale empirical;<br/>standardized/robust body filters; POT-GPD"]
    end

    subgraph Evaluation["Evaluation"]
        CAL["VaR calibration<br/>breach rate, exceptions,<br/>Kupiec/Christoffersen"]
        LOSS["Forecast scores<br/>quantile loss;<br/>FZ joint VaR-ES loss"]
        INF["Loss comparison<br/>DM"]
        DIAG["Forecast diagnostics<br/>Murphy,<br/>ES severity,<br/>stress windows"]
    end

    J --> A
    J --> B
    U --> B
    M --> B
    B --> C
    C --> D
    A --> BF
    A --> AB
    A --> LGBM
    B --> LGBM
    C --> LGBM
    D --> LGBM
    BF --> CAL
    BF --> LOSS
    BF --> DIAG
    AB --> CAL
    AB --> LOSS
    AB --> DIAG
    LGBM --> CAL
    LGBM --> LOSS
    LGBM --> DIAG
    CAL --> INF
    LOSS --> INF
  • The LightGBM block represents the implemented ML-tail registry: direct quantile, location-scale empirical, standardized-loss POT-GPD, and robust median/MAD or median/IQR POT-GPD variants. All use the same registered nested information sets where the model family is eligible.
  • Forecast diagnostics are computed from forecasts, realized losses, timing regimes, and scoring outputs. They are not downstream products of DM or other loss-comparison inference.

5. Expected Experiments

5.1 Primary Data And Timing Experiments

  • Build the OSE settlement-to-open gap target and verify the final forecast sample.
  • Audit U.S./Japan session matching, early closes, holiday desynchronization, DST regimes, roll/SQ exclusions, and vendor availability timestamps.
  • Report target-tail motivation diagnostics: density versus Gaussian, log survival, mean excess, and Hill/GPD tail-index paths.
  • Output expected evidence:
    • market_timing_design;
    • target_tail_motivation;
    • run metadata, panel construction, target-audit, calendar, feature coverage, and leakage sections in the Results Snapshot.

5.2 Benchmark Experiments

  • Run target-history and econometric baselines on left/right 95% loss surfaces.
  • Include historical/rolling quantile, EWMA or volatility-scaled quantile, GARCH-t, GJR-GARCH-t, and GJR-GARCH-EVT.
  • Include advanced econometric benchmarks where they converge and pass artifact gates; these rows remain claim-gated.
  • Output expected evidence:
    • benchmark metrics tables;
    • benchmark Murphy diagnostics;
    • selected benchmark-versus-LGBM performance figures;
    • all-model diagnostic scan.

5.3 Nested Information-Set ML Experiments

  • Run LightGBM direct quantile and LightGBM filtered-tail families over the nested A/B/C/D information sets.
  • Evaluate left and right tails separately.
  • Treat direct quantile rows as information-set comparators, but do not promote them when they fail calibration/admissibility gates.
  • Output expected evidence:
    • primary ML nested-information-set table;
    • per-model ML-tail appendix table;
    • 24-check coverage/admissibility discussion;
    • LGBM Murphy diagnostics for the pass-all LGBM+EVT families.

5.4 Post-24-Check Cross-Suite Comparison

  • Restrict the headline comparison set to models that pass the current calibration/admissibility screen:
    • GJR-GARCH-EVT;
    • LGBM POT-GPD plain MLE (C);
    • LGBM POT-GPD UniBM (C).
  • Compute the 3-by-3 pairwise FZ DM heatmap separately for left and right tails.
  • Use a strict global common sample within each tail so the benchmark and LGBM rows are paired on identical forecast dates.
  • Output expected evidence:
    • dm_heatmap_left_tail;
    • dm_heatmap_right_tail;
    • common-sample N in the figure subtitle/caption.

5.5 Information-Increment And Stress Diagnostics

  • Plot cumulative FZ gains relative to the same-family A-only LGBM+EVT anchor.
  • Compare GJR-GARCH-EVT and B/C/D information expansions against that A-only anchor to show information increments after the 24-check screen.
  • Use stress-window overlays to illustrate VaR/ES behavior in broad stress episodes; do not interpret them as PnL, trading alpha, or validation by themselves.
  • Output expected evidence:
    • cumulative_lgbm_a_anchor_fz_gain;
    • var_es_stress_overlay_2024_stress_episode;
    • var_es_stress_overlay_2025_stress_episode;
    • full-sample VaR overlay diagnostics.

5.6 Appendix Robustness Experiments

  • Run just sensitivity as post-24-check appendix evidence only.
  • Perturb nearby LightGBM capacity for the two pass-all C-information LGBM+EVT families.
  • Perturb POT thresholds at 0.875 and 0.925, while recording 0.95 as a boundary diagnostic at the 95% VaR level.
  • Do not feed sensitivity rows into model selection, promoted rows, DM gates, selected figures, or the cross-suite FZ DM heatmap.

6. Expected Results And Discussion Outputs

6.1 Main Tables

  • Predictor block and coverage table: data/methods table showing information blocks, source families, feature counts, representative variables, missingness, and model role. Coverage is not admissibility; timestamp and feature-matrix gates still apply.
  • Model inventory table: compact methods table explaining Historical, GARCH/GJR, GARCH-EVT, advanced econometric, direct LightGBM, location-scale, and POT-GPD constructions. Performance belongs in result tables, not here.
  • Benchmark floor table: common-sample benchmark breach rates and loss metrics, with left/right tail detail available when page space allows.
  • ML information-ladder table: the main nested information-set table for direct LightGBM, reported separately for left and right tails.
  • Selected model performance table: deterministic selected-row summary after sample-size, coverage, FZ-loss, and quantile-loss gates.
  • Compact DM summary table: headline paired inference only. Full matrices stay in the appendix.

6.2 Main Figures

  • Figure 1, market timing design: institutional timing diagram for OSE settlement, night session, U.S. close, model cutoff, and next OSE open. This is a forecast-origin diagram, not a causal price-discovery diagram.
  • Figure 2, opening-gap tail motivation: density versus Gaussian, left/right log survival, mean-excess diagnostics, and Hill tail-index paths. This single composite motivates the target and EVT route; it is not forecast validation.
  • Coverage-screen evidence is summarized in tables rather than a compact main-text coverage figure; this avoids duplicating the 24-check pass/fail story.
  • Direct LightGBM information-ladder graphics are not main-text figures under the 24-check robustness story because the direct quantile rows fail the calibration screen.
  • Figure 3, cumulative FZ-gain diagnostics: one 2-by-2 figure after the 24-check coverage screen. Each panel fixes a tail side and one of the two 24-check-passing LGBM+EVT families. The anchor is the corresponding A-only LGBM+EVT forecast; plotted candidates are GJR-GARCH-EVT and the same-family B/C/D information expansions. Upward movement means the candidate has lower accumulated FZ loss than A-only under the fixed anchor-loss-minus-candidate-loss convention.

6.3 Appendix Figures And Tables

  • Raw target diagnostics: histogram/density, left/right QQ plots, log survival, mean excess, and Hill plot.
  • Full coverage diagnostics and selected performance figures: appendix checks backing the main coverage and selected-performance summaries.
  • Stress-window overlays: broad OOS stress episodes with left/right tails sharing the same x-axis; LGBM lines use information set C, the best-FZ row within the two 24-check LGBM+EVT families. Illustration only, not validation, PnL, cost, or trading-performance evidence.
  • DM heatmaps: appendix pairwise FZ detail for the post-24-check cross-suite set; rows are candidates, columns are anchors, and negative differences favor the row model. Each tail uses a strict global common sample across GJR-GARCH-EVT, LGBM plain MLE C, and LGBM UniBM C.
  • Murphy diagrams: scoring-family diagnostics, not pairwise dominance claims.
  • ES severity diagnostics: conditional exceedance diagnostics, not model-selection or alpha claims.
  • EVT standardized-residual diagnostics: QQ, log survival, mean excess, Hill, and threshold stability for the POT-GPD route.
  • Appendix tables: full benchmark scan, full LGBM scan, tail-side risk tables, promoted tail rows, restricted result matrix, ES severity diagnostics, claim-scope reference, and configuration robustness.
  • The complete generated figure and table map is maintained in Results Snapshot, which now includes both result interpretation and artifact placement.

6.4 Appendix Configuration Robustness

  • The primary design compares pre-specified point-in-time forecast specifications.
  • just sensitivity is fixed to the post-24-check paper set: GJR-GARCH-EVT, LGBM POT-GPD plain MLE (C), and LGBM POT-GPD UniBM (C).
  • The sensitivity run varies nearby LightGBM capacity only for the two pass-all C-information LGBM+EVT families.
  • POT threshold sensitivity reports forecastable thresholds 0.875 and 0.925 for the same post-24-check set, bracketing the registered primary threshold 0.90.
  • At 95% VaR, threshold 0.95 is recorded only as a boundary diagnostic with status not_applicable_threshold_not_below_tail_level.
  • Sensitivity artifacts live under reports/runs/<run_id>/sensitivity/ and carry primary_claim_allowed=false.
  • Robustness labels describe conclusion stability versus the registered primary specification. They do not feed the cross-suite FZ DM heatmap, promoted rows, result-matrix selection, or selected-model figures.

7. Manuscript Structure

  • Introduction:
    • state the pre-open tail-risk problem;
    • explain why the OSE night session makes the U.S. close question nontrivial;
    • state the nested information-set design;
    • preview the calibration-versus-loss tension in ML tail forecasts.
  • Institutional setting:
    • describe OSE day/night trading;
    • define the U.S. close cutoff;
    • state the point-in-time rule.
  • Materials:
    • describe Japanese futures and options data;
    • describe U.S. ETF, minute, option, rates, FX, volatility, and credit data;
    • describe preprocessing, contract rolls, calendar joins, and feature blocks.
  • Methods:
    • define target, left/right losses, VaR, and ES;
    • describe benchmark and advanced econometric models;
    • describe LightGBM direct quantile and filtered-tail models;
    • describe POT-GPD shape, scale, and ES gates;
    • describe evaluation metrics and inference.
  • Results:
    • begin with sample, timing, and target-tail diagnostics;
    • report baseline benchmark calibration;
    • report ML-tail nested information sets separately for left and right tails;
    • report restricted model-family comparisons;
    • report EVT, ES severity, and stress-window diagnostics as supporting evidence.
  • Discussion:
    • interpret U.S. close information content;
    • distinguish downside and upside risk;
    • discuss VaR coverage before loss-based claims;
    • explain where LightGBM+EVT is useful and where sample gates are still limiting.
  • Conclusion:
    • summarize the predictive evidence;
    • state limitations from coverage drift, FRED vintages, EVT sample size, and missing U.S.-close Nikkei futures marks;
    • define the next empirical extension only where it sharpens interpretation of the current OSE pre-open tail-risk design.

8. Claim Boundaries

  • No structural causal spillover claim.
  • No price-discovery claim.
  • No claim that left-tail and right-tail mechanisms are identical.
  • No deployment claim from historical OHLC data.
  • No residual_usclosemark_to_open claim without licensed timestamped intraday Nikkei futures marks.
  • No claim that LightGBM-EVT is a new ML algorithm.
  • No options-risk primary claim unless historical options entitlement, timestamp safety, and liquidity gates pass.
  • No model-family ranking claim from restricted short samples.

9. Appendix And Source Notes

9.1 Source Notes

9.2 Literature Notes

  • McNeil-Frey filtered EVT: conditional volatility/body filtering followed by EVT tail estimation.
  • Basel VaR backtesting: exception-counting intuition for VaR validation; this paper reports coverage and exception diagnostics but does not apply regulatory traffic-light capital zones.
  • CAViaR: dynamic quantile modeling for VaR.
  • CARE and expectile-based models: expectile links to tail-risk measures.
  • GAS models: score-driven updating for dynamic conditional distributions.
  • Fissler-Ziegel scoring: joint VaR-ES evaluation.
  • Murphy diagrams: sensitivity of forecast comparison to scoring-function choice.
  • Diebold-Mariano: unconditional average-sample model comparison.
  • Diagnostic admissibility across nested information sets:

9.3 Reproducibility Notes

  • The generated results snapshot is the evidence map; this paper plan is the manuscript design.
  • Data source details are maintained in docs/data.md.
  • Current result tables and figure provenance are maintained in docs/results_snapshot.md.
  • The canonical run artifacts live under REPORTS_DIR/runs/<run_id>/.
  • Paper-facing tables and figures are emitted under REPORTS_DIR/runs/<run_id>/latex/.
  • The local data root should be external storage, either through absolute .env paths or a repo-local data/ symlink that resolves outside the cloud-synced repo. REPORTS_DIR can remain local because generated reporting artifacts are comparatively small.
  • table_manifest.json and figure_manifest.json provide source-artifact and claim-scope traceability for the generated paper-facing outputs.