Discussion Q&A¶

This page gives the plain-language framing behind the generated results snapshot. It is a guide to the empirical design and the current evidence, not a substitute for the tables, figures, and registered diagnostics.

What is the paper asking?¶

The paper asks whether information known by the U.S. cash-market close helps forecast the next OSE Nikkei 225 Futures day-session opening tail.

The contract is the OSE large Nikkei 225 Futures contract.
The target is opening-gap risk at the next OSE day-session open.
Left-tail and right-tail losses are reported separately. Both matter for futures positions, but they need not have the same economic pattern.
The main comparison is across nested information sets: Japan-only history, U.S. close core variables, Japan proxy ETFs, and Asia proxy ETFs.
The results are research-candidate evidence. They are not a model-selection statement by themselves.

What is the target?¶

The headline target is the settlement-to-open gap:

gap_t = log(OSE day-session open_t) - log(previous settlement_{t-1})

For left-tail models, loss is -gap_t.
For right-tail models, loss is gap_t.
A VaR exception occurs when realized_loss > VaR forecast.
The headline risk level is 95% VaR/ES, so the nominal exception rate is 5%.
Rows around roll and SQ windows are excluded from the clean headline sample.
full_gap_close_to_open and residual_nightclose_to_day_open are kept for audit and diagnostic use, but they are not the headline target.
A U.S.-close-mark-to-OSE-open residual target would need a licensed timestamped Nikkei futures mark at the U.S. close cutoff. That target is not active in this run.

Why is the OSE open worth studying?¶

The open matters because it is the first OSE day-session mark after the U.S. close information set and after the Japanese night-session interval.

In the current clean headline sample (n=1661), the settle-to-open gap ranges from -0.087513 log (-8.38%) on 2020-03-13 to 0.096937 log (+10.18%) on 2025-04-10.
The largest absolute clean settle-to-open gap is 0.096937 log (+10.18%) on 2025-04-10; this is large enough to make opening-gap tail risk a substantive risk-management forecasting problem rather than a cosmetic return-prediction exercise.
The clean 1% to 99% settle-to-open range is -0.028446 log (-2.80%) to 0.027351 log (+2.77%), so the extremes are far outside the usual daily opening-gap range.
Even after the night-session close, the clean night-close-to-open residual ranges from -0.038278 log (-3.76%) to 0.042071 log (+4.30%), with maximum absolute residual 0.042071 log (+4.30%).
These magnitudes make the empirical object an opening-tail risk problem, not only an average next-open return-forecasting problem.
JPX defines the OSE large contract as the home-market Nikkei 225 Futures contract with JPX/OSE trading hours, SQ rules, and JSCC clearing and margin rules.
JPX and JSCC documentation make the basic risk channel clear: adverse futures moves change mark-to-market PnL, account equity, collateral pressure, and risk limits. Formal margin calls follow exchange and clearing procedures; the paper does not assume a mechanical margin call exactly at 08:45 JST.
OSE night trading is not ignored. J-Quants night-session fields, night-close residuals, and timing indicators are in the audit layer.
CME and SGX Nikkei contracts are important offshore venues, but they are not the target in this run. A cross-venue residual study would be a separate design.
SQ and open-based settlement are related market-structure motivation, but the clean headline sample is not an SQ event study.

What data enter the forecasts?¶

All predictors must be available before the OSE target open under the point-in-time timing rule:

feature_available_ts_utc <= model_cutoff_ts_utc < target_open_ts_utc

J-Quants supplies OSE Nikkei 225 Futures target data, contract metadata, and lagged domestic option-state fields where available.
Massive supplies U.S. close-side ETF, sector, Japan proxy, Asia proxy, minute-bar, and optional U.S.-listed options-derived inputs.
FRED supplies rates, H.10 USD/JPY, VIX, and credit-spread controls with conservative release lags. These are not ALFRED real-time vintages.
CBOE supplies volatility-index data.
Benchmarks use target history only.
The ML information sets add predictors in a fixed order: Japan-only, then U.S. close core, then Japan proxies, then Asia proxies.
U.S.-listed options features are audit-gated. They are not headline evidence unless source, coverage, liquidity, and timing checks pass.

What models are compared?¶

The benchmark floor, advanced benchmark suite, and ML-tail suite are implemented and have completed artifacts in this run.

Benchmark floor models include historical quantiles, rolling quantiles, EWMA, GARCH-t, GJR-GARCH-t, and GJR-GARCH-EVT.
Advanced benchmark families such as CAViaR, CARE/expectile, Taylor ALD, direct FZ-loss, and GAS produce nonblocking empirical forecast rows; their interpretation still follows the benchmark and restricted-sample gates.
The ML suite includes direct LightGBM quantile forecasts, location-scale empirical calibration, standardized-loss POT-GPD variants, and the new research-candidate LightGBM+EVT routes.
LightGBM is used as a fixed tabular learner. The paper does not claim a new machine-learning algorithm.
Hyperparameters are held fixed across information sets and refit dates.
Most models use expanding pre-forecast training histories. The rolling-quantile benchmark is the designed exception: it uses the most recent 1,000 clean observations.

How do the LightGBM+EVT variants work?¶

The final VaR/ES level is still 95%. The q90 models use 90% only as a POT threshold, not as the final risk level.

Direct LightGBM estimates the 95% VaR level directly.
Location-scale models estimate a conditional center and scale, then calibrate the upper tail of standardized losses.
Standardized-loss POT-GPD models fit a Generalized Pareto tail above the registered 0.90 threshold of out-of-fold standardized losses.
Conditional-q90 POT-GPD estimates a dynamic 90% threshold with LightGBM, fits a GPD to losses above that threshold, and extrapolates to 95% VaR/ES.
Median/MAD and median/IQR routes use more robust body filters before the POT-GPD step.
Plain MLE is the standard EVT comparator. Stabilized variants are finite-sample regularized diagnostics until the evidence supports promotion.
New LightGBM+EVT routes are included in per-model and restricted model-family artifacts, but they are not automatically headline rows.

How are forecasts judged?¶

The evaluation is built around tail-risk performance, not a single ranking.

Coverage: VaR breach rate should be close to the nominal 5% level.
Exception count: coverage evidence is weak when the number of tail events is too small.
Kupiec: tests unconditional VaR coverage.
Christoffersen: tests exception clustering.
Quantile loss: evaluates VaR forecasts.
Fissler-Ziegel loss: evaluates joint VaR/ES forecasts where ES is valid.
Mean exceedance severity: reports how large exceptions are once they happen.
DM and MCS are average-sample inference across the unconditional evaluation sample.
CPA is a conditional loss-difference diagnostic based on loss-differential regressions on ex-ante observables. It does not produce forecasts.
Murphy diagrams, DST, stress-window, ES severity, and trigger diagnostics are supporting evidence.

What do the current results say?¶

The current evidence is a calibration-versus-loss tradeoff.

Benchmark floor models generally sit closer to the 5% VaR exception target.
Direct LightGBM quantile rows often show lower average loss on this registered sample, but their breach rates are above the nominal level.
That means lower loss cannot be read alone as better tail calibration.
Filtered EVT and location-scale models improve coverage discipline in several comparisons, but the evidence is not one model-family ranking.
The new conditional-q90 POT-GPD route helps separate threshold estimation from tail extrapolation, but in the latest run most q90 calibration gates fail; its usable sample is too small for a headline claim.
Among the new EVT candidates, median/IQR POT-GPD has the clearest left-tail calibration diagnostics in the current run. Right-tail evidence is less clean and should be reported separately.
The paper should state the tension plainly: flexible ML information sets can change forecast loss, while VaR coverage gates determine whether that change is usable for risk claims.

What can the paper claim?¶

Evidence layer	Can support headline claim?	How to read it
Benchmark common-sample table	Yes, after review	External target-history/econometric floor on a shared sample.
ML-tail nested information sets	Yes, after review	Strict nested-information-set comparison; currently direct quantile survived the gate.
ML-tail per-model rows	No	Model-specific OOS diagnostics; samples need not match across model families.
Restricted result matrix	No headline claim	Matched-date comparison for model families and within-model increments.
DST, stress, Murphy, hedge-trigger diagnostics	Diagnostic only	Useful for interpretation and risk monitoring, not automatic model-selection evidence.

The paper can claim a point-in-time forecast evaluation of OSE Nikkei 225 Futures opening-gap tail risk.
It can report that U.S. close information and proxy blocks change average loss and coverage patterns under registered information sets.
It can report that direct LightGBM quantile forecasts are too liberal in the current headline rows.
It can report that filtered EVT and robust body-filter routes improve some coverage diagnostics, especially on the left tail, while remaining restricted model-family evidence.
It should not claim that one model is universally strongest.
It should not average left-tail and right-tail evidence into one mechanism.
It should not present DST, trigger, or feature-block diagnostics as causal proof or realized trading performance.
The current bottom line: the pipeline now produces a clean evidence set from the durable gold layer; benchmark floor, advanced benchmark, and ML-tail suites completed with zero recorded forecast failures; advanced rows are implemented evidence but remain nonblocking until author-reviewed against the same sample and inference gates.