Skip to content

Data Appendix: Sources, Pretreatment, Features, And Source Notes

This page is written as the data appendix for the paper. It is the canonical source for source roles, source status, target formulas, timestamp policies, pretreatment rules, contract-roll treatment, feature blocks, source limitations, and public data lake tiers. The paper plan should reference this page instead of duplicating vendor fields or processing rules.

Source Status

Source Role Project status Timing interpretation
J-Quants Futures OHLC OSE Nikkei 225 Futures target source and contract metadata source Premium futures access is configured locally for historical audits Ex-post historical research target source, not an operational pre-open feed.
Massive.com U.S. close-side ETF, equity, sector, dollar ETF proxy, Japan proxy, Asia proxy, and index predictors Configured licensed API Predictor source with UTC timestamps converted explicitly to ET. Massive is not the canonical USD/JPY source in the current run.
NYSE calendar U.S. regular close, holiday, and early-close cutoff logic Core public timing source Determines the U.S. cash close used for the forecast origin.
JPX calendar and trading-hour rules OSE business-day, day/night session, holiday-trading, roll/SQ context Core official timing source Determines OSE target eligibility and holiday/session flags.
FRED Treasury yields, selected rates proxies, and public macro series Configured public API Historical control source; vintage-safe work requires ALFRED-style handling where needed.
Cboe VIX historical data U.S. volatility and risk proxy Core public fallback/check Daily VIX close is core; high, low, and range are included only when source support is audited.
CME/SGX/OSE intraday Nikkei marks U.S.-close residual reference and cross-venue robustness Optional licensed extension Required before using residual_usclosemark_to_open or making intraday cross-venue claims.

Core variable definitions should be transparent enough for reader evaluation. Public fallback checks should be provided where feasible, but the documentation should not claim that all core results are reproducible without the licensed target and predictor sources. Target and residual-reference prices are futures-contract prices. Cash-index OHLC is not part of the target-data plan for this paper.

Current Environment Contract

The current local .env contract uses J-Quants API V2. The run futures target pipeline requires Premium futures access for the historical derivatives endpoint; intraday derivatives remain disabled unless a licensed intraday mark is added:

JQUANTS_API_KEY_FILE="/path/to/jquants.keyfile"
JQUANTS_API_BASE_URL="https://api.jquants.com/v2"
JQUANTS_EQUITY_MASTER_ENABLED="true"
JQUANTS_EQUITY_DAILY_ENABLED="true"
JQUANTS_DERIVATIVES_DAILY_ENABLED="true"
JQUANTS_DERIVATIVES_INTRADAY_ENABLED="false"

Massive and FRED predictor settings separate endpoint credentials from the research-config feature-set blocks. The .env file supplies key-file paths and base URLs, not raw API-key values; MASSIVE_API_KEY and JQUANTS_API_KEY are not runtime configuration paths. The run manifest and config/research_config.json record the exact feature-set universe. For the current clean run, the fetched Massive daily universe is the union of the core, optional, Japan proxy, Japanese ADR aggregate, and Asia proxy blocks:

MASSIVE_API_KEY_FILE="/path/to/massive.keyfile"
MASSIVE_FLAT_FILE_KEY_FILE="/path/to/massive-flat-file.keyfile"
MASSIVE_BASE_URL="https://api.massive.com"
MASSIVE_MINUTE_TICKERS="SPY,QQQ,DIA,IWM,EWJ,DXJ,EEM,FXI,EWY,EWT,EWH,TLT,HYG,GLD"
MASSIVE_MINUTE_TICKER="SPY"
MASSIVE_PROBE_TICKERS="I:VIX"
MASSIVE_OPTIONS_HISTORICAL_ENABLED="false"
MASSIVE_OPTIONS_FLAT_FILES_ENABLED="false"
MASSIVE_OPTIONS_CONTRACT_REST_ENABLED="false"
MASSIVE_OPTIONS_UNDERLYINGS="SPY,QQQ,DIA,IWM,XLK,XLF,XLE,XLV,XLI,XLY,XLP,XLB,XLU,XLC,SMH,EWJ,DXJ,EEM,FXI,EWY,EWT,EWH,TM,SONY,MUFG,SMFG,MFG"

massive_core = SPY,QQQ,DIA,IWM,XLK,XLF,XLE,XLV,XLI,XLY,XLP,XLB,XLU,XLC,TLT,GLD,USO,SMH,HYG,LQD
massive_optional = UUP
massive_japan_proxy = EWJ,DXJ
massive_japan_adr_primary = TM,SONY,MUFG,SMFG,MFG
massive_asia_proxy = EEM,FXI,EWY,EWT,EWH

FRED_BASE_URL="https://fred.stlouisfed.org"
fred_core = VIXCLS,DGS2,DGS10,T10Y2Y
fred_fallback = DEXJPUS
fred_credit_enriched = BAMLH0A0HYM2,BAMLC0A0CM

These snippets describe the full clean-run fetch universe. Smoke commands can override them with smaller ticker or series lists. The primary ML nested information sets do not use every fetched field: UUP and the credit-spread series are now B-layer U.S. close candidates, while short-history funding series, robustness stress indexes, and unaudited event/skew variables remain outside the registered primary ML table. The options flags remain disabled in raw settings as a fail-safe for direct CLI calls, and the standard just full recipe now keeps options=false by default. This makes the canonical full-history run independent of the shorter Massive OPRA entitlement window. U.S.-listed options features are still available through an explicit options=true run for appendix or recent-window diagnostics, where they are routed by underlying exposure into the nested information sets.

Short-history and robustness-only candidates stay out of the current registered full-history primary feature set:

POST_2018_FRED_SERIES="SOFR,EFFR"
FRED_ROBUSTNESS_SERIES="NFCI,ANFCI,STLFSI4"

FRED uses current historical values with conservative availability semantics and is not ALFRED/vintage-safe unless a future run explicitly records realtime or vintage parameters. DEXJPUS is handled as a Federal Reserve H.10 weekly-batch as-of FX control: the previous business week's observations are unavailable until the following H.10 release timestamp. Massive FX is not part of the default tail-risk pipeline. UUP, when fetched, is a U.S.-traded dollar ETF proxy and should not be described as a USD/JPY exchange-rate source.

Current Clean-Run Data Inventory

The current paper-facing evidence map is generated from:

run_id = tailrisk_20160719_20260508_20260512T131041Z_commit_f420c4fa
requested window = 2016-07-19 to 2026-05-08
clean forecast sample = 2018-06-20 to 2026-05-08
clean forecast observations = 1,712

The clean sample begins after all required target fields, Massive core fields, FRED core fields, and the canonical FRED H.10 USD/JPY control satisfy the registered coverage and timing requirements. The gold modeling panel contains 2,395 target-date rows before the clean-sample filter.

Run Metadata From The Current Results Snapshot

Field Value
Run ID tailrisk_20160719_20260508_20260512T131041Z_commit_f420c4fa
Claim level research_candidate
Requested window 2016-07-19 to 2026-05-08
Combined clean start 2018-06-20
Gold panel dates 2016-07-19 to 2026-05-08
Forecast sample dates 2018-06-20 to 2026-05-08
Forecast sample rows 1,712
FRED vintage safe False

The clean start is a modeling lower bound. Dates before it remain audit history rather than forecast evidence. FRED values use conservative release timing but are current historical observations rather than ALFRED real-time vintages.

Target Distribution Summary

These rows are copied from the current results snapshot so the data appendix can stand alone when describing the empirical target.

Measure Value
Clean forecast observations 1712
Date range 2018-06-20 to 2026-05-08
Mean gap 0.000562 log, about +0.06%
Standard deviation 0.011038 log, about +1.11%
Skewness -0.0660673
Excess kurtosis 11.2256
1% quantile -0.031102 log, about -3.06%
5% quantile -0.015645 log, about -1.55%
Median 0.001012 log, about +0.10%
95% quantile 0.015305 log, about +1.54%
99% quantile 0.027493 log, about +2.79%
Max drawdown gap -0.087513 log, about -8.38%, on 2020-03-13
Max upside gap 0.096937 log, about +10.18%, on 2025-04-10
Jarque-Bera p-value 0
Jarque-Bera statistic 9016.83

The target summary is a raw-target diagnostic. It motivates tail-risk modeling but does not validate any VaR/ES forecast.

Raw-Tail EVT Data Diagnostics

Tail Threshold probability Threshold Exceedances Mean excess GPD xi GPD scale Hill xi
left_tail_loss 0.900 0.0160618 78 0.0103847 0.152593 0.00878757 0.432871
left_tail_loss 0.925 0.0195607 58 0.00995799 0.293971 0.00712569 0.346247
left_tail_loss 0.950 0.0223549 39 0.0113879 0.237098 0.00874323 0.354783
left_tail_loss 0.975 0.029331 20 0.0127713 0.261132 0.0095416 0.31884
left_tail_loss 0.990 0.0373472 8 0.0175314 0.211214 0.0140966 0.342351
right_tail_loss 0.900 0.0149284 91 0.00910348 0.400284 0.00566715 0.385744
right_tail_loss 0.925 0.0169257 69 0.00974974 0.522434 0.00526951 0.369121
right_tail_loss 0.950 0.0189177 46 0.0121576 0.322832 0.00846784 0.414297
right_tail_loss 0.975 0.0260456 23 0.0146089 0.236968 0.0113013 0.383413
right_tail_loss 0.990 0.0370088 10 0.0171211 0.231959 0.013441 0.352692
absolute_gap 0.900 0.0155233 169 0.00965334 0.293977 0.00689961 0.401503
absolute_gap 0.925 0.0175078 127 0.0105772 0.261802 0.00786988 0.397782
absolute_gap 0.950 0.020701 85 0.0118328 0.25315 0.00892439 0.381347
absolute_gap 0.975 0.0270259 43 0.0143999 0.167275 0.0120367 0.372398
absolute_gap 0.990 0.0372773 17 0.0182071 0.0771593 0.0168379 0.353301

These diagnostics are computed on raw left loss, raw right loss, and absolute gap. They are data diagnostics, not forecast-model diagnostics.

Gold Panel, Target Audit, And Calendar Map

Measure Value
Gold modeling rows 2393
Gold columns 1428
Target-audit rows 2393
Clean target rows 2196
Forecast-sample rows 1712
Rows before combined clean start 420
Target-not-clean rows 197
Mapping excluded rows 64
Target audit reason Rows
None 2196
roll_sq_excluded 195
missing_previous_jpx_session 1
missing_reference_price 1
Timing-map measure Value
Normal trading mappings 2323
U.S./Japan desync mappings 1
NYSE early-close mappings 32
EDT rows 1553
EST rows 840

Roll/SQ exclusions, missing reference prices, early closes, U.S./Japan desynchronization, and DST regimes are stored as auditable row-level state rather than applied as hidden filters.

Feature Coverage From The Current Gold Panel

Source family Block Features Mean missing Max missing
Asia proxy Asia proxy 10 0.000% 0.000%
cboe_volatility fred_core 2 0.000% 0.000%
cross_market_derived Asia proxy 1 0.000% 0.000%
cross_market_derived fred_core 2 0.000% 0.000%
cross_market_derived JP proxy 2 0.000% 0.000%
cross_market_derived US core 2 0.000% 0.000%
event_calendar calendar_controls 7 0.000% 0.000%
fred_core fred_core 9 0.000% 0.000%
FRED credit enriched FRED credit enriched 4 62.179% 62.208%
fx_core fx_core 4 0.000% 0.000%
JP history JP only 37 0.005% 0.058%
JP proxy JP proxy 8 0.000% 0.000%
J-Quants N225 options JP only 30 1.552% 14.486%
massive_daily US core 40 0.001% 0.058%
massive_minute Asia proxy 60 0.000% 0.000%
massive_minute JP proxy 24 0.346% 4.147%
massive_minute US late session 84 0.000% 0.000%
massive_optional massive_optional 2 0.000% 0.000%

Feature coverage is an information-transparency diagnostic. A feature is admissible only when the timestamp availability and feature-matrix gates also pass.

Leakage Audit Summary

Field Value
Status pass_with_warnings
Rows audited 780118
Failures 0
Warnings 609237
Panel row count 2393
Panel signature seed 42
Panel signature f1ca88ded1c0cf25817205318cce38b3c2bfe6e84c220cfb9b1d16d9dfa4d5cc

Zero hard failures means no audited row violated the timestamp invariant. The warnings are retained because conservative-lag and missing-feature situations can still matter for interpretation.

Active Target and Calendar Inputs

  • Target source: J-Quants Premium historical OSE Nikkei 225 Futures daily/session OHLC for the large Nikkei 225 Futures contract.
  • Primary target family: full_gap_settle_to_open, defined as log(OSE day-session open_t) - log(previous settlement_{t-1}).
  • Additional audited target fields: full_gap_close_to_open and residual_nightclose_to_day_open.
  • Disabled target extension: residual_usclosemark_to_open, because the current run does not include a licensed timestamped intraday OSE, CME, SGX, or equivalent Nikkei futures mark at the U.S. cash close.
  • Calendar sources: JPX/OSE trading-day and session rules, NYSE holidays and early closes, U.S. DST rules, roll-window flags, SQ-window flags, and contract-expiry metadata.
  • Timing fields recorded in the panel include model_cutoff_ts_utc, target_open_ts_utc, dst_regime, absorption_regime, us_close_to_ose_night_close_minutes, mapping_status, and join_miss_reason.

Active Massive Daily Inputs

The clean run fetches these Massive daily symbols:

Block Symbols Features used or audited
Broad U.S. beta SPY, QQQ, DIA, IWM Close-to-close log returns and high-low log ranges.
U.S. sectors XLK, XLF, XLE, XLV, XLI, XLY, XLP, XLB, XLU, XLC Sector returns and ranges.
Duration, safe-haven, commodity, semiconductor, and credit-risk proxies TLT, GLD, USO, SMH, HYG, LQD Returns and ranges.
Dollar ETF proxy UUP Cached and audited as massive_optional; enters the U.S. close core information set as a dollar-risk proxy, not as USD/JPY.
Japan proxy block EWJ, DXJ Returns and ranges; enters the third ML information set.
Japanese ADR aggregate block TM, SONY, MUFG, SMFG, MFG Aggregate-only ADR spot returns/ranges; enters the third ML information set without single-name ADR spot features.
Asia proxy block EEM, FXI, EWY, EWT, EWH Returns and ranges; enters the fourth ML information set.

All Massive daily fields are frozen at the US_CASH_CLOSE forecast origin after the configured vendor-availability lag. Massive timestamps are stored in UTC and converted to U.S. Eastern Time before session alignment.

Active Massive Intraday Input

The current feature set uses a curated set of U.S.-listed minute-bar ETF proxies rather than adding more daily ETF controls. SPY is retained through a small compatibility adapter that projects generic SPY minute records into the canonical spy_late_* / spy_final_* feature names; the minute pipeline itself remains multi-ticker. Additional tickers use lower-case ticker prefixes. The derived minute features include:

  • late 30- and 60-minute log returns;
  • late 60-minute realized variance and up/down semivariance;
  • late 60-minute skewness and excess kurtosis, recorded as noisy small-sample estimators rather than asymptotic realized moments;
  • late-session range;
  • within-ticker late-volume surge, z-score, and percentile using prior rolling history only;
  • final-window momentum.

These variables proxy late-session U.S. trading pressure and are frozen at the same U.S. close cutoff as the daily Massive predictors. The deterministic block map keeps U.S. core minute proxies in us_late_session, EWJ/DXJ minute features in japan_proxy, and EEM/FXI/EWY/EWT/EWH minute features in asia_proxy.

Registered Options Source Audit

The project can build bounded U.S. options features from Massive OPRA day_aggs_v1 flat files only in explicit opt-in runs when MASSIVE_OPTIONS_HISTORICAL_ENABLED=true and MASSIVE_OPTIONS_FLAT_FILES_ENABLED=true. Massive live option snapshots are not used for historical backfill. The active U.S. options implementation computes ATM-IV proxies from option daily aggregate close prices, underlying Massive daily closes, FRED DGS2, and a zero-dividend Black-Scholes approximation. It does not use vendor historical IV, Greeks, quotes, or open interest.

just source-probe now performs a nonblocking Massive flat-file check when MASSIVE_FLAT_FILE_KEY_FILE is configured. The live local probe can list and range-read us_options_opra/day_aggs_v1, minute_aggs_v1, and trades_v1 headers from the S3-compatible flat-file endpoint; those headers contain option price/volume/timestamp fields but no direct IV, Greeks, or open interest. quotes_v1 is listed by the bucket but currently returns 403 Forbidden on the sample header read, so quote-based spread/liquidity filters remain disabled until entitlement is confirmed. The active v1 liquidity audit is therefore based on day-agg volume, transaction count, valid-contract count, DTE bucket, and whether an ATM-IV solve succeeds.

The options audit artifacts are:

  • options_source_audit.parquet;
  • options_feature_coverage.parquet;
  • options_liquidity_audit.parquet.

J-Quants Nikkei 225 large-option data are handled separately from U.S.-listed options. The pipeline now fetches Nikkei 225 Options (NK225E) daily option-chain rows from J-Quants, consistent with the J-Quants index-option field specification and option product code list, normalizes the compact V2 fields (Strike, IV, OI, BaseVol, UnderPx, etc.), promotes the night-session option OHLC fields (EO, EH, EL, EC) into silver as night_session_open/high/low/close, and converts volatility percent values to fractions. These features enter the japan_only block only as lagged domestic option-implied state. Same target-date option rows are not used. The default aggregate scope is the registered 7-30 and 31-90 DTE window, so the main predictors are prior available ATM IV, ATM put-call IV skew, base volatility, OI-weighted IV, put/call OI and volume ratios, total OI/volume, valid contract count, days to SQ, and lagged night-session ATM option close/return/range summaries. These night-session option features are deliberately lagged; without timestamped intraday or quote-chain evidence they are not interpreted as same-night U.S.-close-cutoff N225 option information.

The candidate options universe is intentionally capped before any data-driven selection:

Block Candidate underlyings Status
J-Quants N225 large options NK225E Active as lagged japan_only option-implied and night-session option-state controls after source/schema smoke; not same-date target information.
Core U.S. options SPY, QQQ, DIA, IWM Opt-in computed ATM-IV proxies in japan_only_plus_us_close_core only when Massive options flat files are enabled; appendix/recent-window only unless coverage gates later support promotion.
Sector/semiconductor option aggregate XLK, XLF, XLE, XLV, XLI, XLY, XLP, XLB, XLU, XLC, SMH Opt-in aggregate median, dispersion, max, and valid-count ATM-IV state; raw sector option fields stay audit/appendix.
Japan ETF options EWJ, DXJ Opt-in computed ATM-IV proxies in japan_only_plus_us_close_core_plus_japan_proxy only when Massive options flat files are enabled; appendix/recent-window only unless coverage gates later support promotion.
ADR aggregate options TM, SONY, MUFG, SMFG, MFG Opt-in median/20% trimmed-mean aggregate; individual ADRs stay audit/appendix unless separately promoted.
Asia proxy option aggregate EEM, FXI, EWY, EWT, EWH Opt-in aggregate ATM-IV state; individual Asia option fields stay audit/appendix.

The registered DTE buckets are short 7-30 calendar days and medium 31-90 calendar days. ATM selection is delta-neutral when delta is available or computed; otherwise the method falls back to closest-to-spot or closest-to-forward and records the method. Primary options features are capped at 45 curated aggregate features. Raw per-contract, per-sector, per-Asia-ETF, and per-ADR fields remain audit or appendix outputs.

Active FRED and Cboe Inputs

The clean run fetches these FRED series:

Block Series Panel variables
Core rates and volatility VIXCLS, DGS2, DGS10, T10Y2Y Levels and first differences, plus fred_rates_staleness_days.
Canonical USD/JPY FX control DEXJPUS fx_usdjpy_level, fx_usdjpy_return, fx_observation_age_days, fx_release_age_days.
Credit-spread enriched block BAMLH0A0HYM2, BAMLC0A0CM Levels and first differences; enters the U.S. close core information set as credit-stress/risk-appetite proxies subject to coverage gates.

The clean run also uses Cboe VIX historical data:

  • cboe_vix_close;
  • cboe_vix_range.

FRED and Cboe volatility fields are both retained because they serve different audit roles: FRED VIXCLS is handled through the same conservative release-lag machinery as other FRED series, while Cboe VIX supplies the volatility-index predictor used in the point-in-time U.S. close information set.

Active ML Nested Information Sets

The registered ML comparison uses four nested information sets. Options are routed by economic exposure instead of being grouped into a separate primary layer: domestic N225 options enter A, U.S. core and sector-aggregate options enter B, Japan-linked ETF/ADR options enter C, and Asia proxy aggregate options enter D.

Information set Active blocks
japan_only Lagged loss and gap history, rolling loss moments, rolling 95% loss quantile, lagged N225 futures session/volume/OI features, lagged J-Quants N225 large-option implied-state and night-session option aggregates, calendar month terms, DST regime, absorption-regime timing, and timestamp-safe BOJ same-OSE-session flags.
japan_only_plus_us_close_core japan_only plus Massive U.S. core daily features, U.S. core minute features with canonical SPY fields, FRED core rates/VIX features, FRED credit-spread proxies, Cboe VIX features, FRED H.10 USD/JPY, UUP as a dollar-risk ETF proxy, timestamp-safe FOMC/CPI/NFP and event-intensity calendar controls, computed ATM-IV proxies for SPY, QQQ, DIA, and IWM options, and aggregate sector/semiconductor ATM-IV state when enabled and audit-gated.
japan_only_plus_us_close_core_plus_japan_proxy Previous set plus EWJ and DXJ daily/minute features, computed ATM-IV proxies for EWJ and DXJ options, Japanese ADR spot aggregate features, and Japanese ADR aggregate options features when enabled and audit-gated.
japan_only_plus_us_close_core_plus_japan_proxy_plus_asia_proxy Previous set plus Asia/regional proxy features for EEM, FXI, EWY, EWT, and EWH, including daily, minute, and aggregate options features routed by underlying exposure.

The primary ML nested sets do not include SOFR/EFFR, NFCI/ANFCI/STLFSI4, SKEW, or VIX term-structure proxies in the current clean run. They do include a narrow timestamp-safe event-calendar layer: BOJ same-OSE-session information in japan_only, and FOMC/CPI/NFP plus major-event intensity controls from japan_only_plus_us_close_core onward.

Planned or Candidate Inputs Not Active in the Clean Run

  • Broader Japan macro event flags beyond the current BOJ policy-session marker remain planned candidate controls. The active event layer is limited to timestamp-safe FOMC, CPI, NFP/payroll, BOJ, and simple major-event intensity controls.
  • SOFR and EFFR are post-2018 enriched FRED candidates and are not part of the current full-history primary feature set.
  • NFCI, ANFCI, and STLFSI4 are FRED robustness candidates and are not active in the current clean run.
  • Cboe SKEW, VIX9D, VIX3M, VIX6M, option-implied skew, volatility-surface measures, and variance-risk-premium proxies require a separate timestamp and coverage audit before they can enter core claims.

The utility smoke/build commands are engineering checks only:

PYTHONPATH=src uv run python -m n225_open_gap_tail.cli massive-smoke
PYTHONPATH=src uv run python -m n225_open_gap_tail.cli fred-smoke
PYTHONPATH=src uv run python -m n225_open_gap_tail.cli calendar-build
PYTHONPATH=src uv run python -m n225_open_gap_tail.cli contracts-build

They use the same cache vocabulary as the run workflow: vendor payloads under data/bronze/ and typed normalized outputs under data/silver/. Smoke artifacts do not constitute empirical validation of the forecasting paper.

data/bronze, data/silver, and data/gold are logical data-lake locations. Local machines should map DATA_DIR to external storage in .env, or use a repo-local data/ symlink that resolves outside the cloud-synced repo. reports/runs can remain local because generated run summaries, tables, and figures are small relative to the vendor cache and gold data lake.

Forecast Origins

Every modeling row must state a forecast origin and model cutoff.

Forecast origin Nominal timestamp Known information Target open Main use
US_CASH_CLOSE Official U.S. cash close, normally 16:00 ET and adjusted for early closes U.S. ETF, index, sector, FX, VIX, rates, and other predictor fields available by the U.S. close cutoff Next eligible OSE day open at 08:45 JST Main pre-open risk forecast origin.
OSE_NIGHT_CLOSE 06:00 JST OSE night close if available as an audited historical field or licensed intraday reference Same OSE day open at 08:45 JST Night-session absorption robustness and residual decomposition.
PREV_OSE_DAY_CLOSE 15:45 JST Previous OSE day-session close and settlement context Next eligible OSE day open Full opening-level risk target.

J-Quants futures OHLC can support historical reconstruction of targets and residual decompositions after the subscription is available. It should not be described as an operational source for the US_CASH_CLOSE or OSE_NIGHT_CLOSE information set.

The phrase "U.S. close information" is a cutoff definition, not a claim that all U.S. after-close or overnight events are observed. A predictor can enter the US_CASH_CLOSE information set only when its source timestamp and configured availability lag place it at or before the model cutoff.

Predictor Universe

The first-paper predictor universe is pre-registered by economic role rather than by feature search. Candidate variables must pass timestamp, availability, and sample-coverage checks before they enter the modeling table.

Block Candidate variables Source Timing status Economic justification
Broad U.S. beta SPY, DIA, QQQ, IWM returns and ranges Massive.com US_CASH_CLOSE after official close plus vendor lag U.S. equity-market direction and risk appetite.
U.S. late-session dynamics SPY/QQQ/DIA/IWM/TLT/HYG/GLD last-30-minute return, last-hour return, late-session range, late-60-minute volume surge, final-window reversal or momentum Massive.com minute bars US_CASH_CLOSE after official close plus vendor lag Late U.S. trading pressure and closing imbalance proxies that may be more informative than daily close-to-close moves.
U.S. sectors XLK, XLF, XLE, XLV, XLI, XLY, XLP, XLB, XLU, XLC returns and dispersion Massive.com US_CASH_CLOSE Sector composition, growth/cyclical rotation, defensives, utilities, and communications exposure.
U.S. global-risk proxies SMH, HYG, LQD Massive.com core candidates US_CASH_CLOSE after source audit Semiconductor and credit-risk channels relevant to Japan but treated as U.S. core risk proxies.
Japan proxy block EWJ, DXJ plus aggregate TM/SONY/MUFG/SMFG/MFG ADR spot summaries Massive.com ML tail proxy block US_CASH_CLOSE after source audit U.S.-traded Japan equity proxies used to test whether Japan-exposure trading absorbs incremental signal beyond broad U.S. core without exposing primary ML specifications to individual ADR names.
Asia proxy block EEM, FXI, EWY, EWT, EWH Massive.com ML tail proxy block US_CASH_CLOSE after source audit Emerging-market, China, Korea, Taiwan, and Hong Kong proxies used to test regional and supply-chain information beyond Japan proxies.
FX Canonical USD/JPY from FRED DEXJPUS only FRED H.10 H.10 weekly-batch as-of release Conservative lagged currency control without letting optional Massive FX entitlement determine the main sample.
Safe-haven and commodity proxies TLT, GLD, USO Massive.com planned candidates US_CASH_CLOSE after source audit Flight-to-quality, dollar-rate duration, and commodity-risk channels.
U.S. volatility VIX close; VIX high/low/range when available Cboe, FRED, Massive index probe Historical daily close or audited index timestamp U.S. implied volatility and volatility-of-risk regime.
U.S. tail/skew proxies Cboe SKEW, VIX9D, VIX3M, VIX6M Cboe or licensed source Tier 1.5/Tier 2 depending access and coverage Option-implied left-tail and volatility-term-structure information.
Treasury rates DGS2, DGS10, T10Y2Y FRED Current historical values with +1 U.S. business-day availability lag; not ALFRED/vintage-safe by default Rate level and curve slope. SOFR/EFFR are post-2018 enriched candidates only.
Credit spreads BAMLH0A0HYM2, BAMLC0A0CM FRED/ICE BofA B-layer U.S. close candidate with conservative FRED lag; not ALFRED/vintage-safe Credit-stress proxy for global downside tail risk without shortening the required core sample by default.
Event flags FOMC, CPI, payrolls/NFP, BOJ, simple major-event intensity controls; broader Japan macro releases remain planned Official calendars Active timestamp-safe calendar controls for the narrow registered event layer; broader macro-event expansion remains candidate work Scheduled risk-event controls without macro feature fishing.
Lagged Japanese futures state Prior gap, lagged day return, lagged night return, volume/OI changes, roll/SQ flags J-Quants futures after Premium access Historical target-side variables, lagged before cutoff Domestic state, lagged turnover/activity proxies, and contract-state controls; not direct market-depth measures.

The Massive ticker selection covers U.S. market beta, technology and growth exposure, small-cap risk appetite, sector dispersion, a U.S. dollar ETF proxy, duration, safe-haven demand, commodities, Asia/EM risk, and semiconductors. Canonical USD/JPY comes from FRED DEXJPUS, not Massive. Japan and Asia proxy tickers are cached with the panel but enter the ML tail nested information sets separately from the broad U.S. core block.

Data Availability Timeline

Before modeling, each predictor block must produce an availability table with:

  • source name and vendor;
  • candidate variables;
  • raw coverage start and end;
  • usable coverage after timestamp alignment;
  • missingness rate;
  • frequency and release/update timing;
  • effective sample impact after joining to OSE target dates.

The target audit and predictor timeline jointly determine the final sample period. Variables with short or unstable histories can enter robustness tables, but not the main predictor set if they materially shorten the main sample.

Cache-First Data Lake Contract

just full builds a local cache-first data lake before model evaluation. The command defaults to force=false; use force=true only for an intentional, documented schema/cache invalidation. The recipe also defaults to options=false, so the canonical full-history run skips Massive OPRA day_aggs_v1 option-feature ingestion. This avoids making primary claims depend on a shorter OPRA entitlement window. Use just full 2016-07-19 "" 6 false true only for an opt-in appendix or recent-window run with U.S. options features. When the end argument is blank, the default data cutoff is the most recent completed Friday rather than the calendar run date; use an explicit YYYY-MM-DD end date to override the paper-freeze default. The default start is 2016-07-19, treated as a clean-sample candidate rather than a hard empirical claim. The final modeling start is written to the run manifest as:

combined_clean_start = max(
  jquants_required_field_coverage_start,
  required_massive_core_coverage_start,
  required_fred_core_coverage_start,
  canonical_fx_coverage_start
)

jquants_required_field_coverage_start defaults to 2016-07-19 only when fields_coverage_audit.parquet supports required coverage for settlement, last-trading-day, SQ-day, and central-contract fields. 2008-05-07 remains available for target-history audit or robustness runs, not as the default clean predictor sample. Because XLC remains a required U.S. sector control, the final combined_clean_start is expected to move to XLC's post-inception coverage period rather than remain at the 2016 cache lower bound.

Physical layout uses Hive-style Parquet partitions with schema version in the path:

data/bronze/jquants_futures_daily/schema_version=1/year=2016/month=07/data.parquet
data/silver/jquants_nk225f_daily/schema_version=2/year=2016/month=07/data.parquet
data/silver/massive_minute_features/schema_version=1/ticker=spy/year=2016/month=07/data.parquet
data/bronze/calendar_sessions/schema_version=1/start=2016-07-19/end=2026-05-02/metadata.json
data/silver/calendar_sessions/schema_version=1/start=2016-07-19/end=2026-05-02/data.parquet
data/bronze/nikkei_contracts_rule_based/schema_version=1/start=2016-07-19/end=2026-05-02/metadata.json
data/silver/nikkei_contracts_rule_based/schema_version=1/start=2016-07-19/end=2026-05-02/contracts.parquet

All Parquet writes are atomic: write to .tmp.<pid>.<uuid>, validate row count and schema, compute separate xxhash64 chunk and schema hashes, then os.replace() into place. At the start of just full, orphan .tmp files older than two hours are removed. Readers use explicit Hive partition schemas so year and month are numeric, not inferred strings.

Layer boundaries:

  • Bronze stores typed vendor-cache rows and provenance: endpoint, requested range, pull timestamps, row counts, schema version, schema hash, and content hash.
  • Silver stores canonical research rows. J-Quants silver filters NK225F, stores UTC-aware timestamps, flags zero or negative prices and OHLC violations, and does not impute.
  • Gold joins targets, calendar map, Massive predictors, minute late-session features, FRED predictors, roll/SQ flags, and audit columns by ose_trading_date.

Rebuild semantics are layer-aware. Rebuilding silver or gold uses existing local cache and does not call vendor APIs unless bronze is missing or a vendor refresh is explicitly requested.

Calendar Map and Join Diagnostics

calendar_map.parquet is built before the gold panel. It maps the relevant U.S. close date to each OSE target date and records:

  • U.S. official close UTC and early-close flag;
  • EST/EDT regime;
  • OSE day open and night close UTC timestamps;
  • us_close_to_ose_night_close_minutes;
  • model cutoff and target open;
  • enum-valued mapping_status: normal_trading, us_holiday, jp_holiday, us_jp_desync, ose_holiday_trading, or unmapped.

Gold joins preserve target rows when predictors are missing. Missing predictors are reported with enum-valued join_miss_reason, including entitlement gaps, missing cache partitions, FRED release lag, fred_vintage_not_realtime_safe, market-calendar desync, and predictor nulls. This keeps structural missingness separate from random data gaps.

FRED TTL and Vintage Label

Default FRED ingestion uses current historical values with a conservative availability lag and is labeled vintage_safe=false. The cache has a default 30-day TTL, evaluated exactly once at run start. Chunks that are stale at run start refresh before use; chunks that are fresh at run start remain valid for that run even if the TTL would expire mid-run. Each FRED cache metadata file records the pull timestamp, run-start TTL decision timestamp, vintage label, revision-risk label, and refresh status.

For ordinary non-FX FRED predictors, the gold panel selects each feature independently using the latest non-null value whose feature_available_ts_utc is no later than the model cutoff. Forward-filled levels keep their source observation date and availability timestamp; synthetic filled diffs are set to 0.0 and marked with fill metadata rather than treated as raw observations. The expanded ML tail block also carries fred_rates_staleness_days, computed from the DGS2/DGS10/T10Y2Y rate block, so release lag can be learned by the model instead of remaining only a diagnostic.

ML tail writes feature-unavailability diagnostics under metrics/ for each tail-risk run: ml_tail_feature_unavailability.parquet aggregates missing active features by information set, and ml_tail_feature_unavailability_dates.parquet keeps the date-level trace needed to separate structural gaps such as late-session minute volume from FRED release-lag handling.

ML tail also writes a result-matrix layer under metrics/ for model-family audit: ml_tail_result_matrix.parquet, ml_tail_result_matrix_sample_audit.parquet, ml_tail_result_matrix_dm.parquet, and the run-specific ml_tail_result_matrix_notes.md artifact. This layer separates VaR-only comparisons (var_quantile_loss, coverage, exception diagnostics) from VaR-ES joint scoring (var_es_fz_loss) and uses restricted common samples. It does not replace the primary ML tail ladder in ml_tail_metrics.parquet.

Target Hierarchy

All target formulas use log gaps.

Main target:

  • full_gap_settle_to_open = log(day_open_t) - log(prev_settlement_{t-1}).

Secondary target:

  • full_gap_close_to_open = log(day_open_t) - log(prev_day_close_{t-1}).

Absorption robustness target:

  • residual_nightclose_to_day_open = log(day_open_t) - log(night_close_t), when night_close_t is available and its timestamp semantics are audited.

Licensed-data extension:

  • residual_usclosemark_to_open = log(day_open_t) - log(nikkei_futures_mark_at_us_cash_close_t).

residual_usclosemark_to_open is disabled until a licensed intraday OSE, CME, SGX, or equivalent Nikkei futures reference mark exists at the U.S. cash close.

J-Quants Field-to-Use Contract

Field Use Audit note
DaySessionOpen Target open for all full-gap and residual targets Must be present and traceable to raw contract rows.
DaySessionClose Reference for full_gap_close_to_open and lagged Japanese controls Must refer to the same contract convention used in target construction.
NightSessionClose Reference for residual_nightclose_to_day_open and night-session absorption controls Historical residual source only unless a licensed operational feed exists.
SettlementPrice Main reference for full_gap_settle_to_open Must be matched to the prior eligible contract/session.
Volume Lagged turnover/activity proxy and data-sanity field Session-specific volume is used only if available and verified; it is not interpreted as order-book depth.
OpenInterest Contract-state, participation, and roll diagnostics Used with roll/SQ controls; it is not a direct liquidity-depth measure.
LastTradingDay Roll-window flag and expiry exclusion logic Must be reconciled with JPX contract rules.
SpecialQuotationDay SQ-window flag and robustness/exclusion rule Used to prevent SQ-driven artifacts from dominating the tail.
CentralContractMonthFlag Main contract selection and roll diagnostics Must be reconciled against observed liquidity and metadata.

The main contract is the OSE Nikkei 225 Futures large contract. Mini and micro contracts are robustness or liquidity checks only unless the research design changes.

Contract Roll Mechanics

Target gaps are calculated intra-contract wherever possible. A target observation should not mechanically join the settlement or close of one contract to the day open of another contract and treat the resulting artificial spread as a market opening gap.

Default policy:

  • select the active contract using audited CentralContractMonthFlag, contract month, liquidity, and roll metadata;
  • exclude target gaps that cross a contract roll, last-trading-day boundary, or SQ exclusion window from the main specification;
  • keep roll, SQ, and near-expiry flags for audit and robustness tables;
  • use flag-and-include only as a robustness exercise;
  • use ratio-adjusted or Panama-style continuous series only for robustness or long-memory volatility features, not for the main opening-gap target.

The target audit must report how many observations are excluded by the roll/SQ policy and whether extreme gaps are traceable to raw same-contract rows.

Massive Timestamp Policy

Massive raw timestamps are stored in UTC. Preprocessing must convert them explicitly to U.S. Eastern Time before applying U.S. session cutoffs.

The US_CASH_CLOSE cutoff is based on the official U.S. cash-market close:

  • regular sessions normally use 16:00 ET;
  • NYSE early-close sessions use the official early-close time;
  • a configurable vendor-availability lag is applied after the close, with a default of 15 minutes for research feature freezing;
  • the lag is a conservative research convention, not a live data guarantee;
  • is_us_early_close and DST regime flags must be stored.

No feature may enter a US_CASH_CLOSE forecast row unless its vendor_available_ts_utc is no later than model_cutoff_ts_utc.

Public Data Lake Tiers

The data lake is intentionally tiered to prevent feature fishing.

Tier 0: Calendars and Timing

  • JPX/OSE trading hours, holidays, holiday trading, and target-session eligibility.
  • NYSE holidays and official early closes.
  • U.S. DST transition dates and UTC/ET/JST conversion tables.
  • Roll windows, SQ windows, and contract-expiry metadata.

Tier 1: Core Controls and Predictors

  • Massive U.S. ETF, sector, equity-index, dollar ETF proxy, Japan proxy, and Asia proxy predictors. Canonical USD/JPY is the FRED H.10 DEXJPUS series.
  • U.S.-listed minute-bar late-session features: generic ticker-prefixed fields for the curated minute universe, with canonical spy_late_* and spy_final_* fields kept by the SPY compatibility adapter. Features include late returns, realized variance, up/down semivariance, noisy small-sample skewness/kurtosis, range, volume surge, volume z-score, volume percentile, and final-window momentum, all frozen at U.S. close plus the configured vendor-availability lag. Volume normalization is within ticker and uses prior rolling history only.
  • Massive core block additions: XLY, XLP, XLB, XLU, XLC, TLT, GLD, USO, SMH, HYG, and LQD after source and coverage audit.
  • Massive ML tail proxy blocks: Japan proxy (EWJ, DXJ) and Asia proxy (EEM, FXI, EWY, EWT, EWH) are cached now but interpreted separately from the core U.S. close block.
  • Cboe or FRED VIX close; VIX high, low, and range only when the source supports them.
  • FRED 2-year and 10-year Treasury yields, T10Y2Y yield-curve slope, and ICE BofA credit-spread proxies. Credit spreads enter the B-layer U.S. close core as current-historical FRED series with conservative lag controls, not ALFRED/vintage-safe series. SOFR/EFFR funding proxies are post_2018_enriched, not part of the current full-history primary feature set.
  • Event calendar controls: timestamp-safe FOMC, CPI, NFP/payroll, BOJ policy events, and simple major-event intensity controls are active; broader Japan macro releases remain planned candidates.
  • Lagged Japanese futures variables: prior gap, lagged OSE day return, lagged OSE night return when available, volume/open-interest changes, roll/SQ flags, and holiday-adjacent flags.

Tier 1.5: Tail-Risk Proxy Candidates

  • Cboe SKEW or a licensed SKEW proxy.
  • VIX term-structure proxies such as VIX9D, VIX3M, and VIX6M.
  • Massive index probes such as I:VIX and I:SKEW only if the plan supports them.
  • U.S.-listed options-risk features from SPY, QQQ, DIA, IWM, sector ETFs plus SMH, EWJ, DXJ, Asia proxy ETFs, and primary Japanese ADR aggregates can be computed from Massive OPRA day aggregates as ATM-IV proxies only in explicit options=true runs. They are routed by economic exposure: U.S. core and sector aggregate options into B, Japan ETF and ADR aggregate options into C, and Asia proxy aggregate options into D. Under the current data entitlement they are appendix/recent-window diagnostics, not canonical full-history primary predictors. J-Quants NK225E daily options are already active only as lagged domestic japan_only state, including lagged night-session option OHLC summaries from EO/EH/EL/EC.

Tier 1.5 variables are natural for a tail-risk paper but must not shorten the main sample or introduce unclear availability timestamps. If they do, they move to Tier 2 robustness.

Tier 2: Leakage-Safe Extensions

  • ALFRED-style real-time vintage macro series.
  • CME, SGX, or intraday OSE Nikkei futures marks for residual_usclosemark_to_open.
  • Options-implied skew, volatility-surface, or tail-risk proxies where licensed.
  • Public-source replication checks where a lower-fidelity but academically accessible substitute exists.

Tier 2 variables do not enter first-paper core claims unless their timestamps, availability, and source definitions are audited.

Timestamp Fields

Data rows should separate the following timestamps:

  • observation_ts_utc
  • bar_start_ts_utc
  • bar_end_ts_utc
  • vendor_available_ts_utc
  • research_download_ts_utc
  • model_cutoff_ts_utc
  • target_open_ts_utc
  • reference_price_ts_utc
  • release_ts_utc, when using scheduled macro or event data
  • vintage_date, when using revised macro series

Core invariants:

  • target_open_ts_utc > model_cutoff_ts_utc.
  • Feature availability timestamps must be no later than model_cutoff_ts_utc.
  • Residual target reference prices must satisfy reference_price_ts_utc <= model_cutoff_ts_utc.
  • Full-gap ex-post reference prices must be labeled as previous-session references, not live U.S.-close residual marks.

Tail-Risk Labels and EVT Data Requirements

The main paper evaluates both downside and upside opening-gap risk under a positive-loss convention:

  • define left_tail losses as L_t = -gap_t;
  • define right_tail losses as L_t = gap_t;
  • define exceedances using training-window thresholds only;
  • store threshold, exceedance indicator, exceedance severity, VaR forecast, and ES forecast;
  • report training-window standardized-loss counts and exceedance counts before reporting POT-GPD VaR/ES forecasts.

The primary tail level is 0.95. Plain standardized-loss POT-GPD is the registered filtered-EVT estimator. The UniBM route is a restricted shape-estimator comparison: it uses the same LightGBM mean/log-scale body filter, the same POT threshold, and a UniBM block-maxima-derived estimate of the GPD shape xi, with scale refit conditional on that fixed xi. Here EVI means the GPD shape convention xi; it is the reciprocal of the Pareto tail index alpha when P(X > x) ~ x^{-alpha}. UniBM failures are reported as unavailable rather than replaced by plain MLE. Diagnostics record shape method, UniBM block-grid diagnostics, threshold sensitivity, shape/scale stability, shape bins, and whether ES is finite.

Upper-tail labels are part of the two-sided futures risk surface. They use the same positive-loss convention as lower-tail labels, with right_tail defined as the original opening gap in loss units and evaluated under the same sample, coverage, and inference gates.

Model-Ready Loss Fields

Processed model tables should carry the fields needed to audit the LightGBM-standardized-loss POT-GPD path:

Field Meaning
gap_t Target log gap for the selected target family.
loss_t Downside loss, defined as -gap_t.
baseline_residual_loss_t Residual loss from the selected baseline location or location-scale model.
lgbm_predicted_location_t LightGBM conditional location prediction where used.
lgbm_predicted_scale_t LightGBM conditional scale prediction used to standardize losses.
scale_smearing_factor Pooled Duan retransformation factor computed from out-of-fold scale residuals.
oof_standardized_loss_count Number of fully out-of-fold standardized losses available for empirical or EVT tail calibration.
standardized_loss_t Loss divided by predicted scale, after any documented location adjustment.
evt_threshold_u Training-window POT threshold used for the row's forecast.
exceedance_indicator_t Indicator that standardized_loss_t exceeds the threshold.
exceedance_severity_t Excess over threshold for EVT severity calibration.
evt_variant POT-GPD variant label. The registered estimator is plain MLE; the restricted UniBM comparison is unibm.
evt_shape_method Shape-estimation method recorded for the row's EVT calibration.
evt_evi_status Extreme-value-index status for xi, including unavailable or diagnostic-disagreement cases.
evt_ei_status Extremal-index status, including unavailable or no-discount fallbacks.
evt_cap_policy Shape-cap policy used for the variant.
evt_cap_hit Indicator that the fitted shape hit a cap where a cap policy is used.
evt_scale_refit_status Status for GPD scale handling.
evt_es_finite Whether the row supplies a finite ES under the fitted shape.
tail_probability_alpha VaR/ES tail probability for the forecast row.
var_forecast VaR forecast transformed back to target scale.
es_forecast ES forecast transformed back to target scale.

The registered primary POT threshold remains fixed at 0.90. Threshold sensitivity is written as a diagnostic artifact before any dynamic-threshold rule is promoted to the registered primary design. The location-scale empirical and POT-GPD variants use a common final LightGBM location-scale backbone by construction; diagnostic EVT variants differ in tail calibration rather than in a variant-specific final location/scale seed.

Source Notes