Skip to content

Development Audit Prompt

Use this as the handoff contract for implementation review. It should stay durable: keep the research boundaries, audit questions, and claim gates here; keep concrete commands, file paths, and artifact names in the repository documentation and current workflow outputs.

You are working in the n225-open-gap-tail repository. Your task is to build and audit
the reproducible research pipeline for "U.S. Close Information and Pre-Open Tail Risk
in OSE Nikkei 225 Futures".

The research is about OSE Nikkei 225 Futures pre-open tail risk, not generic Japanese
equity overnight returns. Both left-tail downside risk and right-tail upside risk are
modeled as separate futures risk surfaces. OSE futures have a night session, so every
model, table, figure, and claim must state its forecast origin, reference price, target
family, tail side, and information cutoff.

Start by auditing the current repository state against this contract. Only proceed to
new implementation after documenting blockers, non-blocking risks, missing tests, and
documentation drift.

Working principles:

- Use the repository's documented workflow entrypoints for status checks, validation,
  full runs, and documentation builds. Use lower-level entrypoints only when debugging
  a specific layer.
- Treat the data path as cache-first. Rebuilding derived layers must not call vendor
  APIs unless raw caches are missing or refresh was explicitly requested.
- Vendor credentials, raw vendor data, local environments, caches, generated reports,
  and local build artifacts are machine-local state and must not be committed.
- Any external sidecar or worker output remains unmerged evidence until Codex or a
  human reviews it. Do not treat worker patches or sidecar artifacts as repository
  truth before review.
- Keep tests honest: unit tests, schema tests, smoke tests, and real-data validation
  tests must be named and documented separately.
- Maintain the repository's coverage and strict documentation-build standard. Every
  new functional module needs focused tests with small synthetic fixtures unless the
  feature genuinely requires real vendor data.

Audit checklist before adding features:

- Does every data row distinguish observation time, bar end time, research download
  time, vendor availability time where known, model cutoff time, and target-open time
  where relevant?
- Do vendor-source, calendar, and contract-metadata outputs remain smoke or schema
  artifacts rather than empirical validation claims?
- Is the OSE futures target clearly labeled as historical licensed research data
  rather than live pre-open production data?
- Are local state, credentials, raw data, caches, generated reports, and build outputs
  excluded from version control?
- Does the documented verification workflow pass before claims are promoted?
- Are tests labeled honestly as unit, schema, smoke, or real-data checks?
- Are rule-based contract metadata and exchange-calendar outputs clearly labeled as
  scaffolding that requires vendor reconciliation?
- Is any claim about model performance, VaR/ES calibration, or hedge usefulness
  unsupported by current artifacts?

Current implementation status:

- The main data-engineering path is implemented: source probes, cache-first reads,
  durable modeling-panel artifacts, calendar mapping, target audit, feature coverage,
  leakage binding, and run-specific reports.
- The benchmark floor and advanced econometric benchmark layer are implemented behind
  gates. Advanced benchmarks remain nonblocking diagnostics unless sample, stability,
  and author-review gates support stronger use.
- The ML tail path is implemented for direct quantile, location-scale, and standardized
  loss POT-GPD variants over the registered nested information ladder.
- Result governance is implemented for headline metrics, per-model diagnostics, result
  matrix artifacts, feature-unavailability diagnostics, paired-loss inference,
  confidence-set inference, Murphy diagnostics, stress windows, DST attenuation, and
  conditional predictive ability diagnostics.
- Reporting utilities generate manuscript-facing discussion, evidence maps, table
  manifests, and figure galleries from artifacts. These outputs summarize evidence;
  they do not create new empirical evidence.

Research design gates:

- Define forecast origins before ingestion or modeling.
- Define target families before feature engineering or evaluation.
- Treat the U.S.-close-mark residual target as unavailable unless a licensed,
  timestamped intraday Nikkei futures reference mark exists and is available at the
  U.S. cash close.
- Every empirical claim must specify forecast origin, reference price, target family,
  tail side, and information cutoff.
- Treat upper-tail modeling as the right-tail futures risk surface, evaluated under
  the same gates as left-tail downside risk.

Data and feature gates:

- Preserve source, symbol, observation timestamp, bar timestamps, availability
  timestamp, and research download timestamp where relevant.
- Build timestamp-safe U.S. close features only after validating time-zone conversion,
  daylight-saving handling, exchange holidays, early closes, missing sessions, and
  OSE night-session edge cases.
- Include feature blocks only when timestamp validity and sample-coverage gates pass.
- Preserve core lagged Japanese variables, market-structure flags, holiday flags, and
  absorption-timing fields where they are relevant to a forecast origin.
- Maintain a feature-leakage audit proving that every feature is available before the
  model cutoff and that the model cutoff precedes the target open.

Modeling gates:

- Baseline and benchmark metrics should be saved before fitting more flexible models.
- Econometric advanced benchmarks should emit explicit unavailable statuses when
  optimization, filtering, sample size, or ES validity gates fail.
- Score-driven and optimizer-heavy advanced benchmarks should remain appendix or
  diagnostic evidence unless sample gates and author review support stronger use.
- ML tail models must use chronological validation, month-level refits, fixed
  hyperparameter policy, recorded feature hashes, recorded feature drops, and
  training-window diagnostics.
- Direct quantile models remain VaR-only unless a valid ES companion is explicitly
  supplied by the model family.
- Location-scale and POT-GPD variants must use fully out-of-fold standardized losses
  before fitting empirical or EVT tail components.

EVT gates:

- EVT is a tail-calibration layer, not a standalone contribution.
- Fit POT-GPD on timestamp-safe standardized losses as the first reported hybrid
  specification; other EVT interfaces are robustness extensions.
- Never calibrate EVT tails on in-sample fitted standardized residuals.
- Threshold diagnostics should report exceedance counts, mean excess, GPD shape and
  scale, stability across nearby thresholds, selected-threshold flags, and sensitivity
  checks.
- Additional automated EVT threshold-selection procedures are not current-paper
  requirements. Revisit them only if EVT threshold selection becomes a primary
  contribution or if author review promotes extreme-tail extrapolation beyond the
  current sample gates.
- Enforce a minimum exceedance count before reporting an alpha level.
- Report empirical levels separately from extrapolated levels.
- Do not claim very-extreme-tail performance unless the sample size supports meaningful
  evaluation.
- Evaluate VaR and ES separately and jointly.

Evaluation and manuscript gates:

- Report VaR coverage, exception diagnostics, quantile loss, joint VaR-ES loss for
  valid VaR-ES pairs, ES exceedance severity, tail-ranking diagnostics, and paired-loss
  inference where the sample supports it.
- Murphy diagnostics are diagnostic plots, not standalone significance tests.
- Conditional predictive ability diagnostics are conditional loss-difference evidence,
  not automatic model-selection claims.
- Use the headline ML tail ladder for the main information-set story. Treat restricted
  cross-family rows as diagnostic or restricted evidence unless common-sample and
  inference gates justify promotion.
- Additional forecast-distribution scoring extensions are not current evidence. Do not
  add them to the current paper unless a later review requires stable definitions,
  artifact schemas, tests, and manuscript wording.
- DST attenuation is descriptive forecast evidence, not a structural causal mechanism.
- ES severity is conditional on VaR exceptions and must be reviewed before being
  converted into manuscript prose.
- VaR-trigger diagnostics are pre-open risk-monitoring diagnostics. They do not
  estimate hedge PnL, costs, turnover, or loss avoided.
- Do not make trading-alpha, live-deployment, price-discovery, structural-causality,
  or hedge-PnL claims without a separate registered design and evidence layer.

Acceptance criteria:

- The documented verification workflow passes before a change is treated as complete.
- New outputs are either small tracked synthetic fixtures or ignored local artifacts.
- No vendor credentials, raw market data, caches, generated reports, or local build
  artifacts are committed.
- Documentation is updated when behavior, schemas, workflow entrypoints, or claim
  boundaries change.
- Claims are labeled honestly: schema checks, smoke checks, and real-data validation
  are not interchangeable.
- Any unavailable target or benchmark is explicitly marked as unavailable or deferred
  with a reason.

When reporting progress, separate:

- Blocking issues.
- Non-blocking risks.
- Missing tests.
- Documentation drift.
- Recommended next implementation step.
- Implemented and tested.
- Implemented but only smoke-tested.
- Requires real vendor data.
- Requires licensed intraday data.
- Still planned or deferred.

Use current repository evidence when giving file or line references. Do not propose
new model work until the target-data audit gate is satisfied.