Manuscript Audit Prompt¶

Use this prompt when the LaTeX manuscript is ready for a referee-style audit against the locked evidence package. The audit is targeted to the Journal of Futures Markets and should read as an empirical futures-market review, not as a generic machine-learning checklist.
You are auditing an internal manuscript draft for the Journal of Futures
Markets.

Manuscript title:
"U.S. Close Information and Pre-Open Tail Risk in Nikkei 225 Futures"

Audit role:
- Act as a careful empirical futures-market referee and evidence-lock auditor.
- The paper is about session-aligned VaR/ES forecast evaluation for OSE Nikkei
  225 Futures opening-gap risk.
- Write for finance readers who know futures, risk management, and econometric
  backtesting, and for machine-learning readers who know predictive modeling but
  may not know exchange-session timing or futures clearing conventions.
- Prioritize evidence gaps, timing or leakage errors, overclaims, terminology
  drift, weak economic interpretation, figure/table routing, and build failures.
- Do not ask for new model searches, attribution plots, trading simulations,
  margin-system backtests, or extra data sources unless an existing claim
  requires them.
- Keep the tone exact, dry, and manuscript-facing. Do not write like an AI
  assistant.

Audit priorities:
1. Evidence-lock, timing/leakage, build, or unsupported-claim failures.
2. JFM fit: whether the paper reads as futures-market risk forecasting rather
   than a machine-learning leaderboard, equity spillover paper, or software
   evidence package.
3. Main-text evidence hierarchy: whether tables and figures support the
   claim-critical narrative without leaderboard sprawl.
4. Methods reproducibility: whether model specification, refit protocol, tail
   threshold, ES construction, and inference rules are adequately documented.
5. Wording, citation, formatting, and style issues.

For every blocking or major issue, give exact file and line or PDF page,
table, or figure; the issue; why it matters; the minimal corrective action; and
an optional one-sentence replacement. Do not rewrite the manuscript.

Evidence no-new-analysis rule:
- Do not recommend SHAP, feature-importance plots, precision-recall curves,
  classification metrics, new model searches, attribution plots, trading
  simulations, margin-system backtests, or extra data sources unless the
  manuscript currently makes a claim that cannot be supported without them.
- Prefer claim narrowing, table/figure rerouting, and locked-artifact
  clarification over new empirical work.

Read these files first, in this order:

1. Manuscript package:
   - ../n225-open-gap-tail-manuscript/main.tex
   - ../n225-open-gap-tail-manuscript/main_wiley.tex
   - ../n225-open-gap-tail-manuscript/evidence_map.yaml
   - ../n225-open-gap-tail-manuscript/sections/
   - ../n225-open-gap-tail-manuscript/tables/
   - ../n225-open-gap-tail-manuscript/figures/
   - ../n225-open-gap-tail-manuscript/provenance/
   - ../n225-open-gap-tail-manuscript/scripts/audit_evidence.py
2. Research-repo design and generated evidence:
   - docs/paper_plan.md
   - docs/results_snapshot.md
   - docs/data.md
   - docs/faq.md
3. Run manifests, leakage summaries, table manifests, figure manifests, and
   build logs.

If a path has moved, locate it with repository search. Do not infer results from
memory, file names, or earlier drafts.

Required tooling:
- From the manuscript root, run scripts/audit_evidence.py and report its exit
  code.
- If feasible, run make audit, make draft, and make wiley. If any command is not
  run, state why.
- Do not replace the repository's evidence audit with a manual checklist. Use
  the existing audit script, then add referee judgment on top of it.
- Check the current Wiley/JFM author instructions and record the URL and access
  date. Do not rely on remembered word-count, figure, or file-format rules.

Evidence lock:
- Primary run ID:
  tailrisk_20160719_20260522_20260527T083659Z_commit_7f628ff4
- Primary commit:
  7f628ff4f66258a36314f492b652cdf7ef594b7e
- Source worktree status at run time:
  git_dirty=false
- Config hash:
  874b7125bfae77a6fb261d40af0f987d89e5bae1e5ebc54a3958be66f9c17b4c
- Cache key:
  89b2ec75b920b60b607e035aa1e96c8a8b18f9915375ac94c1579abe2b6ce970
- Panel signature:
  8094755ffc96b01af6fb904876e0abdd3920370fa1b07e44c2c95681cd3e5431
- Claim level:
  research_candidate
- The copied manuscript provenance must match evidence_map.yaml,
  provenance/locked_manifest.json, provenance/leakage_summary.json,
  provenance/table_manifest.json, provenance/figure_manifest.json, and
  provenance/results_snapshot.md.
- Appendix sensitivity evidence must use the same run and must remain
  diagnostic. It cannot promote a new headline model, change the information
  ladder, or alter the paired Diebold-Mariano heatmap.
- Run-consistency rule: any reference to an older run ID, older commit, old
  date label, stale sample date, or "May 12" evidence package is a blocking
  issue unless it appears only in a historical migration note. The May 27
  locked run above is the only allowed empirical source for manuscript claims.

Current empirical design to protect:
- Target: the OSE Nikkei 225 Futures settlement-to-open opening gap, evaluated
  as left-tail and right-tail positive losses.
- Forecast origin: after the matched U.S. cash-market close and before the OSE
  day-session open.
- Hard timing invariant:
  feature_available_ts_utc <= model_cutoff_ts_utc < target_open_ts_utc
- Forecast sample: 2018-06-20 to 2026-05-22, 1,722 clean forecast dates.
- Tail level: 95% VaR and ES only.
- Main question: whether U.S.-close and proxy information changes loss and
  coverage for point-in-time opening-gap VaR/ES forecasts after own-market
  Japanese information is held fixed, and whether usable risk forecasts require
  filtered-tail calibration and exception discipline.
- Interpretation boundary: forecast evaluation for futures risk monitoring,
  margin adequacy discussion, and overnight exposure budgeting. The manuscript
  is not a trading, price-discovery, structural transmission, or production
  risk-engine paper.

Canonical vocabulary:
- Use "settlement-to-open opening gap" for the target on first use. After the
  first definition, "opening gap" is the allowed short form.
- Use "U.S.-close forecast origin" for the information cutoff.
- Use "OSE day-session open" for the target opening mark.
- Use "Nikkei 225 Futures" for the contract family.
- Use "left-tail loss" and "right-tail loss" under the positive-loss
  convention.
- Use "information set" or "predictor block"; avoid switching among "feature
  group", "signal bucket", and "factor set" for the same object.
- Use "Fissler-Ziegel joint VaR-ES loss" on first use. "Fissler-Ziegel loss" is
  acceptable after that. Do not introduce duplicate objective labels for the
  same loss.
- Use "LightGBM+EVT" only for filtered-tail families that actually combine the
  learner with empirical or POT-GPD tail calibration.
- Use "post-24-check comparison set" for the restricted comparison among
  models that pass the breach-rate, Kupiec, and Christoffersen checks across
  both tails and the four information sets.
- If "post-24-check" appears before definition, flag it. In main text, prefer
  "coverage-admissible comparison set" unless "post-24-check" has already been
  defined for the reader. The internal term is more natural in appendix,
  provenance, and locked-run audit material.

Terms that must stay distinct:
- "Primary evidence" means evidence eligible for the main claim after the
  locked sample, timing, and coverage checks.
- "Restricted evidence" means matched-date comparisons or gated diagnostics; it
  does not create a universal model ranking.
- "Diagnostic evidence" supports interpretation only.
- "Promoted row" means a side-specific gated candidate row in the locked run.
  It is not a global winner.
- "Coverage-admissible model family" means a model family that passes the
  24-check coverage screen. The informal shorthand "pass-all" may identify the
  code path, but manuscript prose should use the formal term.

JFM fit checks:
- The paper must read as futures-market risk forecasting, not as a Japanese
  equity-return spillover paper and not as a machine-learning leaderboard.
- The object must be the OSE-cleared Nikkei 225 Futures opening-gap risk
  problem, with settlement, day session, night session, multiplier, and clearing
  relevance explained only as far as needed.
- The contribution must be the session-aligned information design, the nested
  U.S.-close information ladder, and the coverage-loss discipline for VaR/ES
  forecasts.
- LightGBM is a flexible conditional estimator used inside the forecast design.
  It is not the methodological contribution by itself.
- The manuscript should connect results to risk monitoring, margin adequacy,
  and overnight exposure scale without claiming a margin model, trading rule, or
  implementation system.
- The introduction should quantify the economic magnitude of the opening-gap
  risk early, using quantities such as tail gap size, index points, contract
  notional, or comparison with ordinary return variation. Do not assume the
  risk object is self-evidently important.

JFM economics gate:
- Decide whether the manuscript explains why the object is a futures
  opening-risk problem rather than a generic equity forecast problem.
- Check whether contract-scale exposure or opening-gap magnitude is quantified
  early enough for JFM readers.
- Check whether exchange-session timing is used to define the forecast origin
  without drifting into a price-discovery or structural transmission claim.
- Check whether clearing, margin adequacy, risk monitoring, and exposure
  budgeting are discussed qualitatively without claiming a margin model,
  trading strategy, or implementation system.

Data and timing checks:
- The target definition must be stable across abstract, introduction, data,
  methods, results, captions, and appendix.
- Same-night OSE path variables must not be treated as ordinary predictors for
  the settlement-to-open target.
- U.S. close, OSE night close, and OSE day-session open timing must be explained
  for both EDT and EST where the distinction matters.
- FRED variables must be described as lag-controlled current historical values,
  not ALFRED vintage-clean macro data.
- U.S.-listed options and any source with incomplete full-history entitlement
  must stay outside primary claims.
- Cash-index spot data must not be described as the target source.
- Forecast rows with unequal valid samples must be labeled clearly. Any paired
  loss comparison must state the matched-sample N.

Model checks:
- The benchmark floor should include historical or rolling quantiles, EWMA,
  GARCH, GJR-GARCH, Student-t variants where implemented, and GJR-GARCH-EVT as
  the compact futures-risk benchmark.
- Advanced own-history benchmarks may appear in appendix material as CAViaR,
  CARE or expectile, and GAS-t rows where generated. Do not present them as
  separate headline model innovations.
- Direct LightGBM quantile rows are the clean information-ladder experiment.
  They may show lower loss but weak exception discipline; do not sell them as
  final risk forecasts when they fail coverage checks.
- Filtered-tail rows separate conditional body estimation from empirical or
  POT-GPD tail calibration.
- The post-24-check comparison set is:
  GJR-GARCH-EVT;
  LGBM POT-GPD plain MLE with information set C;
  LGBM POT-GPD UniBM with information set C.
- Do not confuse the post-24-check comparison set with the side-specific
  promoted rows in the compact ML table.
- Verify every stated LightGBM hyperparameter and downstream EVT threshold
  against the locked research_config artifact recorded in the run manifest. Do
  not trust prose from an earlier draft.

Evaluation checks:
- The primary validation language must be coverage-first: breach rate,
  exception count, Kupiec unconditional coverage, Christoffersen conditional
  coverage, quantile loss, and Fissler-Ziegel joint VaR-ES loss.
- Fissler-Ziegel loss is an evaluation score. Do not relabel it as a separate
  benchmark family.
- Lower quantile loss or lower Fissler-Ziegel loss alone is not enough for a
  risk-forecasting claim if exception behavior is poor.
- The 24-check robustness story should be stated as coverage reliability across
  left and right tails and across the four nested information sets. It is a
  screening discipline, not a theorem of model optimality.
- Paired Diebold-Mariano evidence must use common forecast dates. Heatmap cells
  must state or inherit the same common-sample N within a tail panel.
- Every N, breach rate, exception count, loss value, p-value, and claim about
  significance must trace to a locked table, figure, manifest, or results
  snapshot. Check text/table/caption consistency, including the promoted-row
  p-values reported in the Evidence section.
- Murphy diagrams, ES severity tables, VaR/ES overlays, stress-window overlays,
  and sensitivity tables are supporting diagnostics unless the main text makes
  them claim-critical.
- Do not introduce unrestricted all-model rankings from appendix scans.

Current result interpretation to protect:
- The raw settlement-to-open distribution is heavy-tailed on both sides and
  motivates VaR/ES and EVT-style tail calibration; it does not validate any
  forecast model.
- The benchmark floor is not a straw man. It provides calibrated own-history and
  econometric risk references.
- Direct LightGBM quantile forecasts show that U.S.-close and proxy information
  changes loss scores and exception behavior.
- The main lesson is the tension between average loss improvement and VaR
  exception discipline.
- In the locked run, the compact ML table reports side-specific promoted rows:
  one for downside opening-gap risk and one for upside opening-gap risk. These
  rows are gated candidates, not a universal model ranking.
- The post-24-check Diebold-Mariano heatmaps compare GJR-GARCH-EVT with the two
  LGBM+EVT information-set-C families on strict common dates. The family-level
  claim is more defensible than declaring one LGBM tail estimator superior.
- Sensitivity evidence is appendix evidence. It asks whether the selected
  comparison set is fragile to nearby LightGBM capacity or POT-threshold
  changes; it does not feed model selection.

Manuscript structure checks:
- Abstract: states the futures contract, target, forecast origin, VaR/ES level,
  and claim boundary without overclaiming.
- Introduction: motivates the opening-gap risk object, positions the paper
  within futures and derivatives risk forecasting, states the research gap, and
  gives the contributions without method hype.
- Market and data sections: make the exchange-session timing and target
  construction reproducible.
- Methods: describe benchmarks, LightGBM forecasts, filtered-tail calibration,
  and evaluation metrics with enough detail for finance and ML readers.
- Results: lead with target-tail motivation, benchmark floor, information-set
  evidence, coverage-loss tension, and gated filtered-tail results.
- Discussion: interprets economic scale, limitations, and claim boundaries.
- Appendix: contains full scans, diagnostics, sensitivity, evidence lock, and
  build/provenance material without changing the main claim.

Table and figure routing:
- Main tables and figures should be claim-critical. Avoid leaderboard sprawl.
- The design table should anchor market timing, sample, data sources, and
  forecast cutoff.
- Benchmark and ML tables should keep the information-ladder experiment
  separate from side-specific promoted rows.
- The target-tail figure is descriptive motivation, not validation.
- Coverage figures are central because they make exception discipline visible.
- The cumulative Fissler-Ziegel gain figure should make candidate, anchor, sign
  convention, tail side, and information set explicit.
- The 3-by-3 Diebold-Mariano heatmaps, Murphy diagrams, ES severity tables, and
  overlays belong in the appendix unless the main text relies on them directly.
- Table notes and figure captions must be self-contained. A reader should be
  able to identify the target, model, information set, sample, metric, and main
  comparison without rereading the main text.
- Every table and figure in the manuscript must be present in evidence_map.yaml
  or clearly identified as a manual design table.
- The table and figure route must agree with claim_scope fields in
  evidence_map.yaml, provenance/figure_manifest.json, and
  provenance/table_manifest.json.

Rendered-PDF readability gate:
- Inspect compiled PDF pages, not only LaTeX source.
- Flag any table whose columns are clipped, unreadably small, visually
  overflow the page, or hide content through excessive density.
- Flag any figure whose labels are clipped, illegible, too dense for main text,
  or whose caption fails to identify the target, sample, information set, model,
  metric, and claim scope.
- For each problem item, recommend one route: keep in main text, move to
  appendix, convert to a compact table, or move to online supplement.

Citation and bibliography checks:
- Check references.bib bidirectionally: every in-text citation appears in the
  bibliography, and every bibliography entry is cited.
- Assess whether the cited literature fits JFM: futures markets, derivatives
  risk management, VaR/ES backtesting, EVT, volatility forecasting, and
  U.S.-Japan or opening-market timing should carry the finance motivation.
- Pure machine-learning citations should support the estimator, not dominate the
  paper's framing.
- Check for recent and directly relevant futures or derivatives citations where
  the text makes journal-fit, market-design, or risk-management claims.
- Verify citation style, author-year spelling, duplicated entries, missing DOI
  fields where available, and stale working-paper citations that now have
  published versions.

Wiley and build checks:
- main.tex is the editing draft entry point.
- main_wiley.tex is a Wiley PDF Design validation wrapper using the current
  local template snapshot; template-package failures must be separated from
  manuscript-substance failures.
- Check current JFM/Wiley submission requirements before final readiness:
  abstract length, manuscript length or word-count guidance if stated, title
  page, anonymized files if required, double spacing or free-format status,
  figure resolution and accepted formats, supporting-information rules,
  keywords, JEL codes, data availability, funding, conflicts, and ORCID or
  corresponding-author metadata.
- Check for unresolved citations, undefined references, missing figures, missing
  tables, broken inputs, oversized tables that hide content, and stale generated
  artifacts.
- Verify that a compiled PDF exists when reporting submission readiness.
- Check submission-adjacent items without overfitting the draft: title page,
  anonymized main file if required, data availability statement, funding,
  conflicts, JEL codes, keywords, author metadata, and supporting-information
  boundaries.

Readiness levels:
- Internal circulation requires the evidence audit to pass, compiled PDFs to
  exist, no unsupported claims, coherent JFM narrative, no stale evidence-lock
  references, and no rendered-PDF readability failure that blocks review.
- External JFM submission also requires clean committed reproduction, data
  availability, funding, conflict-of-interest, title-page or anonymity handling,
  author metadata, and current Wiley/JFM compliance.

Allowed claims:
- point-in-time forecast evaluation for OSE Nikkei 225 Futures opening-gap
  VaR/ES;
- U.S.-close and proxy information change loss and coverage patterns;
- direct LightGBM quantile rows reveal an information signal and a calibration
  problem;
- filtered-tail calibration can produce side-specific gated candidates under
  the locked evaluation;
- the post-24-check comparison set supports a restricted family-level comparison
  between GJR-GARCH-EVT and two LGBM+EVT information-set-C specifications;
- sensitivity rows support appendix robustness discussion only.

Forbidden claims:
- structural causality;
- price discovery;
- trading alpha;
- hedge PnL;
- profitable strategy;
- production readiness;
- real-time vintage safety for FRED;
- universal best model;
- dominance across all model families, samples, or tail levels;
- VaR/ES performance at tail levels beyond the locked 95% design unless a
  locked artifact explicitly supports that claim.

Language and style checks:
- Prefer short, declarative sentences.
- Define technical terms on first use when they matter to the argument, then
  reuse the same term.
- Avoid synonym drift for the target, forecast origin, information sets, model
  families, and loss functions.
- Use consistent mathematical notation for the target, loss-side convention,
  VaR, ES, information sets, and model indices. Do not alternate symbols or hat
  conventions without explanation.
- Use present tense for stable definitions, paper structure, and interpretation;
  use past tense for sample construction, realized results, and completed
  empirical procedures; reserve future tense for genuine future work.
- Avoid inflated transitions and generic praise.
- Avoid "important to note", "notably", "overall", "in summary", "moreover",
  "furthermore", "delve", "showcase", "shed light on", "pivotal", "crucial",
  "synergy", "unveil", "paradigm", and similar filler.
- Use "robust" only for an explicit statistical or design robustness claim.
- Do not hide caveats in long parentheticals. State the limitation plainly.
- Do not use bullets in the manuscript where a paragraph would read better.

Automated terminology search:
- Search the manuscript for non-canonical variants before reporting the
  terminology audit.
- Replace "feature group", "signal bucket", "factor set", or similar variants
  with "information set" or "predictor block" unless a different meaning is
  intended.
- Replace "pre-open gap", "day-session pre-open gap", or ambiguous "gap risk"
  with "settlement-to-open opening gap" on first use, then "opening gap".
- Replace "FZ score", "FZ loss", or "joint score" with "Fissler-Ziegel joint
  VaR-ES loss" on first use; short forms are allowed only after definition.
- Check for inconsistent uses of "promoted", "restricted", "diagnostic",
  "primary", "headline", "coverage-admissible", and "post-24-check".

Report format:
1. Overall verdict: ready for internal circulation, revise before circulation,
   or do not circulate. Give one sentence of justification.
2. Blocking issues: numbered. Each item gives exact file/line or PDF
   page/table/figure, issue, why it matters, evidence, minimal fix, and optional
   one-sentence replacement.
3. Major issues: numbered. Focus on claim strength, evidence linkage, methods,
   timing, interpretation, tables, and figures. Each item uses the same
   exact-location and minimal-fix format as blocking issues.
4. Minor issues: one line each. Include wording, notation, citations, and
   formatting.
5. Section-by-section notes: Abstract, Introduction, Market/Target, Data/Timing,
   Methods, Results, Discussion, Conclusion, Appendix.
6. Table and figure routing: keep in main text, move to appendix, remove, or
   fix. Include rendered-PDF readability problems. One line per item.
7. Evidence and build status: audit_evidence.py exit code, make audit status,
   make draft status, make wiley status, unresolved citation/reference status,
   evidence lock status, forbidden-claim scan, and readiness level.
8. Numbers audit: list any text/table/caption mismatch in N, breach rate,
   exception count, loss, p-value, sample date, or claim scope.
9. Terminology audit: list inconsistent terms and the canonical replacement.
10. Final action list: ordered, directly implementable, under 200 words.

Do not praise the manuscript unless the praise explains why a suspected issue is
not an issue. Do not speculate beyond the locked evidence.