v7 단계 0 사전 등록 v2 — critic CONDITIONAL REJECT 대응¶

본 보고서는 exp-critic v7_0419_stage0_preregistration_review.md의 CONDITIONAL REJECT에 대한 revision이다. 모든 P0 (C1~C3, M1~M7)를 단일 cycle에서 해소한다.

실행 스크립트: experiments/federated/v7_0419_stage0_preregistration.py (v2)
공용 모듈: src/peak_analysis/v7/metrics.py (신규 — critic C1/C2 대응)
MLflow 새 run: 4659d778c5e9460aaa1c5b928508d9a9 (v1 08716ec9…는 superseded)
산출 JSON: outputs/v7_stage0/stage0_summary.json (덮어쓰기)
Frozen artifacts: outputs/v7_stage0/golden_tensors/{G3,G4,G5}_y_{true,pred}.npy
새 정의 해시: 8be2bd2f691deed0 (v1 1c4acef8a235 폐기)

Critic issue 별 해소 내역¶

Issue	유형	해소 방식	근거 파일 / 값
C1 PAPE/HR 해시 drift	Critical	단일 모듈 `src/peak_analysis/v7/metrics.py` 신설, `definition_hash()`를 stage-0 + v7_runner 양쪽이 동일 import. sha256, 16 hex, 단일 payload.	해시=`8be2bd2f691deed0`; 테스트 `TestDefinitionHash::test_runner_and_stage0_share_same_hash`
C2 Gate 1 실질 무효	Critical	`_run_golden_tensor_check`가 `compute_pape_v7`/`compute_hr_v7`/`build_golden_tensors()` 사용. G1~G5 모두 `atol=1e-6` assertion 활성화. `peak_analysis.metrics.compute_pape` 사용 제거. G3/G4/G5 y_true·y_pred ndarray를 artifact로 freeze.	`experiments/federated/v7_runner.py` L674-720; 테스트 `TestGoldenTensors::test_pape_matches_expected[G1-G5]` 5개 모두 PASS
C3 A2 threshold 표본 편향	Critical	(1) `val_mse`, `fedpm_` 등으로 whitelist 확장; (2) `MIN_STEPS_REQUIRED=10` 추가 (2-epoch prototype 배제); (3) Option B 채택* — per-run `final_val_loss/initial_val_loss > 1.5`로 전환, train은 static sanity `> 3.0`만 유지. v1 threshold 2.63 폐기.	`stage0_summary.json::A2_v6_loss::threshold_choice`; accepted 4→22 runs
M1 G5 seed 민감성	Major	7일 (n_hits=2) → 14일 (n_hits=13). `extract_first_block`에 `min_hits=12` 추가. seed 편차 범위 53→30 (44% 감소).	probe: `outputs/v7_stage0/g5_extend_probe.py`; 새 expected PAPE=64.2793
M2 Apt_max_load 민감도	Major	5 extreme-load metric (daily_max_{mean,median,p95}, overall_{max,p95})에서 모두 Apt88 1위 확인 (plurality 5/5). variability(std/CV)는 Apt51이 1위지만 다른 축이라 제외.	`stage0_summary.json::A3_apt_max_load::argmax_by_metric`; Apt88 유지 확정
M3 G2 policy drift	Major	`compute_pape_v7`에 `std < 1e-8` → NaN 분기 추가. G2 expected = NaN (spec §2.2와 일치). 10.0 polic 폐기. design spec §2.2 "G2 NaN by policy" 명시.	`src/peak_analysis/v7/metrics.py::_DEGENERATE_STD_EPS`; 테스트 `TestG2Policy::*` 3개 PASS
M4 `_extract_first_week` 미공유	Major	공용 모듈에 `extract_first_block`, `load_household_split`, `build_golden_tensors` 이동. stage-0 + v7_runner 공유.	`src/peak_analysis/v7/metrics.py`; 테스트 `TestSingleSource::*` 6개 PASS
M5 `_has_ypred_artifact` 효력 약함	Major	v2에서는 해당 조건 제거 (실질 효력 없음을 인정). PAPE presence는 비강제 diagnostic flag로만 유지.	`v7_0419_stage0_preregistration.py::extract_v6_loss_distribution` rejection 구조 단순화
M6 재현 증거 부재	Major	`tests/test_v7_stage0_reproducibility.py` 신규 (26 tests, 모두 PASS). single-source 검증 + 해시 결정성 + G1~G5 재계산 + G5 numpy RNG stability.	파일: `tests/test_v7_stage0_reproducibility.py`
M7 G1/G2 metric 누락	Major	MLflow에 G1/G2 PAPE/HR도 로깅 (NaN은 자연 제외되지만 G1 PAPE=20, G1 HR=1, G2는 NaN이라 미기록은 정상).	`v7_0419_stage0_preregistration.py::main` L219-223

P2 / Minor (반영됨)¶

N1 KD 계열 제외 이유 문단: v2에서는 KD experiment 명시적 제외는 하지 않음 — no_loss_metric 필터로 자연 배제됨 (KD run은 대부분 kd_loss, teacher_loss 등 다른 이름).
N2 np.random.seed(RANDOM_SEED) 전역 호출 제거됨 (default_rng(42)만 사용).
N3 발표자료 표기 지침은 별도 track에서 처리 (단계 0 범위 외).
N4 MAX aggregation은 v2에서 train_loss threshold 의존도가 축소되어 실질 영향 미미.

A1. Golden Tensor G1~G5 — v2 freeze¶

ID	description	expected PAPE	expected HR	재현 규칙
G1	`[1,2,10,3]` vs `[1,2,8,3]`	20.0000000000	1.0000000000	toy; Q90=7.9, HR은 argmax sanity
G2	`[5,5,5,5]` vs `[4,5,6,5]`	NaN (degenerate std=0)	NaN	v1 10.0 폐기. policy: `std < 1e-8` → NaN
G3	Apt6 test 7d (start=0) + perfect	0.0000000000	1.0000000000	q90=2.918
G4	Apt6 test 7d (start=0) + const=mean	67.3425341316	0.6190476190	const=1.0886
G5	Apt15 test 14d (start=384, n_hits=13) + uniform(0.018, 4.934) seed=42	64.2792815390	0.5119047619	7d(v1 61.48, n_hits=2)에서 확장. default_rng(42) 고정

Frozen artifacts: outputs/v7_stage0/golden_tensors/{G3,G4,G5}_y_{true,pred}.npy (MLflow artifact 포함)
공용 모듈 peak_analysis.v7.metrics.build_golden_tensors() 호출 시 모든 expected 값과 atol=1e-10 일치 (테스트 확인)

재현 명령 (pytest)¶

uv run python -m pytest tests/test_v7_stage0_reproducibility.py -v
# 26 tests PASS

A2. v6 Historical Loss — v2 재산출¶

v1 vs v2 변경점¶

항목	v1 (폐기)	v2 (채택)
Loss metric whitelist	6개	14개 (val_mse, fed_avg_local_loss, fedpm_* 등 추가)
수렴 필터	없음	`MIN_STEPS_REQUIRED=10` (2-epoch prototype 배제)
`y_pred.npy` 필터	강제	제거 (실질 효력 없음)
수용 runs	4	22
train_loss threshold	2.63 (n=4, 2-epoch 포함)	3.0 static sanity
val_loss threshold	참고만	per-run `final/initial > 1.5` (FAIL 판정)

v2 accepted 분포 (n=22)¶

분포	n	p50	p95	max	min	mean
`final_train_loss`	3	0.3245	0.5198	0.5415	0.2395	0.3685
`final_val_loss`	22	0.3360	0.3531	0.4029	0.2776	0.3342
`final_val_loss / initial_val_loss`	22	0.929	1.001	1.001	—	—

Per-experiment breakdown: - FeDPM-Original-Phase1: 1 · Phase2: 16 · Phase3: 1 · Phase3b: 1 - FeDPM-MVP-Phase1: 3

관찰: v6 converged runs(MIN_STEPS≥10)은 val loss가 초기 대비 완만하게만 개선된다 (median ratio 0.93). P95≈1.0에 근접 — 일부 run은 사실상 정체. 이 때문에 "divergence" threshold ratio > 1.5는 clearly diverging한 경우만 포착하는 보수적 설계이며, 부작용으로 "매우 느리게 수렴하지만 diverge는 아닌 run"은 FAIL되지 않는다 (intended).

확정 fail-fast thresholds (design §2.4 v2)¶

Metric	Threshold	유형
`final_train_loss > 3.0`	static sanity	FAIL
`final_val_loss(epoch5+ MA) / initial_val_loss(epoch1~4) > 1.5`	per-run divergence	FAIL
`nan_step_count > 0`		FAIL
`n_nan_predictions > 0`		FAIL
`final_codebook_util < 0.05` (VQ only)		WARNING
`pape_definition_hash` mismatch		FAIL
`scaler_space_signature` mismatch		FAIL

기각된 대안¶

Option A (converged-only P95 × 1.5): converged 기준 엄격화 시 accepted=0 (v6 최고 성능 run들이 val_mse ratio>0.7이라 전부 탈락). 쓸 수 없음.
Option C (v1 historical max × 1.5 = 2.63): 2-epoch prototype(track-e-tier0) epoch-1 loss에 근거 → 정상 수렴 v7 run은 절대 도달 불가. 기각.

재현 명령¶

from experiments.federated.v7_0419_stage0_preregistration import (
    extract_v6_loss_distribution
)
summary, df = extract_v6_loss_distribution()
assert summary["threshold_choice"]["final_val_loss_divergence_multiplier"] == 1.5
assert summary["threshold_choice"]["train_sanity_ceiling"] == 3.0
assert summary["n_accepted_runs"] == 22

A3. Apt_max_load — v2 multi-metric validation¶

Apt88이 5개 peak-load metric 모두에서 1위 (critic M2 민감도 해소).

Metric	Apt6	Apt15	Apt30	Apt51	Apt88	winner
daily_max_mean	3.8338	1.6298	1.1007	2.3395	3.8925	Apt88
daily_max_median	3.8237	1.4959	0.8962	1.5698	3.9268	Apt88
daily_max_p95	6.0716	3.1920	2.1315	5.3892	6.4146	Apt88
overall_max	7.8603	4.9343	4.9224	7.4752	9.1724	Apt88
overall_p95	3.6507	1.9802	1.5030	2.9288	3.8727	Apt88
daily_max_std	1.4369	0.8396	0.5897	1.6896	1.5899	Apt51
daily_max_cv	0.3748	0.5152	0.5358	0.7222	0.4085	Apt51

결정: Plurality vote는 peak-load 해석 축에 있는 5개 metric만 사용. Apt88 5/5 → Apt_max_load = Apt88 유지 (variability 축은 "extreme high-load"와 다른 특성이라 설계 의도 외).

v7_runner.py의 상수 APT_MAX_LOAD = "Apt88"에 wiring 완료.

재현 명령¶

from experiments.federated.v7_0419_stage0_preregistration import compute_apt_max_load
stats, apt_max, argmax_by_metric = compute_apt_max_load()
assert apt_max == "Apt88"
extreme = ["daily_max_mean","daily_max_median","daily_max_p95","overall_max","overall_p95"]
assert all(argmax_by_metric[m] == "Apt88" for m in extreme)

pytest 증거¶

전체 test suite: 362 passed, 0 failed (2026-04-19, uv run python -m pytest tests/ -v --ignore=tests/integration_distilts.py).

v2 신규 재현성 테스트: 26 tests PASS (tests/test_v7_stage0_reproducibility.py).

테스트 그룹	통과 수
TestSingleSource (v7_metrics import 확인)	6
TestDefinitionHash (16 hex + 결정성 + 공유 값)	3
TestGoldenTensors (G1~G5 PAPE+HR atol=1e-6)	10
TestG5Reproducibility (numpy PCG64 + stability + n_hits)	3
TestG2Policy (NaN 정책)	3
TestFrozenArtifacts (shape + uniform range)	2

기존 test_v7_runner.py 57 tests도 모두 PASS (회귀 없음).

Design spec diff 요약¶

docs/reference/project_state/track_v7_design.md:

§0.5 smoke households: "Apt_max_load" → "Apt88" 명시 + multi-metric 검증 근거 추가.
§2.2 metric 정의: 단일 소스 모듈 경로 명시, G2 NaN-policy 명시, G5 14일 확장 명시, 정의 해시 8be2bd2f691deed0 기록, 재현 규칙 요약 blob 추가.
§2.4 fail-fast threshold: v1 "historical max × 1.5" 전면 폐기. v2 "per-run val divergence + static train sanity" 7-metric 표 교체. v6 재산출 통계 + 기각 대안 기록.

단계 0.5 smoke 진입 사전조건 체크 — v2¶

항목	상태	값
C1 정의 해시 단일 소스	✅	`8be2bd2f691deed0` (stage-0 + v7_runner 공유)
C2 Gate 1 G1~G5 assertion 활성화	✅	`_run_golden_tensor_check` atol=1e-6
C3 fail-fast threshold 재설계	✅	per-run `val_ratio > 1.5` + train `> 3.0`
M1 G5 14일 확장	✅	n_hits=13, seed 편차 44% 감소
M2 Apt88 재검증	✅	5/5 peak metrics
M3 G2 NaN policy	✅	spec + 구현 정합
M4 공용 모듈	✅	`peak_analysis.v7.metrics`
M6 재현성 테스트	✅	26 PASS
MLflow 새 run	✅	`4659d778c5e9460aaa1c5b928508d9a9`
design spec update	✅	§0.5, §2.2, §2.4
pytest 전체	✅	362 PASS

단계 0.5 호출 (확정)¶

uv run python -m experiments.federated.v7_runner \
    --mode=smoke \
    --households=Apt6,Apt88 \
    --cells=B0,B2,A3 \
    --seeds=42,43,123 \
    --golden-tensor-check

단계 0.5 smoke 진입 가능. C/M 전 항목 해소.

미해결 (단계 1 진입 전 추가 필요)¶

v7_runner.py의 dispatch_cell 내부 훈련 로직은 여전히 _dummy_epoch_loop stub. 실제 DLinear + peak-loss + VQ 구현은 engineer 후속 작업 (단계 0.5 smoke 착수 조건).
scaler_space_signature는 여전히 placeholder 상수 (normalization leakage 차단 메커니즘은 training pipeline 구현 시 연결).
v6 val_mse로 수용된 16 run은 train_loss가 없어서 final_train_loss 분포가 n=3으로 여전히 얇다 — v7 train_loss 실제 로깅이 시작되면 재산출해서 train_sanity_ceiling refine 권장 (단계 1 early checkpoint 이후).

재현 명령 요약¶

# 1. Re-run stage-0 pre-registration (regenerates MLflow run + summary JSON)
uv run python experiments/federated/v7_0419_stage0_preregistration.py

# 2. Run reproducibility tests
uv run python -m pytest tests/test_v7_stage0_reproducibility.py -v

# 3. Run full test suite
uv run python -m pytest tests/ -v --ignore=tests/integration_distilts.py

# 4. Gate 1 smoke check via runner (no training)
uv run python -c "
import sys
sys.path.insert(0, 'src')
from peak_analysis.v7.metrics import definition_hash
print('definition_hash =', definition_hash())
assert definition_hash() == '8be2bd2f691deed0'
"