Information-Geometric Context Window Governance and the Probabilistic Theory of Long-Context Collapse in Large Language Models A Unified Framework: Observer Entropy, Extreme Value Theory, and the CPL 4.0 Phase-Aware Governor

DPID: 1073

Abstract

We present a unified theoretical framework connecting two complementary approaches to long-context degradation in Transformer-based large language models (LLMs): the information-geometric theory of observer entropy and the probabilistic theory of attention collapse via Extreme Value Theory (EVT). The central object is the observer entropy S_obs(p_θ, ε) = KL(p_θ ‖ Π̃_ε p_θ), whose quadratic scaling law—the Bridge Theorem—states S_obs = ½ε² v(θ)⊤ I(θ) v(θ) + O(ε³), where I(θ) is the Fisher information matrix. We interpretatively connect this quantity to the signal-noise competition in self-attention: defining signal strength μ_s = 1/√d Tr(W_Q⊤ W_K Σ_qr) and effective margin μ_L = μ_s - σ√(2 log L_eff), we prove a two-sided bound c₁μ_L e^{μ_L}/L ≤ S_obs(L) ≤ c₂ e^{μ_L}/L valid in the pre-collapse regime μ_L > log 2 + 1, and a one-sided upper bound S_obs(L) ≤ c₂ e^{μ_L}/L for all sufficiently large L; the formal equivalence between the partition-based and attention-uniform definitions of S_obs is an interpretative identification whose rigorous formalisation is an open problem. This yields a Fundamental Impossibility Theorem: for any finite μ_s, observer entropy decays to zero as L → ∞, establishing long-context collapse as an information-theoretic inevitability under softmax attention. Using Gumbel convergence for weakly dependent logit maxima with Gaussian marginals (Leadbetter conditions), we derive a closed-form Probabilistic Risk Law: P(ℱ_L) ≈ 1 - exp(-exp(-(μ_s - σ√(2 log L))/a_L)), and conjecture that the critical length L_crit is a heavy-tailed random variable (formal proof incomplete; see Remark [rem:lcrit_gap]). The CPL 4.0 phase-aware governor emerges as the principled control-theoretic response, with a Master Theorem guaranteeing a hard context cap, entropy contraction, and a sub-linear fragmentation bound N_F(T) = O(√(T log(1/δ₀))). Numerical verification confirms the ε² scaling law across both worked examples.