Scaling Laws Explorer

Interactive companion to the canonical scaling-laws literature: explore how model size, training data, and compute determine large-language-model performance — and how the modern prescription has evolved from 2020 (Kaplan) through 2022 (Chinchilla) to 2024+ (inference-aware optimization).

Executive summary. Frontier-model performance is predictable. Loss scales smoothly with parameters and training data along a fitted curve, and the optimal allocation between the two has converged on a roughly 1:1 ratio (Chinchilla 2022). Add expected inference volume and the optimum shifts to smaller, longer-trained models — which is why every model released after 2023 (LLaMA-3, Qwen-2.5, Nemotron-3, OLMo 3) is "over-trained" by the 2020 standard.

Variable legend

Every quantity used in this dashboard, with its symbol and what it represents.

🧠

Model size

Number of trainable parameters (e.g., 7B, 70B).

📚

Training tokens

Volume of pretraining data the model sees during training.

⚡

Compute budget

Total training FLOPs (≈ 6 × N × D for transformers).

📉

Loss

Cross-entropy in nats per token; lower is better.

🎯

Bayes floor

Irreducible entropy of natural language; the loss you can never beat.

🔍

PPL

Perplexity

exp(L). How many tokens the model is "choosing between" per step.

🔄

Inference volume

Expected queries served after training (Sardana 2024).

⏱️

Capability horizon

Duration of human-expert tasks AI can complete reliably (METR 2025).

🧮

Capacity prefactor

Strength of the 🧠 model-size term in Chinchilla's loss law. Fit value ≈ 406.4.

📐

Capacity exponent

How fast loss falls as 🧠 model size grows. Fit value ≈ 0.336.

🧮

Data prefactor

Strength of the 📚 training-token term in Chinchilla's loss law. Fit value ≈ 410.7.

📐

Data exponent

How fast loss falls as 📚 training tokens grow. Fit value ≈ 0.283.

🏗

N_total

Total parameters (MoE)

All experts plus shared layers (e.g., DeepSeek-V3: 671B).

⚙

N_active

Active parameters per token (MoE)

Params actually used per forward pass (e.g., DeepSeek-V3: 37B).

✨

N_eff

Effective dense-equivalent

√(N_total · N_active) — the dense size whose loss matches the MoE (Clark 2022).

◐

Sparsity

N_active / N_total. Lower = more sparse. Abnar 2025 optimum: ~25% small scale → ~6% at frontier.

💎

Information per token

Corpus property: 1 token of quality-q data carries q× the usable signal of a baseline token. 0.3 = noisy / unfiltered; 1 = Common Crawl baseline; 2 = DCLM-filtered; 5 = synthetic curated. Shifts D/N optimum as q^−0.91. Repetition is a separate, training-procedure mechanism (Muennighoff 2023; modeled by the data-wall curve in Panel 2).

1 · Loss law and the compute-optimal recipe

Set your model size, training tokens, and expected inference volume. The dashboard reports the predicted loss at your chosen (N, D) and overlays the optimal tokens-per-parameter recipe across three eras — Kaplan (2020), Chinchilla (2022), and Sardana (2024).

L(N, D) = 🎯 E + 🧠 A · N⁻ᵅ + 📚 B · D⁻ᵝ [Chinchilla 2022]

🧠Model size (N)

100M to 10T parameters (log scale) — type values like 7B, 175B, 1.5T, or 3e10

📚Training tokens (D)

1B to 100T tokens (log scale) — type values like 300B, 15T, or 2.5e12

🔄Expected inference volume (Q)

Lifetime queries served — shifts the Sardana 2024 recipe toward over-training

💎Information per token (q) 1.00× (web baseline)

A corpus property: how much usable signal each token carries. 0.3 = noisy / unfiltered raw CC; 1.0 = standard Common Crawl (Chinchilla baseline); 2 = DCLM-filtered; 5 = synthetic curated (Phi-style). Repetition effects are separate — see Panel 2's data-wall curve.

Predicted loss

—

— ppl

Compute (≈ 6 N D)

—

FLOPs

Your D / N

—

tokens / param

Loss decomposition (your N, D)

🎯 Bayes floor

—

🧠 Capacity gap

—

📚 Data gap

—

Optimal D / N at your compute (across eras)

Kaplan 2020 —

Chinchilla 2022 —

Sardana 2024 (at your Q) —

Your D / N —

Reading the plot. Each line is the prescribed tokens-per-parameter (D/N) at a given training compute budget. Kaplan says under-train; Chinchilla says ~20 regardless of compute; Sardana over-trains (line shifts upward as your Q grows).

Real models appear as two diamonds connected by a dotted line when the model is MoE: a filled diamond at D/N_active (per-token compute view — what governs training and inference FLOPs) and an open diamond at D/N_total (total-capacity view — what governs the loss surface and capacity-floor in Chinchilla's law via N_eff). For dense models the two coincide so only the filled diamond appears. * = MoE. † = Kimi K2.5 includes ~15T multimodal continual tokens atop K2's text base — text-equivalent comparison is approximate. Hover any diamond for the full N_total, N_active, D, both D/N ratios, and the source citation.

The MoE entries (Kimi, DeepSeek-V3/V4, Qwen3.5-397B-A17B, GLM-5) span an order of magnitude or more between active and total — that vertical line height is the visual signature of MoE's "decouple training/inference compute from capacity" trick.

2 · How the data distribution shifts the recipe

Two distinct mechanisms move the Chinchilla optimum off its famous D/N = 20: (1) corpus quality — the information-per-token of the underlying dataset — and (2) the data wall — what happens when raw token count exceeds the world's supply of fresh natural text, forcing repetition with diminishing returns. The two are mathematically similar but mechanistically distinct: quality is a property of the corpus, repetition is a property of the training procedure.

L(N, D | q) = 🎯 E + A · N⁻ᵅ + B · (q · D)⁻ᵝ ⇒ (D / N)_opt = 20 · q^−0.91 [derived from ∂L/∂N = 0 on C = 6 N D]

Compute-optimal D / N by corpus information density (q)

Noisy / unfiltered (q ≈ 0.3)
Raw Common Crawl with minimal cleaning —

Web baseline (q = 1.0)
Chinchilla 2022 reference corpus —

DataDecide top recipe (q ≈ 1.5)
Magnusson 2025: best-ranked of 25 corpora —

DCLM-filtered (q ≈ 2.0)
Li 2024: model-based quality filtering —

Synthetic curated (q ≈ 5.0)
Phi-3 / TinyStories style —

Muennighoff 2023 data wall (training-procedure effect)

A separate mechanism from q: once raw D exceeds the world's ~20T fresh natural-text tokens, additional D is repeated. Each repeated token converges faster (you've already seen patterns), so it contributes less to loss reduction than a fresh one — but it's still the same data, not "lower quality".

Break point at C ≈ —

Forced D/N at 10× past break —

Notable: the q ≈ 2 (DCLM) line is where you'd land on quality alone — but real models (LLaMA-3-8B at D/N ≈ 1875, Qwen3-8B at 4500) sit far above any of these lines. That additional gap is the Sardana 2024 inference-cost effect (Panel 1's Q slider), not a quality or repetition effect.

Reading the plot. Solid lines are compute-optimal D/N for a fixed corpus information density q (the Panel 1 slider). The dash-dot orange line is the Muennighoff data-wall curve — flat at the web baseline until C ≈ 10²⁶, then bending upward as you exhaust ~20T fresh tokens and are forced to repeat (each repetition contributes diminishing loss reduction, so the math behaves like a steepening cost on additional D, manifesting as a higher required D/N to hit the same effective-tokens target). Real-model anchors show where actual training has landed; the residual gap above even the high-q lines is Sardana inference economics, modeled in Panel 1.

3 · Capability horizon forecast (METR)

Frontier-model agentic capability has its own scaling law: the duration of human-expert tasks AI can complete reliably has doubled roughly every 7 months from 2019 through 2025. Project forward to see when AI crosses task-duration thresholds you care about.

h(t) = 50 min × 2^{(t − 2025.25) / 0.583} [t in years, doubling every 7 months]

⏱️Project to year 2026.4

Slide to project frontier capability horizon

Reliable horizon

—

at the chosen date

Months from today

—

today = May 2026

Crossing dates (historical trend)

10 min—

1 hour—

1 work day (8 h)—

1 work week (40 h)—

1 month (160 h)—

Caveat. Unlike Panels 1 and 2 (which fit loss curves with known functional form and tight error bars), this is a Moore's-law-style historical extrapolation. METR notes the trend may have accelerated to a 3-4 month doubling in 2024-2025; the dashed line shows that scenario. Use this for high-level capability forecasting, not for predicting specific model releases.

4 · Mixture-of-Experts adjustment

MoE separates the parameters that influence loss from the parameters that influence cost. Set total and active independently to see how this changes the Chinchilla prediction — and compare against Abnar 2025's compute-optimal sparsity prescription.

✨ N_eff = √( 🏗 N_total · ⚙ N_active ) [Clark 2022 / geometric-mean approximation]
📉 L(N_total, N_active, D) = 🎯 E + A · ✨ N_eff^−α + B · 📚 D^−β

🏗Total parameters (N_total)

1B to 10T total parameters (log scale). Try 671B (DeepSeek-V3), 120B (Nemotron-3-Super), 141B (Mixtral 8×22B).

⚙Active per token (N_active)

Auto-clamped to ≤ N_total. Try 37B (DeepSeek-V3), 12.7B (Nemotron-3-Super), 39B (Mixtral 8×22B).

📚Training tokens (D)

1B to 100T tokens (log scale).

Sparsity (s)

—

✨ N_eff

—

dense-equivalent

Predicted loss

—

— ppl

Inference savings

—

vs matching dense

Abnar 2025 — optimal sparsity at this compute

⚡ Compute (≈ 6 · N_active · D) —

◐ Your sparsity —

◐ Abnar-optimal s —

Verdict —

Reading the plot. The curve shows predicted loss as you vary N_total while holding N_active (your inference cost) fixed. Moving right adds experts at zero additional inference cost — but loss falls only as √N_total (Clark 2022), so the gain shrinks. The dashed line marks the dense baseline (N_total = N_active); the green tick marks Abnar 2025's compute-optimal sparsity for this training budget.

5 · Data-type × benchmark-cluster sensitivity matrix (rough)

A rank-4 sketch of how each data category responds across the standard benchmark clusters. Cell values are 0–3 estimates (with − for negative transfer), interpolated from published per-data-type ablations. Hatched cells (with diagonal stripes and a ? suffix on the value) are inferred from indirect evidence; solid cells are anchored to a named paper (hover for citation). This is meant as a brainstorming target for the M matrix your team would actually fit — not a measured result.

Source codes: DCLM = Li 2024 · DataDecide = Magnusson 2025 · OT = OpenThoughts (Guha 2025) · DS = DeepSeek-Math (Shao 2024) · P = Phi-3 (Abdin 2024) · W = WildChat-1M (Zhao 2024) · W50 = WildChat-50M (Feuer 2025).

Column structure validation — two independent sources:
Evalchemy (2,300+ models · 10,000+ evals): within-category Pearson 0.78 / across-category 0.49 / gap +0.29. Top intra-category pairs at 0.94-0.98 (CodeElo ↔ CodeForces, AIME24 ↔ AIME25).
EEE_datastore (20K models · 60+ benchmark configs · 229K eval results, accessed via evaleval/EEE_datastore HF dataset): within-cluster 0.72 / across-cluster 0.58 / gap +0.15. The smaller EEE gap is exactly what the BAT paper predicts: EEE's population is much more heterogeneous (20K models including weak ones), so the "strong models good at everything" halo effect is diluted relative to Evalchemy's curated frontier set.
Both replicate the rank-4 cluster signal. Per-column EEE gaps reveal cluster quality varies sharply: F +0.30 (cleanest, IFEval-style is genuinely orthogonal), S +0.22 (SWE-bench-like clusters tightly), K +0.16, H +0.14, T +0.09, R +0.08 (diffuse — reasoning benchmarks cross-load with knowledge), M +0.06 (diffuse — different math difficulties don't track each other), D −0.38. Top cross-cluster pairs: BBH ↔ MMLU-Pro at r=0.96 (BBH is half-knowledge); IFEval ↔ MMLU-Pro at r=0.96 (halo effect); MathVista ↔ MMLU-Pro at r=0.95 (math-vision loads on knowledge).

* LLM-judge-mediated columns (D-q / D-j / SAFETY): EEE_datastore's within-D Pearson = 0.18 (anti-correlated with the across-D 0.56) shows the old single "D · Dialogue" column was not a real cluster — it conflated arena-style preference, MT-Bench-style multi-turn quality, reward-model evaluations, and LLM-as-judge benchmarks. Splitting D into D-q (chat quality) and D-j (LLM-as-judge skill) is principled but doesn't fix the underlying problem: all of D-q, D-j, and SAFETY are LLM-judge- or rule-based-refusal-judge-mediated, so they carry biased + high-variance signal. WildChat-50M (Feuer 2025) documents this for chat benchmarks. SOS-Bench (Feuer 2024) measures the SAFETY column directly via BeaverTails / JailbreakHub / SaladBench / SGXSTest / StrongREJECT / DTToxicity / CDNA — useful signal but treat with the same lower-grade caution as D-q/D-j.

A-BB calibration anchor (bias-bounded-evaluation, abb/data 2026): Post-debiasing rank-correlation with the naive leaderboard varies 7× across LLM-judge benchmarks. FLASK (factored 12-skill rubric) preserves ~88% of the model ranking after A-BB removes formatting/length/style/position bias; MT-Bench (holistic helpfulness) preserves only ~13%. The two should not be lumped together — rubric structure, not judge identity, drives signal preservation. Judge intrinsic jitter (test-retest noise) is 14% of range for gpt-5-mini, 33-44% for gpt-3.5-turbo / gpt-4o-mini / deepseek-r1-32b. On MT-Bench-class benchmarks, the within-Llama-3-8B-SFT-variant factor-score spread (helpfulness, accuracy, relevance, depth, creativity, detail) is at-or-below the jitter floor on every individual factor — so D-j cell magnitudes ≤ 1 on integer scale are within the noise band for this benchmark class. Treat D-j cells with |value| ≤ 1 as direction-only on MT-Bench-class benchmarks; FLASK-class benchmarks support ~2× finer cell discrimination. See abb_findings.md for the full tables.

SOS-Bench Detailed empirical anchors (rows I and L): 232 models × 65 benchmarks under a single controlled lm-eval-harness invocation, all hosted at nyu-dice-lab/lm-eval-results-*.
Princeton SFT-preference-method ladder (same Llama-3-Base-8B-SFT base, 5 preference methods) anchors row L: K/M/R essentially invariant under all preference methods; F drops 15-25% under IPO/KTO/ORPO/RDPO; SAFETY drops ~60% under IPO/KTO/ORPO/RDPO (0.681 → 0.23-0.30). DPO appears neutral on every cluster — suspicious near-identity with SFT in Princeton's released checkpoints, worth flagging.
Magpie SFT-data variation (same Llama-3-8B base, 7 SFT corpora) anchors row I: K/R essentially invariant; F varies 40% (WildChat best at 0.33, Magpie/ShareGPT lowest at ~0.24); M nearly flat ~0.04 (Magpie-Reasoning slight bump to 0.058); SAFETY varies modestly (0.32-0.38, OpenHermes best). Corroborates WildChat-50M's claim that chat-style data dominates IF benchmarks.
Math-SFT models (NuminaCoT, MetaMath, dart-math; all Llama-3-8B base) anchor row M: NuminaCoT achieves M=0.123 vs Magpie baseline 0.04 (+2-3× lift on math cluster — clear positive). But math-SFT also degrades F by ~10pp (formal math style displaces chat IF) and leaves BBH (R) flat or slightly lower. K is unchanged. MetaMath-on-Llama-3 shows weak M-cluster transfer (0.035) despite the math-only training — suggests the corpus matters more than just "math" as a label.
Code-only SFT (ajibawa-2023/Code-Llama-3-8B) shows the specialization tax: every general cluster degrades vs Magpie baselines (K −20%, M −60%, F −36%, R −16%) — but SAFETY actually rises to 0.47 (code-trained models often refuse non-code queries, inflating the safety score). This documents the high-end of the code-proportion axis; SOS-Bench has no H/SWE benchmarks, so positive effects of code data on code-benchmarks can't be anchored here.

Scientific-LM empirical anchor (S row): Three converging lines of evidence.
S × M is real and large. Llemma (Azerbayev 2024, ICLR): Llemma-7B trained on Proof-Pile-II (55B tokens of arxiv math + math-web + code, initialized from Code-Llama-7B) reaches GSM8K 36.4% / MATH 18% — Llemma-34B reaches GSM8K 51.1% / MATH 25%, a +20pp GSM8K and +13pp MATH lift over the Code-Llama base. DeepSeekMath-Base 7B (Shao 2024) does even better with math-web continual pretrain alone (no SFT/RL): GSM8K 64.2% / MATH 36.2%; with SFT+RL → 88.2% / 51.7%. Math-corpus scientific pretrain is one of the cleanest column-targeted data effects in the literature.
S × K lift is real but narrower than generalist data. Galactica-120B (Taylor 2022, Meta) was the original scientific LM — 106B-token corpus of papers + textbooks + knowledge bases, beat OPT-175B on MMLU-STEM and beat Chinchilla on MATH/MMLU at 1/7 size. Withdrawn from public release after generating plausible-but-wrong scientific text. SciGLM-6B (NeurIPS 2024) and SciDFM-MoE-18.2B continue the line. Modern Phi-3 emphasizes synthetic-curated technical content (Φ row × K).
The biomedical specialization tax (Jeong et al. 2024, arxiv 2408.13833) is the bombshell finding. Biomedical specialist LMs dramatically underperform same-base generalist Instruct models on clinical case challenges: OpenBioLLM-8B JAMA 17.9% vs Llama-3-8B-Instruct 57.1% (−39pp); Meditron-7B JAMA 12.9% vs Llama-2-7B-chat 44.3% (−31pp); BioMistral-7B JAMA 27.9% vs Mistral-7B-Instruct 52.1%; PMC-Llama-7B NEJM 2.3% vs Llama-2-7B-chat 26.8%. At 70B scale: OpenBioLLM-70B LongHealth hallucination resistance 62.9% vs Llama-3-70B-Instruct 91.7% (−29pp). The same "formal-academic-style displaces chat instruction-following" mechanism we saw with NuminaCoT math-SFT and ajibawa Code-LLaMA-only-SFT — but the magnitude here is much larger. Concluding the paper: "fine-tuning LLMs to biomedical data may not provide the expected benefits"; recommend retrieval-augmented generation over continued domain pretraining.
OpenScholar-Llama-3.1-8B is the abb / MT-Bench anchor for academic-domain SFT: 4.96 on MT-Bench (within-noise of Magpie 5.03, OpenScholar slightly worse on depth/creativity/detail factors). Academic-style SFT does NOT lift D-q.
Updated cell anchors: S × K (Galactica + SciGLM, kept at 3), S × M (raised 1 → 2 — Llemma/DeepSeekMath), S × F (added −1 — biomedical specialization tax on adjacent IF), S × R (Galactica long-form reasoning), and a row-level caveat: the row values assume moderate scientific-data proportion. At high-proportion specialization (Meditron-style 100% bio-medical), the specialization tax overwhelms positive transfers — same pattern as Code-only SFT row C commentary.

OpenThoughts-Agent (NeurIPS 2026 submission) — the strongest controlled agentic-SFT ablation in the open literature: 95 task-generation strategies, each run as 10K-trajectory SFT on Qwen3-8B, evaluated on SWE-Bench-Verified-100, OT-TBLite, and Terminal-Bench-2. Z-score spread 4σ wide (best +1.92, worst −2.31) — biggest cleanly-attributable cross-source variation in any data-mix ablation we've cited.
Top-10 sources for agentic SFT (Table 13): swesmith (synthetic-issues, +1.92, SWE-V 32.3%) · stackexchange-superuser (human-infra, +1.51, TB2 10.9%) · stackexchange-tezos (+1.45) · issue-tasks (+1.37, SWE-V 24.0%) · repo-scaffold (+1.31) · r2egym (+1.17, SWE-V 28.3%) · stackexchange-tor (+1.16) · swegym (+1.14, SWE-V 27.0%) · code-feedback (+1.04) · stackexchange-unix (+1.02). Pattern: synthetic issue-resolution traces lift SWE; human-written infrastructure Q&A lifts Terminal-Bench — they're different sub-skills inside the T cluster.
Bottom of Table 13 (HURTS agentic): agenttuning-os (−2.31), agenttuning-mind2web (−2.26), agenttuning-db (−1.70), tulu3-sft-personas-math (−1.35, rank 92/95), agenttuning-webshop (−1.31), code-contests (−1.27), all-puzzles (−1.27). Two findings here: (1) math-SFT data is among the WORST sources for agentic SFT — anchors M × S = −1 and M × T = −1 in the matrix; (2) the old AgentTuning corpus is now broadly net-negative vs modern issue-resolution sources.
The bombshell from Table 23: CoT-only SFT contributes ~zero to agentic benchmarks. Pure-reasoning models at 7-8B scale: DeepSeek-R1-Distill-Qwen-7B SWE-V 0.0% / TB2 0.7%, OpenThinker3-7B SWE-V 0.4% / TB2 0.0%, Llama3.1-Nemotron-Nano-8B-v1 SWE-V 0.5% / TB2 0.0%. Same base + agent-SFT (Nemotron-Terminal-8B): SWE-V 22.1% / TB2 13.1%. 20+pp gap. This directly contradicts a naive reading of BFCL/τ²-Bench thinking-model dominance: the "Thinking" branding on frontier models reflects bundled agent+reasoning SFT, not pure CoT. R × S lowered 1 → 0 and R × T lowered 2 → 0 to reflect this controlled finding.
Headline takeaway: Agentic capability is a tail-skill / mid-training phenomenon (X-row) plus an agentic-trajectory phenomenon (A-row), not a side-effect of reasoning, math, or general SFT. The matrix now reflects this with X × S raised 1 → 2 and A × S raised 2 → 3.

BFCL / τ²-Bench empirical anchor (T column): ToolACE (Liu 2024, ICLR 2025) is the cleanest data-mix ablation for tool-use: synthetic-curated function-calling SFT on LLaMA-3.1-8B-Instruct lifts BFCL-AST 75% → 91% (+16pt); on LLaMA-3-8B-Instruct 53% → 88% (+35pt); on Qwen1.5-7B 49% → 86% (+37pt). ToolACE-8B reaches BFCL-v3 rank 3 (59.22 overall, Sep 2024) — beating xLAM-8x22b-r and Mistral-Large despite being 100× smaller. Crucially, ToolACE Fig 8 shows MMLU / HumanEval / GSM8K / CSQA are unchanged after FC-SFT, so synthetic FC data adds T without a specialization tax — this is the textbook X-row tail-skill / Φ-row synthetic-curated story.
BFCL-v3 leaderboard (May 2026): Top 7 are all thinking-variants — GLM-4.5 (0.778) · GLM-4.5-Air (0.764) · LongCat-Flash-Thinking (0.744) · Qwen3-Next-80B-A3B-Thinking (0.720) · Qwen3-235B-A22B-Thinking-2507 (0.719) · Qwen3-VL-32B-Thinking (0.717). τ²-Bench (Sierra, Apr 2026): Step-3.5-Flash 0.882 · GLM-4.7 0.874 · MiMo-V2-Flash 0.803 · GLM-4.7-Flash 0.795 · MiniMax M2 0.772. τ²-Telecom: Claude Opus 4.6 0.993 · LongCat-Flash-Thinking-2601 0.993 · GPT-5.4 0.989. Long-CoT reasoning data (R row) and multi-turn agentic trajectories (A row) are both clearly load-bearing for the multi-turn tool-use side of T.
Updated cell anchors: X × T (ToolACE/xLAM lift), Φ × T (raised 0 → 2 — synthetic-curated FC data is the canonical winning path), R × T (BFCL/τ²-Bench thinking-model dominance), A × T (τ²-Bench leaderboard), I × T (ToolACE Fig 7: generic SFT raw baseline 53-75% with +16-35pt FC-SFT headroom — generic instruction pairs are NOT a substitute for tool-specific data). C × T and M × T remain weakly inferred — ToolACE Fig 8 actually shows HumanEval unchanged after FC-SFT, suggesting code knowledge isn't the BFCL bottleneck.

abb / Bias-Bounded Evaluation empirical anchor (D-j column): 11 Llama-3.1-8B / Llama-3-8B SFT/DPO/distill/merge variants evaluated on MT-Bench under the gpt-5-mini judge with full A-BB factored decomposition (abb/data/mt_bench/gpt5-mini_debiased.csv).
SUT cluster overall scores (jitter floor = 1.29 on 1-10): Ref-GPT4 6.94 · Ref-GPT3.5 6.41 · Meta-Instruct 5.41 · FuseChat-SFT 5.31 · Magpie-SFT 5.03 · Tulu-3-DPO 5.01 · OpenScholar 4.96 · Hermes-3-distill 4.64 · Tulu-2-DPO 4.52 · WildChat 4.30 · Princeton-zephyr-SFT 4.01. The 8 SFT/DPO/merge Llama variants span 1.30 points = exactly 1× jitter, with only helpfulness/depth/detail factors marginally above the noise floor. Most discriminating factor is detail (range 1.51), least is creativity (0.54).
Surprises: WildChat is near-bottom on MT-Bench (4.30) despite winning IFEval in the Magpie/SOS-Bench ladder — chat data tunes IF, not multi-turn coherence judges. Princeton's Llama-3-Base-8B-SFT lands dead-last (4.01) despite being mid-pack on SOS-Bench rule-based clusters. Pref-Tulu3 (5.01) ≈ SFT-Magpie (5.03), so DPO is again indistinguishable from SFT-only — same DPO≈SFT identity flagged from Princeton's release pipeline.
Why no new cell values: the within-Llama-3-8B-SFT spread is at-or-below the judge noise floor on every individual factor. We can't anchor specific D-j cell magnitudes with new numbers; we can only calibrate the existing cells' confidence band. FLASK does support ~2× finer discrimination (88% signal-survival), but the corresponding ajudge job on FLASK's Llama-3-8B SUT set has not been run (empty results.jsonl).

DataDecide ablation coverage (Magnusson 2025, 25 data recipes × 14 sizes × 8 OLMES tasks): Row C cells (especially C × K, C × M, C × R) and row M cells (M × K, M × R) are anchored by Dolma 1.7's ± code and ± math/code subset ablations. I × F anchored by ± Flan. L × D-q anchored by ± Reddit. Multiple row W cells anchored by DCLM-Baseline + FineWeb-Edu / DCLM-classifier filter ratio ablations.

Row X — Tail-skill / mid-training patches: a fundamentally different bucket. These are focused bursts of targeted training data that correct high-perplexity tail behaviors: protocol learning (MCP, function-calling schemas), format compliance (JSON/XML structured outputs), long-context handling and summarization triggers, anti-spam / context-garbage management, refusal-pattern learning. The literature mostly treats these as engineering recipes rather than scaling-law subjects, hence the row's heavy hatching — almost every cell is currently inferred. Yet operationally these often dominate model usability (a Sonnet-class base that can't follow MCP is useless for agentic deployment regardless of its MMLU). Row X loads primarily on the interactional/format axis (F, T) with secondary contributions to D and procedural columns.

Latent rank-4 fit, refined by EEE gap analysis:
Cleanest single-cluster columns (anchor a latent axis): F · IF (gap +0.30) anchors the format/alignment axis; S · SWE (gap +0.22) anchors the procedure axis. These are the highest-confidence dimensions of M.
Moderate-quality clusters (axis members): K · Knowledge (+0.16) and H · Code-gen (+0.14) — K loads on knowledge axis, H loads with S on procedure axis.
Diffuse clusters (cross-load between axes): R · Reasoning (+0.08), M · Math (+0.06), T · Tool-use (+0.09) — these three are NOT cleanly separable from each other in EEE. The high BBH ↔ MMLU correlation (r=0.96) suggests R is part-knowledge; the MATH ↔ Reward-Bench Reasoning correlation (r=0.95) suggests M and R share a reasoning axis. Most plausibly, M+R+T collapse into a unified "reasoning + procedure" axis under any reasonable factorization, not three separate axes.
Lowest-quality columns (load weakly on any axis): D-q and D-j — both are LLM-judge-mediated, both are noisy, and the rank-4 model places them on the same interactional axis as F (alignment), but with much higher within-axis variance.
So the honest rank-4 axes are: Knowledge (K) · Reasoning + Procedure (M + R + T + H + S, dominated by S as the cleanest member) · Format / Alignment (F + D-q + D-j) · Tail-skill (X, partly bridging procedure ↔ format). Row attribution to axes: pretraining-dominated for axes 1-2, mid/post-training-dominated for axes 3-4.

Footnote — unimodal scope. This matrix considers text-only pretraining data and text-only benchmarks. Effects from mixed modalities (vision-language, audio-text, tabular grounding) are not modeled here. Adding modality as an additional axis would likely introduce at least one more latent factor — a "perception" or "multimodal grounding" axis — with strong cross-loadings on grounded-reasoning benchmarks (MMMU, ChartQA, scientific-diagram tasks) and weak loadings on the existing four axes. Tabular reasoning is similarly missing and would likely cluster with procedure / agentic. The benchmark columns also currently omit dedicated long-context evaluations (RULER, NIAH-at-N, BABILong); these would deserve their own column with strong loading on row X.

6 · Capacity floors: when is a model "too small" for a task?

Loss-side scaling laws don't capture an obvious property of benchmarks: some capabilities require minimum capacity to fit at all. A 1B model can memorize MMLU; it cannot be a general-purpose agent. This panel visualizes the capacity-vs-horizon frontier — the boundary between "model can complete this task with ≥50% reliability" and "model fails." Slide task difficulty to see how the frontier moves.

H_s(p) = ln(s) / ln(p) [Sinha 2025 Proposition 1 — theoretical horizon under IID constant step accuracy p, no self-correction; s = success threshold (default 0.5)]
Composite extension below: p(N) = 1 − A · N^−α [Chinchilla-style power-law for per-step capacity-limited error — NOT in Sinha; combined here for visualization]

🎚️Task difficulty (A) 3.0 (multi-step reasoning)

Per-step error scale. Higher A = harder task class. 0.3 ≈ MMLU-style recall; 3 ≈ MATH multi-step; 30 ≈ long-horizon agentic.

At N = 8 B

—

max steps @ 50% reliability

At N = 70 B

—

max steps @ 50% reliability

At N = 1 T

—

max steps @ 50% reliability

What the literature says about capacity floors

Allen-Zhu & Li 2024 (Physics 3.3): ~2 bits of knowledge per parameter at saturation, even quantized to int8. A 7B model can store ~14B bits — exceeding all of English Wikipedia + textbooks combined. Below capacity, accuracy is bounded by storage.

Elhage et al. 2022 (Superposition): models pack > N features into N dimensions via superposition, but with interference cost. For complex tasks needing many simultaneously-active features (agentic working memory), interference dominates below threshold N.

Merrill & Sabharwal 2023 (Expressive power of CoT): transformers with depth d solve problems of bounded complexity per forward pass. CoT extends this — but CoT length must scale at least linearly with problem size for tasks like Parity, Reachability, Multiplication.

Sinha et al. 2025 (Illusion of Diminishing Returns, arXiv:2509.09677): this is the rollout-depth-vs-capacity paper. Marginal gains in single-step accuracy compound into exponential gains in task length. They identify a "self-conditioning effect" — errors in context make subsequent errors more likely. Thinking mitigates self-conditioning.

Kandpal et al. 2022 (Long-Tail Knowledge): rare facts need ~10× more parameters than common ones for equivalent accuracy. Memorization is capacity-bounded.

Scaling Agent Systems 2025 (arXiv:2512.08296): empirical quantitative scaling for agent systems across Finance-Agent / BrowseComp-Plus / PlanCraft / Workbench. Shows the capacity floors are real and measurable, but doesn't unify them with loss-side laws.

Reading the heatmap. Each cell is the theoretical success probability under Sinha's Proposition 1 (IID step accuracy, no self-correction) for a task with the chosen difficulty A at model size N (x-axis) running for K sequential steps (y-axis). Green = success likely; red = task is beyond capacity. Black contour = 50% reliability frontier.

Theory vs observation gap (Sinha 2025's actual contribution). The heatmap shows the theoretical baseline. Sinha's empirical measurements deviate sharply:

Red circles — non-thinking frontier models fall BELOW theory. DeepSeek-V3 without CoT manages only 4 steps despite frontier-scale N. The naive `p^K` compounding overpredicts capability because of the self-conditioning effect: each error in context increases the probability of subsequent errors. Self-conditioning does not improve with N alone — bigger non-thinking models hit the same wall.
Green stars — thinking models often EXCEED theory. DeepSeek-R1 (thinking version of V3) reaches ≥100 steps; Claude-4-Sonnet hits 432; GPT-5-thinking (codenamed "Horizon") executes 2100. Test-time CoT compute partially breaks the self-conditioning loop, recovering the theoretical compounding (and sometimes exceeding it because thinking is a form of distributed search).

What the equation IS vs ISN'T. The relation H_s(p) = ln(s)/ln(p) is Sinha 2025 Proposition 1. The parameterization p(N) = 1 − A·N^−α is a Chinchilla-style composite I added for visualization — Sinha treats p as the empirical observable, not as a derived function of N.

What's still not solved. The literature has formalized capacity floors for (i) factual storage (Allen-Zhu — ~2 bits/param), (ii) feature representation (superposition), (iii) CoT computational depth (Merrill), and (iv) horizon execution under self-conditioning (Sinha). Nobody has unified these into a single per-benchmark (γ_b, α_b) capacity-threshold fit across the Harbor-Mix / Evalchemy taxonomy — that's the empirical agenda that closes the gap with the data-side scaling laws of Panels 1-5.

References

Hoffmann et al. 2022 — "Training Compute-Optimal Large Language Models" (Chinchilla). arXiv:2203.15556. Provides the L(N, D) = E + A·N⁻ᵅ + B·D⁻ᵝ functional form used in Panel 1; default parameters from their Table A3 (E ≈ 1.69, A ≈ 406.4, α ≈ 0.336, B ≈ 410.7, β ≈ 0.283).

Kaplan et al. 2020 — "Scaling Laws for Neural Language Models". arXiv:2001.08361. Source of the N ∝ C^0.73 / D ∝ C^0.27 compute-optimal scaling shown for the "Kaplan era" in Panel 2.

Sardana et al. 2024 — "Beyond Chinchilla-Optimal: Accounting for Inference". arXiv:2401.00448. Provides the inference-aware compute-optimal recipe shown in Panel 2.

Clark et al. 2022 — "Unified Scaling Laws for Routed Language Models". arXiv:2202.01169. Source of the effective-parameter approximation N_eff ≈ N_dense · (E/E_base)^c with c ≈ 0.5 used in Panel 4; the geometric-mean form √(N_total · N_active) is the simplest closed-form consistent with their fit.

Abnar et al. 2025 — "Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models". arXiv:2501.12370. Source of the compute-dependent optimal-sparsity prescription in Panel 4 (~25% active at small compute → ~6% at frontier).

Kwa et al. (METR) 2025 — "Measuring AI Ability to Complete Long Tasks". arXiv:2503.14499. Source of the 7-month doubling time in Panel 3.

Li et al. 2024 — "DataComp-LM" (DCLM). arXiv:2406.11794. Source of the q ≈ 2 data-quality estimate for model-based-filtered Common Crawl in Panel 2.

Magnusson et al. 2025 — "DataDecide: How to Predict Best Pretraining Data with Small Experiments". arXiv:2504.11393. Source of the DataDecide top-recipe q ≈ 1.5 estimate in Panel 2.

Muennighoff et al. 2023 — "Scaling Data-Constrained Language Models". arXiv:2305.16264. Source of the repetition diminishing-returns model and the data-wall curve in Panel 2.

DeepSeek-AI 2024 — "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism". arXiv:2401.02954. Provides the explicit data-quality variable that motivates the q parameter in the modified loss law L(N, D | q).

Abdin et al. 2024 — "Phi-3 Technical Report". arXiv:2404.14219. Existence-proof for synthetic + heavily-filtered data delivering benchmark performance far above Chinchilla loss-law predictions; source of multiple cells in row Φ of Panel 5.

Ye et al. 2024 — "Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance". arXiv:2403.16952 (ICLR 2025). Fits parametric Loss = g(mixture_weights, N, D) — the closest existing formalism to the M matrix in Panel 5.

Shao et al. 2024 — "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". arXiv:2402.03300. Clean math-data → math-benchmark ablation; source of row M and several cross-cells in Panel 5.

Zhao et al. 2024 — "WildChat: 1M ChatGPT Interaction Logs in the Wild". arXiv:2405.01470. The canonical real-user conversation corpus; anchors row L of Panel 5.

Feuer et al. 2025 — "WildChat-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training". arXiv:2501.18511. Provides the post-training scaling-law structure used for the alignment-axis cells in row L of Panel 5; the Re-Wild SFT mix beats Tulu-3 with 40% as many samples. Also documents the LLM-judge bias and high run-to-run variance that motivates Panel 5's chat-column caveat.

Perlitz et al. 2024 (BAT paper) — "Benchmark Agreement Testing Done Right". Local PDF: 245_Benchmark_Agreement_Testin.pdf. NeurIPS 2025 submission. Methodological caveats for interpreting benchmark correlations: insufficient model counts → ~0.25 std-deviation in correlation; "strong models good everywhere" halo inflates within-cluster correlations independent of actual benchmark agreement. Releases BenchBench package. Applied throughout Panel 5's correlation-based column validation.

EvalEval Coalition 2026 — "Every Eval Ever" datastore. HuggingFace: evaleval/EEE_datastore. ~229K (model, benchmark) results across 60+ benchmark configs and 20K unique models. Used in Panel 5 for independent replication of Evalchemy's column-clustering signal (within-cluster Pearson 0.72 vs across-cluster 0.58, gap +0.15) and for the empirical finding that the legacy "D · Dialogue" column was not a real cluster (within-D = 0.18). Local processing scripts: eee_correlate.py, eee_cluster_analysis.py.

Allen-Zhu & Li 2024 — "Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws". arXiv:2404.05405 (ICLR 2025). The canonical ~2 bits/parameter knowledge-storage bound used in Panel 6 to anchor the capacity-floor argument. A 7B model stores ~14B bits — exceeds Wikipedia + textbooks combined. Below capacity, accuracy is hard-bounded by storage.

Elhage et al. 2022 (Anthropic) — "Toy Models of Superposition". arXiv:2209.10652. Formalizes how neural nets pack >N features into N dimensions via superposition, with interference cost. Mechanism behind Panel 6's "complex tasks need many simultaneously-active features → interference dominates below threshold N" claim.

Kandpal et al. 2022 — "Large Language Models Struggle to Learn Long-Tail Knowledge". arXiv:2211.08411. Quantifies that rare facts require ~10× more parameters than common ones for equivalent accuracy. Directly supports Panel 6's "memorization is capacity-bounded" argument.

Merrill & Sabharwal 2023 — "The Expressive Power of Transformers with Chain of Thought". arXiv:2310.07923. Formalizes that transformers with depth d solve problems of bounded complexity per forward pass, and that CoT length must scale at least linearly in problem size for tasks like Parity, Reachability, Multiplication. The computational-depth basis for Panel 6.

Sinha et al. 2025 — "The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs". arXiv:2509.09677. The capacity-vs-rollout-depth paper. Marginal gains in single-step accuracy compound exponentially into task-length capability; identifies a "self-conditioning effect" where errors in context cause more errors; shows thinking mitigates this. Directly motivates Panel 6's frontier visualization.

Various 2025 — "Towards a Science of Scaling Agent Systems". arXiv:2512.08296. Empirical quantitative scaling for agent systems across 180 (architecture × LLM × benchmark) configurations on Finance-Agent / BrowseComp-Plus / PlanCraft / Workbench. Shows agentic capacity floors are real and measurable, but doesn't unify with loss-side laws — the gap Panel 6 highlights.

SOS-Bench Detailed 2024-2026 — 232 open SFT / preference / merge models on Mistral-7B and Llama-3-8B bases, each evaluated under a single controlled lm-eval-harness invocation across 65 benchmarks (including BBH subtasks, GPQA, IFEval, MMLU-Pro, plus safety: BeaverTails / JailbreakHub / SaladBench / SGXSTest / StrongREJECT / DTToxicity / CDNA). Hosted at HF collection nyu-dice-lab/sos-bench-detailed. Anchors Panel 5 rows I and L from Princeton's SFT-preference-method ladder and Magpie's SFT-data variation. Empirical pipeline: sosbench_extract.py, sosbench_attribution.py; raw findings: sosbench_findings.md.