0.5B vs 1.5B: What Changes at 3x Scale?

TL;DR — Both atlases are complete. Scaling from 0.49B to 1.54B parameters (3.1×) fundamentally redistributes the causal architecture: the universal hub migrates from L2 to L26 (8% → 93% depth), attention heads gain 22× impact (specialists emerge), MLP effects weaken 3×, steering collapses 70×, and skill knockout selectivity drops ~48,000×. Cross-model transfer maintains its proportional depth invariant (final ~10%). Layer skipping remains fatal — but one mid-skip config shows the first hint of redundancy (10% top-5 overlap). The message is clear: interpretability insights do not directly transfer across scales; the causal structure itself transforms.

⚙️

Models compared: Qwen2.5-0.5B (24 layers, 14 heads GQA, d_model=896, ~0.49B params, 22 experiments, 12 task families) vs Qwen2.5-1.5B (28 layers, 12 heads GQA, d_model=1536, ~1.54B params, 9 experiments, 8 task families). Same causal-intervention methodology, same hardware (RTX 2070 Super, 8GB, bf16). All values below are from verified, completed experiments.

Comparison Table

Every dimension, side by side. Both columns filled from completed, verified experiments.

Dimension	Qwen2.5-0.5B	Qwen2.5-1.5B	Δ
Architecture	24 layers · 14 heads (GQA, 2 KV) · d_model=896 · d_mlp=4864 · ~0.49B params	28 layers · 12 heads (GQA) · d_model=1536 · ~1.54B params	3.1×
Universal Hub Layer	L2 — mean ablation KL 19.11 across all 12 families. First+last position router. Singular hub. HIGH	L26 — mean ablation KL 13.70. Secondary: L6 (13.28), L14 (13.01). Distributed across 3 layers. HIGH	Migrates
MLP Dominance	L2 MLP top, max KL 8.12. L0 MLP second. MLPs carry 42% of top-layer effect. MEDIUM	L0 MLP top, max KL 2.58. L1 (1.83), L27 (1.17). MLPs carry 19% of top-layer effect. MEDIUM	3× weaker
Head Specialization	Distributed — max single-head KL 0.046 (200× smaller than layer effects). No specialist heads. MEDIUM	Specialists emerge — max KL 1.02 (arithmetic L0 H3). L0 H3 cross-task. Still 13× smaller than layer effects. MEDIUM	22× stronger
Steering Leverage	L2 steering boosts target 3.3× (0.064 → 0.213). Factual direction. Practically useful. MEDIUM	L2 steering best boost 0.003. Near-inert. Target prob 0.0009 → 0.010 at best. MEDIUM	70× weaker
Skill Knockout Selectivity	L19 selectively suppresses factual recall — 11,654× selectivity at s=−2.0. JSON/copying preserved. MEDIUM	L21 best: 0.24× selectivity (anti-selective). Most layers negative. Skills entangled. MEDIUM	48,000× weaker
Cross-Model Transfer	L23 (100%), L22 (97%), L21 (95%). Final 3 layers (88–96% depth). MEDIUM	L27 (99.9%), L26 (99.2%), L25 (98.7%). Final 3 layers (89–96% depth). More uniform recovery. MEDIUM	Shifts 4L
Adapter Norm-Effect Corr	0.85 — effect IS at late layers (L19–L23), matching norm distribution. Norms predict impact. MEDIUM	0.54 — norms flat (3.33–3.45), effects still peak late (L23–L27). Norms decouple from impact. MEDIUM	0.85→0.54
Layer Skipping Viability	0% top-5 overlap & KL 7.9–13.8 in ALL 10 configs. Even weakest layer (L15) breaks output. HIGH (negative)	0% in 3/4 configs. Exception: skip_mid_5 = 10% top-5 overlap (KL 6.58). First hint of redundancy. HIGH (negative)	+10%
Early Exit	L22: 0% argmax, KL 9.14. Only L23 (full) works. Every layer transforms the residual stream. HIGH (negative)	All exits fail: 0–7.14% argmax. KL ∞ at L25–L26. No viable early exit. HIGH (negative)	Still fatal
LoRA Training	r=8, batch_size=2, 100 steps. Loss 0.062. Full adapter fits in 8GB.	r=8, batch_size=1, gradient checkpointing. Loss 0.0098. More VRAM-constrained.	Tighter
Core Circuit Lock-in	L2/L7/L9 locks in by step 10 (first 10%). Two-phase: fast architecture → slow refinement. MEDIUM	Not tested (compute constraints). Unknown whether two-phase architecture persists at scale.	Unknown
JSON Training Shift	Largest base→trained KL. LoRA most affects JSON at 0.5B.	Largest base→trained KL (6.47). LoRA most affects JSON at 1.5B too. Invariant.	Same

Cross-Scale Analysis

What the numbers mean — dimension by dimension.

The Hub Migration: L2 → L26

The most dramatic structural change is the migration of the universal routing hub from L2 (8% depth) to L26 (93% depth). At 0.5B, one layer (L2) dominated with mean KL 19.11 — a singular hub early in the network. At 1.5B, the hub role is shared across three layers: L26 (13.70), L6 (13.28), and L14 (13.01), spanning early, mid, and late positions.

This migration suggests a fundamental shift in computational strategy. At 0.5B, the model does critical routing early — Layer 2 decides where information goes before most computation. At 1.5B, the critical routing happens late — Layer 26 processes the fully-built representation just before output. The secondary hubs at L6 and L14 may be intermediate integration points feeding into L26's final routing.

The absolute hub KL is lower at 1.5B (13.70 vs 19.11) despite the larger model. This is because the 1.5B model has more layers to compensate — ablating one layer leaves 27 others to carry the load, versus 23 at 0.5B. The larger model is more gracefully degradable.

The MLP-Attention Inversion

Scale inverts the computational balance between MLPs and attention heads:

MLP max KL: 8.12 (0.5B) → 2.58 (1.5B) — 3× weaker despite 3× more parameters
Head max KL: 0.046 (0.5B) → 1.02 (1.5B) — 22× stronger
MLP share of top-layer effect: 42% → 19% — MLPs carry less of the load
Head share of top-layer effect: 0.2% → 7.4% — heads carry 37× more relative impact

At 0.5B, MLPs dominated and heads were negligible. At 1.5B, MLPs recede and specialist heads emerge — particularly L0 H3, which handles arithmetic, code syntax, and copying. The model trades MLP magnitude for head precision. This aligns with the intuition that larger models benefit more from relational reasoning (attention) than per-token feature transformation (MLP).

The Steering Collapse: 70× Weaker

At 0.5B, steering L2 with a factual recall direction boosted "Rome" probability from 0.064 to 0.213 — a 3.3× increase that was practically useful for skill manipulation. At 1.5B, the same intervention produces a best boost of 0.003 — essentially no effect.

Three factors likely contribute: (1) representational entanglement — factual knowledge is distributed across more neurons at 1.5B, making a single linear direction insufficient; (2) hub migration — L2 is no longer the hub, so steering there has less leverage; (3) wider residual stream — d_model grows 71% (896 → 1536), diluting a single steering vector's effect. The practical implication: linear steering does not scale. Non-linear interventions (SAE-based, distributed, activation editing) may be required at 1.5B and beyond.

Skill Knockout: From 11,654× to 0.24×

The ~48,000× collapse in skill knockout selectivity is the most consequential scaling finding for safety. At 0.5B, L19 knockout suppressed factual recall tokens 11,654× more than non-skill tokens — highly selective, removing the skill while preserving other behavior. At 1.5B, the best layer (L21) achieves 0.24× selectivity — the knockout suppresses non-skill tokens 4× more than skill tokens (anti-selective).

This strongly suggests skills become entangled at scale. At 0.5B, factual recall occupies a relatively isolated subspace. At 1.5B, it is woven into the same representational space as other behaviors — removing it requires removing entangled components too. If undesirable capabilities are similarly entangled at deployment scale, skill removal via steering may be impossible without retraining or circuit surgery.

Cross-Model Transfer: The Proportional Invariant

Cross-model patching maintains a robust structural invariant: trained behavior is encoded in the final ~10% of layers at both scales. At 0.5B (24 layers), the top transfer layers are L21–L23 (88–96% depth). At 1.5B (28 layers), they are L25–L27 (89–96% depth). The zone shifts 4 layers deeper but maintains the same proportional position.

Recovery is more uniform at 1.5B: L24 achieves 97.9% versus 0.5B's L20 at 87%, suggesting a wider transfer zone. This invariant means cross-model patching methods can be calibrated by depth fraction rather than absolute layer index.

Adapter Norm-Effect Decoupling

At 0.5B, adapter norms and ablation effects co-locate at late layers (correlation 0.85) — training writes where it matters. At 1.5B, adapter norms are remarkably uniform (3.33–3.45 across all 28 layers), but effects still peak at L23–L27. The correlation drops to 0.54.

This decoupling means norm-based pruning is unreliable at scale. An adapter with small norms at L27 has large functional effects; the same norm at L0 has minimal effects. The residual stream at late layers is more sensitive to perturbation — the same weight change has more leverage on a more refined representation. Causal ablation, not norm inspection, is necessary to identify which adapter components matter.

Efficiency: The First Crack in the Wall

At 0.5B, ALL 10 layer-skip configurations produced 0% top-5 overlap. At 1.5B, skip_mid_5 (skipping L4–L8) preserves 10% top-5 overlap — the first evidence of partial mid-layer redundancy at scale. While 10% is far from practical, it suggests that structured pruning with retraining (not zero-ablation) might eventually recover usable redundancy at larger scales. Early exit remains fatal at both scales — no intermediate layer produces usable output.

What Changes at 3× Scale

A summary of what transforms, what weakens, and what stays invariant.

Transforms

The universal hub migrates from L2 (8% depth) to L26 (93% depth). Attention heads gain specialist roles (22× stronger). The computational burden shifts from MLPs to attention. The hub goes from singular to distributed across 3 layers.

Weakens

Steering leverage collapses 70× (0.213 → 0.003). Skill knockout selectivity drops ~48,000× (11,654× → 0.24×). MLP effects weaken 3× (8.12 → 2.58). Adapter norm-effect correlation drops (0.85 → 0.54). Training becomes more VRAM-constrained (batch_size 2→1, gradient checkpointing needed).

Stays Invariant

Cross-model transfer zone: final ~10% of layers at both scales. Layer skipping is fatal (0% overlap in most configs). Early exit fails at all layers. JSON produces the largest training shift. LoRA is required (full SFT OOMs). Zero ablation is disruptive — every layer matters.

New at Scale

First hint of mid-layer redundancy (10% top-5 overlap in skip_mid_5). Specialist attention heads (L0 H3 — arithmetic, code, copying). Late MLPs gain relative importance (L27 enters top 3). More uniform cross-model recovery (L24 = 97.9% vs 0.5B's L20 = 87%).

Limitations

What this comparison cannot yet claim.

Only two scale points — 0.5B and 1.5B. A third point (e.g., 3B) would establish scaling laws rather than pairwise comparisons. The hub migration (L2 → L26) could be linear, logarithmic, or step-wise — two points cannot distinguish.
Different task suite sizes — 0.5B tested 12 families; 1.5B tested 8. Four families (refusal/compliance, verbosity, variable renaming, uncertainty) are missing from 1.5B, limiting direct comparison on those dimensions.
Steering tested at L2 only for 1.5B — the 70× collapse may be layer-specific. Testing at L26 (the new hub) might yield different results, but was not attempted due to compute constraints.
No checkpoint timeline for 1.5B — the core circuit lock-in finding (step 10 at 0.5B) is not verified at 1.5B. Whether the two-phase training architecture persists at scale is unknown.
Single seed at both scales — all results from one random seed per model. Multi-seed replication needed for confidence in cross-scale claims.
Same model family — both are Qwen2.5. Findings may be family-specific. Cross-family comparison (Qwen vs Llama vs Pythia) would test generality.
Zero ablation at both scales — creates out-of-distribution activations. The magnitude of ablation effects may be inflated at both scales, though relative comparisons should hold.

Qualitative Differences

Beyond the causal atlas: how 0.5B and 1.5B differ in actual output quality. Full 30-prompt × 6-config breakdown on the Qualitative Analysis page.

The atlas above maps which components matter. A complementary vibe check — 30 prompts × 6 configs (0.5B/1.5B at bf16, 8-bit, 4-bit NF4) — asks the blunter question of whether the output actually reads well. Three qualitative gaps mirror the structural ones:

Prose quality & repetition

0.5B loops. 13–16 of 30 outputs degenerate into self-repeating text (43–53% repetition rate) across every precision — the model seizes on a phrase and rewrites it with one token swapped. 1.5B loops 3–4× less (3–8 of 30, 10–27%). On the same palindrome prompt, 0.5B-bf16 never writes the function, instead emitting a chain of near-identical "maximum length of N characters" sentences until the token budget runs out (repetition ratio 0.73); 1.5B-4bit produces a clean working function with a docstring (repetition ratio 0.17). The smaller model's thinner residual stream and fewer layers give it less slack to escape degeneration loops.

Constraint adherence

1.5B follows explicit prompt constraints roughly 2× better than 0.5B (13–17% vs 0–7%). 0.5B-bf16 meets 0% of stated constraints — it tends to describe or hallucinate requirements rather than satisfy them. Scale roughly doubles instruction-following, while quantization changes adherence only within noise at either size. (Absolute rates are low because these are base, non-instruct-tuned checkpoints; the reliable signal is the relative gap.)

Quantization interaction

Quantization affects the two scales differently. 0.5B loses 42–55% of speed going to 4-/8-bit and gains little quality; 1.5B-4bit NF4 loses only 9% of speed (17.1 vs 18.8 tok/s) and keeps quality within noise of bf16. Notably, 8-bit is the slowest quantization at both scales (bitsandbytes dequantizes 8-bit weights back to bf16 on every matmul), making 4-bit NF4 the practical sweet spot — especially at 1.5B, where it fits in 8GB and nearly matches bf16 throughput.

📊

Takeaway: The qualitative gap tracks the structural one. The same 3× scale-up that migrates the hub and specializes heads also cuts repetition 3–4× and doubles constraint adherence — and 4-bit NF4 preserves both the speed and the coherence of the larger model. Full 30-prompt, 6-config breakdown on the Qualitative Analysis page.

"Insights from small models are necessary but not sufficient for understanding large models. The causal architecture itself is scale-dependent — the hub migrates, heads specialize, steering collapses, and skills entangle. Methods must be re-validated at each scale."

← Previous: 1.5B Analysis · Back to MI-Atlas →

0.5B vs 1.5B: What Changes at 3× Scale?