| Dimension | Qwen2.5-0.5B | Qwen2.5-1.5B | Δ |
|---|---|---|---|
| Architecture | 24 layers · 14 heads (GQA, 2 KV) · d_model=896 · d_mlp=4864 · ~0.49B params | 28 layers · 12 heads (GQA) · d_model=1536 · ~1.54B params | 3.1× |
| Universal Hub Layer | L2 — mean ablation KL 19.11 across all 12 families. First+last position router. Singular hub. HIGH | L26 — mean ablation KL 13.70. Secondary: L6 (13.28), L14 (13.01). Distributed across 3 layers. HIGH | Migrates |
| MLP Dominance | L2 MLP top, max KL 8.12. L0 MLP second. MLPs carry 42% of top-layer effect. MEDIUM | L0 MLP top, max KL 2.58. L1 (1.83), L27 (1.17). MLPs carry 19% of top-layer effect. MEDIUM | 3× weaker |
| Head Specialization | Distributed — max single-head KL 0.046 (200× smaller than layer effects). No specialist heads. MEDIUM | Specialists emerge — max KL 1.02 (arithmetic L0 H3). L0 H3 cross-task. Still 13× smaller than layer effects. MEDIUM | 22× stronger |
| Steering Leverage | L2 steering boosts target 3.3× (0.064 → 0.213). Factual direction. Practically useful. MEDIUM | L2 steering best boost 0.003. Near-inert. Target prob 0.0009 → 0.010 at best. MEDIUM | 70× weaker |
| Skill Knockout Selectivity | L19 selectively suppresses factual recall — 11,654× selectivity at s=−2.0. JSON/copying preserved. MEDIUM | L21 best: 0.24× selectivity (anti-selective). Most layers negative. Skills entangled. MEDIUM | 48,000× weaker |
| Cross-Model Transfer | L23 (100%), L22 (97%), L21 (95%). Final 3 layers (88–96% depth). MEDIUM | L27 (99.9%), L26 (99.2%), L25 (98.7%). Final 3 layers (89–96% depth). More uniform recovery. MEDIUM | Shifts 4L |
| Adapter Norm-Effect Corr | 0.85 — effect IS at late layers (L19–L23), matching norm distribution. Norms predict impact. MEDIUM | 0.54 — norms flat (3.33–3.45), effects still peak late (L23–L27). Norms decouple from impact. MEDIUM | 0.85→0.54 |
| Layer Skipping Viability | 0% top-5 overlap & KL 7.9–13.8 in ALL 10 configs. Even weakest layer (L15) breaks output. HIGH (negative) | 0% in 3/4 configs. Exception: skip_mid_5 = 10% top-5 overlap (KL 6.58). First hint of redundancy. HIGH (negative) | +10% |
| Early Exit | L22: 0% argmax, KL 9.14. Only L23 (full) works. Every layer transforms the residual stream. HIGH (negative) | All exits fail: 0–7.14% argmax. KL ∞ at L25–L26. No viable early exit. HIGH (negative) | Still fatal |
| LoRA Training | r=8, batch_size=2, 100 steps. Loss 0.062. Full adapter fits in 8GB. | r=8, batch_size=1, gradient checkpointing. Loss 0.0098. More VRAM-constrained. | Tighter |
| Core Circuit Lock-in | L2/L7/L9 locks in by step 10 (first 10%). Two-phase: fast architecture → slow refinement. MEDIUM | Not tested (compute constraints). Unknown whether two-phase architecture persists at scale. | Unknown |
| JSON Training Shift | Largest base→trained KL. LoRA most affects JSON at 0.5B. | Largest base→trained KL (6.47). LoRA most affects JSON at 1.5B too. Invariant. | Same |
The most dramatic structural change is the migration of the universal routing hub from L2 (8% depth) to L26 (93% depth). At 0.5B, one layer (L2) dominated with mean KL 19.11 — a singular hub early in the network. At 1.5B, the hub role is shared across three layers: L26 (13.70), L6 (13.28), and L14 (13.01), spanning early, mid, and late positions.
This migration suggests a fundamental shift in computational strategy. At 0.5B, the model does critical routing early — Layer 2 decides where information goes before most computation. At 1.5B, the critical routing happens late — Layer 26 processes the fully-built representation just before output. The secondary hubs at L6 and L14 may be intermediate integration points feeding into L26's final routing.
The absolute hub KL is lower at 1.5B (13.70 vs 19.11) despite the larger model. This is because the 1.5B model has more layers to compensate — ablating one layer leaves 27 others to carry the load, versus 23 at 0.5B. The larger model is more gracefully degradable.
Scale inverts the computational balance between MLPs and attention heads:
At 0.5B, MLPs dominated and heads were negligible. At 1.5B, MLPs recede and specialist heads emerge — particularly L0 H3, which handles arithmetic, code syntax, and copying. The model trades MLP magnitude for head precision. This aligns with the intuition that larger models benefit more from relational reasoning (attention) than per-token feature transformation (MLP).
At 0.5B, steering L2 with a factual recall direction boosted "Rome" probability from 0.064 to 0.213 — a 3.3× increase that was practically useful for skill manipulation. At 1.5B, the same intervention produces a best boost of 0.003 — essentially no effect.
Three factors likely contribute: (1) representational entanglement — factual knowledge is distributed across more neurons at 1.5B, making a single linear direction insufficient; (2) hub migration — L2 is no longer the hub, so steering there has less leverage; (3) wider residual stream — d_model grows 71% (896 → 1536), diluting a single steering vector's effect. The practical implication: linear steering does not scale. Non-linear interventions (SAE-based, distributed, activation editing) may be required at 1.5B and beyond.
The ~48,000× collapse in skill knockout selectivity is the most consequential scaling finding for safety. At 0.5B, L19 knockout suppressed factual recall tokens 11,654× more than non-skill tokens — highly selective, removing the skill while preserving other behavior. At 1.5B, the best layer (L21) achieves 0.24× selectivity — the knockout suppresses non-skill tokens 4× more than skill tokens (anti-selective).
This strongly suggests skills become entangled at scale. At 0.5B, factual recall occupies a relatively isolated subspace. At 1.5B, it is woven into the same representational space as other behaviors — removing it requires removing entangled components too. If undesirable capabilities are similarly entangled at deployment scale, skill removal via steering may be impossible without retraining or circuit surgery.
Cross-model patching maintains a robust structural invariant: trained behavior is encoded in the final ~10% of layers at both scales. At 0.5B (24 layers), the top transfer layers are L21–L23 (88–96% depth). At 1.5B (28 layers), they are L25–L27 (89–96% depth). The zone shifts 4 layers deeper but maintains the same proportional position.
Recovery is more uniform at 1.5B: L24 achieves 97.9% versus 0.5B's L20 at 87%, suggesting a wider transfer zone. This invariant means cross-model patching methods can be calibrated by depth fraction rather than absolute layer index.
At 0.5B, adapter norms and ablation effects co-locate at late layers (correlation 0.85) — training writes where it matters. At 1.5B, adapter norms are remarkably uniform (3.33–3.45 across all 28 layers), but effects still peak at L23–L27. The correlation drops to 0.54.
This decoupling means norm-based pruning is unreliable at scale. An adapter with small norms at L27 has large functional effects; the same norm at L0 has minimal effects. The residual stream at late layers is more sensitive to perturbation — the same weight change has more leverage on a more refined representation. Causal ablation, not norm inspection, is necessary to identify which adapter components matter.
At 0.5B, ALL 10 layer-skip configurations produced 0% top-5 overlap. At 1.5B, skip_mid_5 (skipping L4–L8) preserves 10% top-5 overlap — the first evidence of partial mid-layer redundancy at scale. While 10% is far from practical, it suggests that structured pruning with retraining (not zero-ablation) might eventually recover usable redundancy at larger scales. Early exit remains fatal at both scales — no intermediate layer produces usable output.
The atlas above maps which components matter. A complementary vibe check — 30 prompts × 6 configs (0.5B/1.5B at bf16, 8-bit, 4-bit NF4) — asks the blunter question of whether the output actually reads well. Three qualitative gaps mirror the structural ones:
0.5B loops. 13–16 of 30 outputs degenerate into self-repeating text (43–53% repetition rate) across every precision — the model seizes on a phrase and rewrites it with one token swapped. 1.5B loops 3–4× less (3–8 of 30, 10–27%). On the same palindrome prompt, 0.5B-bf16 never writes the function, instead emitting a chain of near-identical "maximum length of N characters" sentences until the token budget runs out (repetition ratio 0.73); 1.5B-4bit produces a clean working function with a docstring (repetition ratio 0.17). The smaller model's thinner residual stream and fewer layers give it less slack to escape degeneration loops.
1.5B follows explicit prompt constraints roughly 2× better than 0.5B (13–17% vs 0–7%). 0.5B-bf16 meets 0% of stated constraints — it tends to describe or hallucinate requirements rather than satisfy them. Scale roughly doubles instruction-following, while quantization changes adherence only within noise at either size. (Absolute rates are low because these are base, non-instruct-tuned checkpoints; the reliable signal is the relative gap.)
Quantization affects the two scales differently. 0.5B loses 42–55% of speed going to 4-/8-bit and gains little quality; 1.5B-4bit NF4 loses only 9% of speed (17.1 vs 18.8 tok/s) and keeps quality within noise of bf16. Notably, 8-bit is the slowest quantization at both scales (bitsandbytes dequantizes 8-bit weights back to bf16 on every matmul), making 4-bit NF4 the practical sweet spot — especially at 1.5B, where it fits in 8GB and nearly matches bf16 throughput.
"Insights from small models are necessary but not sufficient for understanding large models. The causal architecture itself is scale-dependent — the hub migrates, heads specialize, steering collapses, and skills entangle. Methods must be re-validated at each scale."