d_model=1536 · vocab 151,936 · ~1.54B params · bf16 on RTX 2070 Super (8GB). 9 experiments · 8 task families · LoRA training with gradient checkpointing (batch_size=1, loss 0.0098). Same causal-intervention methodology as the 0.5B study for direct comparability.
We present a mechanistic interpretability atlas of Qwen2.5-1.5B, a 1.5B-parameter transformer — a 3× scale-up from our prior 0.5B study. Using the same causal intervention suite (layer ablation, MLP ablation, head ablation, steering vectors, LoRA training perturbation, cross-model activation transfer, skill knockout, and adapter ablation), we map how this larger model processes information across 8 task families. We find that the universal routing hub shifts from L2 (at 0.5B) to L26 (mean ablation KL 13.70), that individual attention heads become 22× more impactful (max head KL 1.02 vs 0.046), that steering becomes 70× weaker (best boost 0.003 vs 0.213), and that skill knockout selectivity collapses from 11,654× to 0.24×. Cross-model patching shifts monotonically from L21–L23 to L25–L27. MLP ablation is 3× weaker despite 3× more parameters. All layer-skipping configurations remain fatal (0% top-5 overlap except one mid-skip at 10%), and early exit at any layer fails.
At 0.5B (24 layers), L2 was the singular universal hub at 8% depth. At 1.5B (28 layers), the hub role is distributed across three layers spanning the network:
| Layer | Mean KL | Depth | Role |
|---|---|---|---|
| L26 | 13.70 | 93% | Primary hub — late integration & routing |
| L6 | 13.28 | 21% | Secondary hub — early integration |
| L14 | 13.01 | 50% | Secondary hub — mid integration |
| L5 | 10.44 | 18% | Supporting early layer |
| L9 | 10.34 | 32% | Supporting mid layer |
The top 3 layers are within 0.69 nats of each other — no single layer dominates as overwhelmingly as L2 did at 0.5B. This suggests the model distributes routing across depth rather than concentrating it in one layer.
| Component | Role | Evidence |
|---|---|---|
| L26 (residual) | Universal routing hub — late integration | Ablation KL 13.70 mean across all families |
| L0 MLP | Strongest MLP, but 3× weaker than 0.5B's L2 MLP | MLP ablation KL 2.58 |
| L0 H3 (attention head) | Cross-task specialist — arithmetic, code syntax, copying | Head ablation KL 1.02 (arithmetic), 0.35 (code), 0.26 (copying) |
| L27 | Final-layer transfer point — 99.9% cross-model recovery | Cross-model patching |
| L21 | Best (but weak) skill knockout point | Skill knockout selectivity 0.24× |
| Heads (12, GQA) | Partially specialized — max KL 1.02, 22× stronger than 0.5B | Head ablation |
Zero-ablating each of the 28 layers and measuring mean KL across 8 task families reveals a multi-peaked importance distribution. Unlike 0.5B's sharp L2 peak, 1.5B shows three near-equal peaks (L26, L6, L14) with a long tail of supporting layers. The mean per-layer KL ranges from 5.40 (L23) to 13.70 (L26).
MLP-specific ablation shows L0 MLP dominates (KL 2.58), with L1 (1.83) and L27 (1.17) following. The MLP contribution is 3× weaker than at 0.5B (max 2.58 vs 8.12). Late MLPs (L27) gain relative importance — third place at 1.5B, absent from the top 5 at 0.5B. This suggests MLPs play a smaller role in the larger model's computation, with attention picking up the slack.
The emergence of specialist heads is the most notable component-level change. At 0.5B, the maximum head effect was 0.046 — all heads were near-equal and negligible. At 1.5B, the maximum reaches 1.02, with identifiable specialists:
All top heads are in L0, suggesting that the first attention layer develops specialized routing at 1.5B — a role that was distributed across all heads at 0.5B.
Steering L2 with a factual recall direction (sv_norm 9.625) produces a dramatically weaker effect than at 0.5B. The best probability boost is 0.003, compared to 0.5B's 0.213 — a 70× reduction. For "The capital of Italy is ", the target probability moves from 0.000919 to at most 0.010315 (at strength −2.0). For "The capital of Spain is ", it moves from 0.000038 to 0.000353.
The KL divergence at moderate steering strengths (±0.5 to ±1.0) is minimal (0.004–0.036), with meaningful KL only at extreme strengths (0.76 at −4.0). This suggests the model's representations are too entangled for a single linear steering direction to produce meaningful behavioral change.
| Layer | Mean Recovery | 0.5B Equivalent |
|---|---|---|
| L27 | 99.9% | L23 (100%) |
| L26 | 99.2% | L22 (97%) |
| L25 | 98.7% | L21 (95%) |
| L24 | 97.9% | L20 (87%) |
| L23 | 97.4% | — |
| L22 | 95.5% | — |
The top transfer layers are the final 3 layers at both scales — a robust proportional depth invariant (final ~10%). Recovery is more uniform at 1.5B: L24 achieves 97.9% versus 0.5B's L20 at 87%, suggesting a wider transfer zone at scale.
| Layer | Selectivity | Skill Drop | Non-Skill Drop | SV Norm |
|---|---|---|---|---|
| L2 | −24.55 | −0.0095 | −0.0004 | 8.88 |
| L3 | −77.41 | −0.0099 | −0.0001 | 10.63 |
| L16 | −9.31 | −0.0033 | +0.0004 | 26.13 |
| L19 | −1.62 | −0.0861 | +0.0532 | 36.50 |
| L21 | +0.24 | +0.0011 | +0.0044 | 49.50 |
The best selectivity (L21: 0.24×) means the skill drop is only 24% of the non-skill drop — the knockout is anti-selective. At 0.5B, L19 achieved 11,654× selectivity (skill suppressed 11,654× more than non-skill). The ~48,000× collapse suggests skills are deeply entangled in the 1.5B model's representations and cannot be isolated with linear steering.
LoRA training (r=8, alpha=16, target q/k/v/o_proj) on the 1.5B model required batch_size=1 and gradient checkpointing to fit in 8GB VRAM — more constrained than 0.5B's batch_size=2. Training converged to loss 0.0098.
| Task Family | Base→Trained KL | 0.5B Comparison |
|---|---|---|
| json_schema | 6.47 | Largest shift at both scales |
| copying | 0.54 | — |
| dead_code | 0.42 | — |
| delimiter_tracking | 0.28 | — |
| factual_recall | 0.16 | — |
| arithmetic | 0.08 | — |
| code_semantics | 0.08 | — |
| code_syntax | 0.05 | Minimal shift (pre-trained skill) |
JSON schema training produces the largest distributional shift at both scales, confirming it as the skill most affected by LoRA. Code tasks (syntax, semantics) show minimal shift, suggesting these capabilities are largely pre-trained and LoRA fine-tunes them only marginally.
The correlation between adapter weight norm and ablation effect is 0.54 at 1.5B, down from 0.5B's 0.85. Adapter norms are remarkably uniform across all 28 layers (3.33–3.45), but effects still peak at late layers (L23–L27):
| Layer | Total KL | Adapter Norm |
|---|---|---|
| L27 | 13.64 | 3.42 |
| L26 | 12.70 | 3.39 |
| L25 | 12.00 | 3.39 |
| L24 | 11.56 | 3.37 |
| L23 | 10.92 | 3.41 |
The weaker correlation means adapter norms are less predictive of functional impact at 1.5B. At 0.5B, norm and effect co-located at late layers (corr 0.85). At 1.5B, norms are flat but effects still peak late — the decoupling suggests the residual stream at late layers is more sensitive to perturbation, not that the weights are larger there.
| Config | Layers Skipped | Mean KL | Top-5 Overlap |
|---|---|---|---|
| skip_weakest_1 | L15 | 9.50 | 0% |
| skip_mid_5 | L4–L8 | 6.58 | 10% |
| skip_6_layers | L4,5,8,11,15,16 | 9.15 | 2.86% |
| skip_8_layers | L4,5,8,10,11,14,15,16 | 9.15 | 2.86% |
| Exit Layer | Layers Skipped | Mean KL | Argmax Match | Speedup |
|---|---|---|---|---|
| L27 (full) | 0 | ∞ | 0% | 1.00× |
| L26 | 1 | ∞ | 0% | 1.04× |
| L25 | 2 | ∞ | 7.14% | 1.08× |
| L23 | 4 | 12.60 | 7.14% | 1.17× |
| L17 | 10 | 7.21 | 0% | 1.56× |
The 10% top-5 overlap in skip_mid_5 is the first positive efficiency signal across both studies. At 0.5B, ALL 10 skip configurations gave 0%. While 10% is far from practical, it suggests that mid layers (L4–L8) at 1.5B may carry partially redundant computation. Structured pruning with retraining — not zero-ablation — might recover this redundancy at larger scales.
| ID | Hypothesis | Status |
|---|---|---|
| H001 | Universal hub persists at same relative depth | REJECTED — migrates from L2 (8%) to L26 (93%) |
| H002 | Specialist attention heads emerge at scale | SUPPORTED — max head KL 22× stronger (0.046 → 1.02) |
| H003 | Steering leverage scales with capacity | REJECTED — collapses 70× (0.213 → 0.003) |
| H004 | Skill knockout selectivity preserved at scale | REJECTED — drops ~48,000× (11,654× → 0.24×) |
| H005 | Cross-model transfer zone defined by relative depth | SUPPORTED — final ~10% at both scales |
| H006 | MLP dominance scales with parameters | REJECTED — MLP effects weaken 3× despite 3× params |
| H007 | Mid-layer redundancy emerges at scale | PARTIALLY SUPPORTED — skip_mid_5 gives 10% overlap |
| H008 | Adapter norm predicts functional impact | WEAKENED — correlation drops 0.85 → 0.54 |
"Scaling from 0.5B to 1.5B does not merely amplify the atlas — it transforms it. The hub migrates, heads specialize, MLPs recede, steering collapses, and skills entangle. The causal architecture itself is scale-dependent."