Qwen2.5-1.5B: Scaling the Atlas

Bilawal Riaz
Mechanistic Interpretability Atlas
2026
TL;DR — 9 causal experiments on a 1.5B-parameter transformer (3× the 0.5B study) reveal that the universal routing hub migrates from L2 to L26 (from 8% to 93% depth), specialist attention heads emerge (max head KL 1.02, 22× stronger than 0.5B), MLP effects weaken 3× (despite 3× more parameters), steering collapses 70× (best boost 0.003 vs 0.213), and skill knockout selectivity drops ~48,000× (from 11,654× to 0.24×). Cross-model transfer shifts to L25–L27 but maintains its proportional depth invariant (final ~10%). Naive layer skipping remains fatal — but one mid-skip config shows the first hint of redundancy (10% top-5 overlap).
⚙️
Model & scope: Qwen2.5-1.5B · 28 transformer layers · 12 attention heads (GQA) · d_model=1536 · vocab 151,936 · ~1.54B params · bf16 on RTX 2070 Super (8GB). 9 experiments · 8 task families · LoRA training with gradient checkpointing (batch_size=1, loss 0.0098). Same causal-intervention methodology as the 0.5B study for direct comparability.
Abstract

We present a mechanistic interpretability atlas of Qwen2.5-1.5B, a 1.5B-parameter transformer — a 3× scale-up from our prior 0.5B study. Using the same causal intervention suite (layer ablation, MLP ablation, head ablation, steering vectors, LoRA training perturbation, cross-model activation transfer, skill knockout, and adapter ablation), we map how this larger model processes information across 8 task families. We find that the universal routing hub shifts from L2 (at 0.5B) to L26 (mean ablation KL 13.70), that individual attention heads become 22× more impactful (max head KL 1.02 vs 0.046), that steering becomes 70× weaker (best boost 0.003 vs 0.213), and that skill knockout selectivity collapses from 11,654× to 0.24×. Cross-model patching shifts monotonically from L21–L23 to L25–L27. MLP ablation is 3× weaker despite 3× more parameters. All layer-skipping configurations remain fatal (0% top-5 overlap except one mid-skip at 10%), and early exit at any layer fails.

Key Findings
The six headline results, each with its confidence level and evidence.

L26 is the universal hub — migrated from L2High

Layer-level ablation · 8 task families · 28 layers
Zero-ablating Layer 26 causes the largest mean KL (13.70 nats) across all families. The hub shifts from L2 (8% depth at 0.5B) to L26 (93% depth at 1.5B). Secondary hubs: L6 (13.28), L14 (13.01) — the hub role is now distributed across three layers, not singular.

Specialist attention heads emergeMedium

Head ablation · 6 families · 12 heads × 28 layers
Max single-head KL is 1.02 (arithmetic L0 H3) — 22× stronger than 0.5B's 0.046. L0 H3 appears across arithmetic, code syntax, and copying, a potential cross-task specialist. At 0.5B, heads were fully distributed (200× smaller than layer effects); at 1.5B, the gap closes to 13×.

Steering collapses 70×Medium

Steering sweep · L2 · factual recall direction
Best probability boost is 0.003, compared to 0.5B's 0.213 — a 70× reduction. For "The capital of Italy is ", even at optimal strength (−2.0), target probability only reaches 0.010 from a baseline of 0.0009. Linear steering is nearly inert at 1.5B; representations are too entangled for single-direction manipulation.

Skill knockout: 48,000× less selectiveMedium

Negative steering · factual_recall · 5 layers
Best selectivity is L21 at 0.24× — meaning the knockout suppresses non-skill tokens 4× more than skill tokens (anti-selective). At 0.5B, L19 achieved 11,654× selectivity. Most 1.5B layers show large negative selectivity (−1.6 to −77.4). Skills are deeply entangled at scale.

Cross-model transfer shifts to L25–L27Medium

Trained→base activation transfer · 28 layers
Recovery increases monotonically: L27 = 99.9%, L26 = 99.2%, L25 = 98.7%, L24 = 97.9%. The transfer zone shifts 4 layers deeper from 0.5B's L21–L23, but maintains the same proportional depth invariant: final ~10% of layers at both scales. Recovery is more uniform — L24 reaches 97.9% vs 0.5B's L20 at 87%.

MLP effects weaken 3× at scaleMedium

MLP ablation · 8 families · 28 layers
Max MLP KL is 2.58 (L0), down from 0.5B's 8.12 (L2 MLP) — a 3× reduction despite 3× more parameters. L0 MLP dominates at both scales, but MLPs carry only 19% of the top layer's effect (vs 42% at 0.5B). The computational burden shifts from MLPs toward attention heads at scale.
Architecture Map
The hub migrates to late layers, attention heads gain specialization, and MLPs lose relative dominance.

Layer ablation: the hub migration

At 0.5B (24 layers), L2 was the singular universal hub at 8% depth. At 1.5B (28 layers), the hub role is distributed across three layers spanning the network:

LayerMean KLDepthRole
L2613.7093%Primary hub — late integration & routing
L613.2821%Secondary hub — early integration
L1413.0150%Secondary hub — mid integration
L510.4418%Supporting early layer
L910.3432%Supporting mid layer

The top 3 layers are within 0.69 nats of each other — no single layer dominates as overwhelmingly as L2 did at 0.5B. This suggests the model distributes routing across depth rather than concentrating it in one layer.

Component atlas

ComponentRoleEvidence
L26 (residual)Universal routing hub — late integrationAblation KL 13.70 mean across all families
L0 MLPStrongest MLP, but 3× weaker than 0.5B's L2 MLPMLP ablation KL 2.58
L0 H3 (attention head)Cross-task specialist — arithmetic, code syntax, copyingHead ablation KL 1.02 (arithmetic), 0.35 (code), 0.26 (copying)
L27Final-layer transfer point — 99.9% cross-model recoveryCross-model patching
L21Best (but weak) skill knockout pointSkill knockout selectivity 0.24×
Heads (12, GQA)Partially specialized — max KL 1.02, 22× stronger than 0.5BHead ablation
Component Mapping
Layer, MLP, and head ablation results — how causal structure redistributes at 1.5B.

Layer ablation

Zero-ablating each of the 28 layers and measuring mean KL across 8 task families reveals a multi-peaked importance distribution. Unlike 0.5B's sharp L2 peak, 1.5B shows three near-equal peaks (L26, L6, L14) with a long tail of supporting layers. The mean per-layer KL ranges from 5.40 (L23) to 13.70 (L26).

MLP ablation

MLP-specific ablation shows L0 MLP dominates (KL 2.58), with L1 (1.83) and L27 (1.17) following. The MLP contribution is 3× weaker than at 0.5B (max 2.58 vs 8.12). Late MLPs (L27) gain relative importance — third place at 1.5B, absent from the top 5 at 0.5B. This suggests MLPs play a smaller role in the larger model's computation, with attention picking up the slack.

Head ablation

The emergence of specialist heads is the most notable component-level change. At 0.5B, the maximum head effect was 0.046 — all heads were near-equal and negligible. At 1.5B, the maximum reaches 1.02, with identifiable specialists:

  • L0 H3 — arithmetic (1.02), code syntax (0.35), copying (0.26): a cross-task computational specialist
  • L0 H6 — code syntax (0.51), delimiter tracking (0.44): a structural/syntax specialist
  • L0 H10 — delimiter tracking (0.44): a format specialist
  • L0 H5 — factual recall (0.20): a knowledge-access specialist

All top heads are in L0, suggesting that the first attention layer develops specialized routing at 1.5B — a role that was distributed across all heads at 0.5B.

Causal Interventions
Steering, cross-model patching, and skill knockout — what scales and what collapses.

Steering: the 70× collapse

Steering L2 with a factual recall direction (sv_norm 9.625) produces a dramatically weaker effect than at 0.5B. The best probability boost is 0.003, compared to 0.5B's 0.213 — a 70× reduction. For "The capital of Italy is ", the target probability moves from 0.000919 to at most 0.010315 (at strength −2.0). For "The capital of Spain is ", it moves from 0.000038 to 0.000353.

The KL divergence at moderate steering strengths (±0.5 to ±1.0) is minimal (0.004–0.036), with meaningful KL only at extreme strengths (0.76 at −4.0). This suggests the model's representations are too entangled for a single linear steering direction to produce meaningful behavioral change.

⚠️
Critical scaling finding. The steering leverage that made skill manipulation practical at 0.5B (3.3× boost, 11,654× knockout selectivity) is largely unavailable at 1.5B. Linear steering may be fundamentally limited at scale — non-linear interventions (SAE-based steering, distributed steering, activation editing) may be required.

Cross-model patching: proportional shift

LayerMean Recovery0.5B Equivalent
L2799.9%L23 (100%)
L2699.2%L22 (97%)
L2598.7%L21 (95%)
L2497.9%L20 (87%)
L2397.4%
L2295.5%

The top transfer layers are the final 3 layers at both scales — a robust proportional depth invariant (final ~10%). Recovery is more uniform at 1.5B: L24 achieves 97.9% versus 0.5B's L20 at 87%, suggesting a wider transfer zone at scale.

Skill knockout: entanglement at scale

LayerSelectivitySkill DropNon-Skill DropSV Norm
L2−24.55−0.0095−0.00048.88
L3−77.41−0.0099−0.000110.63
L16−9.31−0.0033+0.000426.13
L19−1.62−0.0861+0.053236.50
L21+0.24+0.0011+0.004449.50

The best selectivity (L21: 0.24×) means the skill drop is only 24% of the non-skill drop — the knockout is anti-selective. At 0.5B, L19 achieved 11,654× selectivity (skill suppressed 11,654× more than non-skill). The ~48,000× collapse suggests skills are deeply entangled in the 1.5B model's representations and cannot be isolated with linear steering.

Training Perturbation
LoRA training at 1.5B scale — convergence, VRAM constraints, and base-to-trained divergence.

LoRA training (r=8, alpha=16, target q/k/v/o_proj) on the 1.5B model required batch_size=1 and gradient checkpointing to fit in 8GB VRAM — more constrained than 0.5B's batch_size=2. Training converged to loss 0.0098.

Base-to-trained KL by task family

Task FamilyBase→Trained KL0.5B Comparison
json_schema6.47Largest shift at both scales
copying0.54
dead_code0.42
delimiter_tracking0.28
factual_recall0.16
arithmetic0.08
code_semantics0.08
code_syntax0.05Minimal shift (pre-trained skill)

JSON schema training produces the largest distributional shift at both scales, confirming it as the skill most affected by LoRA. Code tasks (syntax, semantics) show minimal shift, suggesting these capabilities are largely pre-trained and LoRA fine-tunes them only marginally.

Advanced Interventions
Adapter ablation and the weakening norm-effect relationship at scale.

Adapter ablation: norm-effect decoupling

The correlation between adapter weight norm and ablation effect is 0.54 at 1.5B, down from 0.5B's 0.85. Adapter norms are remarkably uniform across all 28 layers (3.33–3.45), but effects still peak at late layers (L23–L27):

LayerTotal KLAdapter Norm
L2713.643.42
L2612.703.39
L2512.003.39
L2411.563.37
L2310.923.41

The weaker correlation means adapter norms are less predictive of functional impact at 1.5B. At 0.5B, norm and effect co-located at late layers (corr 0.85). At 1.5B, norms are flat but effects still peak late — the decoupling suggests the residual stream at late layers is more sensitive to perturbation, not that the weights are larger there.

Efficiency Insights
Layer skipping and early exit — still fatal, but with the first crack in the wall.
Two hard negative results — with one partial exception. (1) Layer skipping: 0% top-5 overlap in 3 of 4 configurations (KL 6.58–9.50). The exception is skip_mid_5 (skipping L4–L8), which preserves 10% top-5 overlap — the first hint of mid-layer redundancy at scale. (2) Early exit: 0–7.14% argmax match at all exit points; no configuration produces usable output.

Layer skipping configurations

ConfigLayers SkippedMean KLTop-5 Overlap
skip_weakest_1L159.500%
skip_mid_5L4–L86.5810%
skip_6_layersL4,5,8,11,15,169.152.86%
skip_8_layersL4,5,8,10,11,14,15,169.152.86%

Early exit

Exit LayerLayers SkippedMean KLArgmax MatchSpeedup
L27 (full)00%1.00×
L2610%1.04×
L2527.14%1.08×
L23412.607.14%1.17×
L17107.210%1.56×

The redundancy signal

The 10% top-5 overlap in skip_mid_5 is the first positive efficiency signal across both studies. At 0.5B, ALL 10 skip configurations gave 0%. While 10% is far from practical, it suggests that mid layers (L4–L8) at 1.5B may carry partially redundant computation. Structured pruning with retraining — not zero-ablation — might recover this redundancy at larger scales.

Limitations
What this atlas cannot yet claim — stated plainly.
  • Single seed — all results from one random seed. Confidence capped at MEDIUM (except L26 hub at HIGH). Multi-seed replication needed for publication.
  • VRAM-constrained training — LoRA at batch_size=1 with gradient checkpointing may produce different internal changes than higher-batch training.
  • 8 task families (vs 12 at 0.5B) — refusal/compliance, verbosity, variable renaming, and uncertainty tasks are missing, limiting direct comparison on those dimensions.
  • Steering tested at L2 only — the steering collapse may be layer-specific. Testing at L26 (the new hub) might yield different results, but was not attempted due to compute constraints.
  • No checkpoint timeline — the core circuit lock-in finding (step 10 at 0.5B) is not verified at 1.5B. Whether the two-phase training architecture persists at scale is unknown.
  • Zero ablation — creates out-of-distribution activations. Mean/resample ablation would be more principled.
  • Short synthetic prompts — 5–15 tokens. Results may not transfer to natural language or longer contexts.
  • Single model family — both 0.5B and 1.5B are Qwen2.5. Findings may be family-specific. Cross-family validation needed.
Open Hypotheses
IDHypothesisStatus
H001Universal hub persists at same relative depthREJECTED — migrates from L2 (8%) to L26 (93%)
H002Specialist attention heads emerge at scaleSUPPORTED — max head KL 22× stronger (0.046 → 1.02)
H003Steering leverage scales with capacityREJECTED — collapses 70× (0.213 → 0.003)
H004Skill knockout selectivity preserved at scaleREJECTED — drops ~48,000× (11,654× → 0.24×)
H005Cross-model transfer zone defined by relative depthSUPPORTED — final ~10% at both scales
H006MLP dominance scales with parametersREJECTED — MLP effects weaken 3× despite 3× params
H007Mid-layer redundancy emerges at scalePARTIALLY SUPPORTED — skip_mid_5 gives 10% overlap
H008Adapter norm predicts functional impactWEAKENED — correlation drops 0.85 → 0.54

"Scaling from 0.5B to 1.5B does not merely amplify the atlas — it transforms it. The hub migrates, heads specialize, MLPs recede, steering collapses, and skills entangle. The causal architecture itself is scale-dependent."

← Previous: 0.5B Analysis   ·   Next: 0.5B vs 1.5B Comparison →