Qwen2.5-1.5B: Scaling the Atlas

TL;DR — 9 causal experiments on a 1.5B-parameter transformer (3× the 0.5B study) reveal that the universal routing hub migrates from L2 to L26 (from 8% to 93% depth), specialist attention heads emerge (max head KL 1.02, 22× stronger than 0.5B), MLP effects weaken 3× (despite 3× more parameters), steering collapses 70× (best boost 0.003 vs 0.213), and skill knockout selectivity drops ~48,000× (from 11,654× to 0.24×). Cross-model transfer shifts to L25–L27 but maintains its proportional depth invariant (final ~10%). Naive layer skipping remains fatal — but one mid-skip config shows the first hint of redundancy (10% top-5 overlap).

⚙️

Model & scope: Qwen2.5-1.5B · 28 transformer layers · 12 attention heads (GQA) · d_model=1536 · vocab 151,936 · ~1.54B params · bf16 on RTX 2070 Super (8GB). 9 experiments · 8 task families · LoRA training with gradient checkpointing (batch_size=1, loss 0.0098). Same causal-intervention methodology as the 0.5B study for direct comparability.

Abstract

We present a mechanistic interpretability atlas of Qwen2.5-1.5B, a 1.5B-parameter transformer — a 3× scale-up from our prior 0.5B study. Using the same causal intervention suite (layer ablation, MLP ablation, head ablation, steering vectors, LoRA training perturbation, cross-model activation transfer, skill knockout, and adapter ablation), we map how this larger model processes information across 8 task families. We find that the universal routing hub shifts from L2 (at 0.5B) to L26 (mean ablation KL 13.70), that individual attention heads become 22× more impactful (max head KL 1.02 vs 0.046), that steering becomes 70× weaker (best boost 0.003 vs 0.213), and that skill knockout selectivity collapses from 11,654× to 0.24×. Cross-model patching shifts monotonically from L21–L23 to L25–L27. MLP ablation is 3× weaker despite 3× more parameters. All layer-skipping configurations remain fatal (0% top-5 overlap except one mid-skip at 10%), and early exit at any layer fails.

Key Findings

The six headline results, each with its confidence level and evidence.

L26 is the universal hub — migrated from L2High

Layer-level ablation · 8 task families · 28 layers

Zero-ablating Layer 26 causes the largest mean KL (13.70 nats) across all families. The hub shifts from L2 (8% depth at 0.5B) to L26 (93% depth at 1.5B). Secondary hubs: L6 (13.28), L14 (13.01) — the hub role is now distributed across three layers, not singular.

Specialist attention heads emergeMedium

Head ablation · 6 families · 12 heads × 28 layers

Max single-head KL is 1.02 (arithmetic L0 H3) — 22× stronger than 0.5B's 0.046. L0 H3 appears across arithmetic, code syntax, and copying, a potential cross-task specialist. At 0.5B, heads were fully distributed (200× smaller than layer effects); at 1.5B, the gap closes to 13×.

Steering collapses 70×Medium

Steering sweep · L2 · factual recall direction

Best probability boost is 0.003, compared to 0.5B's 0.213 — a 70× reduction. For "The capital of Italy is ", even at optimal strength (−2.0), target probability only reaches 0.010 from a baseline of 0.0009. Linear steering is nearly inert at 1.5B; representations are too entangled for single-direction manipulation.

Skill knockout: 48,000× less selectiveMedium

Negative steering · factual_recall · 5 layers

Best selectivity is L21 at 0.24× — meaning the knockout suppresses non-skill tokens 4× more than skill tokens (anti-selective). At 0.5B, L19 achieved 11,654× selectivity. Most 1.5B layers show large negative selectivity (−1.6 to −77.4). Skills are deeply entangled at scale.

Cross-model transfer shifts to L25–L27Medium

Trained→base activation transfer · 28 layers

Recovery increases monotonically: L27 = 99.9%, L26 = 99.2%, L25 = 98.7%, L24 = 97.9%. The transfer zone shifts 4 layers deeper from 0.5B's L21–L23, but maintains the same proportional depth invariant: final ~10% of layers at both scales. Recovery is more uniform — L24 reaches 97.9% vs 0.5B's L20 at 87%.

MLP effects weaken 3× at scaleMedium

MLP ablation · 8 families · 28 layers

Max MLP KL is 2.58 (L0), down from 0.5B's 8.12 (L2 MLP) — a 3× reduction despite 3× more parameters. L0 MLP dominates at both scales, but MLPs carry only 19% of the top layer's effect (vs 42% at 0.5B). The computational burden shifts from MLPs toward attention heads at scale.

Architecture Map

The hub migrates to late layers, attention heads gain specialization, and MLPs lose relative dominance.

Layer ablation: the hub migration

At 0.5B (24 layers), L2 was the singular universal hub at 8% depth. At 1.5B (28 layers), the hub role is distributed across three layers spanning the network:

Layer	Mean KL	Depth	Role
L26	13.70	93%	Primary hub — late integration & routing
L6	13.28	21%	Secondary hub — early integration
L14	13.01	50%	Secondary hub — mid integration
L5	10.44	18%	Supporting early layer
L9	10.34	32%	Supporting mid layer

The top 3 layers are within 0.69 nats of each other — no single layer dominates as overwhelmingly as L2 did at 0.5B. This suggests the model distributes routing across depth rather than concentrating it in one layer.

Component atlas

Component	Role	Evidence
L26 (residual)	Universal routing hub — late integration	Ablation KL 13.70 mean across all families
L0 MLP	Strongest MLP, but 3× weaker than 0.5B's L2 MLP	MLP ablation KL 2.58
L0 H3 (attention head)	Cross-task specialist — arithmetic, code syntax, copying	Head ablation KL 1.02 (arithmetic), 0.35 (code), 0.26 (copying)
L27	Final-layer transfer point — 99.9% cross-model recovery	Cross-model patching
L21	Best (but weak) skill knockout point	Skill knockout selectivity 0.24×
Heads (12, GQA)	Partially specialized — max KL 1.02, 22× stronger than 0.5B	Head ablation

Component Mapping

Layer, MLP, and head ablation results — how causal structure redistributes at 1.5B.

Layer ablation

Zero-ablating each of the 28 layers and measuring mean KL across 8 task families reveals a multi-peaked importance distribution. Unlike 0.5B's sharp L2 peak, 1.5B shows three near-equal peaks (L26, L6, L14) with a long tail of supporting layers. The mean per-layer KL ranges from 5.40 (L23) to 13.70 (L26).

MLP ablation

MLP-specific ablation shows L0 MLP dominates (KL 2.58), with L1 (1.83) and L27 (1.17) following. The MLP contribution is 3× weaker than at 0.5B (max 2.58 vs 8.12). Late MLPs (L27) gain relative importance — third place at 1.5B, absent from the top 5 at 0.5B. This suggests MLPs play a smaller role in the larger model's computation, with attention picking up the slack.

Head ablation

The emergence of specialist heads is the most notable component-level change. At 0.5B, the maximum head effect was 0.046 — all heads were near-equal and negligible. At 1.5B, the maximum reaches 1.02, with identifiable specialists:

L0 H3 — arithmetic (1.02), code syntax (0.35), copying (0.26): a cross-task computational specialist
L0 H6 — code syntax (0.51), delimiter tracking (0.44): a structural/syntax specialist
L0 H10 — delimiter tracking (0.44): a format specialist
L0 H5 — factual recall (0.20): a knowledge-access specialist

All top heads are in L0, suggesting that the first attention layer develops specialized routing at 1.5B — a role that was distributed across all heads at 0.5B.

Causal Interventions

Steering, cross-model patching, and skill knockout — what scales and what collapses.

Steering: the 70× collapse

Steering L2 with a factual recall direction (sv_norm 9.625) produces a dramatically weaker effect than at 0.5B. The best probability boost is 0.003, compared to 0.5B's 0.213 — a 70× reduction. For "The capital of Italy is ", the target probability moves from 0.000919 to at most 0.010315 (at strength −2.0). For "The capital of Spain is ", it moves from 0.000038 to 0.000353.

The KL divergence at moderate steering strengths (±0.5 to ±1.0) is minimal (0.004–0.036), with meaningful KL only at extreme strengths (0.76 at −4.0). This suggests the model's representations are too entangled for a single linear steering direction to produce meaningful behavioral change.

⚠️

Critical scaling finding. The steering leverage that made skill manipulation practical at 0.5B (3.3× boost, 11,654× knockout selectivity) is largely unavailable at 1.5B. Linear steering may be fundamentally limited at scale — non-linear interventions (SAE-based steering, distributed steering, activation editing) may be required.

Cross-model patching: proportional shift

Layer	Mean Recovery	0.5B Equivalent
L27	99.9%	L23 (100%)
L26	99.2%	L22 (97%)
L25	98.7%	L21 (95%)
L24	97.9%	L20 (87%)
L23	97.4%	—
L22	95.5%	—

The top transfer layers are the final 3 layers at both scales — a robust proportional depth invariant (final ~10%). Recovery is more uniform at 1.5B: L24 achieves 97.9% versus 0.5B's L20 at 87%, suggesting a wider transfer zone at scale.

Skill knockout: entanglement at scale

Layer	Selectivity	Skill Drop	Non-Skill Drop	SV Norm
L2	−24.55	−0.0095	−0.0004	8.88
L3	−77.41	−0.0099	−0.0001	10.63
L16	−9.31	−0.0033	+0.0004	26.13
L19	−1.62	−0.0861	+0.0532	36.50
L21	+0.24	+0.0011	+0.0044	49.50

The best selectivity (L21: 0.24×) means the skill drop is only 24% of the non-skill drop — the knockout is anti-selective. At 0.5B, L19 achieved 11,654× selectivity (skill suppressed 11,654× more than non-skill). The ~48,000× collapse suggests skills are deeply entangled in the 1.5B model's representations and cannot be isolated with linear steering.

Training Perturbation

LoRA training at 1.5B scale — convergence, VRAM constraints, and base-to-trained divergence.

LoRA training (r=8, alpha=16, target q/k/v/o_proj) on the 1.5B model required batch_size=1 and gradient checkpointing to fit in 8GB VRAM — more constrained than 0.5B's batch_size=2. Training converged to loss 0.0098.

Base-to-trained KL by task family

Task Family	Base→Trained KL	0.5B Comparison
json_schema	6.47	Largest shift at both scales
copying	0.54	—
dead_code	0.42	—
delimiter_tracking	0.28	—
factual_recall	0.16	—
arithmetic	0.08	—
code_semantics	0.08	—
code_syntax	0.05	Minimal shift (pre-trained skill)

JSON schema training produces the largest distributional shift at both scales, confirming it as the skill most affected by LoRA. Code tasks (syntax, semantics) show minimal shift, suggesting these capabilities are largely pre-trained and LoRA fine-tunes them only marginally.

Advanced Interventions

Adapter ablation and the weakening norm-effect relationship at scale.

Adapter ablation: norm-effect decoupling

The correlation between adapter weight norm and ablation effect is 0.54 at 1.5B, down from 0.5B's 0.85. Adapter norms are remarkably uniform across all 28 layers (3.33–3.45), but effects still peak at late layers (L23–L27):

Layer	Total KL	Adapter Norm
L27	13.64	3.42
L26	12.70	3.39
L25	12.00	3.39
L24	11.56	3.37
L23	10.92	3.41

The weaker correlation means adapter norms are less predictive of functional impact at 1.5B. At 0.5B, norm and effect co-located at late layers (corr 0.85). At 1.5B, norms are flat but effects still peak late — the decoupling suggests the residual stream at late layers is more sensitive to perturbation, not that the weights are larger there.

Efficiency Insights

Layer skipping and early exit — still fatal, but with the first crack in the wall.

⛔

Two hard negative results — with one partial exception. (1) Layer skipping: 0% top-5 overlap in 3 of 4 configurations (KL 6.58–9.50). The exception is skip_mid_5 (skipping L4–L8), which preserves 10% top-5 overlap — the first hint of mid-layer redundancy at scale. (2) Early exit: 0–7.14% argmax match at all exit points; no configuration produces usable output.

Layer skipping configurations

Config	Layers Skipped	Mean KL	Top-5 Overlap
skip_weakest_1	L15	9.50	0%
skip_mid_5	L4–L8	6.58	10%
skip_6_layers	L4,5,8,11,15,16	9.15	2.86%
skip_8_layers	L4,5,8,10,11,14,15,16	9.15	2.86%

Early exit

Exit Layer	Layers Skipped	Mean KL	Argmax Match	Speedup
L27 (full)	0	∞	0%	1.00×
L26	1	∞	0%	1.04×
L25	2	∞	7.14%	1.08×
L23	4	12.60	7.14%	1.17×
L17	10	7.21	0%	1.56×

The redundancy signal

The 10% top-5 overlap in skip_mid_5 is the first positive efficiency signal across both studies. At 0.5B, ALL 10 skip configurations gave 0%. While 10% is far from practical, it suggests that mid layers (L4–L8) at 1.5B may carry partially redundant computation. Structured pruning with retraining — not zero-ablation — might recover this redundancy at larger scales.

Limitations

What this atlas cannot yet claim — stated plainly.

Single seed — all results from one random seed. Confidence capped at MEDIUM (except L26 hub at HIGH). Multi-seed replication needed for publication.
VRAM-constrained training — LoRA at batch_size=1 with gradient checkpointing may produce different internal changes than higher-batch training.
8 task families (vs 12 at 0.5B) — refusal/compliance, verbosity, variable renaming, and uncertainty tasks are missing, limiting direct comparison on those dimensions.
Steering tested at L2 only — the steering collapse may be layer-specific. Testing at L26 (the new hub) might yield different results, but was not attempted due to compute constraints.
No checkpoint timeline — the core circuit lock-in finding (step 10 at 0.5B) is not verified at 1.5B. Whether the two-phase training architecture persists at scale is unknown.
Zero ablation — creates out-of-distribution activations. Mean/resample ablation would be more principled.
Short synthetic prompts — 5–15 tokens. Results may not transfer to natural language or longer contexts.
Single model family — both 0.5B and 1.5B are Qwen2.5. Findings may be family-specific. Cross-family validation needed.

Open Hypotheses

ID	Hypothesis	Status
H001	Universal hub persists at same relative depth	REJECTED — migrates from L2 (8%) to L26 (93%)
H002	Specialist attention heads emerge at scale	SUPPORTED — max head KL 22× stronger (0.046 → 1.02)
H003	Steering leverage scales with capacity	REJECTED — collapses 70× (0.213 → 0.003)
H004	Skill knockout selectivity preserved at scale	REJECTED — drops ~48,000× (11,654× → 0.24×)
H005	Cross-model transfer zone defined by relative depth	SUPPORTED — final ~10% at both scales
H006	MLP dominance scales with parameters	REJECTED — MLP effects weaken 3× despite 3× params
H007	Mid-layer redundancy emerges at scale	PARTIALLY SUPPORTED — skip_mid_5 gives 10% overlap
H008	Adapter norm predicts functional impact	WEAKENED — correlation drops 0.85 → 0.54

"Scaling from 0.5B to 1.5B does not merely amplify the atlas — it transforms it. The hub migrates, heads specialize, MLPs recede, steering collapses, and skills entangle. The causal architecture itself is scale-dependent."

← Previous: 0.5B Analysis · Next: 0.5B vs 1.5B Comparison →