d_model=896 · d_head=64 · d_mlp=4864 · vocab 151,936 · ~0.49B params · bf16 on RTX 2070 Super (8GB). 22 experiments · 12 task families · 92 examples · 17 clean/corrupt pairs · 10 LoRA adapters · 5 training checkpoints.
We present a mechanistic interpretability atlas of Qwen2.5-0.5B, a 0.5B-parameter transformer. Using causal interventions — layer ablation, activation patching, steering vectors, and LoRA training perturbation — we map how this small model processes information across 12 task families. We find that Layer 2 acts as a universal routing hub with positional specialization (HIGH confidence), that LoRA training rewires where skills live in a task-specific manner (rejecting uniform concentration), and that a core circuit (L2/L7/L9) locks in within the first 10% of training. We demonstrate cross-model activation transfer, selective skill knockout via negative steering, and a norm-effect separation in adapter weights. Across 22 experiments, we build a reproducible causal atlas connecting behaviours to components, with implications for small-model optimization and targeted skill injection.
Finding 1: L2 is a universal importance hub with positional specialization. High
Zero-ablating Layer 2 causes the largest KL divergence across all 12 task families (0.5–11.5 nats). Key observations:
Finding 2: L0 MLP and L2 MLP are the two most important MLP components. Medium
MLP ablation reveals L2 MLP has the highest effect (max KL 11.26), with L0 MLP second. This confirms L2's role is driven by its MLP subcomponent, not just residual stream magnitude.
Finding 3: Individual head effects are small (max KL 0.046), suggesting distributed processing. Medium
Head ablation effects are 200× smaller than layer-level effects. No single head dominates. Attention in Qwen2.5-0.5B operates through distributed head contributions rather than specialist heads.
Finding 4: L2 steering with factual direction causally boosts target token probability 3.3×. Medium
Steering L2 with a factual recall direction increases "Rome" probability from 0.064 to 0.213 for "capital of Italy". Negative steering suppresses it. However, extreme steering (s ≥ +2) causes degeneration (Chinese characters, repetition), indicating a finite steering budget.
Finding 5: Each skill concentrates in DIFFERENT layers after LoRA training. Medium
The hypothesis that training universally concentrates skills into early layers (H002) is REJECTED. Each skill family has its own concentration pattern:
Targeted intervention must be skill-specific — there is no universal "training target" layer.
Finding 8: Core circuit (L2/L7/L9) locks in by step 10 (first 10% of training). Medium
The JSON core circuit stabilizes at step 10 and drifts <1% through step 100. Loss drops from 0.587 (step 10) to 0.062 (step 100). Secondary layers (L15, L6) continue shifting (+2.85/+2.73), suggesting a two-phase training process: rapid core circuit formation followed by secondary layer refinement.
Finding 9: Adapter norms peak at late layers (L20–L23) but ablation effects peak at early layers (L0–L2). Medium
This norm-effect separation is a key architectural finding. Training writes the largest weight changes to late layers, but the functional impact (measured by ablation) is concentrated in early layers. Effects propagate upstream — the adapter modifies late layers, but the information that matters for behavior flows through early layers.
Finding 10: Adapters can be combined with varying interference. Medium
The delimiter adapter's extreme behavior may indicate format-specific overfitting. The clean stacking of factual + json suggests these skills occupy orthogonal subspaces.
Finding 12: Trained activations can partially transfer learned behavior to the base model. Medium
Cross-model patching reveals that trained model activations at specific layers can transfer learned behavior into the base model. Top transfer layers: L23 (recovery=1.000), L22 (recovery=0.966), L21 (recovery=0.947). The LoRA adapter's learned behavior is partially encoded in the activation patterns at these layers, not solely in the weight modifications.
Finding 13: Negative steering can selectively suppress learned skills. Medium
For factual_recall, the best knockout was at L19 with selectivity ratio 11,654×. Negative steering at moderate strengths (−1.0 to −2.0) can suppress skill-specific tokens while preserving non-skill behavior. Higher strengths (−4.0 to −8.0) cause broader degradation. Learned skills can be selectively removed without full model retraining.
Finding 14: Adapter norm and ablation effect are spatially separated, supporting upstream propagation. Medium
The correlation between adapter weight norm and ablation effect is 0.855, indicating a weak or negative relationship. Layers with low adapter norms but high ablation effects (upstream propagation evidence): L12. Top adapter ablation effect layers: L23 (KL=0.872), L22 (KL=0.809), L21 (KL=0.723). This supports hypothesis H6: adapter weights write to late layers but the functional effects propagate through early layers.
The model has clear positional specialization across layers. Different layers route different token positions through different pathways within the same layer stack:
| Component | Role | Evidence |
|---|---|---|
| L2 (residual + MLP) | Universal routing hub · first+last position | Ablation KL 0.5–11.5 across all families |
| L0 MLP | Second-strongest across all families; absorbs JSON (+2.99) after LoRA | MLP ablation · trained-vs-base delta |
| L22 | Unembedding pathway · last-token exclusive | Position ablation · cross-model 97% recovery |
| L9 | Instruction-sensitive layer | Position ablation (first 5.66 / last 9.20) |
| L19 | Skill-specific suppression point (factual) | Skill knockout 11,654× selectivity |
| Heads (14, GQA) | Distributed — max single-head KL 0.046 (200× smaller than layer effects) | Head ablation |
The full layer ablation reveals a multi-peaked importance distribution. Layer 2 stands alone as the dominant hub (mean KL 19.11), but several secondary peaks appear:
| Layer | Mean KL | Role |
|---|---|---|
| L2 | 19.11 | Universal routing hub — first+last position |
| L0 | 13.52 | Second-highest; MLP-driven |
| L9 | 11.14 | Instruction-sensitive |
| L7 | 10.96 | Balanced first+last routing |
| L22 | 10.52 | Unembedding pathway (last-token exclusive) |
| L15 | 3.37 | Weakest layer — still essential for correct output |
The mean L2 ablation effect across all 12 task families is: [18.42, 21.48, 18.54, 16.50, 21.77, 17.31, 16.71, 20.68, 20.38, 19.12, 21.17, 17.23]. The highest effects occur on factual recall (21.77) and arithmetic (21.48), while the lowest is on delimiter tracking (16.50).
MLP ablation isolates the feedforward subcomponents within each layer. L2 MLP has the highest single-component effect (max KL 11.26), confirming that L2's dominance is driven by its MLP sublayer rather than just residual stream magnitude. L0 MLP is second. All other MLPs have substantially lower effects. This two-component dominance suggests that feedforward networks at layers 0 and 2 carry the majority of the model's representational capacity.
Head ablation across 6 layers × 14 heads reveals that no individual head has a significant effect. The maximum single-head KL is 0.046 — roughly 200× smaller than the top layer effect (19.11) and 170× smaller than the top MLP effect (11.26). This is a strong indicator of distributed attention processing: the 14 GQA heads (with 2 KV heads) share computational load evenly rather than having specialist heads. This pattern contrasts with larger models where specific heads take on identifiable roles.
The model exhibits a clear positional architecture where different token positions are routed through distinct layer pathways:
| Position | Dominant Layer | Mean KL Effect | Interpretation |
|---|---|---|---|
| Last (prediction) | L22 | 14.55 | Unembedding gateway — exclusively affects final token |
| First (instruction) | L2 | 3.34 | Instruction routing — processes the prompt prefix |
| Last (at L2) | L2 | 5.03 | Prediction routing — processes the output prefix |
| Operator/delimiter | All layers | ~0 | Near-zero effect — operators flow through without processing |
This positional specialization suggests the model processes instruction tokens and prediction tokens through different pathways within the same layer stack, with operators and delimiters serving as near-transparent pass-throughs.
Training follows a two-phase architecture: a fast architectural phase (steps 0–10) where the core circuit locks in, and a slow refinement phase (steps 10–100) where skill-specific components are tuned. This has a direct practical implication — early training steps are critical for establishing the processing architecture, while later steps fine-tune skill-specific components.
Finding 6: L0 MLP concentration peaks at r=4. Higher rank distributes rather than concentrates. Medium
| Rank | L0 MLP Effect | Total Adapter Norm | Character |
|---|---|---|---|
| r=1 | 15.77 | 6.14 | Most surgically precise |
| r=2 | — | — | Intermediate |
| r=4 | 15.77 | — | Peak L0 concentration |
| r=8 | — | — | Default config |
| r=16 | 13.94 | 22.92 | Distributed across layers |
Lower rank produces more localized adapters. At r=4, the adapter concentrates its effect at L0 MLP (the model's strongest component). At r=16, the effect distributes across multiple layers, diluting the L0 concentration. This has implications for efficient skill injection — r=4 may be the optimal precision/coverage tradeoff. Total adapter norm scales linearly with rank (6.14 at r=1 to 22.92 at r=16), confirming that higher rank adds proportionally more parameters without proportionally more impact.
Finding 7: o_proj is the most efficient skill injection pathway. Medium
| Module Config | Params | L0 Effect | Efficiency (Effect/Param) |
|---|---|---|---|
| o_proj-only | 344K | +3.64 | Best — 10.6×10⁻⁶ per param |
| v_proj-only | 197K | +2.75 | Good — 14.0×10⁻⁶ per param |
| MLP-only | 3.3M | +1.92 | Worst — 0.6×10⁻⁶ per param |
| q_proj-only | — | — | — |
| attn_all | — | — | — |
| all_linear | — | — | — |
The o_proj (output projection) writes directly to the residual stream, making it the most parameter-efficient injection point. It achieves +3.64 L0 effect with only 344K parameters (0.07% of the model). In contrast, MLP-only requires 3.3M parameters (10× more) but achieves a smaller effect (+1.92). This is because o_proj's output feeds immediately into the residual stream, while MLP changes are mediated through the feedforward computation.
Training on 5 different dataset shards and comparing the resulting component maps reveals that each skill family writes to a unique set of layers:
| Skill Family | Concentration Layers | Pattern |
|---|---|---|
| factual_recall | L3, L16, L19 | Distributed across early-mid-late |
| code_semantics | L1, L10, L21 | Spans early processing to late output |
| json_schema | L6, L12, L13 | Mid-layer concentration |
| copying | Dispersed | No single critical circuit |
| delimiter_tracking | Fully absorbed | 0 ablation sensitivity post-training |
This is the direct rejection of hypothesis H002 ("LoRA concentrates skill into early layers"). Each skill has a unique fingerprint. Delimiter tracking is particularly notable — after training, it becomes completely absorbed into the model's baseline processing (zero ablation sensitivity), meaning the model has internalized this skill so thoroughly that removing any layer no longer affects it.
Tracking the component map across 5 checkpoints (steps 10, 25, 50, 75, 100) reveals a clear two-phase training architecture:
Practical implication: Early training steps are critical for establishing the processing architecture. Short training runs (10 steps) may be sufficient for basic skill acquisition, while longer runs refine skill-specific components. This has direct implications for efficient fine-tuning budgets.
Weighted merging of adapters trained on different skills reveals which skills can coexist:
| Adapter Pair | Interaction | Effect |
|---|---|---|
| factual + json | Synergistic | +2.35 factual, +1.17 json |
| code + json | Compatible | Moderate interference |
| delimiter + any | Destructive | −7 to −16 nats degradation |
The factual + json pairing is synergistic — both skills improve when combined. This suggests they occupy orthogonal subspaces in the model's representation. The delimiter adapter is destructive when stacked, likely due to format-specific overfitting that interferes with other skills' processing patterns. Practical implication: Multi-skill models can be built by stacking compatible adapters without retraining.
Finding 15: Naive layer skipping destroys output — all layers are necessary, even "weak" ones. High (negative result)
We tested 10 layer-skip configurations, from skipping single weak layers (L15, L4, L8) to skipping 8 layers at once. Results:
| Config | Layers Skipped | Top-5 Overlap | KL Divergence |
|---|---|---|---|
| skip_weakest_1 (L15) | 1 | 0% | 9.02 nats |
| skip_weak_1 (L4) | 1 | 0% | — |
| skip_weak_1 (L8) | 1 | 0% | — |
| skip_multiple | 2–8 | 0% | 7.9–13.8 nats |
Even skipping L15 alone (the weakest layer, max ablation KL of 3.37) produces KL of 9.02 and 0% top-5 overlap. The "weak" label means the layer contributes less to the KL when ablated, not that it can be safely removed. Every layer is essential for correct output.
Finding 16: Early exit at L22 (unembedding layer) does not work naively. High (negative result)
Despite L22 being the unembedding pathway (97% recovery in cross-model patching), projecting L22's hidden state through the lm_head gives:
This is because each layer transforms the residual stream. L22's hidden state is the input to L23, not the final representation. The lm_head expects L23's output, and L22's output is not directly projectable to vocab.
The efficiency experiments are negative for naive inference optimization, but the atlas reveals where efficiency gains ARE possible:
The following null and negative results are reported to avoid publication bias:
Key methodological decisions and their rationale:
| ID | Decision | Rationale |
|---|---|---|
| D001 | HF native hooks instead of TransformerLens | GQA incompatibility — TransformerLens does not support Qwen2.5's grouped-query attention |
| D002 | LoRA instead of full SFT | VRAM constraint — full fine-tuning OOMs on 8GB RTX 2070 Super |
| D003 | Zero ablation instead of mean | Simpler implementation; more conservative (larger effects) |
| D004 | Single seed | VRAM budget limits throughput — multi-seed replication needed |
| D005 | Short synthetic prompts (5–15 tokens) | Cleaner interpretability — natural language introduces confounds |
| D006 | Aero as primary compute host | RTX 2070 Super (8GB) — only available GPU |
| D007 | Bundle-based GitHub push | Aero has no GitHub authentication configured |
All 22 experiments are fully reproducible via the scripts in the repository:
python scripts/run_*.py (17 scripts total, from baseline through adapter ablation).python scripts/generate_publication_report.py.
| ID | Hypothesis | Status |
|---|---|---|
| H001 | L2 is a general-purpose routing hub | SUPPORTED (with positional nuance) |
| H002 | LoRA concentrates skill into early layers | REJECTED (skill-specific) |
| H003 | Higher rank distributes skill | SUPPORTED |
| H004 | o_proj is the key skill injection pathway | SUPPORTED for JSON |
| H005 | Factual and algorithmic tasks use different circuits | WEAKENED (both depend on L2) |
| H006 | Adapter norms write late, effects propagate upstream | UPDATED → REJECTED (effect IS at late layers, corr 0.85) |
| H007 | L22 is the unembedding pathway | SUPPORTED (last-position exclusive) |
"The efficiency experiments are negative for naive inference optimization — but the atlas reveals exactly where efficiency gains are possible: training, parameters, and selective skill manipulation."