Phase 2: Repeatable Small-Model Surgery

Phase 3 Revision (June 2026): The 1.5B hub was originally reported as L26 (93% depth) based on testing with fewer task families. Phase 3 multi-seed replication with the full 12-family suite revealed the true hub is L14 (50% depth). Hub migration pattern: L2 (8%) → L14 (50%) → L34 (94%). All replicated across 3 seeds (std=0.0). See Phase 3 for details.

🧭

Phase 2 thesis: Small language models have scale-dependent control surfaces. A behaviour may be localized, steerable, and surgically editable at one scale, but distributed, entangled, and resistant at another. Therefore specialist small models need causal atlases, not just benchmark scores.

Bottom line: Confirmed with a Phase 3 revision. The universal hub migrates from L2 (8% depth) at 0.5B to L14 (50%) at 1.5B to L34 (94%) at 3B when tested with the full task suite. Steering doesn't collapse at scale — it migrates and gets stronger. Each model size needs its own atlas.

The Hub Migration

The dominant causal hub changes with scale. Phase 2 found the migration; Phase 3 revised the 1.5B location after broadening the task suite.

Qwen2.5-0.5B

8% of depth · 24 layers

→

Qwen2.5-1.5B

L14

50% of depth · 28 layers

→

Qwen2.5-3B

L34

94% of depth · 36 layers

⚠️

Cross-family exception: SmolLM2-1.7B showed a flat Phase 2 ablation profile where no layer was clearly more important than the others. Hub position is architecture-specific, not purely depth-dependent. Each model family needs its own atlas.

Headline Findings

8 findings from Phase 2, ranked by evidence strength.

NEW FINDING

Hub migration is real and monotonic

0.5B: L2 (8% depth) → 1.5B: L14 (50%, Phase 3 revision from the narrower Phase 2 L26 result) → 3B: L34 (94%). At 3B, the hub is the second-to-last layer. Top 5 layers at 3B in Phase 2: L34 (221.3), L27 (192.3), L13 (187.5), L22 (171.5), L18 (160.7). Evidence: layer ablation across 3 scales, revised by Phase 3 full-suite replication.

NEW FINDING

Steering did NOT collapse at 1.5B — it migrated and got 3× stronger

Phase 1 claimed steering "collapsed" at 1.5B (0.003 boost vs 0.213 at 0.5B). Phase 2 tested the RIGHT layers. Results: L6 = +3.14, L21 = +4.64, L26 = −5.88, multi-layer = −7.20. Best single-layer boost at 0.5B was only +2.16 (L12). Steering leverage INCREASES with scale when targeting the correct hub layers. The earlier "collapse" was testing the wrong layer (L2).

CONFIRMED

Layer skipping remains fatal at all scales

0% top-5 overlap for all skip configurations at 0.5B, 1.5B, and 3B. Every layer is necessary. Naive inference compression via layer removal is invalid. Structured pruning with recovery training is the only viable path. Evidence: layer skipping tests across 3 scales.

CONFIRMED

L0 MLP is always the top MLP

Across 0.5B, 1.5B, and 3B, L0 MLP consistently has the largest ablation effect. Early processing matters universally regardless of scale. This is the one structural invariant across scales.

NEW FINDING

Cross-family hubs are architecture-specific

SmolLM2-1.7B (24 layers) did not show a Qwen-like hub under the Phase 2 reduced atlas: the layer profile was effectively flat, so the recorded L0 winner should be treated as a tie artifact. This is fundamentally different from Qwen's hub at L2 (0.5B), L14 (1.5B), or L34 (3B). Hub position depends on architecture, task suite, and measurement sensitivity. Evidence: cross-family reduced atlas.

PARTIALLY CONFIRMED

Ablation method affects absolute effect sizes but preserves rank ordering

6 ablation types compared (zero, mean, resample, clean→corrupt, corrupt→clean, random patch). Zero ablation gives the largest effects. Mean ablation and resample give smaller but correlated effects. Top-3 layer ranking is preserved across methods for most task families. Evidence: ablation controls at 0.5B (24 layers × 6 types) and 1.5B (28 layers × 6 types).

PARTIALLY CONFIRMED

code_semantics is the most surgically separable skill

Skill Separability Score: code_semantics = 0.361 (best), json_schema = 0.221, factual_recall = 0.218, copying = 0.219, delimiter_tracking = 0.215. Code semantics has the highest insertion gain (+3.70) and moderate localization sharpness. Collateral damage is zero for all skills tested. Evidence: skill separability benchmark on 0.5B, 5 skills, 5 operations each.

PARTIALLY CONFIRMED

Hub stability varies by task across prompt lengths

Long-task robustness tested at short (5–30 tokens), medium (30–150), and long (150–1000). Hub is stable for JSON and code tasks but shifts for factual recall and copying. Steering effectiveness degrades with prompt length for factual recall. LoRA adapter loading fails across model sizes (adapter is model-specific). Evidence: Block I robustness tests on 0.5B and 1.5B.

Cross-Scale Metrics

Key metrics across 0.5B, 1.5B, and 3B.

Metric	0.5B	1.5B	3B	Trend
Architecture	24L, 14H, d=896	28L, 12H, d=1536	36L, 16H, d=2048	Deeper, wider
Universal hub layer	L2	L14	L34	Nonlinear migration
Hub depth (%)	8%	50%	94%	Nonlinear migration
Hub max KL	19.11	13.70	221.34	Grows with scale
MLP max effect (L0)	8.12	2.58	pending	3× weaker at 1.5B
Head max effect	0.046	1.023	pending	22× stronger at 1.5B
Best steering boost	+2.16	+4.64	pending	2× stronger at 1.5B
Multi-layer steering	+1.17	−7.20	pending	Much stronger at scale
Layer skipping	0% overlap	0% overlap	0% overlap	Always fatal

Experiment Blocks

9 blocks, prioritized by value. All blocks completed.

Block	Description	Models	Status
A	Parity verification	1.5B	DONE
B	Steering migration	0.5B, 1.5B	DONE
C	Ablation controls (6 types)	0.5B, 1.5B	DONE
D	Third scale point (3B)	3B	DONE (5/8 sub-experiments)
E	Cross-family replication	SmolLM2-1.7B	DONE
F	Adapter surgery	0.5B, 1.5B	DONE
G	Skill separability benchmark	0.5B	DONE
H	Deobfuscation surgery	0.5B, 1.5B	PARTIAL
I	Long-task robustness	0.5B, 1.5B	DONE

Practical Rules for Small-Model Brain Surgery

10 actionable rules from Phase 2 evidence.

RULE 1

Each model size needs its own atlas

The hub moves from L2 to L14 to L34 across a 3× scale-up after Phase 3 full-suite replication. What works at 0.5B does not automatically work at 1.5B. Never transfer intervention locations across scales without re-mapping.

RULE 2

Each architecture family needs its own atlas

SmolLM2 did not show a Qwen-like hub under the reduced atlas. Qwen has L2/L14/L34 across the tested scales. Hub position is architecture-specific. Do not assume cross-architecture transfer.

RULE 3

Test steering at ALL candidate hub layers, not just the ablation winner

Phase 1 missed 1.5B steering because it only tested L2. Testing L6, L21, L26, L27 revealed stronger steering at 1.5B. Always sweep all plausible candidate layers, not just the current ablation winner.

RULE 4

Steering direction can reverse at scale

At 0.5B, positive steering at L2 boosts factual recall. At 1.5B, tested candidate layers respond to negative steering (L21) or have reversed effects (L26). Always test both directions.

RULE 5

Every layer is necessary — do not skip layers

0% top-5 overlap for all skip configurations at all scales. Use structured pruning with recovery training if you need inference compression.

RULE 6

Use multiple ablation methods, not just zero ablation

Zero ablation overestimates some effects. Mean ablation and resample ablation give more nuanced maps. Claim stability should be checked against at least 2 methods.

RULE 7

LoRA adapters are model-specific — do not load across model sizes

Loading a 0.5B adapter into 1.5B or 3B causes dimension mismatch errors. Each model size needs its own trained adapters.

RULE 8

code_semantics is the best candidate for surgical skill injection

Highest SSS (0.361), highest insertion gain (+3.70), zero collateral damage. If you need one skill to surgically add to a small model, code semantics is the cleanest target.

RULE 9

Steering effectiveness may degrade with prompt length

Long-task robustness shows hub stability varies by task. Factual recall steering degrades at longer prompts. Test your steering interventions at the prompt lengths you'll actually use.

RULE 10

L0 MLP is always important — invest in early-layer analysis

Across all scales and families, L0 MLP consistently has the largest ablation effect. Early processing is universal. If you're doing quick triage, start with L0.

Method

Phase 2 methodology additions.

📋

Phase 2 added: Multi-method ablation (6 types: zero, mean, resample, clean→corrupt, corrupt→clean, random patch). Steering migration sweeps (7 layers × 8 strengths × 4 vector types). Adapter surgery (norm profiles, compatibility matrix, rank truncation). Skill separability benchmark (5 operations per skill). Cross-family reduced atlas. Long-prompt robustness testing. Experiment registry with 35 entries. Claim card system. Reproducible configs in YAML. Phase 3 tightens these findings with multi-seed full-suite replication.

"The atlas is NOT transferable across scales or architectures. Each model needs its own map. The control surface changes with scale, family, task type, and fine-tuning method."

35 registered experiments · 16 task families · 3 model scales · 2 architectures · 6 ablation types
Results: experiments/results/ · Reports: reports/phase2/