MI-Atlas

Mechanistic Interpretability of Small Language Models

Phase 1–3 · Qwen2.5 · LFM2.5-230M · 4 scales · 2 architectures · SFT · Data format ablation

🧭

What this is: A reproducible causal atlas connecting model behaviours to internal components — which layers, heads, and MLPs do what, how LoRA training rewires them, and whether learned skills can be transferred or selectively knocked out. Every claim carries a metric, a counterfactual, a control, and a confidence level. No overclaiming, no ignoring null results.

Analysis Posts

Start with the one-page atlas if you want the shareable version. The longer notes below keep the experiment detail, caveats, and current Phase 9 work.

MI-Atlas: One-Page Causal Map

Shareable summary · claims audit · caveats included

A single readable page covering the causal atlas, LoRA targeting, steering, negative results, and the claims that survive the current audit.

Qwen2.5-0.5B: A Causal Atlas

22 experiments · 24 layers · 14 heads GQA · 0.49B params

The complete atlas. L2 universal hub (HIGH), skill-specific LoRA concentration, core circuit lock-in by step 10, cross-model patching monotonic recovery, L19 skill knockout at 11,654× selectivity, and why naive layer skipping fails.

02 · IN PROGRESS

Qwen2.5-1.5B: Scaling the Atlas

Experiments running · 28 layers · 1.5B params

Scaling the causal atlas 3×. Does the universal hub persist, shift, or fragment? Does head specialization emerge at scale? Live analysis as experiments complete.

0.5B vs 1.5B: What Changes at 3× Scale?

Cross-model comparison

Side-by-side: architecture, universal hub layer, MLP dominance, head specialization, LoRA concentration, circuit lock-in, knockout selectivity, and layer-skipping viability across a 3× parameter scale-up.

04 · QUALITATIVE

Qualitative Analysis: The Vibe Check

30 prompts × 6 configs · Prose quality, creativity, and quantization impact

Does the output actually read well? 0.5B/1.5B at bf16, 8-bit, and 4-bit NF4 scored on speed, repetition & degeneration, and constraint adherence. Finding: 1.5B is qualitatively much better (3–4× fewer loops, 2× better constraint following), and 4-bit NF4 is the sweet spot — 8-bit is the slowest quantization.

Phase 2: Repeatable Small-Model Surgery

9 blocks · 3 scales · 2 architectures · 35 registered Phase 2 experiments

Hub migration found and then tightened by Phase 3: the original L26 result at 1.5B was revised to L14 with the full 12-family suite. Steering did NOT collapse at 1.5B — it migrated and strengthened. Cross-family hubs are architecture-specific.

Phase 3: Gap Closure and Gem Discovery

18 blocks · 9 seeds · 36 result files · 5 models

Atlas-guided LoRA validated: 13.8× fewer params, equal accuracy. Hub migration replicated across 9 seeds (all std=0.0). Hub revised from L26 to L14 at 1.5B (narrow suite artifact). Coder hub flip: L2→L22 at same scale. Quantization amplifies steering 476×.

07 · NEW ARCHITECTURE

LFM2.5-230M: Hybrid Architecture Atlas

230M params · 14 layers (8 conv + 6 attn) · Liquid AI

Hybrid conv+attention atlas for LFM2.5-230M. L0 (conv) is the strongest measured hub. Early layers are 3.3× more important than late in this run. Conv MLPs are 2.12× stronger than attention MLPs. L4_H11 is the strongest recurring head. Atlas-guided LoRA: 65K params, 14× less behavior shift.

08 · SFT SWEEP

LFM2.5-230M: SFT Experiment Sweep

39 experiments · 10 datasets · 4 optimizers · 6 hyperparameter dimensions

What actually matters when fine-tuning a 230M model? Dataset format (multi-turn concise) beats hyperparameters by 5×. Best dataset: smol-magpie-ultra (loss 1.27). Best optimizer: Adafactor. Hub structure preserved across all 39 experiments.

09 · FORMAT ABLATION

Phase 9: Data Format Ablation

300 canonical examples · 6 formats · 153 eval prompts · Judge-based scoring

Isolating data shape as a variable. Same content, 6 representations — from flat Alpaca to multi-turn concise to structured terse. Controlled experiment to answer: what is the optimal information shape for small-model SFT?

Method

Causal interventions, not correlations. Every claim on the evidence ladder.

🔬

Zero ablation removes a component's output and measures KL divergence in the next-token distribution. Activation patching swaps activations between clean and corrupt runs. Steering vectors inject mean(positive) − mean(negative) directions. LoRA training perturbation trains low-rank adapters on specific skills and compares component maps before/after. Cross-model patching transfers trained activations into the base model. Skill knockout applies negative steering to suppress learned skills.

"Every claim needs a metric, a counterfactual, a control, and a confidence level. If a conclusion sounds exciting, attack it harder before reporting it."

Phase 1–3 atlas · 39 SFT experiments · 6 format ablations · 3 model scales · 2 architectures
Fully reproducible: python scripts/run_*.py