Mechanistic interpretability plus training perturbation

A causal atlas for small language models.

I tested where behavior moves inside small models, then used that map to target LoRA, steering, ablation, patching, and negative controls. The result is not a universal theory of language models. It is a practical rule: map the model you plan to edit before you edit it.

Read the claims Open the repo Detailed Phase 3 notes

5models or model families mapped

12default task families in the main suite

39LFM2.5-230M SFT runs in the sweep

13.8xfewer LoRA params on the JSON result

What got mapped

The atlas links behavior to components, training changes, activations, and interventions. The core measurements are causal: remove, patch, steer, train, compare.

Ablation

Which layers matter?

Residual, MLP, head, module, and position-specific ablations measure how much output distributions move when a component is removed.

Patching

Can clean activations recover behavior?

Activation patching tests whether a component carries behavior by transferring clean or trained activations into a different run.

Training

Where does LoRA write?

Adapters are trained on controlled task families, then compared against the base model by layer, module, norm, and downstream behavior.

Steering

Can directions move behavior?

Activation additions test whether a direction can increase, suppress, or damage a behavior. Steering claims remain control-sensitive.

Formats

What data shape trains cleanly?

The data-format ablation holds content fixed across 6 formats. Real judge scores from mimo-v2.5 show the base model outperforms all adapters. Loss is inversely correlated with quality: formats that are harder to learn from (structured terse, single-turn chat) produced better adapters.

Negative results

What failed?

Full SFT OOMed on 8GB VRAM, naive layer skipping broke outputs, and some early patching and scoring setups were not useful.

The claims I would share

These are the claims that survive the current audit. Confidence is intentionally conservative.

Claim	Evidence	Confidence	Details
Qwen2.5-0.5B has a stable L2 causal hub under the main suite.	L2 is top under ablation across 12 families. Phase 3 replicated hub=L2 across seeds 42, 137, and 2026 with std=0.0. Position-specific ablation shows first-token and last-token effects, not a uniform layer.	High	Open 0.5B atlas
Qwen hub location changes with scale and task suite breadth.	Full-suite hubs: Qwen2.5-0.5B L2, Qwen2.5-1.5B L14, Qwen2.5-3B L34. Each replicated across 3 seeds with std=0.0. The 1.5B hub was revised from L26 to L14 after widening the suite.	High	Open Phase 3
Atlas-guided LoRA is parameter-efficient on JSON schema following.	On Qwen2.5-0.5B JSON, atlas-guided LoRA used 319K trainable params and reached exact_match=1.000. All-linear also reached 1.000 but used 4.4M params. That is 13.8x more parameters.	High for JSON	Open LoRA evidence
Generic conclusions do not transfer cleanly across tasks.	Factual and code tasks did not match the clean JSON story. For code semantics, all-linear won exact match; atlas-guided had lower loss than random-matched but did not reach exact-match parity.	Medium	Open task split
Adapter effects are late-layer effects, not upstream propagation.	Adapter-only ablation on the JSON adapter gives norm-effect correlation around 0.85, with effects concentrated around L19-L23. The older "norm-effect separation" story is refuted.	Medium	Open adapter notes
Naive layer skipping is not a free speedup.	Layer skipping produced 0% top-5 token overlap across tested skip configs in the audited claim. Recovery fine-tuning is still open work.	High negative result	Open skip result
4-bit NF4 is the practical inference choice in the tested 1.5B setup.	Qwen2.5-1.5B bf16 ran at 18.8 tok/s, 4-bit NF4 at 17.1 tok/s, and 8-bit at 9.0 tok/s. Quality stayed close in the qualitative checks. Causal-surface drift is still under test.	Medium-high	Open quant notes
Training loss is inversely correlated with behavioral quality at 230M scale.	Real mimo-v2.5 judge on 153 eval prompts: base model (3.17) beats all 8 adapters. Best adapter: single-turn chat (2.60). Worst: quality bsmagpie (1.93). Loss-quality correlation r ≈ −0.7. 300-example SFT causes catastrophic overfitting.	High	Open format ablation

How the story changed

The useful part of the project is not that every early guess held up. Several did not. That is the point of the atlas.

Phase 1

Initial Qwen2.5-0.5B atlas. Found L2, LoRA rewiring, late-layer trained activations, and a suspiciously strong factual knockout result.

Phase 2

Scale and controls. Some claims held; some shifted. SmolLM2 did not support a simple universal hub rule.

Phase 3

Multi-seed replication and practical tests. The strongest result is atlas-guided LoRA for JSON with 13.8x fewer parameters.

Phase 8–9

SFT sweep (39 runs) found dataset format matters 5x more than hyperparameters. Phase 9 format ablation with real judge showed base model beats all adapters at 230M scale. Loss inversely correlates with quality: lower loss = more overfitting.

LFM2 SFT sweep

This is the applied training branch of the atlas: 39 supervised fine-tuning runs on LFM2.5-230M across datasets, optimizers, LoRA rank, target modules, learning rate, steps, and data volume.

Training loss

Dataset choice moved the result most.

In the Phase 8 sweep, smol-magpie-ultra reached loss 1.25 while Alpaca sat around 6.07. The dataset gap was much larger than optimizer or learning-rate gaps.

Open SFT sweep

Optimizer

Adafactor was the practical default.

On smol-magpie-ultra, Adafactor gave loss 1.267 and KL 0.109 while using less memory than AdamW. Lion drifted more in this setup.

Open optimizer table

LoRA rank

Rank 8 was the efficiency point.

Higher rank helped less with each step up. Rank 8 used 213K trainable params in the sweep; rank 32 used 852K for a smaller additional loss gain.

Open rank sweep

Target modules

Hub-targeted LoRA kept drift low.

The hub plus o_proj setup used about 65K params and KL 0.039. All-linear used far more params and moved the model more, while reaching lower loss.

Open target sweep

Format ablation

Base model beats all adapters.

Real mimo-v2.5 judge scores: base model overall 3.17 vs best adapter 2.60 (single-turn chat). Loss is inversely correlated with quality (r ≈ −0.7). Formats harder to learn from (structured terse, single-turn chat) produced better adapters. 300-example SFT caused catastrophic overfitting at 230M scale.

Open format ablation

Key finding

Loss does not predict quality.

Multi-turn verbose: best loss (1.37), worst judge score (1.99). Structured terse: worst loss (1.83), 2nd-best judge score (2.52). H4 strongly confirmed. This is the most important finding from Phase 9R.

Open loss vs quality analysis

Side results worth keeping

These are useful, but I would not sell them as final laws.

Hybrid model

LFM2.5-230M behaves differently from Qwen.

In the LFM2.5-230M pass, L0 had the largest measured effect and early-layer steering moved outputs more. This used a corrected operator/MLP hook because full-layer zeroing caused cascade-zero artifacts.

Open LFM2 atlas

Data format

Format affects loss but not quality the way you'd expect.

The 6-format ablation holds content fixed. Multi-turn verbose gave the lowest loss (1.37) but the worst judge score (1.99). Structured terse had the worst loss (1.83) but 2nd-best judge score (2.52). Loss and quality are inversely correlated — lower loss means more overfitting at 230M scale.

Open format details

SFT recipe

The useful LFM2 recipe is narrow.

For memory-constrained LFM2 SFT, the best current default is good multi-turn data, Adafactor, rank 8, and hub-targeted modules when low drift matters.

Open recipe evidence

Steering

Steering can work, but controls matter.

Steering looked stronger when tested at the right layers in larger Qwen models. Random-vector and shuffled-label controls are still required before treating this as a clean task-specific intervention.

Open steering notes

Prompt length

Short prompts are not the whole story.

Long-task checks suggest some hub locations and steering effects change with prompt length. This limits how far short synthetic prompts can be stretched.

Open robustness notes

What I do not claim

This is the part that keeps the page honest.

No universal law

The hub locations are model-specific. Even within Qwen, scale and training change the answer. Across architectures, the story changes again.

No probe-only claims

The page does not rely on attention maps, probes, or SAE labels alone. Claims need an intervention, a metric, and a control path.

No hidden nulls

Full SFT did not fit on 8GB VRAM. Full-residual patching was trivial. Early clean/corrupt pairs had tokenization problems. JSON knockout had limited room to suppress targets.

No publication claim yet

This is a reproducible research project, not a finished paper. The strongest results are ready to build on; several controls are still open.

Practical rules

What I would actually do if I had to edit a small model tomorrow.

1. Map first.

Run ablation and patching on the exact model, task family, prompt style, and quantization you plan to use.

2. Use a broad task suite.

The 1.5B hub moved from L26 to L14 when the suite widened. Narrow suites can give confident wrong answers.

3. Target adapters by evidence.

Atlas-guided LoRA was clearly parameter-efficient for JSON. Do not assume the same recipe wins every task.

4. Treat steering as fragile.

Layer, strength, prompt length, and quantization all matter. Add random-vector and shuffled-label controls.

5. Do not skip layers blindly.

The naive skip experiments failed hard. If you want speed, test recovery fine-tuning rather than deleting layers and hoping.

6. Log the misses.

The negative results were not cleanup. They changed the method and killed bad claims.