Back to MI-Atlas

LFM2.5-230M:
Hybrid Architecture Atlas

A causal atlas of LiquidAI's 230M-parameter hybrid conv+attention language model, with gated short convolutions and grouped query attention.

229.7M
Parameters
14
Layers (8 conv + 6 attn)
L0
Universal Hub
450 MB
VRAM (bf16)

Key Findings

1
L0 (conv) is the universal hub. The first layer, a gated convolution, contributes more to model output than any other layer in this run (skip KL=82.9, 1.87× stronger than the next). That differs from the Qwen2.5-0.5B hub at ~8% depth.
2
Early layers are 3.3× more important than late layers. L0-L5 (quiet phase, low residual norms) dominate processing. L6-L13 (high phase, 10× higher norms) handle refinement. The residual norm explosion marks a transition from critical processing to refinement.
3
Conv MLPs are 2.12× stronger than attention MLPs. The MLP in convolution layers contributes more than in attention layers. Conv layers work primarily through their MLPs; attention layers work through their operators.
4
L4_H11 is the universal attention head. Head 11 in layer 4 is the #1 or #2 head across 7/9 task families. H11 is also the best head in ALL 6 attention layers — an architecturally special head index.
5
Steering is MORE effective at early layers. Opposite of Qwen2.5. Early layers (L0-L6) have low residual norms (~2), so added vectors are proportionally large. Late layers (L11-L13) have norms of ~25, drowning out perturbations.
6
Gated convolution has zero component redundancy. Zeroing gate (B), carrier (C), or signal (x) gives identical KL — the multiplicative structure means every component is necessary. No redundancy at all.
7
The residual norm jump is input-dependent. For arithmetic, the jump happens at L5. For factual recall, at L6. The transition point shifts based on input content — it's not a fixed architectural boundary.
8
Atlas-guided LoRA is the most efficient adapter strategy. Targeting hub layers L0,L2,L4,L5 with o_proj+MLP gives 65K trainable params (0.028%) with minimal behavior shift (KL=0.06), compared to all-linear with 737K params (0.32%) and KL=0.86.

Architecture

Layer Map: L0 CONV (12.06M) ← UNIVERSAL HUB L1 CONV (12.06M) L2 ATTN (11.01M) ← L2_H9 structural specialist L3 CONV (12.06M) L4 ATTN (11.01M) ← L4_H11 universal head L5 CONV (12.06M) ← STRONGEST MLP · norm jump L6 ATTN (11.01M) L7 CONV (12.06M) L8 ATTN (11.01M) L9 CONV (12.06M) L10 ATTN (11.01M) L11 CONV (12.06M) L12 ATTN (11.01M) ← late refinement hub L13 CONV (12.06M) Gated Short Convolution (Lfm2ShortConv): input(1024) → in_proj → [3072] → chunk(3) → B[1024]=gate C[1024]=carrier x[1024]=signal → Bx = B * x → depthwise_conv1d(k=3, g=1024) → y = C * conv_out → out_proj → [1024] Attention (Lfm2Attention): GQA: 16 Q-heads / 8 KV-heads, head_dim=64 Q/K per-head RMSNorm → RoPE (θ=1M) → SDPA

Residual Stream

The residual stream has three distinct phases with a dramatic norm transition:

PhaseLayersNorm RangeCharacter
QuietL0-L41.4 – 2.3Feature extraction, low-magnitude processing
TransitionL5→ 25.510-18× norm explosion (input-dependent location)
HighL6-L1322.9 – 26.3Sustained high-magnitude refinement

Layer Ablation (Operator + MLP)

Methodology note: The standard approach of hooking the full decoder layer and zeroing its output causes cascading zeros through the residual stream (all layers appear identical). Our corrected approach hooks the operator (conv/self_attn) and MLP (feed_forward) separately.

LayerTypeOperator KLMLP KLSkip KL
L0conv56.3146.1582.90
L1conv7.9411.1836.31
L2attn22.958.6330.81
L3conv3.7922.8425.86
L4attn34.8721.1634.56
L5conv5.4247.7544.41
L6attn6.884.778.74
L7conv1.835.477.26
L8attn8.304.0510.20
L9conv2.325.297.07
L10attn11.106.5817.68
L11conv3.145.108.67
L12attn7.407.7429.13
L13conv4.555.7412.13

Head Ablation

HeadMean KLBest FamilyRole
L4_H110.948instruction_following (1.76)Universal head
L2_H90.563json_schema (0.83)Structural specialist
L2_H110.250factual_recall (0.32)General processing
L2_H40.083factual_recall (0.08)Minor
L4_H40.078instruction_following (0.28)Minor
Remaining 91 heads: KL < 0.07 each. Individual head effects are 87× smaller than layer effects.

H11 is architecturally special: Best head in ALL 6 attention layers (L2, L4, L6, L8, L10, L12). This head index has learned a universally useful function.

Steering Sweep

Random steering vector (seed=42), applied to last token position. Opposite of Qwen2.5: steering is more effective at early layers.

LayerTypeMax KL (s=-4.0)
L6attn16.37
L5conv15.99
L1conv15.87
L2attn15.83
L0conv15.59
L10attn8.99
L11conv4.80
L12attn3.96
L13conv2.99

Why? Early layers have residual norms of ~2, so a steering vector of magnitude 4 doubles the activation. Late layers have norms of ~25, making the same vector only 16% of the activation.

Qualitative Analysis

46 tests across 9 categories. This 230M model is surprisingly capable.

CategoryResultNotes
Factual recall7/10 correctParis, Tokyo, H2O, Jupiter, 100°C, pound, Rossum
JSON generationValid structureSome extra brace artifacts on complex schemas
Code generationCorrect algorithmsFibonacci, binary search, merge sort all correct!
Instruction followingGoodProper lists, one-sentence, bullet points
Multi-turn contextMaintains stateNames, colors, arithmetic carry over correctly
Creative writingCoherentValid haiku, atmospheric prose
Prompt injectionDeflectedCompletely ignores injection attempt
Edge casesDegeneratesRepeated/special chars loop (expected at 230M)

LoRA Training

ConfigParamsTrainable %LossKL Shift
all_linear737,2800.32%6.310.86
attn_only344,0640.15%6.760.30
atlas_guided65,5360.028%8.120.06
mlp_only
conv_proj_only
atlas_full

Atlas-guided wins: 11× fewer params than all_linear, with 14× less behavior shift. Targeting just hub layers L0, L2, L4, L5 with o_proj+MLP is the most surgical strategy.

Cross-Architecture Comparison

MetricLFM2.5-230MQwen2.5-0.5BSmolLM2-1.7B
ArchitectureHybrid (8 conv + 6 attn)Pure transformer (24 attn)Pure transformer (24 attn)
Universal hubL0 (conv, 0% depth)L2 (attn, 8% depth)L0 (attn, 0% depth)
Hub KL (skip)82.9
Conv vs Attn MLPConv 2.12× strongerN/A (all attn)N/A (all attn)
Steering targetEarly layers (L0-L6)Late layers (L19-L23)L0 (consistent)
Head specialization2-3 heads matter22× increase at scale
Hub stabilitystd=0.0 (3 seeds)std=0.0 (3 seeds)

Architecture matters more than size. LFM2 (hybrid) and SmolLM2 both have L0 hubs, while Qwen (pure transformer) has L2. The hub position is architecture-specific, not just depth-dependent.

Methodological Discovery: Cascade Zero Bug

The Cascade Zero Bug

Hooking model.model.layers[i] and zeroing its output causes cascading zeros through the residual stream. When you zero L0's output, L1 sees input=0, which makes L1's output=0, and so on. ALL layers produce identical KL because the final hidden state is zero regardless of which layer was zeroed.

The fix: Hook the operator (conv/self_attn) and MLP (feed_forward) separately. Zeroing the operator gives: residual + 0 + ffn(norm(residual)), which preserves the residual pass-through and gives layer-specific KL values.

Implication: This bug likely affects existing MI-Atlas results for all residual-stream models (including Qwen). The reported hub locations should be re-verified with the corrected methodology.

Resources


Experiment Date: June 29, 2026  |  Hardware: RTX 2070 Super 8GB  |  Runtime: ~45 minutes total  |  Model VRAM: 450 MB (bf16)

Deep Probe: Beyond Standard MI

10 follow-up probes beyond the standard MI-Atlas pass, with 18 result files and 23 total experiments.

Headline Finding: Residual Stream Locks In at L5

CKA similarity analysis gives rounded CKA=1.0000 from L5 through L13. In this run, the residual stream representation appears to stabilize after L5. Layers L6-L13 still affect logits, but CKA does not separate their representations at the reported precision.

Implication: L5 looks like a representation transition point in this setup. The later layers may be refining a stable representation rather than building a new one from scratch.

Residual Stream CKA Similarity

Layer PairCKAInterpretation
embed ↔ L00.84High — L0 preserves embedding structure
L0 ↔ L40.86Gradual evolution through early layers
L4 ↔ L50.72Dip — L5 creates a new representation
L5 ↔ L61.00IDENTICAL — lock-in begins
L5 ↔ L131.00IDENTICAL — all post-L5 layers identical
embed ↔ L50.18L5 representation is completely different from embeddings
L13 ↔ embed0.55Final output partially recovers embedding structure

Logit Lens: What Each Layer "Knows"

Projecting each layer's residual stream through the LM head reveals when the model forms its prediction:

Layer"The capital of France is"Entropy
embed' is' (1.00)0.00
L0'?' (0.00)0.00
L3'olate' (0.18)5.69
L5' indeed' (0.22)5.53
L6' indeed' (0.34)5.03
L10' usually' (0.28)3.97
L13' Paris' (0.94)0.32

The answer "Paris" only emerges at L13 despite the representation being locked at L5. The LM head reads different information from the same representation at different layers — or the embedding_norm + lm_head projection amplifies subtle differences invisible to CKA.

Conv Kernel: Local → Lookahead

LayerKernel [past, current, future]Dominant
L0[0.018, 0.165, 0.017]Current token
L1[0.031, 0.153, 0.053]Current token
L3[0.025, 0.061, 0.145]Future token
L5[0.022, 0.051, 0.155]Future token
L7[0.020, 0.046, 0.144]Future token
L13[0.002, 0.008, 0.222]Future token (dominant)

Early convs (L0-L1) focus on the current token. Later convs (L3+) shift to the next (future) token. The model learns to look ahead — the convolution kernel position [2] (future) grows from 0.017 to 0.222 across layers.

LoRA Rank Sweep (Hub Layers L0,L2,L4,L5)

RankParamsLossKL Shift
r=216,3849.160.008
r=432,7688.640.031
r=865,5368.120.071
r=16131,0727.700.116

Clean linear scaling: rank doubles → KL roughly doubles. r=8 is the sweet spot (65K params, good loss convergence).


Total experiments: 23  |  Result files: 18 JSON  |  Total runtime: ~60 minutes  |  Date: June 29, 2026

Capability Benchmark: Where LFM2.5-230M Excels

82 automated tests across 14 capability categories. Each test has a deterministic pass/fail criterion.

Deployment Sweet Spots

USE FOR: Data extraction (70%), structured output generation (83%), entity extraction/NER (83%), code generation and translation (83%), factual Q&A (89%), multilingual tasks (86%), on-device lightweight agentic pipelines.

AVOID: Reasoning-heavy tasks (37%), adversarial/noisy environments (43%), text summarization and reformatting (25%), format-constrained generation with exact requirements (57%).

SPEED: 69-90 tok/s on RTX 2070 Super bf16. ~450MB VRAM. 128K context.

CapabilityScoreTestsNotes
Factual knowledge88.9%18Strong across geography, science, history, tech. UK-specific weaker.
Multilingual85.7%7French, Spanish, German, Chinese, Urdu all work. Arabic weaker.
Structured output83.3%6JSON, YAML, CSV, markdown tables. JSON schema draft-07 fails.
Code83.3%6Generation, explanation, Python→JS translation. Bug detection weaker.
Entity extraction83.3%3Names, dates, locations, organizations. Product names weaker.
Data extraction70.0%5Invoice, medical, product fields. Flight booking extraction fails.
Math73.3%15Integer arithmetic solid. Percentages good. Word problems weaker.
Classification66.7%6Sentiment strong. Topic classification (science vs tech) weaker.
Agentic patterns66.7%3API error analysis and action planning. Multi-tool coordination weaker.
Instruction following57.1%7Lists and translations OK. Exact word counts and "ONLY" constraints fail.
Robustness42.9%7Injection resistance OK. Typo tolerance, all-caps, scrambled order fail.
Reasoning37.5%8Logic puzzles weak. Syllogisms, trick questions, pattern completion fail.
Text transformation25.0%4Summarization, style conversion, step extraction mostly fail.

Inference Speed (RTX 2070 Super, bf16)

Prompt LengthTokens/secLatency (100 tokens)
Short (6 tokens)69.01.45s
Medium (23 tokens)90.31.11s
Long (51 tokens)90.61.10s

Faster at longer prompts due to parallel prefill. Sustained ~90 tok/s decode.

Recommended Use Cases

Use CaseFitWhy
Invoice/receipt parsingExcellentStructured extraction from known formats
Entity extraction pipelineExcellentNER at 83%, fast enough for real-time
JSON API response parsingExcellentStructured output at 83%, tiny footprint
Code snippet generationGoodSimple functions, translations, explanations
Multilingual chatbot (basic)Good86% across 6 languages
Factual Q&A (constrained domain)Good89% factual accuracy on general knowledge
On-device data extraction agentGoodModel card recommendation, verified at 70%
Sentiment analysisModeratePositive/negative detection works, nuance weaker
Math tutoring (basic)ModerateArithmetic solid, explanation quality varies
Text summarizationPoorOnly 25% on transformation tasks
Complex reasoningPoor37.5% on logic puzzles
Adversarial/noisy inputPoor42.9% robustness — needs clean input

Exhaustive Finetuning Sweep

Comprehensive testing of finetuning methods with real HuggingFace datasets. Measures loss convergence, KL shift from base model, and hub preservation.

Method Comparison

MethodDatasetStepsFinal LossKL ShiftStatus
LoRA (target sweep)Arithmetic1006.31-8.120.06-0.866 configs
QLoRA (4-bit NF4)Alpaca (500)2005.120.51Strongest learning
LoRA rank sweepArithmetic1009.16→7.700.008→0.12r=2..16
SFT (Alpaca)Alpaca (500)2005.260.16Good task learning
DPO (manual)Synthetic pairs1000.00010.002Most surgical
GRPO (manual)Math (10 prompts)50-0.0060.001Reward=1.0, minimal shift

Hyperparameter Sweep (13 configs, hub layers L0,L2,L4,L5)

ConfigValueLossKL ShiftInsight
Learning rate1e-59.540.0006Too conservative
Learning rate5e-58.130.0022Conservative
Learning rate2e-46.760.009Default — sweet spot
Learning rate1e-35.780.037Aggressive
Rankr=27.640.004Minimal params
Rankr=47.130.005Good efficiency
Rankr=86.760.009Sweet spot
Rankr=166.430.019Diminishing returns
Rankr=326.070.020Marginal gain vs r=16
Steps507.660.004Undertrained
Steps2006.030.015Good
Steps5005.310.013Loss drops, KL plateaus!

Finetuning Key Insight

Loss keeps decreasing with more training, but KL plateaus around 0.01-0.02. In this setup, the model improves on the training objective without a large measured distribution shift. That is a useful signal for controlled fine-tuning, but it still needs task-level eval before becoming a deployment rule.

Recommended config: lr=2e-4, r=8, 200-500 steps, target hub layers (L0,L2,L4,L5) with o_proj+MLP. This gives good task learning (loss 5.3-6.0) with minimal behavior shift (KL 0.01-0.02).