Back to MI-Atlas

LFM2.5-230M:
Hybrid Architecture Atlas

A causal atlas of LiquidAI's 230M-parameter hybrid conv+attention language model, with gated short convolutions and grouped query attention.

229.7M

Parameters

Layers (8 conv + 6 attn)

Universal Hub

450 MB

VRAM (bf16)

Key Findings

L0 (conv) is the universal hub. The first layer, a gated convolution, contributes more to model output than any other layer in this run (skip KL=82.9, 1.87× stronger than the next). That differs from the Qwen2.5-0.5B hub at ~8% depth.

Early layers are 3.3× more important than late layers. L0-L5 (quiet phase, low residual norms) dominate processing. L6-L13 (high phase, 10× higher norms) handle refinement. The residual norm explosion marks a transition from critical processing to refinement.

Conv MLPs are 2.12× stronger than attention MLPs. The MLP in convolution layers contributes more than in attention layers. Conv layers work primarily through their MLPs; attention layers work through their operators.

L4_H11 is the universal attention head. Head 11 in layer 4 is the #1 or #2 head across 7/9 task families. H11 is also the best head in ALL 6 attention layers — an architecturally special head index.

Steering is MORE effective at early layers. Opposite of Qwen2.5. Early layers (L0-L6) have low residual norms (~2), so added vectors are proportionally large. Late layers (L11-L13) have norms of ~25, drowning out perturbations.

Gated convolution has zero component redundancy. Zeroing gate (B), carrier (C), or signal (x) gives identical KL — the multiplicative structure means every component is necessary. No redundancy at all.

The residual norm jump is input-dependent. For arithmetic, the jump happens at L5. For factual recall, at L6. The transition point shifts based on input content — it's not a fixed architectural boundary.

Atlas-guided LoRA is the most efficient adapter strategy. Targeting hub layers L0,L2,L4,L5 with o_proj+MLP gives 65K trainable params (0.028%) with minimal behavior shift (KL=0.06), compared to all-linear with 737K params (0.32%) and KL=0.86.

Architecture

Layer Map: L0 CONV (12.06M) ← UNIVERSAL HUB L1 CONV (12.06M) L2 ATTN (11.01M) ← L2_H9 structural specialist L3 CONV (12.06M) L4 ATTN (11.01M) ← L4_H11 universal head L5 CONV (12.06M) ← STRONGEST MLP · norm jump L6 ATTN (11.01M) L7 CONV (12.06M) L8 ATTN (11.01M) L9 CONV (12.06M) L10 ATTN (11.01M) L11 CONV (12.06M) L12 ATTN (11.01M) ← late refinement hub L13 CONV (12.06M) Gated Short Convolution (Lfm2ShortConv): input(1024) → in_proj → [3072] → chunk(3) → B[1024]=gate C[1024]=carrier x[1024]=signal → Bx = B * x → depthwise_conv1d(k=3, g=1024) → y = C * conv_out → out_proj → [1024] Attention (Lfm2Attention): GQA: 16 Q-heads / 8 KV-heads, head_dim=64 Q/K per-head RMSNorm → RoPE (θ=1M) → SDPA

Residual Stream

The residual stream has three distinct phases with a dramatic norm transition:

Phase	Layers	Norm Range	Character
Quiet	L0-L4	1.4 – 2.3	Feature extraction, low-magnitude processing
Transition	L5	→ 25.5	10-18× norm explosion (input-dependent location)
High	L6-L13	22.9 – 26.3	Sustained high-magnitude refinement

Layer Ablation (Operator + MLP)

Methodology note: The standard approach of hooking the full decoder layer and zeroing its output causes cascading zeros through the residual stream (all layers appear identical). Our corrected approach hooks the operator (conv/self_attn) and MLP (feed_forward) separately.

Layer	Type	Operator KL	MLP KL	Skip KL
L0	conv	56.31	46.15	82.90
L1	conv	7.94	11.18	36.31
L2	attn	22.95	8.63	30.81
L3	conv	3.79	22.84	25.86
L4	attn	34.87	21.16	34.56
L5	conv	5.42	47.75	44.41
L6	attn	6.88	4.77	8.74
L7	conv	1.83	5.47	7.26
L8	attn	8.30	4.05	10.20
L9	conv	2.32	5.29	7.07
L10	attn	11.10	6.58	17.68
L11	conv	3.14	5.10	8.67
L12	attn	7.40	7.74	29.13
L13	conv	4.55	5.74	12.13

Head Ablation

Head	Mean KL	Best Family	Role
L4_H11	0.948	instruction_following (1.76)	Universal head
L2_H9	0.563	json_schema (0.83)	Structural specialist
L2_H11	0.250	factual_recall (0.32)	General processing
L2_H4	0.083	factual_recall (0.08)	Minor
L4_H4	0.078	instruction_following (0.28)	Minor
Remaining 91 heads: KL < 0.07 each. Individual head effects are 87× smaller than layer effects.

H11 is architecturally special: Best head in ALL 6 attention layers (L2, L4, L6, L8, L10, L12). This head index has learned a universally useful function.

Steering Sweep

Random steering vector (seed=42), applied to last token position. Opposite of Qwen2.5: steering is more effective at early layers.

Layer	Type	Max KL (s=-4.0)
L6	attn	16.37
L5	conv	15.99
L1	conv	15.87
L2	attn	15.83
L0	conv	15.59
L10	attn	8.99
L11	conv	4.80
L12	attn	3.96
L13	conv	2.99

Why? Early layers have residual norms of ~2, so a steering vector of magnitude 4 doubles the activation. Late layers have norms of ~25, making the same vector only 16% of the activation.

Qualitative Analysis

46 tests across 9 categories. This 230M model is surprisingly capable.

Category	Result	Notes
Factual recall	7/10 correct	Paris, Tokyo, H2O, Jupiter, 100°C, pound, Rossum
JSON generation	Valid structure	Some extra brace artifacts on complex schemas
Code generation	Correct algorithms	Fibonacci, binary search, merge sort all correct!
Instruction following	Good	Proper lists, one-sentence, bullet points
Multi-turn context	Maintains state	Names, colors, arithmetic carry over correctly
Creative writing	Coherent	Valid haiku, atmospheric prose
Prompt injection	Deflected	Completely ignores injection attempt
Edge cases	Degenerates	Repeated/special chars loop (expected at 230M)

LoRA Training

Config	Params	Trainable %	Loss	KL Shift
all_linear	737,280	0.32%	6.31	0.86
attn_only	344,064	0.15%	6.76	0.30
atlas_guided	65,536	0.028%	8.12	0.06
mlp_only	—	—	—	—
conv_proj_only	—	—	—	—
atlas_full	—	—	—	—

Atlas-guided wins: 11× fewer params than all_linear, with 14× less behavior shift. Targeting just hub layers L0, L2, L4, L5 with o_proj+MLP is the most surgical strategy.

Cross-Architecture Comparison

Metric	LFM2.5-230M	Qwen2.5-0.5B	SmolLM2-1.7B
Architecture	Hybrid (8 conv + 6 attn)	Pure transformer (24 attn)	Pure transformer (24 attn)
Universal hub	L0 (conv, 0% depth)	L2 (attn, 8% depth)	L0 (attn, 0% depth)
Hub KL (skip)	82.9	—	—
Conv vs Attn MLP	Conv 2.12× stronger	N/A (all attn)	N/A (all attn)
Steering target	Early layers (L0-L6)	Late layers (L19-L23)	L0 (consistent)
Head specialization	2-3 heads matter	22× increase at scale	—
Hub stability	std=0.0 (3 seeds)	std=0.0 (3 seeds)	—

Architecture matters more than size. LFM2 (hybrid) and SmolLM2 both have L0 hubs, while Qwen (pure transformer) has L2. The hub position is architecture-specific, not just depth-dependent.

Methodological Discovery: Cascade Zero Bug

The Cascade Zero Bug

Hooking model.model.layers[i] and zeroing its output causes cascading zeros through the residual stream. When you zero L0's output, L1 sees input=0, which makes L1's output=0, and so on. ALL layers produce identical KL because the final hidden state is zero regardless of which layer was zeroed.

The fix: Hook the operator (conv/self_attn) and MLP (feed_forward) separately. Zeroing the operator gives: residual + 0 + ffn(norm(residual)), which preserves the residual pass-through and gives layer-specific KL values.

Implication: This bug likely affects existing MI-Atlas results for all residual-stream models (including Qwen). The reported hub locations should be re-verified with the corrected methodology.

Resources

GitHub Repository — All code, data, results, and adapters
HuggingFace Model Card — Original model weights and documentation
LFM2 Technical Report — Liquid AI's architecture paper

Experiment Date: June 29, 2026 | Hardware: RTX 2070 Super 8GB | Runtime: ~45 minutes total | Model VRAM: 450 MB (bf16)

Deep Probe: Beyond Standard MI

10 follow-up probes beyond the standard MI-Atlas pass, with 18 result files and 23 total experiments.

Headline Finding: Residual Stream Locks In at L5

CKA similarity analysis gives rounded CKA=1.0000 from L5 through L13. In this run, the residual stream representation appears to stabilize after L5. Layers L6-L13 still affect logits, but CKA does not separate their representations at the reported precision.

Implication: L5 looks like a representation transition point in this setup. The later layers may be refining a stable representation rather than building a new one from scratch.

Residual Stream CKA Similarity

Layer Pair	CKA	Interpretation
embed ↔ L0	0.84	High — L0 preserves embedding structure
L0 ↔ L4	0.86	Gradual evolution through early layers
L4 ↔ L5	0.72	Dip — L5 creates a new representation
L5 ↔ L6	1.00	IDENTICAL — lock-in begins
L5 ↔ L13	1.00	IDENTICAL — all post-L5 layers identical
embed ↔ L5	0.18	L5 representation is completely different from embeddings
L13 ↔ embed	0.55	Final output partially recovers embedding structure

Logit Lens: What Each Layer "Knows"

Projecting each layer's residual stream through the LM head reveals when the model forms its prediction:

Layer	"The capital of France is"	Entropy
embed	' is' (1.00)	0.00
L0	'?' (0.00)	0.00
L3	'olate' (0.18)	5.69
L5	' indeed' (0.22)	5.53
L6	' indeed' (0.34)	5.03
L10	' usually' (0.28)	3.97
L13	' Paris' (0.94)	0.32

The answer "Paris" only emerges at L13 despite the representation being locked at L5. The LM head reads different information from the same representation at different layers — or the embedding_norm + lm_head projection amplifies subtle differences invisible to CKA.

Conv Kernel: Local → Lookahead

Layer	Kernel [past, current, future]	Dominant
L0	[0.018, 0.165, 0.017]	Current token
L1	[0.031, 0.153, 0.053]	Current token
L3	[0.025, 0.061, 0.145]	Future token
L5	[0.022, 0.051, 0.155]	Future token
L7	[0.020, 0.046, 0.144]	Future token
L13	[0.002, 0.008, 0.222]	Future token (dominant)

Early convs (L0-L1) focus on the current token. Later convs (L3+) shift to the next (future) token. The model learns to look ahead — the convolution kernel position [2] (future) grows from 0.017 to 0.222 across layers.

LoRA Rank Sweep (Hub Layers L0,L2,L4,L5)

Rank	Params	Loss	KL Shift
r=2	16,384	9.16	0.008
r=4	32,768	8.64	0.031
r=8	65,536	8.12	0.071
r=16	131,072	7.70	0.116

Clean linear scaling: rank doubles → KL roughly doubles. r=8 is the sweet spot (65K params, good loss convergence).

Total experiments: 23 | Result files: 18 JSON | Total runtime: ~60 minutes | Date: June 29, 2026

Capability Benchmark: Where LFM2.5-230M Excels

82 automated tests across 14 capability categories. Each test has a deterministic pass/fail criterion.

Deployment Sweet Spots

USE FOR: Data extraction (70%), structured output generation (83%), entity extraction/NER (83%), code generation and translation (83%), factual Q&A (89%), multilingual tasks (86%), on-device lightweight agentic pipelines.

AVOID: Reasoning-heavy tasks (37%), adversarial/noisy environments (43%), text summarization and reformatting (25%), format-constrained generation with exact requirements (57%).

SPEED: 69-90 tok/s on RTX 2070 Super bf16. ~450MB VRAM. 128K context.

Capability	Score	Tests	Notes
Factual knowledge	88.9%	18	Strong across geography, science, history, tech. UK-specific weaker.
Multilingual	85.7%	7	French, Spanish, German, Chinese, Urdu all work. Arabic weaker.
Structured output	83.3%	6	JSON, YAML, CSV, markdown tables. JSON schema draft-07 fails.
Code	83.3%	6	Generation, explanation, Python→JS translation. Bug detection weaker.
Entity extraction	83.3%	3	Names, dates, locations, organizations. Product names weaker.
Data extraction	70.0%	5	Invoice, medical, product fields. Flight booking extraction fails.
Math	73.3%	15	Integer arithmetic solid. Percentages good. Word problems weaker.
Classification	66.7%	6	Sentiment strong. Topic classification (science vs tech) weaker.
Agentic patterns	66.7%	3	API error analysis and action planning. Multi-tool coordination weaker.
Instruction following	57.1%	7	Lists and translations OK. Exact word counts and "ONLY" constraints fail.
Robustness	42.9%	7	Injection resistance OK. Typo tolerance, all-caps, scrambled order fail.
Reasoning	37.5%	8	Logic puzzles weak. Syllogisms, trick questions, pattern completion fail.
Text transformation	25.0%	4	Summarization, style conversion, step extraction mostly fail.

Inference Speed (RTX 2070 Super, bf16)

Prompt Length	Tokens/sec	Latency (100 tokens)
Short (6 tokens)	69.0	1.45s
Medium (23 tokens)	90.3	1.11s
Long (51 tokens)	90.6	1.10s

Faster at longer prompts due to parallel prefill. Sustained ~90 tok/s decode.

Recommended Use Cases

Use Case	Fit	Why
Invoice/receipt parsing	Excellent	Structured extraction from known formats
Entity extraction pipeline	Excellent	NER at 83%, fast enough for real-time
JSON API response parsing	Excellent	Structured output at 83%, tiny footprint
Code snippet generation	Good	Simple functions, translations, explanations
Multilingual chatbot (basic)	Good	86% across 6 languages
Factual Q&A (constrained domain)	Good	89% factual accuracy on general knowledge
On-device data extraction agent	Good	Model card recommendation, verified at 70%
Sentiment analysis	Moderate	Positive/negative detection works, nuance weaker
Math tutoring (basic)	Moderate	Arithmetic solid, explanation quality varies
Text summarization	Poor	Only 25% on transformation tasks
Complex reasoning	Poor	37.5% on logic puzzles
Adversarial/noisy input	Poor	42.9% robustness — needs clean input

Exhaustive Finetuning Sweep

Comprehensive testing of finetuning methods with real HuggingFace datasets. Measures loss convergence, KL shift from base model, and hub preservation.

Method Comparison

Method	Dataset	Steps	Final Loss	KL Shift	Status
LoRA (target sweep)	Arithmetic	100	6.31-8.12	0.06-0.86	6 configs
QLoRA (4-bit NF4)	Alpaca (500)	200	5.12	0.51	Strongest learning
LoRA rank sweep	Arithmetic	100	9.16→7.70	0.008→0.12	r=2..16
SFT (Alpaca)	Alpaca (500)	200	5.26	0.16	Good task learning
DPO (manual)	Synthetic pairs	100	0.0001	0.002	Most surgical
GRPO (manual)	Math (10 prompts)	50	-0.006	0.001	Reward=1.0, minimal shift

Hyperparameter Sweep (13 configs, hub layers L0,L2,L4,L5)

Config	Value	Loss	KL Shift	Insight
Learning rate	1e-5	9.54	0.0006	Too conservative
Learning rate	5e-5	8.13	0.0022	Conservative
Learning rate	2e-4	6.76	0.009	Default — sweet spot
Learning rate	1e-3	5.78	0.037	Aggressive
Rank	r=2	7.64	0.004	Minimal params
Rank	r=4	7.13	0.005	Good efficiency
Rank	r=8	6.76	0.009	Sweet spot
Rank	r=16	6.43	0.019	Diminishing returns
Rank	r=32	6.07	0.020	Marginal gain vs r=16
Steps	50	7.66	0.004	Undertrained
Steps	200	6.03	0.015	Good
Steps	500	5.31	0.013	Loss drops, KL plateaus!

Finetuning Key Insight

Loss keeps decreasing with more training, but KL plateaus around 0.01-0.02. In this setup, the model improves on the training objective without a large measured distribution shift. That is a useful signal for controlled fine-tuning, but it still needs task-level eval before becoming a deployment rule.

Recommended config: lr=2e-4, r=8, 200-500 steps, target hub layers (L0,L2,L4,L5) with o_proj+MLP. This gives good task learning (loss 5.3-6.0) with minimal behavior shift (KL 0.01-0.02).

LFM2.5-230M:Hybrid Architecture Atlas