A causal atlas of LiquidAI's 230M-parameter hybrid conv+attention language model, with gated short convolutions and grouped query attention.
The residual stream has three distinct phases with a dramatic norm transition:
| Phase | Layers | Norm Range | Character |
|---|---|---|---|
| Quiet | L0-L4 | 1.4 – 2.3 | Feature extraction, low-magnitude processing |
| Transition | L5 | → 25.5 | 10-18× norm explosion (input-dependent location) |
| High | L6-L13 | 22.9 – 26.3 | Sustained high-magnitude refinement |
Methodology note: The standard approach of hooking the full decoder layer and zeroing its output causes cascading zeros through the residual stream (all layers appear identical). Our corrected approach hooks the operator (conv/self_attn) and MLP (feed_forward) separately.
| Layer | Type | Operator KL | MLP KL | Skip KL | |
|---|---|---|---|---|---|
| L0 | conv | 56.31 | 46.15 | 82.90 | |
| L1 | conv | 7.94 | 11.18 | 36.31 | |
| L2 | attn | 22.95 | 8.63 | 30.81 | |
| L3 | conv | 3.79 | 22.84 | 25.86 | |
| L4 | attn | 34.87 | 21.16 | 34.56 | |
| L5 | conv | 5.42 | 47.75 | 44.41 | |
| L6 | attn | 6.88 | 4.77 | 8.74 | |
| L7 | conv | 1.83 | 5.47 | 7.26 | |
| L8 | attn | 8.30 | 4.05 | 10.20 | |
| L9 | conv | 2.32 | 5.29 | 7.07 | |
| L10 | attn | 11.10 | 6.58 | 17.68 | |
| L11 | conv | 3.14 | 5.10 | 8.67 | |
| L12 | attn | 7.40 | 7.74 | 29.13 | |
| L13 | conv | 4.55 | 5.74 | 12.13 |
| Head | Mean KL | Best Family | Role |
|---|---|---|---|
| L4_H11 | 0.948 | instruction_following (1.76) | Universal head |
| L2_H9 | 0.563 | json_schema (0.83) | Structural specialist |
| L2_H11 | 0.250 | factual_recall (0.32) | General processing |
| L2_H4 | 0.083 | factual_recall (0.08) | Minor |
| L4_H4 | 0.078 | instruction_following (0.28) | Minor |
| Remaining 91 heads: KL < 0.07 each. Individual head effects are 87× smaller than layer effects. | |||
H11 is architecturally special: Best head in ALL 6 attention layers (L2, L4, L6, L8, L10, L12). This head index has learned a universally useful function.
Random steering vector (seed=42), applied to last token position. Opposite of Qwen2.5: steering is more effective at early layers.
| Layer | Type | Max KL (s=-4.0) | |
|---|---|---|---|
| L6 | attn | 16.37 | |
| L5 | conv | 15.99 | |
| L1 | conv | 15.87 | |
| L2 | attn | 15.83 | |
| L0 | conv | 15.59 | |
| L10 | attn | 8.99 | |
| L11 | conv | 4.80 | |
| L12 | attn | 3.96 | |
| L13 | conv | 2.99 |
Why? Early layers have residual norms of ~2, so a steering vector of magnitude 4 doubles the activation. Late layers have norms of ~25, making the same vector only 16% of the activation.
46 tests across 9 categories. This 230M model is surprisingly capable.
| Category | Result | Notes |
|---|---|---|
| Factual recall | 7/10 correct | Paris, Tokyo, H2O, Jupiter, 100°C, pound, Rossum |
| JSON generation | Valid structure | Some extra brace artifacts on complex schemas |
| Code generation | Correct algorithms | Fibonacci, binary search, merge sort all correct! |
| Instruction following | Good | Proper lists, one-sentence, bullet points |
| Multi-turn context | Maintains state | Names, colors, arithmetic carry over correctly |
| Creative writing | Coherent | Valid haiku, atmospheric prose |
| Prompt injection | Deflected | Completely ignores injection attempt |
| Edge cases | Degenerates | Repeated/special chars loop (expected at 230M) |
| Config | Params | Trainable % | Loss | KL Shift |
|---|---|---|---|---|
| all_linear | 737,280 | 0.32% | 6.31 | 0.86 |
| attn_only | 344,064 | 0.15% | 6.76 | 0.30 |
| atlas_guided | 65,536 | 0.028% | 8.12 | 0.06 |
| mlp_only | — | — | — | — |
| conv_proj_only | — | — | — | — |
| atlas_full | — | — | — | — |
Atlas-guided wins: 11× fewer params than all_linear, with 14× less behavior shift. Targeting just hub layers L0, L2, L4, L5 with o_proj+MLP is the most surgical strategy.
| Metric | LFM2.5-230M | Qwen2.5-0.5B | SmolLM2-1.7B |
|---|---|---|---|
| Architecture | Hybrid (8 conv + 6 attn) | Pure transformer (24 attn) | Pure transformer (24 attn) |
| Universal hub | L0 (conv, 0% depth) | L2 (attn, 8% depth) | L0 (attn, 0% depth) |
| Hub KL (skip) | 82.9 | — | — |
| Conv vs Attn MLP | Conv 2.12× stronger | N/A (all attn) | N/A (all attn) |
| Steering target | Early layers (L0-L6) | Late layers (L19-L23) | L0 (consistent) |
| Head specialization | 2-3 heads matter | 22× increase at scale | — |
| Hub stability | std=0.0 (3 seeds) | std=0.0 (3 seeds) | — |
Architecture matters more than size. LFM2 (hybrid) and SmolLM2 both have L0 hubs, while Qwen (pure transformer) has L2. The hub position is architecture-specific, not just depth-dependent.
Hooking model.model.layers[i] and zeroing its output causes cascading zeros through the residual stream. When you zero L0's output, L1 sees input=0, which makes L1's output=0, and so on. ALL layers produce identical KL because the final hidden state is zero regardless of which layer was zeroed.
The fix: Hook the operator (conv/self_attn) and MLP (feed_forward) separately. Zeroing the operator gives: residual + 0 + ffn(norm(residual)), which preserves the residual pass-through and gives layer-specific KL values.
Implication: This bug likely affects existing MI-Atlas results for all residual-stream models (including Qwen). The reported hub locations should be re-verified with the corrected methodology.
Experiment Date: June 29, 2026 | Hardware: RTX 2070 Super 8GB | Runtime: ~45 minutes total | Model VRAM: 450 MB (bf16)
10 follow-up probes beyond the standard MI-Atlas pass, with 18 result files and 23 total experiments.
CKA similarity analysis gives rounded CKA=1.0000 from L5 through L13. In this run, the residual stream representation appears to stabilize after L5. Layers L6-L13 still affect logits, but CKA does not separate their representations at the reported precision.
Implication: L5 looks like a representation transition point in this setup. The later layers may be refining a stable representation rather than building a new one from scratch.
| Layer Pair | CKA | Interpretation |
|---|---|---|
| embed ↔ L0 | 0.84 | High — L0 preserves embedding structure |
| L0 ↔ L4 | 0.86 | Gradual evolution through early layers |
| L4 ↔ L5 | 0.72 | Dip — L5 creates a new representation |
| L5 ↔ L6 | 1.00 | IDENTICAL — lock-in begins |
| L5 ↔ L13 | 1.00 | IDENTICAL — all post-L5 layers identical |
| embed ↔ L5 | 0.18 | L5 representation is completely different from embeddings |
| L13 ↔ embed | 0.55 | Final output partially recovers embedding structure |
Projecting each layer's residual stream through the LM head reveals when the model forms its prediction:
| Layer | "The capital of France is" | Entropy |
|---|---|---|
| embed | ' is' (1.00) | 0.00 |
| L0 | '?' (0.00) | 0.00 |
| L3 | 'olate' (0.18) | 5.69 |
| L5 | ' indeed' (0.22) | 5.53 |
| L6 | ' indeed' (0.34) | 5.03 |
| L10 | ' usually' (0.28) | 3.97 |
| L13 | ' Paris' (0.94) | 0.32 |
The answer "Paris" only emerges at L13 despite the representation being locked at L5. The LM head reads different information from the same representation at different layers — or the embedding_norm + lm_head projection amplifies subtle differences invisible to CKA.
| Layer | Kernel [past, current, future] | Dominant |
|---|---|---|
| L0 | [0.018, 0.165, 0.017] | Current token |
| L1 | [0.031, 0.153, 0.053] | Current token |
| L3 | [0.025, 0.061, 0.145] | Future token |
| L5 | [0.022, 0.051, 0.155] | Future token |
| L7 | [0.020, 0.046, 0.144] | Future token |
| L13 | [0.002, 0.008, 0.222] | Future token (dominant) |
Early convs (L0-L1) focus on the current token. Later convs (L3+) shift to the next (future) token. The model learns to look ahead — the convolution kernel position [2] (future) grows from 0.017 to 0.222 across layers.
| Rank | Params | Loss | KL Shift |
|---|---|---|---|
| r=2 | 16,384 | 9.16 | 0.008 |
| r=4 | 32,768 | 8.64 | 0.031 |
| r=8 | 65,536 | 8.12 | 0.071 |
| r=16 | 131,072 | 7.70 | 0.116 |
Clean linear scaling: rank doubles → KL roughly doubles. r=8 is the sweet spot (65K params, good loss convergence).
Total experiments: 23 | Result files: 18 JSON | Total runtime: ~60 minutes | Date: June 29, 2026
82 automated tests across 14 capability categories. Each test has a deterministic pass/fail criterion.
USE FOR: Data extraction (70%), structured output generation (83%), entity extraction/NER (83%), code generation and translation (83%), factual Q&A (89%), multilingual tasks (86%), on-device lightweight agentic pipelines.
AVOID: Reasoning-heavy tasks (37%), adversarial/noisy environments (43%), text summarization and reformatting (25%), format-constrained generation with exact requirements (57%).
SPEED: 69-90 tok/s on RTX 2070 Super bf16. ~450MB VRAM. 128K context.
| Capability | Score | Tests | Notes |
|---|---|---|---|
| Factual knowledge | 88.9% | 18 | Strong across geography, science, history, tech. UK-specific weaker. |
| Multilingual | 85.7% | 7 | French, Spanish, German, Chinese, Urdu all work. Arabic weaker. |
| Structured output | 83.3% | 6 | JSON, YAML, CSV, markdown tables. JSON schema draft-07 fails. |
| Code | 83.3% | 6 | Generation, explanation, Python→JS translation. Bug detection weaker. |
| Entity extraction | 83.3% | 3 | Names, dates, locations, organizations. Product names weaker. |
| Data extraction | 70.0% | 5 | Invoice, medical, product fields. Flight booking extraction fails. |
| Math | 73.3% | 15 | Integer arithmetic solid. Percentages good. Word problems weaker. |
| Classification | 66.7% | 6 | Sentiment strong. Topic classification (science vs tech) weaker. |
| Agentic patterns | 66.7% | 3 | API error analysis and action planning. Multi-tool coordination weaker. |
| Instruction following | 57.1% | 7 | Lists and translations OK. Exact word counts and "ONLY" constraints fail. |
| Robustness | 42.9% | 7 | Injection resistance OK. Typo tolerance, all-caps, scrambled order fail. |
| Reasoning | 37.5% | 8 | Logic puzzles weak. Syllogisms, trick questions, pattern completion fail. |
| Text transformation | 25.0% | 4 | Summarization, style conversion, step extraction mostly fail. |
| Prompt Length | Tokens/sec | Latency (100 tokens) |
|---|---|---|
| Short (6 tokens) | 69.0 | 1.45s |
| Medium (23 tokens) | 90.3 | 1.11s |
| Long (51 tokens) | 90.6 | 1.10s |
Faster at longer prompts due to parallel prefill. Sustained ~90 tok/s decode.
| Use Case | Fit | Why |
|---|---|---|
| Invoice/receipt parsing | Excellent | Structured extraction from known formats |
| Entity extraction pipeline | Excellent | NER at 83%, fast enough for real-time |
| JSON API response parsing | Excellent | Structured output at 83%, tiny footprint |
| Code snippet generation | Good | Simple functions, translations, explanations |
| Multilingual chatbot (basic) | Good | 86% across 6 languages |
| Factual Q&A (constrained domain) | Good | 89% factual accuracy on general knowledge |
| On-device data extraction agent | Good | Model card recommendation, verified at 70% |
| Sentiment analysis | Moderate | Positive/negative detection works, nuance weaker |
| Math tutoring (basic) | Moderate | Arithmetic solid, explanation quality varies |
| Text summarization | Poor | Only 25% on transformation tasks |
| Complex reasoning | Poor | 37.5% on logic puzzles |
| Adversarial/noisy input | Poor | 42.9% robustness — needs clean input |
Comprehensive testing of finetuning methods with real HuggingFace datasets. Measures loss convergence, KL shift from base model, and hub preservation.
| Method | Dataset | Steps | Final Loss | KL Shift | Status |
|---|---|---|---|---|---|
| LoRA (target sweep) | Arithmetic | 100 | 6.31-8.12 | 0.06-0.86 | 6 configs |
| QLoRA (4-bit NF4) | Alpaca (500) | 200 | 5.12 | 0.51 | Strongest learning |
| LoRA rank sweep | Arithmetic | 100 | 9.16→7.70 | 0.008→0.12 | r=2..16 |
| SFT (Alpaca) | Alpaca (500) | 200 | 5.26 | 0.16 | Good task learning |
| DPO (manual) | Synthetic pairs | 100 | 0.0001 | 0.002 | Most surgical |
| GRPO (manual) | Math (10 prompts) | 50 | -0.006 | 0.001 | Reward=1.0, minimal shift |
| Config | Value | Loss | KL Shift | Insight |
|---|---|---|---|---|
| Learning rate | 1e-5 | 9.54 | 0.0006 | Too conservative |
| Learning rate | 5e-5 | 8.13 | 0.0022 | Conservative |
| Learning rate | 2e-4 | 6.76 | 0.009 | Default — sweet spot |
| Learning rate | 1e-3 | 5.78 | 0.037 | Aggressive |
| Rank | r=2 | 7.64 | 0.004 | Minimal params |
| Rank | r=4 | 7.13 | 0.005 | Good efficiency |
| Rank | r=8 | 6.76 | 0.009 | Sweet spot |
| Rank | r=16 | 6.43 | 0.019 | Diminishing returns |
| Rank | r=32 | 6.07 | 0.020 | Marginal gain vs r=16 |
| Steps | 50 | 7.66 | 0.004 | Undertrained |
| Steps | 200 | 6.03 | 0.015 | Good |
| Steps | 500 | 5.31 | 0.013 | Loss drops, KL plateaus! |
Loss keeps decreasing with more training, but KL plateaus around 0.01-0.02. In this setup, the model improves on the training objective without a large measured distribution shift. That is a useful signal for controlled fine-tuning, but it still needs task-level eval before becoming a deployment rule.
Recommended config: lr=2e-4, r=8, 200-500 steps, target hub layers (L0,L2,L4,L5) with o_proj+MLP. This gives good task learning (loss 5.3-6.0) with minimal behavior shift (KL 0.01-0.02).