Qualitative Analysis: The Vibe Check

TL;DR — 30 prompts × 6 configs (Qwen2.5-0.5B and 1.5B at bf16, 8-bit, and 4-bit NF4). The mechanistic atlas tells you which components matter; this page asks the simpler, blunter question: does the output actually read well? The answer is a clean win for scale. 1.5B is qualitatively much better — 3–4× fewer repetitive outputs and ~2× better constraint following than 0.5B. And the quantization surprise: 4-bit NF4 is the sweet spot, not 8-bit. 8-bit is the slowest quantization (bitsandbytes dequantization overhead), while 4-bit NF4 loses almost no speed at 1.5B (9% vs bf16) and keeps quality. The practical recipe: run 1.5B in 4-bit NF4.

⚙️

Setup: 30 prompts across code, creative writing, reasoning, and instruction-following categories. Each prompt carries explicit constraints (e.g. "must include a docstring", "exactly 3 sentences"). Each of the 6 configs — 0.5B-bf16, 0.5B-8bit, 0.5B-4bit, 1.5B-bf16, 1.5B-8bit, 1.5B-4bit — generated up to 200 tokens per prompt on the same RTX 2070 Super (8GB), HF/transformers backend, bitsandbytes for 8-bit and 4-bit NF4. Outputs scored for word/character/sentence counts, repetition ratio, empty/garbage flags, repetition/degeneration, and constraint adherence.

Config Comparison: 6 Configs Side by Side

Every metric, every config. The 4-bit NF4 sweet-spot row is highlighted. All values from completed runs on the same hardware.

Config	Precision	Speed (tok/s)	vs bf16	Constraint Adherence	Repetitive /30	Repetition Rate	Empty /30
0.5B	bf16	23.6	—	0.0%	13	43.3%	1
0.5B	8-bit	10.5	−55.5%	3.3%	16	53.3%	1
0.5B	4-bit NF4	13.7	−41.9%	6.7%	13	43.3%	1
1.5B	bf16	18.8	—	13.3%	4	13.3%	0
1.5B	8-bit	9.0	−52.1%	16.7%	3	10.0%	0
1.5B Sweet spot	4-bit NF4	17.1	−9.0%	13.3%	8	26.7%	2

"vs bf16" is the speed loss relative to that model's own bf16 baseline. Repetition Rate = repetitive outputs ÷ 30. Lower is better for repetition and empty; higher is better for speed and constraint adherence.

Key Findings

The four results that actually change what you should run.

1. Scale wins on quality HIGH

0.5B vs 1.5B · 30 prompts each

1.5B is qualitatively much better. Repetitive outputs drop 3–4× (0.5B: 13–16 / 30; 1.5B: 3–8 / 30) and constraint adherence roughly doubles (0.5B: 0–6.7%; 1.5B: 13.3–16.7%). The 0.5B model frequently degenerates into looping prose; 1.5B mostly stays on track.

2. 4-bit NF4 is the sweet spot HIGH

Speed & quality, not 8-bit

Counter-intuitive but consistent: 4-bit NF4, not 8-bit, is the right quantization. At 1.5B, 4-bit NF4 is only 9% slower than bf16 (17.1 vs 18.8 tok/s) while 8-bit is 52% slower (9.0 tok/s) — and 4-bit keeps quality within noise of the 8-bit run.

3. 8-bit is the slowest quant MEDIUM

bitsandbytes dequantization overhead

More bits does not mean faster. 8-bit is slower than 4-bit at both scales because bitsandbytes dequantizes 8-bit weights back to bf16 at every matmul. The extra precision buys little quality but costs ~50% of throughput. 8-bit is the worst of both worlds here.

4. 0.5B degrades more under quant MEDIUM

Quantization sensitivity scales inversely

The smaller model is hit harder by quantization. 0.5B loses 42–55% of speed going to 4-/8-bit; 1.5B loses only 9–52%. The smaller residual stream and thinner layers have less slack to absorb rounding error, so quantization cost is disproportionately large at 0.5B.

🍪

Quantization recommendation — run 1.5B in 4-bit NF4. It is the only config that threads all three needles at once: (1) speed — 17.1 tok/s, just 9% off the bf16 baseline and nearly 2× the 8-bit config; (2) quality — 13.3% constraint adherence and 26.7% repetition rate, within noise of bf16 (13.3% / 13.3%) and 8-bit (16.7% / 10.0%); (3) memory — 4-bit NF4 fits the 1.5B model comfortably in 8GB where bf16 is tight. 8-bit should be avoided: it pays the full dequantization tax for almost no quality gain. If you must run 0.5B, prefer 4-bit NF4 there too (13.7 tok/s vs 10.5 for 8-bit), but expect noticeably more degeneration regardless of precision.

Speed: The 8-bit Tax

Why more bits is slower, and why 1.5B-4bit nearly matches bf16.

8-bit is slower than 4-bit. Yes, really.

The instinct is that 8-bit (more bits, more precision) should sit between bf16 and 4-bit on speed. It does not. At both model sizes 8-bit is the slowest quantization: 0.5B-8bit runs 10.5 tok/s versus 13.7 for 4-bit, and 1.5B-8bit runs 9.0 tok/s versus 17.1 for 4-bit. The cause is in the backend: bitsandbytes stores 8-bit weights compactly but must dequantize them back to bf16 on every single matmul. That dequantization is a per-operation tax that 4-bit NF4 — which uses fused NF4 kernels designed to avoid exactly this — largely escapes. The result is that the "heavier" 4-bit format is faster than the "lighter" 8-bit one.

1.5B-4bit barely loses to bf16.

The standout speed number is 1.5B-4bit at 17.1 tok/s — only 9% below the 18.8 tok/s bf16 baseline. For a 4× memory reduction (roughly) that is an exceptional trade. The same comparison at 0.5B is far worse: 4-bit drops 42% (23.6 → 13.7). The difference is that the 1.5B model's larger matmuls amortize the quantization overhead better, while the 0.5B model's tiny per-layer ops are dominated by the fixed dequantization cost. Bigger models quantize more gracefully.

Constraint Adherence: Following Instructions

Can the model actually do what the prompt explicitly asks?

Each prompt carries explicit constraints ("must include a docstring", "exactly 3 sentences", "must handle the no-pair case"). Adherence is the share of those constraints met per config. Two patterns stand out:

0.5B essentially ignores constraints. bf16 scores 0.0% — not a single constraint met across 30 prompts. Quantization helps marginally (8-bit 3.3%, 4-bit 6.7%), but the ceiling is near zero. The 0.5B model tends to describe the task or hallucinate requirements rather than satisfy the stated ones.
1.5B follows constraints ~2× better, and 8-bit edges bf16. All three 1.5B configs land at 13–17%, roughly double the 0.5B best. The mild 8-bit lead (16.7%) over bf16/4-bit (13.3%) is within noise at n=30 and is not interpreted as a real quantization benefit — it is flagged as a tie.

The absolute numbers are low for both models — these are 0.5B/1.5B base models with no instruction tuning, so weak instruction-following is expected. The reliable signal is the relative one: scale roughly doubles adherence, and quantization does not meaningfully change it at either size.

Repetition & Degeneration: The Looping Problem

Where outputs collapse into self-repeating text, by config.

Repetition is the dominant failure mode here, and it is where the scale gap is widest. Each output is flagged as repetitive when its repetition ratio crosses a degeneration threshold; empty when it produces no usable tokens.

0.5B loops constantly. 13–16 of 30 outputs are repetitive across all three precisions (43–53%). The model seizes on a phrase and rewrites it with one token swapped — the classic small-model degeneration loop. Quantization does not fix this; 8-bit is actually slightly worse (16) than bf16/4-bit (13).
1.5B loops 3–4× less. 3–8 of 30 outputs are repetitive (10–27%). bf16 and 8-bit are best (4 and 3); 4-bit NF4 picks up a few extra loops (8), a small quality cost for the large speed/memory win.
Empty outputs are rare and roughly flat. 0.5B produces 1 empty output per precision; 1.5B produces 0 (bf16/8-bit) or 2 (4-bit). Degeneration, not silence, is the failure mode.

Sample Prose: Same Prompt, Two Configs

Prompt code_001 — "Write a Python function that checks if a string is a palindrome. Include a docstring and handle empty strings." A concrete illustration of the degeneration gap.

0.5B · bf16 · DEGENERATES

Additionally, the function should be able to handle strings with special characters and numbers. The function should also be able to handle strings with mixed case letters and numbers. The function should be able to handle strings with a maximum length of 100 characters. The function should be able to handle strings with a maximum length of 1000 characters. The function should be able to handle strings with a maximum length of 10000 characters. The function should be able to handle strings with a maximum length of 100000 characters. The function should be able to handle strings with a maximum length of 1000000 characters. The function should be able to handle strings with a maximum length of 10000000 characters. The function should be able to handle strings with a maximum length of 100000000 characters. The function should be able to handle strings with a maximum length of

repetition ratio 0.73 · 200 tokens (truncated) · repetitive · constraints met: 0/4

1.5B · 4-bit NF4 · CLEAN

Certainly! Here's a Python function that checks if a string is a palindrome, along with a docstring and an example of how to use it: ```python def is_palindrome(s): """ Check if a string is a palindrome. Parameters: s (str): The string to check. Returns: bool: True if the string is a palindrome, False otherwise. """ # Remove any leading or trailing whitespace s = s.strip() # Check if the string is empty if len(s) == 0: return True # Reverse the string reversed_s = s[::-1] # Check if the string is equal to its reverse if s == reversed_s: return True else: return False ```

repetition ratio 0.17 · 200 tokens (truncated) · not repetitive · includes docstring & empty-string handling

The 0.5B model never writes the function — it generates a chain of nearly-identical sentences, each adding a zero to the "maximum length", until the token budget runs out. The 1.5B 4-bit model, on the same prompt, produces a working function with a docstring and explicit empty-string handling. Same prompt, same 200-token budget, same hardware — the difference is scale, and the 4-bit NF4 quantization did not break the better model's coherence.

Method & Limitations

What this vibe check can and cannot claim.

Small sample (n=30). 30 prompts per config is enough to see large effects (the 3–4× repetition gap, the 8-bit speed tax) but not to resolve small differences. The 8-bit constraint lead over bf16/4-bit (16.7% vs 13.3%) is within noise and is reported as a tie, not a real 8-bit advantage.
Base models, not instruct-tuned. These are raw Qwen2.5-0.5B/1.5B checkpoints with no SFT or chat template. Absolute constraint adherence is low by design; the meaningful signal is the relative gap between configs, not the absolute rate.
Greedy decoding. All runs use greedy (temperature 0) decoding, which maximizes repetition risk. Sampling would reduce repetition counts across the board but would also add variance; greedy isolates the model's own degeneration tendency.
Single seed. One generation per prompt per config. The qualitative patterns (looping at 0.5B, coherence at 1.5B) are consistent across prompts and categories, but per-prompt variance is not measured.
Constraint detection is heuristic. Adherence is scored by automated checks (presence of required elements, sentence counts, etc.), not human grading. Genuine-but-differently-phrased completions may be undercounted; the relative config ordering should still hold.
Same family, one backend. Both models are Qwen2.5; both run on HF/transformers with bitsandbytes. The 8-bit dequantization tax is backend-specific; a different 8-bit kernel could change the 8-bit vs 4-bit speed ordering. The 4-bit NF4 recommendation is specific to bitsandbytes NF4.

"The atlas tells you which layers and heads matter. The vibe check tells you whether the model can write a palindrome function without looping forever. You need both — and at 1.5B in 4-bit NF4, you get coherence and speed in the same 8GB card."

← Previous: Comparison Analysis · Back to MI-Atlas →