⚙️
Setup: 30 prompts across code, creative writing, reasoning, and instruction-following categories. Each prompt carries explicit constraints (e.g. "must include a docstring", "exactly 3 sentences"). Each of the 6 configs — 0.5B-bf16, 0.5B-8bit, 0.5B-4bit, 1.5B-bf16, 1.5B-8bit, 1.5B-4bit — generated up to 200 tokens per prompt on the same RTX 2070 Super (8GB), HF/transformers backend, bitsandbytes for 8-bit and 4-bit NF4. Outputs scored for word/character/sentence counts, repetition ratio, empty/garbage flags, repetition/degeneration, and constraint adherence.
Every metric, every config. The 4-bit NF4 sweet-spot row is highlighted. All values from completed runs on the same hardware.
"vs bf16" is the speed loss relative to that model's own bf16 baseline. Repetition Rate = repetitive outputs ÷ 30. Lower is better for repetition and empty; higher is better for speed and constraint adherence.
The four results that actually change what you should run.
1. Scale wins on quality HIGH
0.5B vs 1.5B · 30 prompts each
1.5B is qualitatively much better. Repetitive outputs drop 3–4× (0.5B: 13–16 / 30; 1.5B: 3–8 / 30) and constraint adherence roughly doubles (0.5B: 0–6.7%; 1.5B: 13.3–16.7%). The 0.5B model frequently degenerates into looping prose; 1.5B mostly stays on track.
2. 4-bit NF4 is the sweet spot HIGH
Speed & quality, not 8-bit
Counter-intuitive but consistent: 4-bit NF4, not 8-bit, is the right quantization. At 1.5B, 4-bit NF4 is only 9% slower than bf16 (17.1 vs 18.8 tok/s) while 8-bit is 52% slower (9.0 tok/s) — and 4-bit keeps quality within noise of the 8-bit run.
3. 8-bit is the slowest quant MEDIUM
bitsandbytes dequantization overhead
More bits does not mean faster. 8-bit is slower than 4-bit at both scales because bitsandbytes dequantizes 8-bit weights back to bf16 at every matmul. The extra precision buys little quality but costs ~50% of throughput. 8-bit is the worst of both worlds here.
4. 0.5B degrades more under quant MEDIUM
Quantization sensitivity scales inversely
The smaller model is hit harder by quantization. 0.5B loses 42–55% of speed going to 4-/8-bit; 1.5B loses only 9–52%. The smaller residual stream and thinner layers have less slack to absorb rounding error, so quantization cost is disproportionately large at 0.5B.
🍪
Quantization recommendation — run 1.5B in 4-bit NF4. It is the only config that threads all three needles at once: (1) speed — 17.1 tok/s, just 9% off the bf16 baseline and nearly 2× the 8-bit config; (2) quality — 13.3% constraint adherence and 26.7% repetition rate, within noise of bf16 (13.3% / 13.3%) and 8-bit (16.7% / 10.0%); (3) memory — 4-bit NF4 fits the 1.5B model comfortably in 8GB where bf16 is tight. 8-bit should be avoided: it pays the full dequantization tax for almost no quality gain. If you must run 0.5B, prefer 4-bit NF4 there too (13.7 tok/s vs 10.5 for 8-bit), but expect noticeably more degeneration regardless of precision.
Why more bits is slower, and why 1.5B-4bit nearly matches bf16.
8-bit is slower than 4-bit. Yes, really.
The instinct is that 8-bit (more bits, more precision) should sit between bf16 and 4-bit on speed. It does not. At both model sizes 8-bit is the slowest quantization: 0.5B-8bit runs 10.5 tok/s versus 13.7 for 4-bit, and 1.5B-8bit runs 9.0 tok/s versus 17.1 for 4-bit. The cause is in the backend: bitsandbytes stores 8-bit weights compactly but must dequantize them back to bf16 on every single matmul. That dequantization is a per-operation tax that 4-bit NF4 — which uses fused NF4 kernels designed to avoid exactly this — largely escapes. The result is that the "heavier" 4-bit format is faster than the "lighter" 8-bit one.
1.5B-4bit barely loses to bf16.
The standout speed number is 1.5B-4bit at 17.1 tok/s — only 9% below the 18.8 tok/s bf16 baseline. For a 4× memory reduction (roughly) that is an exceptional trade. The same comparison at 0.5B is far worse: 4-bit drops 42% (23.6 → 13.7). The difference is that the 1.5B model's larger matmuls amortize the quantization overhead better, while the 0.5B model's tiny per-layer ops are dominated by the fixed dequantization cost. Bigger models quantize more gracefully.
Can the model actually do what the prompt explicitly asks?
Each prompt carries explicit constraints ("must include a docstring", "exactly 3 sentences", "must handle the no-pair case"). Adherence is the share of those constraints met per config. Two patterns stand out:
- 0.5B essentially ignores constraints. bf16 scores 0.0% — not a single constraint met across 30 prompts. Quantization helps marginally (8-bit 3.3%, 4-bit 6.7%), but the ceiling is near zero. The 0.5B model tends to describe the task or hallucinate requirements rather than satisfy the stated ones.
- 1.5B follows constraints ~2× better, and 8-bit edges bf16. All three 1.5B configs land at 13–17%, roughly double the 0.5B best. The mild 8-bit lead (16.7%) over bf16/4-bit (13.3%) is within noise at n=30 and is not interpreted as a real quantization benefit — it is flagged as a tie.
The absolute numbers are low for both models — these are 0.5B/1.5B base models with no instruction tuning, so weak instruction-following is expected. The reliable signal is the relative one: scale roughly doubles adherence, and quantization does not meaningfully change it at either size.
Where outputs collapse into self-repeating text, by config.
Repetition is the dominant failure mode here, and it is where the scale gap is widest. Each output is flagged as repetitive when its repetition ratio crosses a degeneration threshold; empty when it produces no usable tokens.
- 0.5B loops constantly. 13–16 of 30 outputs are repetitive across all three precisions (43–53%). The model seizes on a phrase and rewrites it with one token swapped — the classic small-model degeneration loop. Quantization does not fix this; 8-bit is actually slightly worse (16) than bf16/4-bit (13).
- 1.5B loops 3–4× less. 3–8 of 30 outputs are repetitive (10–27%). bf16 and 8-bit are best (4 and 3); 4-bit NF4 picks up a few extra loops (8), a small quality cost for the large speed/memory win.
- Empty outputs are rare and roughly flat. 0.5B produces 1 empty output per precision; 1.5B produces 0 (bf16/8-bit) or 2 (4-bit). Degeneration, not silence, is the failure mode.
Prompt code_001 — "Write a Python function that checks if a string is a palindrome. Include a docstring and handle empty strings." A concrete illustration of the degeneration gap.
0.5B · bf16 · DEGENERATES
Additionally, the function should be able to handle strings with special characters and numbers. The function should also be able to handle strings with mixed case letters and numbers. The function should be able to handle strings with a maximum length of 100 characters. The function should be able to handle strings with a maximum length of 1000 characters. The function should be able to handle strings with a maximum length of 10000 characters. The function should be able to handle strings with a maximum length of 100000 characters. The function should be able to handle strings with a maximum length of 1000000 characters. The function should be able to handle strings with a maximum length of 10000000 characters. The function should be able to handle strings with a maximum length of 100000000 characters. The function should be able to handle strings with a maximum length of
repetition ratio 0.73 · 200 tokens (truncated) · repetitive · constraints met: 0/4
1.5B · 4-bit NF4 · CLEAN
Certainly! Here's a Python function that checks if a string is a palindrome, along with a docstring and an example of how to use it:
```python
def is_palindrome(s):
"""
Check if a string is a palindrome.
Parameters:
s (str): The string to check.
Returns:
bool: True if the string is a palindrome, False otherwise.
"""
# Remove any leading or trailing whitespace
s = s.strip()
# Check if the string is empty
if len(s) == 0:
return True
# Reverse the string
reversed_s = s[::-1]
# Check if the string is equal to its reverse
if s == reversed_s:
return True
else:
return False
```
repetition ratio 0.17 · 200 tokens (truncated) · not repetitive · includes docstring & empty-string handling
The 0.5B model never writes the function — it generates a chain of nearly-identical sentences, each adding a zero to the "maximum length", until the token budget runs out. The 1.5B 4-bit model, on the same prompt, produces a working function with a docstring and explicit empty-string handling. Same prompt, same 200-token budget, same hardware — the difference is scale, and the 4-bit NF4 quantization did not break the better model's coherence.
What this vibe check can and cannot claim.
- Small sample (n=30). 30 prompts per config is enough to see large effects (the 3–4× repetition gap, the 8-bit speed tax) but not to resolve small differences. The 8-bit constraint lead over bf16/4-bit (16.7% vs 13.3%) is within noise and is reported as a tie, not a real 8-bit advantage.
- Base models, not instruct-tuned. These are raw Qwen2.5-0.5B/1.5B checkpoints with no SFT or chat template. Absolute constraint adherence is low by design; the meaningful signal is the relative gap between configs, not the absolute rate.
- Greedy decoding. All runs use greedy (temperature 0) decoding, which maximizes repetition risk. Sampling would reduce repetition counts across the board but would also add variance; greedy isolates the model's own degeneration tendency.
- Single seed. One generation per prompt per config. The qualitative patterns (looping at 0.5B, coherence at 1.5B) are consistent across prompts and categories, but per-prompt variance is not measured.
- Constraint detection is heuristic. Adherence is scored by automated checks (presence of required elements, sentence counts, etc.), not human grading. Genuine-but-differently-phrased completions may be undercounted; the relative config ordering should still hold.
- Same family, one backend. Both models are Qwen2.5; both run on HF/transformers with bitsandbytes. The 8-bit dequantization tax is backend-specific; a different 8-bit kernel could change the 8-bit vs 4-bit speed ordering. The 4-bit NF4 recommendation is specific to bitsandbytes NF4.
"The atlas tells you which layers and heads matter. The vibe check tells you whether the model can write a palindrome function without looping forever. You need both — and at 1.5B in 4-bit NF4, you get coherence and speed in the same 8GB card."