Evaluating 8 models for transcript → SOAP note generation
Run clinical note generation locally. Patient data never leaves the device.
Find the best small model that doesn't fabricate medical facts.
MedGemma 1.5 4B · MedGemma 4B · Llama3 Medical COT
Llama 3.2 3B · Gemma 3n E2B · Phi-4 Mini · Qwen 3.5 4B
Gemini 2.5 Flash
The ceiling. How close can local models get?
Tests hallucination resistance.
Will the model invent content?
Balanced signal.
Standard clinical encounters.
Tests completeness.
Will the model drop content?
SOAP note template with entity marking instructions provided to all models.
6 dimensions, each scored 0–3 by LLM-as-judge
Claude Opus
No conflict of interest
Gemini 2.5 Pro
Bias experiment
Equal weights across all 6 dimensions — Claude Opus judge
Gemma 3n leads local models. MedGemma 4B trails by 8 points.
Side-by-side review of generated notes told a different story.
Cleaner formatting, well-structured SOAP sections. But on dense visits: missing critical details.
Messier structure, content in wrong sections. But on dense visits: capturing more clinical content.
The scores weren't reflecting clinical utility.
When MedGemma puts exam findings in the wrong SOAP section, the judge penalizes both completeness AND template adherence.
Same clinical error counted twice.
Transcript 14 — Multimorbid patient
MedGemma captures labs, exam findings, and dialysis counseling… but places them in the wrong section.
Judge docks completeness AND format.
Models that capture more content get penalized more for structural errors.
"Alert and oriented"
In virtually every clinical note. Flagged as fabrication.
"No known drug allergies"
Standard documentation default. Flagged as fabrication.
Models that behave like real clinicians get penalized.
Perverse incentive: omitting standard documentation is "safer" under this rubric.
Entity Marking
Broken for all local models: 0–3% scores.
Carries the same weight as hallucination.
Template Adherence
Structure weighted equally with content.
A well-formatted empty note scores higher than messy completeness.
The rubric rewards structure over substance.
1. Misplaced content penalizes template adherence only, not completeness
Eliminates double-penalization
2. Clinical defaults exempt from hallucination scoring
"A&O", "NKDA", "well-appearing", "no acute distress"
3. Weighted scoring replaces equal weights
Hallucination 2x · Completeness 2x · Entity Marking 0.5x
Re-scored all 200 SOAP notes with both judges.
Weighted scoring — Claude Opus judge
The gap between Gemma 3n and MedGemma 4B: 8 pts → 1 pt
Score change after rubric fix + weighted scoring
The old rubric was unfairly penalizing MedGemma's content-heavy approach.
| Halluc. | Complete | Instruct | Format | Entities | No Dup | |
|---|---|---|---|---|---|---|
| Gemma 3n | 56% | 35% | 39% | 61% | 24% | 84% |
| MedGemma 4B | 53% | 48% | 28% | 44% | 3% | 88% |
Which failure mode is more acceptable?
Not self-preference bias — Gemini only gives itself +3 over Claude's score.
Real finding: judges disagree on what constitutes failure.
Qwen case study:
Qwen outputs 60K chars of thinking dump with a note buried inside.
"This is not a clinical note."
Scores the output as a product.
Finds the buried note.
Scores the clinical content.
For our use case — output must be directly usable — Claude is the right judge.
Completeness wins over structure.
Physicians can fix bad formatting. They cannot add back missing content.
48% completeness vs 35% — a 13-point gap on the dimension that matters most.
Gemma 3n's risks
Factual errors in family history
Note duplication bug
Dangerous, hard to catch
MedGemma 4B's issues
Template leakage (post-processable)
Section reordering
Fixable in post-processing
Both at ~50% of frontier quality. Physician review is mandatory regardless.
evals/plan.md — evaluation design and rubric
evals/reflections/ — analysis and methodology notes
evals/results/ — raw scores and model outputs