Finding the Right Model
for Offline Clinical Notes

Evaluating 8 models for transcript → SOAP note generation

KasaMD Offline

Context The Goal

8GB
Apple Silicon target
0
Cloud dependencies
0
Tolerable hallucinations

Run clinical note generation locally. Patient data never leaves the device.
Find the best small model that doesn't fabricate medical facts.

Models The Candidates

7 Local Models (3-4B params, Q4)

MedGemma 1.5 4B · MedGemma 4B · Llama3 Medical COT
Llama 3.2 3B · Gemma 3n E2B · Phi-4 Mini · Qwen 3.5 4B

1 Cloud Frontier

Gemini 2.5 Flash

The ceiling. How close can local models get?

Data The Test Suite

  • 25 doctor dictation transcripts across primary care
  • Distribution mirrors real-world visit frequency (CDC NAMCS)
  • Density mix to stress-test both failure modes:

Sparse

Tests hallucination resistance.
Will the model invent content?

Medium

Balanced signal.
Standard clinical encounters.

Dense

Tests completeness.
Will the model drop content?

SOAP note template with entity marking instructions provided to all models.

Method The Scoring Rubric v1

6 dimensions, each scored 0–3 by LLM-as-judge

Hallucination — fabricated medical content
Completeness — clinical facts captured
Instruction Following — prompt adherence
Template Adherence — SOAP structure
Entity Marking — bracket notation
Duplication — repeated content

Primary Judge

Claude Opus

No conflict of interest

Second Judge

Gemini 2.5 Pro

Bias experiment

Run 1 The First Leaderboard

Equal weights across all 6 dimensions — Claude Opus judge

Gemini Flash
78%
Gemma 3n
49%
MedGemma 4B
41%
Phi-4 Mini
34%
Llama 3.2 3B
31%
MedGemma 1.5
21%
Llama3 COT
7%
Qwen 3.5
7%

Gemma 3n leads local models. MedGemma 4B trails by 8 points.

Something Didn't Add Up

Side-by-side review of generated notes told a different story.

Gemma 3n

Cleaner formatting, well-structured SOAP sections. But on dense visits: missing critical details.

vs

MedGemma 4B

Messier structure, content in wrong sections. But on dense visits: capturing more clinical content.

The scores weren't reflecting clinical utility.

Problem 1 Double-Penalization

When MedGemma puts exam findings in the wrong SOAP section, the judge penalizes both completeness AND template adherence.

Same clinical error counted twice.

Transcript 14 — Multimorbid patient

MedGemma captures labs, exam findings, and dialysis counseling… but places them in the wrong section.

Judge docks completeness AND format.

Models that capture more content get penalized more for structural errors.

Problem 2 Clinical Boilerplate as Hallucination

"Alert and oriented"

In virtually every clinical note. Flagged as fabrication.

"No known drug allergies"

Standard documentation default. Flagged as fabrication.

Models that behave like real clinicians get penalized.

Perverse incentive: omitting standard documentation is "safer" under this rubric.

Problem 3 Equal Weighting

Entity Marking

Broken for all local models: 0–3% scores.

Carries the same weight as hallucination.

Template Adherence

Structure weighted equally with content.

A well-formatted empty note scores higher than messy completeness.

The rubric rewards structure over substance.

Rubric v2 The Fix

1. Misplaced content penalizes template adherence only, not completeness

Eliminates double-penalization

2. Clinical defaults exempt from hallucination scoring

"A&O", "NKDA", "well-appearing", "no acute distress"

3. Weighted scoring replaces equal weights

Hallucination 2x · Completeness 2x · Entity Marking 0.5x

Re-scored all 200 SOAP notes with both judges.

Run 2 Updated Leaderboard

Weighted scoring — Claude Opus judge

Gemini Flash
81%
Gemma 3n
50%
MedGemma 4B
49%
Phi-4 Mini
41%
Llama 3.2 3B
37%
MedGemma 1.5
29%
Llama3 COT
9%
Qwen 3.5
7%

The gap between Gemma 3n and MedGemma 4B: 8 pts → 1 pt

Shift Before vs After

Score change after rubric fix + weighted scoring

MedGemma 4B
+8 pts
41% → 49%
Phi-4 Mini
+7 pts
34% → 41%
MedGemma 1.5
+8 pts
21% → 29%
Llama 3.2 3B
+6 pts
31% → 37%
Gemini Flash
+3 pts
78% → 81%
Gemma 3n
+1
49% → 50%

The old rubric was unfairly penalizing MedGemma's content-heavy approach.

Head to Head Gemma 3n vs MedGemma 4B

Halluc. Complete Instruct Format Entities No Dup
Gemma 3n 56% 35% 39% 61% 24% 84%
MedGemma 4B 53% 48% 28% 44% 3% 88%
+13
MedGemma completeness lead
+17
Gemma format lead

Which failure mode is more acceptable?

Bias Check Claude vs Gemini as Judge

Not self-preference bias — Gemini only gives itself +3 over Claude's score.

Real finding: judges disagree on what constitutes failure.

Qwen case study:

Qwen outputs 60K chars of thinking dump with a note buried inside.

Claude: 7%

"This is not a clinical note."
Scores the output as a product.

Gemini: 73%

Finds the buried note.
Scores the clinical content.

For our use case — output must be directly usable — Claude is the right judge.

Recommendation: MedGemma 4B

Completeness wins over structure.

Physicians can fix bad formatting. They cannot add back missing content.

48% completeness vs 35% — a 13-point gap on the dimension that matters most.

Gemma 3n's risks

Factual errors in family history

Note duplication bug

Dangerous, hard to catch

MedGemma 4B's issues

Template leakage (post-processable)

Section reordering

Fixable in post-processing

Both at ~50% of frontier quality. Physician review is mandatory regardless.

Next What's Next

  • Score H&P and DAP templates to confirm MedGemma holds across note types
  • Post-processing pipeline for MedGemma's known failure modes
  • Prompt engineering: simpler template, few-shot examples
  • Target: close the gap from 49% toward frontier quality
49% ?

Full methodology, scores, and reflections in evals/

evals/plan.md — evaluation design and rubric

evals/reflections/ — analysis and methodology notes

evals/results/ — raw scores and model outputs

KasaMD Offline