Finding the Right Model
for Offline Clinical Notes

Evaluating 8 models for transcript → SOAP note generation

KasaMD Offline

Context The Goal

8GB

Apple Silicon target

Cloud dependencies

Tolerable hallucinations

Run clinical note generation locally. Patient data never leaves the device.
Find the best small model that doesn't fabricate medical facts.

Models The Candidates

7 Local Models (3-4B params, Q4)

MedGemma 1.5 4B · MedGemma 4B · Llama3 Medical COT
Llama 3.2 3B · Gemma 3n E2B · Phi-4 Mini · Qwen 3.5 4B

1 Cloud Frontier

Gemini 2.5 Flash

The ceiling. How close can local models get?

Data The Test Suite

25 doctor dictation transcripts across primary care
Distribution mirrors real-world visit frequency (CDC NAMCS)
Density mix to stress-test both failure modes:

Sparse

Tests hallucination resistance.
Will the model invent content?

Medium

Balanced signal.
Standard clinical encounters.

Dense

Tests completeness.
Will the model drop content?

SOAP note template with entity marking instructions provided to all models.

Method The Scoring Rubric v1

6 dimensions, each scored 0–3 by LLM-as-judge

Hallucination — fabricated medical content

Completeness — clinical facts captured

Instruction Following — prompt adherence

Template Adherence — SOAP structure

Entity Marking — bracket notation

Duplication — repeated content

Primary Judge

Claude Opus

No conflict of interest

Second Judge

Gemini 2.5 Pro

Bias experiment

Run 1 The First Leaderboard

Equal weights across all 6 dimensions — Claude Opus judge

Gemini Flash

78%

Gemma 3n

49%

MedGemma 4B

41%

Phi-4 Mini

34%

Llama 3.2 3B

31%

MedGemma 1.5

21%

Llama3 COT

Qwen 3.5

Gemma 3n leads local models. MedGemma 4B trails by 8 points.

Something Didn't Add Up

Side-by-side review of generated notes told a different story.

Gemma 3n

Cleaner formatting, well-structured SOAP sections. But on dense visits: missing critical details.

MedGemma 4B

Messier structure, content in wrong sections. But on dense visits: capturing more clinical content.

The scores weren't reflecting clinical utility.

Problem 1 Double-Penalization

When MedGemma puts exam findings in the wrong SOAP section, the judge penalizes both completeness AND template adherence.

Same clinical error counted twice.

Transcript 14 — Multimorbid patient

MedGemma captures labs, exam findings, and dialysis counseling… but places them in the wrong section.

Judge docks completeness AND format.

Models that capture more content get penalized more for structural errors.

Problem 2 Clinical Boilerplate as Hallucination

"Alert and oriented"

In virtually every clinical note. Flagged as fabrication.

"No known drug allergies"

Standard documentation default. Flagged as fabrication.

Models that behave like real clinicians get penalized.

Perverse incentive: omitting standard documentation is "safer" under this rubric.

Problem 3 Equal Weighting

Entity Marking

Broken for all local models: 0–3% scores.

Carries the same weight as hallucination.

Template Adherence

Structure weighted equally with content.

A well-formatted empty note scores higher than messy completeness.

The rubric rewards structure over substance.

Rubric v2 The Fix

1. Misplaced content penalizes template adherence only, not completeness

Eliminates double-penalization

2. Clinical defaults exempt from hallucination scoring

"A&O", "NKDA", "well-appearing", "no acute distress"

3. Weighted scoring replaces equal weights

Hallucination 2x · Completeness 2x · Entity Marking 0.5x

Re-scored all 200 SOAP notes with both judges.

Run 2 Updated Leaderboard

Weighted scoring — Claude Opus judge

Gemini Flash

81%

Gemma 3n

50%

MedGemma 4B

49%

Phi-4 Mini

41%

Llama 3.2 3B

37%

MedGemma 1.5

29%

Llama3 COT

Qwen 3.5

The gap between Gemma 3n and MedGemma 4B: 8 pts → 1 pt

Shift Before vs After

Score change after rubric fix + weighted scoring

MedGemma 4B

+8 pts

41% → 49%

Phi-4 Mini

+7 pts

34% → 41%

MedGemma 1.5

+8 pts

21% → 29%

Llama 3.2 3B

+6 pts

31% → 37%

Gemini Flash

+3 pts

78% → 81%

Gemma 3n

49% → 50%

The old rubric was unfairly penalizing MedGemma's content-heavy approach.

Head to Head Gemma 3n vs MedGemma 4B

	Halluc.	Complete	Instruct	Format	Entities	No Dup
Gemma 3n	56%	35%	39%	61%	24%	84%
MedGemma 4B	53%	48%	28%	44%	3%	88%

+13

MedGemma completeness lead

+17

Gemma format lead

Which failure mode is more acceptable?

Bias Check Claude vs Gemini as Judge

Not self-preference bias — Gemini only gives itself +3 over Claude's score.

Real finding: judges disagree on what constitutes failure.

Qwen case study:

Qwen outputs 60K chars of thinking dump with a note buried inside.

Claude: 7%

"This is not a clinical note."
Scores the output as a product.

Gemini: 73%

Finds the buried note.
Scores the clinical content.

For our use case — output must be directly usable — Claude is the right judge.

Recommendation: MedGemma 4B

Completeness wins over structure.

Physicians can fix bad formatting. They cannot add back missing content.

48% completeness vs 35% — a 13-point gap on the dimension that matters most.

Gemma 3n's risks

Factual errors in family history

Note duplication bug

Dangerous, hard to catch

MedGemma 4B's issues

Template leakage (post-processable)

Section reordering

Fixable in post-processing

Both at ~50% of frontier quality. Physician review is mandatory regardless.

Next What's Next

Score H&P and DAP templates to confirm MedGemma holds across note types
Post-processing pipeline for MedGemma's known failure modes
Prompt engineering: simpler template, few-shot examples
Target: close the gap from 49% toward frontier quality

49% → ?

Full methodology, scores, and reflections in evals/

evals/plan.md — evaluation design and rubric
evals/reflections/ — analysis and methodology notes
evals/results/ — raw scores and model outputs

KasaMD Offline

Finding the Right Modelfor Offline Clinical Notes

Context The Goal

Models The Candidates

7 Local Models (3-4B params, Q4)

1 Cloud Frontier

Data The Test Suite

Sparse

Medium

Dense

Method The Scoring Rubric v1

Primary Judge

Second Judge

Run 1 The First Leaderboard

Something Didn't Add Up

Gemma 3n

MedGemma 4B

Problem 1 Double-Penalization

Problem 2 Clinical Boilerplate as Hallucination

Problem 3 Equal Weighting

Rubric v2 The Fix

Run 2 Updated Leaderboard

Shift Before vs After

Head to Head Gemma 3n vs MedGemma 4B

Bias Check Claude vs Gemini as Judge

Claude: 7%

Gemini: 73%

Recommendation: MedGemma 4B

Next What's Next

Full methodology, scores, and reflections in evals/

Finding the Right Model
for Offline Clinical Notes