Fine-tuning LLMs for healthcare

Challenge

Healthcare is one of the hardest domains to deploy AI in. The data is sensitive, the tasks are complex, and the public foundation models most teams reach for first are built to refuse the exact requests this client needed answered.

Three constraints framed the project:

Data residency. Member records, claims, prior-authorisation notes — none of it could leave the client’s private cloud. That ruled out public inference endpoints, hosted fine-tuning services, and most managed eval tooling.
Multiple overlapping tasks. One pipeline needed to do entity extraction over claim narratives, classification of denial reasons, summarisation of long medical histories, drafting of appeal letters, and grounded Q&A against policy documents. Standing up five narrow models would have been operationally untenable.
Public-model risk. Any model the client depended on needed to still exist a year from now, unchanged in behaviour. Public endpoints get deprecated, rate-limited, silently re-tuned, or pulled with weeks of notice. That isn’t a risk a regulated insurer can carry.

Solution

We treated model selection, eval design, and data strategy as decisions that come before training — not artefacts that emerge from it.

Pick the base model before you write training code. We shortlisted candidates on benchmarks that mirrored the actual tasks — clinical entity recognition, long-context summarisation, constrained-format generation — not general leaderboard rank. Then we checked licences. A model you can’t legally fine-tune is a model you’ve already wasted time on.

Design for multiple tasks from day one. We structured the fine-tune as a single multi-task problem. Adding data for one task degrades another; the only way through is fast, instrumented iteration. One well-tuned model serving five workflows is dramatically cheaper to run than five narrow ones, and it makes the eval surface tractable.

Set up evals before you touch training. Every task got its own evaluation criteria before a single training run started. Entity detection scored on recall against a held-out gold set. Q&A was judged by a stronger model with task-specific rubrics. Generation tasks combined automatic metrics with structured human review. If your evals aren’t in place first, your experiments produce noise, not signal.

Synthetic first, vendor second. We didn’t wait for perfect data. Synthetic datasets came first — cheap to generate, good enough to stress-test the model candidates and validate the eval setup. When vendor data was procured (expensive, requiring explicit training-use licensing) we used it to simulate further labelled examples inside the private cloud. Everything went into the final training run. The layered approach kept us moving without cutting corners on data governance.

QLoRA, and the discipline to switch base models mid-project. QLoRA kept compute costs manageable through hundreds of experiments. But here’s the part most playbooks skip: as data volume grew, we re-ran model selection. The model that led at ten thousand examples didn’t lead at a hundred thousand. We switched mid-project when the data profile changed. It was the right call, and it would not have been visible without the eval scaffolding already in place.

From training to production. Final models were quantised, merged, and exported in a production-ready format. We benchmarked inference hardware across configurations on the client’s own infrastructure — because a model that’s accurate but too slow or too expensive to serve at scale isn’t production-ready, it’s a prototype.

Results

The system shipped as a single model serving all five workflows from the client’s private cloud. The wins worth highlighting are the ones that protected the project from the failure modes that sink most healthcare LLM efforts.

Privacy

The model was trained, evaluated, and served entirely inside the client’s private cloud. Synthetic data generation, vendor-data augmentation, fine-tuning, eval runs, and production inference all happened behind the same firewall as the source records. No prompt, completion, or embedding ever left the perimeter.

Beyond the public guardrails

Off-the-shelf models refuse most of the requests this product needed to answer — summarising a member’s medical history, drafting an appeal that cites prior decisions, extracting entities from a denial narrative. A fine-tuned model trained on the client’s own conventions doesn’t have to negotiate with a generic safety policy designed for consumer chat. The behaviour was tuned to the client’s compliance posture, not somebody else’s.

Evaluation

Every release was scored against pre-set, task-specific criteria. No vibes-based “this seems better.” Adding capability to one task can quietly regress another; without a real eval harness you find out in production. The harness also made the mid-project base-model switch possible — we could prove the new model was better on the metrics that mattered before committing to it.

Availability

The client owns the weights, the serving stack, and the evaluation infrastructure. There is no public endpoint that can be deprecated, rate-limited, silently re-tuned, or pulled. A year from now, the model behaves the same. Five years from now, if they want it to keep behaving the same, they can.

The steps above are learnable. What’s harder to learn is which shortcuts will cost you later — especially in a domain where a wrong output isn’t just a bad user experience. That’s the work.