Building AI Tools
for Life Sciences

Automated PII adjudication & autonomous documentation agents

Wesley Beckner — 2026

Judgify — What & Why

Clinical NLP pipelines use named-entity recognition to find sensitive data—patient names, IDs, locations, demographics. But NER models over-tag. A mention of "Dr. Smith" as an attending physician isn't the same as "John Smith" the patient.

Judgify is an automated PII adjudication pipeline: it takes NER-tagged clinical text, decides which entities are actually patient-identifying, and trains specialized binary classifiers to make that decision at scale.

graph LR A["Raw Clinical Text"] --> B["NER Tagging"] B --> C["16K entities found"] C --> D["Judgify Adjudication"] D --> E["2.1K confirmed PII"] style A fill:#111,stroke:#333,color:#fff style B fill:#111,stroke:#333,color:#fff style C fill:#111,stroke:#FF6B6B,color:#FF6B6B style D fill:#111,stroke:#00D4FF,color:#00D4FF style E fill:#111,stroke:#00FF88,color:#00FF88

The gap between "entity found" and "entity is patient PII" is where Judgify operates.

Judgify — The Pipeline

Six phases take raw text through to scored, production-ready output:

graph LR P0["Phase 0
Entity Recognition
NER on raw text"] --> P1["Phase 1
Prompt Refinement
3-model delta analysis"] P1 --> P2["Phase 2
LLM Annotation
Ground-truth labeling"] P2 --> P3["Phase 3
Encoder Training
ModernBERT classifiers"] P3 --> P4["Phase 4
Production Inference
Multi-GPU scoring"] P4 --> P5["Phase 5
Evaluation
Benchmarks & reports"] style P0 fill:#111,stroke:#00D4FF,color:#fff style P1 fill:#111,stroke:#00D4FF,color:#fff style P2 fill:#111,stroke:#00D4FF,color:#fff style P3 fill:#111,stroke:#00D4FF,color:#fff style P4 fill:#111,stroke:#00D4FF,color:#fff style P5 fill:#111,stroke:#00D4FF,color:#fff

LLM Phases (0–2)

Phase 0 — NER extraction (CPU or GPU fleet)
Phase 1 — Run 3 models on samples; iterate prompts until ≥95% agreement
Phase 2 — Full-scale annotation with converged prompts

Encoder Phases (3–5)

Phase 3 — Train 3 binary classifiers (name/ID, location, demographic)
Phase 4 — Score the full dataset at ~30 min wall time via GPU dispatch
Phase 5 — Precision/recall benchmarks, stratified sampling for review

Judgify — MCP Architecture

MCP exposes the pipeline as 15+ typed tools instead of one monolithic prompt. The operator's machine never touches EC2 directly—it sends control signals via SSM to the compute instance, which orchestrates GPU provisioning.

graph LR OP["Operator"] -->|"MCP
(stdio)"| MCP["MCP Server"] MCP -->|"SSM only"| T3["Compute Instance
Orchestrator"] T3 -->|"EC2 API
(instance profile)"| GPU["GPU Fleet"] T3 <--> S3["Object Storage"] T3 <--> LLM["LLM API"] GPU --> S3 style OP fill:#111,stroke:#00D4FF,color:#fff style MCP fill:#0a1a2a,stroke:#00D4FF,color:#00D4FF style T3 fill:#0a1a2a,stroke:#00FF88,color:#00FF88 style GPU fill:#111,stroke:#333,color:#fff style S3 fill:#111,stroke:#333,color:#fff style LLM fill:#111,stroke:#333,color:#fff

Permission Boundaries

Operator (Mac) — SSM-only. Cannot launch EC2 instances directly.
Compute instance — full EC2/S3 via instance profile. Launches, monitors, and terminates GPU workers.
GPU workers — ephemeral. Auto-terminate on completion, tagged for safe cleanup.

Control Flow

Approval gate — MCP returns a checkpoint; user confirms before any GPU provisioning.
Phase gating — prerequisites enforced in code. Phase 3 refuses to run if Phase 2 hasn't completed.
Fire-and-poll — long-running jobs return a handle; poll for progress.

Judgify — Hardening the MCP

Structured tools are only useful if they're reliable. The hardening process uses automated cold-start regression: each run creates fresh infrastructure, runs all 5 phases on real data, then tears everything down.

graph LR A["Create fresh
instance"] --> B["Bootstrap
code + deps"] B --> C["Run Phases
0 → 5"] C --> D{"Clean
run?"} D -->|Yes| E["Log + pick
next dataset"] D -->|No| F["Fix bug
in-loop"] F --> A E --> A style A fill:#111,stroke:#00D4FF,color:#fff style B fill:#111,stroke:#00D4FF,color:#fff style C fill:#111,stroke:#00D4FF,color:#fff style D fill:#111,stroke:#FFD700,color:#FFD700 style E fill:#111,stroke:#00FF88,color:#00FF88 style F fill:#111,stroke:#FF6B6B,color:#FF6B6B

31 Consecutive clean runs

6 Distinct data sources

3 Input format variants

~16 min Cold-start end-to-end

Philosophy: Each run varies data shape (raw text vs. pre-tagged), source domain (clinical notes, lab reports, fragments), and scale. Bugs found during the loop are fixed in-loop—they're features of the testing process, not failures.

DocHound — What & Why

Documentation drifts from code. Links break, version strings go stale, CLI examples reference flags that were renamed three PRs ago. Nobody notices until a new engineer hits a dead end.

DocHound is an autonomous documentation auditing agent. It watches repositories, detects stale or incorrect documentation, and submits fix PRs—automatically.

graph LR A["main branch
changes"] --> B["DocHound
audits docs"] B --> C{"Issues
found?"} C -->|Yes| D["Opens fix PR
with edits"] D --> E["Human reviews"] E -->|Comments| F["LLM revises
force-pushes"] F --> D C -->|No| G["No action"] style A fill:#111,stroke:#333,color:#fff style B fill:#0a1a2a,stroke:#00D4FF,color:#00D4FF style C fill:#111,stroke:#FFD700,color:#FFD700 style D fill:#111,stroke:#00FF88,color:#00FF88 style E fill:#111,stroke:#333,color:#fff style F fill:#0a1a2a,stroke:#00D4FF,color:#00D4FF style G fill:#111,stroke:#333,color:#666

Not a linter. It reads your codebase structure, manifests, and source—then verifies that what your docs claim matches what the code does.

DocHound — Deployment Architecture

Production deployment on EKS via Flux GitOps. LLM calls route through a cluster AI Gateway—the pod never holds API keys directly.

graph LR FLUX["Flux
GitOps"] --> ECR["Container
Registry"] ECR --> POD["DocHound
Pod"] POD --> GW["AI Gateway"] GW --> LLM["LLM API"] POD --> GH["GitHub API"] POD --> PROM["Prometheus"] style FLUX fill:#111,stroke:#333,color:#fff style ECR fill:#111,stroke:#333,color:#fff style POD fill:#0a1a2a,stroke:#00D4FF,color:#00D4FF style GW fill:#0a1a2a,stroke:#00FF88,color:#00FF88 style LLM fill:#111,stroke:#333,color:#fff style GH fill:#111,stroke:#333,color:#fff style PROM fill:#111,stroke:#333,color:#fff

Flux — reconciles desired state from git; pulls OCI images on commit.
Container Registry — stores multi-stage Docker images (Chainguard base, non-root).
DocHound Pod — async poller with FastAPI service layer (/healthz, /readyz).

AI Gateway — cluster-internal proxy that handles LLM auth. Pod sends OpenAI-compatible requests with a dummy key.
GitHub API — GitHub App auth mints short-lived JWT tokens. No long-lived PATs.
Prometheus — polls created, errors, audit duration metrics.

DocHound — Tiered Check System

Checks are layered by cost and confidence. Deterministic checks run first; LLM augments where needed.

Tier 3 — Broad LLM

Full-document audit. Architecture accuracy, prose quality, structural claims.

Expensive. Only runs on full audits, not diff-driven.

Tier 2 — Targeted LLM

LLM validates specific claims. CLI examples, install instructions, API usage.

Narrow scope, moderate cost.

Tier 1 — Deterministic

Dead links, code element refs, version mismatches, hardcoded paths, CLI surface gaps.

Free, fast, always auto-fixable. No LLM required.

Key design: Tier 1 findings are passed as structured JSON context to the LLM in Tiers 2–3. The LLM confirms, rejects, or extends them. If the LLM call fails, Tier 1 findings pass through unmodified—the pipeline degrades gracefully, never silently.

DocHound — PR Philosophy

One PR per repo. Always update, never create multiples.

sequenceDiagram participant Main as main participant DH as DocHound participant PR as dochound/fix Main->>DH: New commits DH->>DH: Audit (Tier 1 → LLM) DH->>PR: Create or force-push fixes Note over PR: Human comments PR-->>DH: Feedback DH->>PR: LLM reverts/edits, force-push

One PR per repo — never spam; always update the existing branch.
Living document — PR evolves as main advances.

Feedback loop — human comments trigger LLM-driven reverts or edits.
Idempotent — SHA-tracked state prevents duplicate work.

1 Deterministic first, LLM where it adds value. Regex catches dead links for free. LLMs confirm architectural claims. Layer them.
2 Test the tooling, not just the product. 31 cold-start runs proved the MCP reliable. The testing loop found bugs the tools never would have.
3 MCP as the interface layer. Structured tools with typed inputs and enforced prerequisites let AI assistants operate complex infrastructure safely.

Wesley Beckner — wabbazzar.com

Building AI Toolsfor Life Sciences

Judgify — What & Why

Judgify — The Pipeline

LLM Phases (0–2)

Encoder Phases (3–5)

Judgify — MCP Architecture

Permission Boundaries

Control Flow

Judgify — Hardening the MCP

DocHound — What & Why

DocHound — Deployment Architecture

DocHound — Tiered Check System

DocHound — PR Philosophy

Building AI Tools
for Life Sciences