Projects

Mammography Classification

2026

Python PyTorch EfficientNet Active Learning

DICOM view classification with minimal labels.

Need to sort millions of mammography DICOMs—is this a mag view? Is a biopsy tool present? DICOM metadata is inconsistent garbage across different machines and time periods, can't trust it.

Used EfficientNet-B0 with active learning. Label 50 images, train, review what it got wrong, label those, repeat. The model tells you which samples are most informative to label next.

Hit 90% accuracy on 400 samples. Proved the concept works, now building it out with the team for full deployment.

Links: GitHub

Constrained Decoding for Medical JSON

2025

Python LLM llama.cpp vLLM

Schema-guaranteed JSON extraction from pathology reports.

Prompt-only approaches hit ~67% valid JSON on pathology reports. Not good enough when you need to process 17k reports and feed them into downstream pipelines.

Built a two-stage generation system: reason first in free text, then emit constrained JSON. Iterated through multiple stacks (Outlines → XGrammar → llama.cpp → vLLM) profiling bottlenecks until fast enough for production.

180x speedup (3 min → 1 sec per report), 100% schema validity. Currently scaling to 1M+ reports.

Links: GitHub

Caring Contacts RAG

2025

Python RAG BM25 Clinical NLP

Personalized discharge letters for psychiatric patients.

Psychiatric patients benefit from personalized follow-up after discharge. The letters need to reference specific details from their stay to feel genuine—generic templates don't work.

Built a RAG system using BM25 for fast retrieval and a reranker for precision. Experimented with chunking at document, paragraph, and proposition levels—proposition-level worked best for pulling personal details. Added prompt guardrails for safety and tone.

Shipped concept to production in 3 weeks. Now in active clinical trial. Presented at ISBD 2025 (Japan) and IASR 2025 (Boston).

PHI De-identification

2025

Python NER Synthetic Data GPT-4

Hospital-specific PHI detection via synthetic data.

Off-the-shelf models trained on MIMIC/i2b2 miss hospital-specific patterns—our MRN format, local clinic names, physician signatures. Needed something that works on our data.

Built a synthetic data pipeline: SQL queries to sample real PHI patterns from the clinical DB, generate fake clinical text with GPT-4 using XML placeholders, swap real identifiers back in. Train NER on that.

Improved F1 from 45% → 75% with my pipeline. Mentored a student who extended the approach to 90%.

Links: Writeup

FHIR Extraction Pipeline

2024

Python PostgreSQL FHIR

700x speedup on clinical report extraction.

Needed to pull 2M+ clinical reports for research. Existing queries took 120 days for a full batch—completely unusable for iterative work.

Reverse-engineered the undocumented Observation → DiagnosticReport structure. Nobody had written it down. Added temporary PostgreSQL indexes on the join keys.

120 days → 4 hours (700x speedup). Also discovered 500k missing pathology reports from old import errors and built a system to reconstruct them.

Microcalcification Classification from Radiology Reports

2024

Python RadBERT LLMs

Published — IGTxIMNO 2024 (Oral Presentation)

Compared three approaches for extracting microcalcification status from breast imaging reports: BI-RADS-segmented BERT, RadBERT with truncation, and zero/few-shot LLMs (Yi-34B, Mixtral, Meditron-70B, Qwen-72B).

RadBERT won at 94% accuracy. But the LLM results were more interesting: Yi-34B hit 79% with minimal tuning—beating models twice its size. Meditron-70B collapsed from 72% to 34% with few-shot learning. Bigger ≠ better for medical text.

Links: Abstract

NLP to OMOP Pipeline

2024

Python fhir.resources fhirpy Aidbox OMOP CDM nbdev

NLP inference is useless if the output can't go anywhere.

Had DeBERTa classifiers extracting microcalcification status and BI-RADS categories from radiology reports. The models worked. But clinical research needs structured data in standardized formats—FHIR for interoperability, OMOP for analytics. Model outputs were stranded as JSON blobs.

Built the full pipeline: report in → NLP inference → FHIR resources → OMOP tables.

The FHIR piece: Following the HL7 Breast Radiology IG, I extended fhir.resources with custom profiles. The IG is deeply nested—a microcalcification finding requires Device → Composition → Report Section → Findings BiLateral Breast Section → Mg Findings Observation → Calcification Observation → Calcification Presence Component (where the NLP label finally lives). Each resource references children by ID, built bottom-up. Created a class hierarchy to handle modality-specific codes while sharing structure.

Terminology mapping: NLP outputs "Positive"/"Negative". FHIR wants RadLex codes (RDE1556_present). OMOP wants SNOMED (10828004). Built a YAML-based mapping layer—single function call converts NLP labels to any target coding system.

Aidbox integration: Extended fhirpy's AsyncFHIRResource with a proper upsert method—search by device name/manufacturer, update if exists, create if not. Stock library didn't support this.

What I learned: Healthcare interoperability is 80% understanding the data models. The Breast Radiology IG has nested Observations five levels deep. If you don't map this out first, you'll build something that "works" but isn't compliant.

Links: Architecture OMOP Slides

Pronunciation App

2026

Python Whisper LLM

LLM-as-judge for my daughter's reading practice.

My daughter reads Harry Potter but says "marriage" for "mirage". Strong comprehension, weak decoding. Wanted something that could give her real-time feedback.

Built a pipeline: speech → IPA transcription → compare against gold pronunciation → LLM-as-judge with rubric scoring. Built it in evening sessions with her as the tester.

Working prototype. Customer base of 1, retention 100%.

Roblox Game Guide

2026

Lua Roblox Studio Tutorial

Teaching game dev to a 7-year-old.

Wanted to teach my daughter game development. Documented the process as a step-by-step tutorial so other parents could do the same.

She built her first game—a Harry Potter themed escape room. Deployed the guide on Vercel.

Links: Guide

SEC Form D Job Board

2025

Python DuckDB FastHTML Claude API

Turned SEC filings into a job hunting pipeline.

Every company raising $5M-$60M in the US files a Form D with the SEC. That's Series A/B territory. The data is public, machine-readable, and nobody's using it for job hunting.

Built a pipeline that pulls the daily SEC EDGAR index, filters for the funding range, and stores everything in DuckDB via append-only JSONL. The catch: Form D is noisy—real estate funds, SPVs, holding companies all file too. Added an LLM filter: OpenAI classifies each filing as "real startup" or "fund/SPV" based on name + industry + filing patterns.

FastHTML dashboard shows filtered companies. Click one and it triggers a Claude research agent that pulls their website, careers page, tech stack signals, and generates a cold outreach report. The insight: recently funded startups are hiring but haven't posted jobs yet. Form D filings are leading indicators.

RARO Paper Implementation

2025

Python PyTorch Research

Attempted replication of "Escaping the Verifier".

Read "Escaping the Verifier: Learning to Reason via Demonstrations". The idea was compelling—train a model to generate reasoning chains that fool a verifier. Wanted to see if it worked.

Implemented it on Colab, trained it. It got worse. Learned why—the training signal wasn't strong enough to overcome the base model's priors. Sometimes negative results are the most educational.

Links: Colab

Semantic Routers for MoE

2025

Python PyTorch MoE GPT-2

Cohere fellowship pitch. Got rejected. Revisiting at tiny scale.

Applied for Cohere's fellowship: "take one of our papers, propose a change." Chose their BAM paper on upcycling FFNs into Mixture-of-Experts. The limitation: linear routers do domain matching only. Legal tokens go to law expert, but miss cross-domain skills.

Proposed learnable expert embeddings—"cue cards" encoding capabilities like symbolic reasoning, not just domains. Router matches hidden states to embeddings, discovering unexpected expertise.

Grokked some online GPUs, trained GPT-2 domain experts (Law, Code, Math, General), built the MoE, observed the router. Got rejected. Recently revisiting at tiny scale first—Karpathy's advice.

Links: Pitch Deck

Router matching hidden states to expert embeddings

Radiomics vs. Learned Features

2025

Python PyRadiomics SimpleITK Random Forest

Bored of Titanic, used pathology images instead.

Wanted to learn Random Forests properly. Couldn't stomach another Titanic tutorial. Grabbed PCam instead—262k pathology patches of metastatic tissue from whole slide images.

Converted H5 to ImageNet format, extracted traditional radiomics features (GLCM, GLRLM, texture stats), trained a classifier. It worked, but barely.

Turns out handcrafted radiomics is mid. You need learned feature extractors. The boring way would've taught me the same sklearn API with less insight.

Links: Dataset Prep Feature Extraction

Bookkeeping Agent

2025

Python Claude DuckDB Pydantic

Built tools for a financial agent from scratch.

Interviewing for a role at a financial agents company. Wanted to understand the patterns by building, not reading.

Started with file I/O tools, added Pydantic structured extraction for bank statements. Hit context limits on large files—split by pages. Calculator tool failed on transcription errors, pivoted to DuckDB for SQL queries directly on CSVs. That was the breakthrough.

Ran it on my own statements. Found I was spending $70/month on Starbucks. Bought a coffee machine for $40.

Links: Colab Slides

Mammography Classification

Constrained Decoding for Medical JSON

Caring Contacts RAG

PHI De-identification

FHIR Extraction Pipeline

Microcalcification Classification from Radiology Reports

NLP to OMOP Pipeline

side projects

Pronunciation App

Roblox Game Guide

SEC Form D Job Board

RARO Paper Implementation

Semantic Routers for MoE

Radiomics vs. Learned Features

Bookkeeping Agent