Engineers who haven't spent time in insurance document processing tend to underestimate the loss run problem. It looks like a table extraction task. It is not a table extraction task.
Loss runs are the historical claims record for a risk — typically requested from prior carriers going back five to ten years, and required as part of any specialty lines submission. Every carrier that produces them has its own format. Some carriers generate clean, structured PDFs from their policy admin systems. Many produce PDFs that were originally word processor documents, printed to PDF and sometimes rescanned. A meaningful percentage arrive as photocopies of printouts from systems that no longer exist.
The format taxonomy
When we started building Undwrlyft's loss run extraction pipeline, we catalogued the format variation we were encountering. The diversity is not minor. Within a single submission packet, an underwriter might receive loss runs from three prior carriers, each with a fundamentally different document structure:
- Column-first layouts: Year of loss runs down the left, claim detail columns across. Common in legacy regional carriers. The "Year" column may be labeled "Accident Year," "Policy Year," or "Report Year" — which are not the same thing, and matter significantly for loss development analysis.
- Claim-first layouts: Each row is a claim, with policy period and accident date as separate fields. Requires aggregation to produce year summaries. Common in more modern PAS-generated reports.
- Period-summary layouts: Pre-aggregated by policy period, with totals only — no individual claim detail. Useful for pricing, but limited for large-loss identification.
- Carrier-specific formats with embedded narrative: Some carriers wrap claim summaries in narrative text rather than tables. The incurred loss figure might appear in a sentence: "Total incurred as of 12/31/2024: $142,500." A rule-based parser will miss this entirely.
Why column headers are the hardest part
Even within tabular formats, the column header problem is severe. The fields that matter to an underwriter — paid losses, incurred losses, allocated loss adjustment expense (ALAE), open reserves, and whether a claim is open or closed — appear under a variety of labels depending on the carrier's internal terminology.
"Incurred" might appear as: Incurred, Total Incurred, Gross Incurred, Net Incurred, Case Incurred, or just a column with a dollar sign and no label at all when the header appears on a prior page and the current page is a continuation. "Paid" can be Paid Loss, Paid to Date, Cash Paid, or Loss Payments. ALAE is particularly inconsistent — it appears as ALAE, DCC (Defense and Cost Containment), ALE, Allocated Expense, or is folded into a combined "Losses and LAE" total without separation.
For a pricing underwriter trying to calculate a pure loss rate or an experience modification, these distinctions are not academic. A combined losses-and-LAE figure used as a pure loss input will overstate the loss ratio. A case-only incurred figure omits IBNR development. Getting this wrong propagates into pricing.
The OCR layer failure mode
Generic OCR — the kind you get from off-the-shelf document processing APIs — handles the text recognition layer reasonably well on high-quality PDFs. It fails in several characteristic ways on loss runs specifically:
First, tabular structure recovery. OCR engines extract text, but most do not reliably recover which text belongs to which table cell when columns are narrow, fonts are small, or lines are absent. A loss run printed without grid lines — which is common — gets extracted as a flat stream of numbers with ambiguous column assignment. Reconstructing which dollar figure belongs to which year, carrier, and field from a flat text stream requires structural inference that generic OCR does not perform.
Second, number accuracy in dense layouts. In a table with eight columns and fifteen rows of claim data, OCR character substitution errors — a "1" read as "l", a "0" read as "O" — compound. The error rate on individual characters may be low, but the probability of at least one numeric error in a 15-row loss table is materially higher. For financial data, one corrupted figure in a loss summary is not a minor issue.
Third, multi-page document continuity. Loss runs often span multiple pages, with headers only on the first page. Generic OCR processes each page independently. Column alignment across pages — knowing that column 3 on page 2 maps to "Paid" because that was the column header on page 1 — requires document-level context that standard OCR lacks.
Where language models help — and where they don't
The genuine advance that large language models bring to this problem is semantic field normalization: the ability to recognize that "Net Incurred" and "Gross Case Loss" and "Total Losses to Date" may all be pointing at the same pricing field depending on context, rather than requiring an explicit mapping for each carrier's terminology.
We're not saying LLMs solve the loss run problem by themselves. They don't. LLMs hallucinate numbers. A language model asked to extract a specific dollar figure from a complex table will sometimes produce a plausible-looking but wrong figure — especially when the table is dense and the values are close together. For financial document extraction, hallucinated numbers are not an acceptable failure mode.
The architecture that works combines OCR for text recognition, layout analysis for structural recovery, and language model inference for semantic field mapping — with the actual numeric values always grounded in the OCR output, never generated by the language model. The LLM's job is to identify which OCR-extracted text belongs to which field, not to supply values it cannot see.
A real-world illustration
Consider a specialty contractor E&S submission we processed in 2024: a California environmental contractor, prior coverage with three different carriers over a seven-year period. The loss runs included a structured PDF from a regional carrier with clean tabular output, a photocopied two-page document from a carrier that had since been acquired with partial column headers, and a letter-format document from a London-based reinsurer that reported losses in GBP with USD conversion notes in footnotes.
Each of those documents required a different extraction approach. The currency conversion footnote problem alone — recognizing that the GBP figures needed to be flagged for the underwriter rather than converted programmatically at whatever exchange rate the system might apply — required context-aware handling that a simple table extractor would not manage.
The underwriter still reviewed every extracted value against the source document before the submission advanced. That review took twelve minutes. Their previous process, working from raw PDFs, took ninety. The accuracy of the extracted data was high enough that the review was confirmation rather than reconstruction.
What this means for procurement
When specialty carriers evaluate loss run extraction capabilities, the right evaluation is not "can your system parse a clean, well-formatted loss run PDF." Any modern extraction system can do that. The right evaluation is: show me extraction on the worst five formats from my specific prior carrier mix. Include one photocopied document. Include one that spans seven pages. Include one where the header is on a different page than the data.
Format diversity is not a corner case in E&S. It is the baseline condition. Any extraction system that hasn't been specifically built and tested on the range of specialty lines loss run formats that actually appear in production will underperform on the cases that matter most — the complex submissions where accurate data extraction has the highest value for the underwriting decision.