How do DOJ dataset numbers map to specific Epstein investigations and court files?
Executive summary
The Justice Department organized its public Epstein release into numbered "Data Sets" (Data Set 1–12 and related court-record pages) rather than by a single case name, and those datasets together contain about 3.5 million pages responsive to the Epstein Files Transparency Act drawn from multiple prosecutions and investigations [1] [2]. The DOJ's public materials state the primary source cases (Florida and New York Epstein prosecutions, the Maxwell prosecution, probes into Epstein's death, a Florida butler matter, FBI probes and an OIG review), but the agency does not publish a simple one-to-one key that maps each dataset number to a single case file — researchers must use each dataset's file listing and embedded metadata to trace specific documents back to particular court files or grand jury exhibits [2] [3].
1. How the DOJ organized the release and what the dataset numbers mean
The DOJ split its production into numbered Data Sets (for example, Data Set 9, 10, 11, 12 were part of a December 2025 upload) and a broader "Epstein Library" and court-records hub, with individual dataset landing pages that host the released files and descriptions [4] [5] [6] [1] [7]. The agency framed those datasets as batches of collected files rather than labels of discrete prosecutions — the dataset numbers are administrative containers for review, redaction and publication work rather than canonical case identifiers [1] [2].
2. What cases and investigations supplied the material
DOJ disclosure materials list five primary sources feeding the release: the Florida and New York cases against Jeffrey Epstein, the New York case against Ghislaine Maxwell, New York matters connected to investigations into Epstein’s death, a Florida case about a former Epstein butler, multiple FBI investigations, and the Office of Inspector General probe into Epstein’s death [2]. News outlets reporting on the late-2025/early-2026 releases likewise note that the newest batches included grand-jury presentations, interview transcripts, court records and internal DOJ interview transcripts tied to earlier federal inquiries [3].
3. Why there is no simple “dataset-to-case” index in the DOJ release
The DOJ’s production was driven by responsiveness to statutory criteria and judicial orders, redaction protocols, and privilege reviews, producing overlapping material drawn from multiple case files (including duplicates between SDNY and SDFL files), so many documents legitimately belong to more than one case file and thus resist a one-to-one mapping to a single prosecution docket [2]. Judicial constraints (grand jury secrecy rules) and redaction decisions further complicate any neat mapping, and the agency withheld some items under privilege or for being unrelated to the designated cases [2].
4. Practical method to map dataset entries to specific court files
The working way to map documents is forensic: use the DOJ dataset landing pages and the file-level metadata (file names, Bates ranges, docket numbers, exhibit labels, grand jury exhibit tags and internal DOJ labels) to trace each item to a source prosecution or investigation; reporters and researchers have identified grand-jury exhibits and interview transcripts among the releases, for example, which can be cross-checked against SDNY and SDFL dockets and the Maxwell case records [3] [7]. Congressional releases and curated extracts (such as the House Oversight batch of ~33,295 pages) offer additional indexed windows into particular subsets of the production but do not replace the file-by-file tracing that researchers must do [8].
5. Known gaps, risks and what the released datasets reveal (and do not reveal)
DOJ says it identified over six million potentially responsive pages and released roughly 3.5 million after review and redactions, meaning significant material was screened out or withheld and some documents were duplicated across investigative files [9] [2]. Independent reporting flagged redaction failures — unredacted images and victim names appearing in the public set — underscoring risk in using the raw repository without cross-checking redactions and court orders [10]. The department’s earlier internal memo and subsequent litigation context (noting contested claims about a "client list") also mean the released collections must be interpreted alongside DOJ statements and court rulings, not as a definitive single narrative of uncharged third parties [11].
6. Bottom line for researchers and journalists
There is no DOJ-published cheat-sheet that maps "Data Set X = Case Y" across the entire corpus; instead, Data Sets are batched repositories containing material from multiple sources identified by the DOJ (Florida, New York, Maxwell, death probes, FBI and OIG), and precise linkage requires per-file examination of metadata, exhibits and docket references available on the dataset pages and court-record portals [2] [4] [7]. Where dataset-level descriptions exist they point to the contributing investigations, but authoritative mapping depends on document-level evidence and cross-referencing with court dockets and congressional disclosures [8].