Which independent databases and newsroom projects have built searchable indexes of the DOJ Epstein releases, and how do they document provenance?
Executive summary
Three types of non‑DOJ searchable indexes have emerged around the massive Jeffrey Epstein releases: newsroom and independent databases that ingested DOJ files (for example Google’s Pinpoint collection and COURIER’s repository), congressional releases that republished subsets of records, and collaborative newsroom projects that indexed and reported on the material; each documents provenance differently—some by linking back to DOJ dataset identifiers and timestamps, others by preserving original DOJ filenames and creating independent transcriptions or copies of deleted items (DOJ files), though reporting remains incomplete about the full inventory of all independent indexes [1] [2] [3] [4] [5].
1. Google Pinpoint (journalist studio) — searchable collection built on DOJ datasets and labeled uploads
Google’s Pinpoint instance is being used by at least one newsroom to host a dedicated searchable collection of DOJ Epstein files, and its public listing states the project has uploaded multiple DOJ data sets (Data Sets 1–8 and 12 at the time of listing) while promising further uploads as DOJ releases continue, explicitly noting it will retain items deleted by the DOJ and inviting tips from users who find notable documents [1]. The Pinpoint page therefore signals provenance by tying records to named DOJ data set numbers and by maintaining a changelog-style note about which DOJ datasets have been uploaded, which aids researchers wanting to trace an item back to a particular DOJ production [1].
2. COURIER Newsroom / independent repository — scraped DOJ releases, independent transcriptions, claims of preserving deleted items
COURIER published an independent searchable repository that it describes as containing everything the DOJ “wants hidden,” including court filings, images, audio, video and both DOJ‑provided and independently produced transcriptions of audio content—an explicit provenance practice that presents DOJ originals alongside the outlet’s own copies and transcriptions and asserts the repository contains items the DOJ later deleted [2]. COURIER frames its provenance documentation as a side‑by‑side comparison (DOJ transcription vs COURIER transcription) and says it retrieved the initial DOJ release before portions were removed, which it uses to justify preserving the original filenames and media; that claim is sourced to COURIER’s reporting rather than the DOJ itself, so independent verification of preserved deletions is provided by the outlet’s archive rather than a government index [2].
3. Congressional and federal repositories — Oversight Committee and DOJ official datasets as provenance anchors
Congressional releases, such as the House Oversight Committee’s publication of a 33,295‑page packet provided by the DOJ, function as alternative public repositories and explicitly note their source as materials the Department provided to the committee, which creates a separate provenance trail linking specific pages to the committee’s subpoena and release timeline [3]. The DOJ’s official Epstein library and dataset pages remain the canonical provenance anchors: the department hosts the “Epstein Library,” dataset pages, and a public statement describing the scope of the production (including that files came from five primary investigative sources and warnings that the production may include fake or public‑submitted materials), and these DOJ pages identify data‑set numbers and the agency’s caveats about duplicates and unvetted public submissions [6] [5] [7] [8] [9].
4. Collaborative newsroom projects — pooled reporting with shared indexes but outlet‑level provenance
Several news outlets coordinated coverage and shared findings from the releases, with reporting noting that journalists “worked together to examine the documents and share information” while each newsroom maintained responsibility for its own independent reporting, a model that produces shared indexes or trackers in some cases but documents provenance primarily by citing the DOJ dataset and the outlet’s own internal handling rather than a single, uniform cataloguing standard [4]. This collaborative model improves cross‑verification of notable documents but can mean provenance is fragmented across outlets: some will cite DOJ dataset identifiers, some will publish copies or transcriptions, and some will rely on the committee or other repositories as source anchors [4] [5].
5. How provenance is documented — common practices and limits in current reporting
Across the independent projects described in reporting, provenance is most commonly documented by referencing DOJ dataset numbers or the Oversight Committee release, by preserving original DOJ filenames or timestamps when possible, and by publishing side‑by‑side transcriptions or copies of media that the outlet claims to have captured before DOJ deletions; the DOJ’s own pages explicitly warn about over‑collection, duplicates, and potentially false public submissions—an important provenance caveat that independent databases sometimes echo but do not fully reconcile with their own archives [7] [9] [1] [2]. Reporting does not provide a definitive master list of every independent index built to date, nor a standardized metadata schema used across projects, so researchers must rely on the individual project pages (Pinpoint, COURIER, congressional releases, and the DOJ library) to map a document back to an original DOJ production or to the committee subpoena [1] [2] [3] [6].