How can I clean and merge ICE detention datasets across years to track individuals or statuses?
Executive summary
Merging ICE detention datasets across years is feasible but requires careful handling of changing identifiers, dataset boundaries, and processing quirks: Vera’s ICE Detention Trends team documents linked detention histories from 2008–mid‑2025 but warns that dataset boundaries and differing record structures can create false discontinuities [1] [2]. The Deportation Data Project and FOIA releases include linked identifiers in some releases that permit person‑level tracing within a release but those identifiers often differ between releases, blocking straight merges without reconciliation [3] [4].
1. Know your sources and their limits: federal, FOIA, and third‑party feeds
ICE’s own public dashboards and statistics are authoritative for recent administrative counts but are described as “fluid” until the fiscal year locks and can change in format or definition over time [5]. The DHS OHSS persist datasets and monthly tables are constructed from ICE’s Enforcement Integrated Database and note corrected historical processing errors (for example FY2013–14 book‑out location corrections in January 2025), so researchers must treat older dumps and later corrections as distinct source editions [6] [7].
2. Identifier mismatch is the central practical problem
Some historical ICE releases obtained via FOIA include unique identifiers that link arrest, detainer and detention tables within those releases, enabling trace‑through analyses “anonymously” inside that release [3]. Those same releases, however, use different identifier schemes across separate ICE data dumps (for example earlier ACLU/FOIA tables vs. the most recent releases), meaning a person tracked in one release won’t match automatically to a record in another without an explicit reconciliation strategy [3] [1].
3. Use published projects and code as blueprints
Vera’s Detention Trends dashboard and technical appendix show practical solutions: treating detention stay histories as one or more stints per person, concatenating stints, and accounting for transfers by facility code treatment; their repo and appendix describe how they joined detention and bond data and how merged fields can contain multiple values you must collapse or select among [1] [2] [8]. The Deportation Data Project and UWCHR code repositories also provide cleaning pipelines (concatenation, parsing, headcount calculation) that you can adapt rather than re‑inventing the wheel [3] [9].
4. Practical steps to clean and harmonize across years
Start by documenting each file’s variable definitions and date ranges; OHSS and Vera both explicitly flag date‑boundary discontinuities (e.g., Oct 1 boundaries and Dataset I vs Dataset III effects) so treat those as potential structural breaks [6] [1]. Normalize facility codes (Vera counts each unique facility code as distinct), standardize date/time formats, collapse multi‑value bond or status fields per your analytic rules, and create a canonical “stay” record by merging contiguous stints [2] [1]. Maintain provenance columns so every row points back to the original release and row — corrections and republishing mean older files may be retained intentionally [6].
5. Reconciliation techniques when identifiers differ
When FOIA/ICE identifiers aren’t consistent across years, available sources describe two complementary approaches. First, probabilistic matching on stable demographic and event fields (arrival/arrest date ranges, facility sequences, age, sex, nationality) can link likely same individuals across releases; Vera’s work implies using stay‑level sequences and facility history to improve matching [1] [2]. Second, exploit releases that do contain linked identifiers to build mapping tables if overlapping periods exist; Deportation Data Project notes some releases include linked IDs within that release even if not matchable to later releases [3] [4].
6. Watch for reporting artifacts that look like real changes
Researchers have found that population changes near dataset cut‑dates can reflect compilation differences rather than actual custody shifts — Vera explicitly warns differences across Oct 1 boundaries may reflect how ICE compiled different datasets, not true population changes [1]. Similarly, DHS/OHSS corrections to earlier published fields (e.g., book‑out arrest location) show a row can change meaning when reprocessed [6].
7. Ethical, legal, and methodological guardrails
The Deportation Data Project emphasizes that many ICE releases are anonymized and intended for aggregate research but also that merges enabling “pathways” can reconstruct sensitive trajectories if done carelessly; treat any person‑level work with appropriate privacy safeguards and document anonymization limits [3] [4]. Also preserve source citations and versioning — OHSS explicitly maintains previously published reports “as originally approved” so you can show how figures evolved [6].
8. Where to go next: toolchain and reference materials
Follow Vera’s public repo and technical appendix for concrete cleaning scripts and data models used on a 2008–2025 span [2] [8]. Use the Deportation Data Project’s hosted FOIA releases as canonical inputs and the UWCHR code examples for concatenation and headcount routines [3] [9]. Consult DHS OHSS monthly tables for the official Persist Dataset definitions and to flag where agency corrections may require reprocessing [7] [6].
Limitations: available sources do not mention exact field‑level mapping tables between every ICE release and therefore do not supply a turnkey crosswalk for IDs — researchers must build reconciliation logic themselves (not found in current reporting).