How can I clean and merge ICE detention datasets across years to track individuals or statuses?

Checked on December 4, 2025
Disclaimer: Factually can make mistakes. Please verify important information or breaking news. Learn more.

Executive summary

Merging ICE detention datasets across years is feasible but requires careful handling of changing identifiers, dataset boundaries, and processing quirks: Vera’s ICE Detention Trends team documents linked detention histories from 2008–mid‑2025 but warns that dataset boundaries and differing record structures can create false discontinuities [1] [2]. The Deportation Data Project and FOIA releases include linked identifiers in some releases that permit person‑level tracing within a release but those identifiers often differ between releases, blocking straight merges without reconciliation [3] [4].

1. Know your sources and their limits: federal, FOIA, and third‑party feeds

ICE’s own public dashboards and statistics are authoritative for recent administrative counts but are described as “fluid” until the fiscal year locks and can change in format or definition over time [5]. The DHS OHSS persist datasets and monthly tables are constructed from ICE’s Enforcement Integrated Database and note corrected historical processing errors (for example FY2013–14 book‑out location corrections in January 2025), so researchers must treat older dumps and later corrections as distinct source editions [6] [7].

2. Identifier mismatch is the central practical problem

Some historical ICE releases obtained via FOIA include unique identifiers that link arrest, detainer and detention tables within those releases, enabling trace‑through analyses “anonymously” inside that release [3]. Those same releases, however, use different identifier schemes across separate ICE data dumps (for example earlier ACLU/FOIA tables vs. the most recent releases), meaning a person tracked in one release won’t match automatically to a record in another without an explicit reconciliation strategy [3] [1].

3. Use published projects and code as blueprints

Vera’s Detention Trends dashboard and technical appendix show practical solutions: treating detention stay histories as one or more stints per person, concatenating stints, and accounting for transfers by facility code treatment; their repo and appendix describe how they joined detention and bond data and how merged fields can contain multiple values you must collapse or select among [1] [2] [8]. The Deportation Data Project and UWCHR code repositories also provide cleaning pipelines (concatenation, parsing, headcount calculation) that you can adapt rather than re‑inventing the wheel [3] [9].

4. Practical steps to clean and harmonize across years

Start by documenting each file’s variable definitions and date ranges; OHSS and Vera both explicitly flag date‑boundary discontinuities (e.g., Oct 1 boundaries and Dataset I vs Dataset III effects) so treat those as potential structural breaks [6] [1]. Normalize facility codes (Vera counts each unique facility code as distinct), standardize date/time formats, collapse multi‑value bond or status fields per your analytic rules, and create a canonical “stay” record by merging contiguous stints [2] [1]. Maintain provenance columns so every row points back to the original release and row — corrections and republishing mean older files may be retained intentionally [6].

5. Reconciliation techniques when identifiers differ

When FOIA/ICE identifiers aren’t consistent across years, available sources describe two complementary approaches. First, probabilistic matching on stable demographic and event fields (arrival/arrest date ranges, facility sequences, age, sex, nationality) can link likely same individuals across releases; Vera’s work implies using stay‑level sequences and facility history to improve matching [1] [2]. Second, exploit releases that do contain linked identifiers to build mapping tables if overlapping periods exist; Deportation Data Project notes some releases include linked IDs within that release even if not matchable to later releases [3] [4].

6. Watch for reporting artifacts that look like real changes

Researchers have found that population changes near dataset cut‑dates can reflect compilation differences rather than actual custody shifts — Vera explicitly warns differences across Oct 1 boundaries may reflect how ICE compiled different datasets, not true population changes [1]. Similarly, DHS/OHSS corrections to earlier published fields (e.g., book‑out arrest location) show a row can change meaning when reprocessed [6].

7. Ethical, legal, and methodological guardrails

The Deportation Data Project emphasizes that many ICE releases are anonymized and intended for aggregate research but also that merges enabling “pathways” can reconstruct sensitive trajectories if done carelessly; treat any person‑level work with appropriate privacy safeguards and document anonymization limits [3] [4]. Also preserve source citations and versioning — OHSS explicitly maintains previously published reports “as originally approved” so you can show how figures evolved [6].

8. Where to go next: toolchain and reference materials

Follow Vera’s public repo and technical appendix for concrete cleaning scripts and data models used on a 2008–2025 span [2] [8]. Use the Deportation Data Project’s hosted FOIA releases as canonical inputs and the UWCHR code examples for concatenation and headcount routines [3] [9]. Consult DHS OHSS monthly tables for the official Persist Dataset definitions and to flag where agency corrections may require reprocessing [7] [6].

Limitations: available sources do not mention exact field‑level mapping tables between every ICE release and therefore do not supply a turnkey crosswalk for IDs — researchers must build reconciliation logic themselves (not found in current reporting).

Want to dive deeper?
What unique identifiers help reliably match ICE detainees across different years of datasets?
Which data cleaning steps remove duplicates and correct inconsistent fields in ICE detention records?
How can probabilistic record linkage be implemented to merge ICE datasets with limited identifiers?
What privacy and legal considerations apply when tracking individuals in ICE detention data?
Which open-source tools and workflows are best for longitudinal analysis of ICE detention statuses?