What public datasets and studies provide reliable pay-disparity figures for race and gender in software engineering?
Executive summary
Reliable public datasets and academic studies for pay disparities in software engineering include the Stack Overflow Developer Survey (used by academic papers) and large private survey providers like Payscale and Hired whose reports are widely cited; academic work has also begun to use Stack Overflow’s data to measure cross‑country gender gaps [1] [2] [3]. Major aggregator reports (Payscale, Dice, Glassdoor, Hired) publish occupation‑level or industry slices and controlled/uncontrolled gap estimates useful for race and gender analysis [3] [4] [5] [6].
1. What data sources researchers actually use — public, private, and hybrid
Scholars and analysts rely on a mix: academic studies often reuse large public developer surveys (for example, Stack Overflow’s annual Developer Survey, which is explicitly used by at least one cross‑country pay‑gap paper) while industry reports from Payscale, Hired, Glassdoor, Dice and similar firms supply large samples and controlled‑gap estimates that researchers and journalists cite [1] [2] [3] [5] [4]. Each source has tradeoffs: Stack Overflow gives developer‑specific signals and is open for reuse by academics [1] [2]; Payscale and Glassdoor provide controlled and occupational breakdowns but rely on employee‑reported or platform‑collected data with their own sampling biases [3] [5].
2. Strengths: developer‑focused depth and controlled analyses
Stack Overflow’s survey is sector‑targeted and includes salary, role, experience and demographic fields that let academics test within‑field variation and run cross‑country comparisons — precisely what Prakash & Yadav used in their cross‑country gender pay study for software developers [1] [2]. Payscale’s Gender Pay Gap Report provides "controlled" estimates that adjust for job, experience and other compensable factors, which helps separate compositional effects from within‑job pay differentials [3].
3. Weaknesses and sampling caveats you must expect
Platform surveys and employer‑reported datasets skew by who chooses to participate: tech job markets overrepresent active job searchers or platform users, and private salaries reported to Glassdoor/Payscale can under‑ or over‑represent certain firms or regions [5] [3]. Academic reuse of Stack Overflow is powerful but limited by self‑selection (developers who use Stack Overflow and choose to answer the salary items), and many large reports do not fully disclose raw sampling frames or race/ethnicity classifiers [2] [1].
4. Race and intersectional analysis: where the evidence is strongest and where it’s thin
Multiple industry analyses and news pieces report that the gender gap is larger for Black and Hispanic women and that pay gaps vary sharply by race and state — Bankrate and other analysts using Census Bureau data show groups like Black and Hispanic women earning markedly less relative to white men [7] [8]. Hired and Statista compilations have long documented race/ethnic pay ranks in tech (Asian men, white men, Hispanic men, Asian women in some samples), but these are platform‑specific snapshots and often lack consistent cross‑firm weighting [9] [6].
5. How to combine sources to get the most reliable picture
Triangulate: use Stack Overflow for developer‑specific breakdowns and international comparisons, Payscale for controlled job‑level gaps in the U.S., and platform reports (Hired, Glassdoor, Dice) to validate trends and market‑level magnitudes. Academics explicitly cite Stack Overflow when doing cross‑country work; Payscale’s GPGR supplies controlled/uncontrolled comparisons that clarify how much of a gap is compositional versus within‑job [1] [2] [3].
6. Practical next steps for researchers or HR professionals
If you want reproducible analysis, start with Stack Overflow’s raw survey for developer populations (used by peer‑reviewed work) and supplement with Payscale’s GPGR to run controlled comparisons; cross‑check with Hired/Glassdoor/Dice for employer/market coverage and with Census Bureau analyses for population‑level race/gender baselines [1] [2] [3] [7]. Where race is crucial, prefer sources that publish race/ethnicity breakdowns and document methodology [6] [7].
Limitations and competing viewpoints: academic reuse of Stack Overflow yields high relevance to developers but is vulnerable to self‑selection and international comparability issues [1] [2]. Payscale reports argue Technology, Engineering & Science can "achieve pay equity" when controlled — a finding that contrasts with many news narratives that emphasize persistent racialized gaps, so treat "controlled" parity claims as conditional on the covariates used [3] [7]. Available sources do not mention a single, definitive public dataset that perfectly measures race‑and‑gender pay disparities for software engineers without important sampling or methodological caveats.