UK Biobank health data keeps ending up on GitHub

by Cynddl on 4/23/2026, 1:58:03 PM

I'm a researcher studying privacy, and I started tracking the DMCA notices that UK Biobank sends to GitHub. I tracked 110 notices filed so far, targeting 197 code repositories by 170 developers across the world.The exposure of Biobank data on GitHub is the latest in a long series of governance challenges for UK Biobank. (My colleague and I have an editorial in the BMJ about this: <a href="http://bmj.com/cgi/content/full/bmj.s660?ijkey=dEot4dJZGZGXeG1&keytype=ref" rel="nofollow">http://bmj.com/cgi/content/full/bmj.s660?ijkey=dEot4dJZGZGXe...</a>). The latest is today, with information of all half a million members listed for sale on Alibaba.Looking at the takedown notices, we often see specific files being targeted rather than entire repositories (possibly to justify the copyright infringement as required for a takedown notice, not a copyright expert; although it is clear that they only use DMCA notices as a last resort, for GitHub users they cannot identify, and who were likely not given access in the first place). A quarter of the files are genetic/genomics. Tabular data account for another large share and could contain phenotype or health records.

https://biobank.rocher.lc

Comments

by: michaelt

> It has given 20,000 researchers around the world access under strict agreements that prohibit sharing data further.To me it seems rather naive to have done that.After all, you can't un-leak medical data. So even if the "strict agreement" included huge punishments, there's no getting the toothpaste back in the tube.If you want to ensure compliance before a leak happens you have to (ugh) audit their compliance. And that isn't something that scales to 20,000 researchers.Too late to do anything about it now though :(

4/23/2026, 9:06:32 PM

by: captn3m0

Took me 5 minutes to find more: <a href="https://github.com/tanaylab/Mendelson_et_al_2023/blob/9c5a653f6025506c984ee91d528525aded2022f2/Disease_Longevity_UKBB.ipynb#L72" rel="nofollow">https://github.com/tanaylab/Mendelson_et_al_2023/blob/9c5a65...</a> (Uses Date of Birth column).And some information on how they were distributing it to researchers: <a href="https://github.com/broadinstitute/ml4h/blob/master/ingest/ukbb_csv_bigquery/README.md" rel="nofollow">https://github.com/broadinstitute/ml4h/blob/master/ingest/uk...</a>> The following steps require the ukbunpack and ukbconv utilities from the UK Biobank website. The file decrypt_all.sh will run through the following steps on one of the on-prem servers.> Once the data is downloaded, it needs to be "ukbunpacked" which decrypts it, and then converts it to a file format of choice. Both ukbunpack and ukbconv are available from the UK Biobank's website. The decryption has to happen on a linux system if you download the linux tools, e.g. the Broad's on-prem servers. Note that you need plenty of space to decrypt/unpack, and the programs may fail silently if disk space runs out during the middle.<a href="https://biobank.ctsu.ox.ac.uk/crystal/download.cgi" rel="nofollow">https://biobank.ctsu.ox.ac.uk/crystal/download.cgi</a>

4/23/2026, 10:03:35 PM

by: mil22

The irony is, they don’t even provide the data to the participants themselves.

4/23/2026, 10:52:07 PM

by: John7878781

What are the pros/cons of just open-sourcing everything for future bio bank projects?

4/23/2026, 9:10:18 PM