UK Biobank health data keeps ending up on GitHub
by Cynddl on 4/23/2026, 1:58:03 PM
I'm a researcher studying privacy, and I started tracking the DMCA notices that UK Biobank sends to GitHub. I tracked 110 notices filed so far, targeting 197 code repositories by 170 developers across the world.<p>The exposure of Biobank data on GitHub is the latest in a long series of governance challenges for UK Biobank. (My colleague and I have an editorial in the BMJ about this: <a href="http://bmj.com/cgi/content/full/bmj.s660?ijkey=dEot4dJZGZGXeG1&keytype=ref" rel="nofollow">http://bmj.com/cgi/content/full/bmj.s660?ijkey=dEot4dJZGZGXe...</a>). The latest is today, with information of all half a million members listed for sale on Alibaba.<p>Looking at the takedown notices, we often see specific files being targeted rather than entire repositories (possibly to justify the copyright infringement as required for a takedown notice, not a copyright expert; although it is clear that they only use DMCA notices as a last resort, for GitHub users they cannot identify, and who were likely not given access in the first place). A quarter of the files are genetic/genomics. Tabular data account for another large share and could contain phenotype or health records.
Comments
by: michaelt
<i>> It has given 20,000 researchers around the world access under strict agreements that prohibit sharing data further.</i><p>To me it seems rather naive to have done that.<p>After all, you can't un-leak medical data. So even if the "strict agreement" included huge punishments, there's no getting the toothpaste back in the tube.<p>If you want to ensure compliance <i>before</i> a leak happens you have to (ugh) audit their compliance. And that isn't something that scales to 20,000 researchers.<p>Too late to do anything about it now though :(
4/23/2026, 9:06:32 PM
by: captn3m0
Took me 5 minutes to find more: <a href="https://github.com/tanaylab/Mendelson_et_al_2023/blob/9c5a653f6025506c984ee91d528525aded2022f2/Disease_Longevity_UKBB.ipynb#L72" rel="nofollow">https://github.com/tanaylab/Mendelson_et_al_2023/blob/9c5a65...</a> (Uses Date of Birth column).<p>And some information on how they were distributing it to researchers: <a href="https://github.com/broadinstitute/ml4h/blob/master/ingest/ukbb_csv_bigquery/README.md" rel="nofollow">https://github.com/broadinstitute/ml4h/blob/master/ingest/uk...</a><p>> The following steps require the ukbunpack and ukbconv utilities from the UK Biobank website. The file decrypt_all.sh will run through the following steps on one of the on-prem servers.<p>> Once the data is downloaded, it needs to be "ukbunpacked" which decrypts it, and then converts it to a file format of choice. Both ukbunpack and ukbconv are available from the UK Biobank's website. The decryption has to happen on a linux system if you download the linux tools, e.g. the Broad's on-prem servers. Note that you need plenty of space to decrypt/unpack, and the programs may fail silently if disk space runs out during the middle.<p><a href="https://biobank.ctsu.ox.ac.uk/crystal/download.cgi" rel="nofollow">https://biobank.ctsu.ox.ac.uk/crystal/download.cgi</a>
4/23/2026, 10:03:35 PM
by: mil22
The irony is, they don’t even provide the data to the participants themselves.
4/23/2026, 10:52:07 PM
by: John7878781
What are the pros/cons of just open-sourcing everything for future bio bank projects?
4/23/2026, 9:10:18 PM