Hacker News Viewer

UK Biobank health data keeps ending up on GitHub

by Cynddl on 4/23/2026, 1:58:03 PM

I&#x27;m a researcher studying privacy, and I started tracking the DMCA notices that UK Biobank sends to GitHub. I tracked 110 notices filed so far, targeting 197 code repositories by 170 developers across the world.<p>The exposure of Biobank data on GitHub is the latest in a long series of governance challenges for UK Biobank. (My colleague and I have an editorial in the BMJ about this: <a href="http:&#x2F;&#x2F;bmj.com&#x2F;cgi&#x2F;content&#x2F;full&#x2F;bmj.s660?ijkey=dEot4dJZGZGXeG1&amp;keytype=ref" rel="nofollow">http:&#x2F;&#x2F;bmj.com&#x2F;cgi&#x2F;content&#x2F;full&#x2F;bmj.s660?ijkey=dEot4dJZGZGXe...</a>). The latest is today, with information of all half a million members listed for sale on Alibaba.<p>Looking at the takedown notices, we often see specific files being targeted rather than entire repositories (possibly to justify the copyright infringement as required for a takedown notice, not a copyright expert; although it is clear that they only use DMCA notices as a last resort, for GitHub users they cannot identify, and who were likely not given access in the first place). A quarter of the files are genetic&#x2F;genomics. Tabular data account for another large share and could contain phenotype or health records.

https://biobank.rocher.lc

Comments

by: michaelt

<i>&gt; It has given 20,000 researchers around the world access under strict agreements that prohibit sharing data further.</i><p>To me it seems rather naive to have done that.<p>After all, you can&#x27;t un-leak medical data. So even if the &quot;strict agreement&quot; included huge punishments, there&#x27;s no getting the toothpaste back in the tube.<p>If you want to ensure compliance <i>before</i> a leak happens you have to (ugh) audit their compliance. And that isn&#x27;t something that scales to 20,000 researchers.<p>Too late to do anything about it now though :(

4/23/2026, 9:06:32 PM


by: captn3m0

Took me 5 minutes to find more: <a href="https:&#x2F;&#x2F;github.com&#x2F;tanaylab&#x2F;Mendelson_et_al_2023&#x2F;blob&#x2F;9c5a653f6025506c984ee91d528525aded2022f2&#x2F;Disease_Longevity_UKBB.ipynb#L72" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tanaylab&#x2F;Mendelson_et_al_2023&#x2F;blob&#x2F;9c5a65...</a> (Uses Date of Birth column).<p>And some information on how they were distributing it to researchers: <a href="https:&#x2F;&#x2F;github.com&#x2F;broadinstitute&#x2F;ml4h&#x2F;blob&#x2F;master&#x2F;ingest&#x2F;ukbb_csv_bigquery&#x2F;README.md" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;broadinstitute&#x2F;ml4h&#x2F;blob&#x2F;master&#x2F;ingest&#x2F;uk...</a><p>&gt; The following steps require the ukbunpack and ukbconv utilities from the UK Biobank website. The file decrypt_all.sh will run through the following steps on one of the on-prem servers.<p>&gt; Once the data is downloaded, it needs to be &quot;ukbunpacked&quot; which decrypts it, and then converts it to a file format of choice. Both ukbunpack and ukbconv are available from the UK Biobank&#x27;s website. The decryption has to happen on a linux system if you download the linux tools, e.g. the Broad&#x27;s on-prem servers. Note that you need plenty of space to decrypt&#x2F;unpack, and the programs may fail silently if disk space runs out during the middle.<p><a href="https:&#x2F;&#x2F;biobank.ctsu.ox.ac.uk&#x2F;crystal&#x2F;download.cgi" rel="nofollow">https:&#x2F;&#x2F;biobank.ctsu.ox.ac.uk&#x2F;crystal&#x2F;download.cgi</a>

4/23/2026, 10:03:35 PM


by: mil22

The irony is, they don’t even provide the data to the participants themselves.

4/23/2026, 10:52:07 PM


by: John7878781

What are the pros&#x2F;cons of just open-sourcing everything for future bio bank projects?

4/23/2026, 9:10:18 PM