We collected 10k hours of neuro-language data in our basement

by nee1r on 12/8/2025, 5:33:13 PM

Comments

by: n7ck

Hey I'm Nick, and I originally came to Conduit as a data participant! After my session, I started asking questions about the setup to the people working there, and apparently I asked good questions, so they hired me.Since I joined, we've gone from <1k hours to >10k hours, and I've been really excited by how much our whole setup has changed. I've been implementing lots of improvements to the whole data pipeline and the operations side. Now that we train lots of models on the data, the model results also inform how we collect data (e.g. we care a lot less about noise now that we have more data).We're definitely still improving the whole system, but at this point, we've learned a lot that I wish someone had told us when we started, so we thought we'd share it in case any of you are doing human data collection. We're all also very curious to get any feedback from the community!

12/8/2025, 5:50:21 PM

by: asgraham

Really cool dataset! Love seeing people actually doing the hard work of generating data rather than just trying to analyze what exists (I say this as someone who’s gone out of his way to avoid data collection).Have you played at all with thought-to-voice? Intuitively I’d think EEG readout would be more reliable for spoken rather than typed words, especially if you’re not controlling for keyboard fluency.

12/8/2025, 8:06:14 PM

by: in-silico

It's interesting that the model generalizes to unseen participants. I was under the impression that everyone's brain patterns were different enough that the model would need to be retrained for new users.Though, I suppose if the model had LLM-like context where it kept track of brain data and speech/typing from earlier in the conversation then it could perform in-context learning to adapt to the user.

12/8/2025, 7:47:34 PM

by: whatshisface

What's the plan for after this mind reading helmet works reliably?

12/8/2025, 8:29:36 PM

by: titzer

I lol'd at the hardware "patch" that kept the software from crashing--removing all but the alpha-numeric keys (!?). Holy cow, you had time to collect thousands of hours of neurotraces but couldn't sanitize the inputs to remove a stray [? That sounds...funky.

12/8/2025, 7:21:10 PM

by: ag8

This is a cool setup, but naively it feels like it would require hundreds of thousands of hours of data to train a decent generalizable model that would be useful for consumers. Are there plans to scale this up, or is there reason to believe that tens of thousands of hours are enough?

12/8/2025, 6:02:30 PM

by: richardfeynman

This is an interesting dataset to collect, and I wonder whether there will be applications for it beyond what you're currently thinking.A couple of questions: What's the relationship between the number of hours of neurodata you collect and the quality of your predictions? Does it help to get less data from more people, or more data from fewer people?

12/8/2025, 6:26:39 PM

by: devanshp

Cool post! I'm somewhat curious whether the data quality scoring has actually translated into better data; do you have numbers on how much more of your data is useful for training vs in May?

12/8/2025, 6:46:21 PM

by: Gormisdomai

The example sentences generated “only from neural data” at the top of this article seem surprisingly accurate to me, like, not exact matches but much better than what I would expect even from 10k hours:“the room seemed colder” -> “ there was a breeze even a gentle gust”

12/8/2025, 5:56:50 PM

by: wiwillia

Really interested in how accuracy improves with the scale of the data set. Non-invasive thought-to-action would be a whole new interaction paradigm.

12/8/2025, 6:27:52 PM

by: rajlego

Did you consider trying to collect data in a much poorer country that still has high quality English? e.g. the Philippines

12/8/2025, 6:54:21 PM

by: mishajw

Interesting dataset! I'm curious what kind of results you would get with just EEG, compared to multiple modalities? Why do multiple modalities end up being important?

12/8/2025, 5:49:22 PM

by: ArjunPanicksser

Makes sense that CL ends up being the best for recruiting first-time participants. Curious what other things you tried for recruitment and how useful they were?

12/8/2025, 5:47:20 PM

by: estitesc

Loved watching this unfold in our basement. : )

12/8/2025, 7:03:05 PM

by: g413n

what's the basis for conversion between hours of neural data to number of tokens? is that counting the paired text tokens?

12/8/2025, 5:49:40 PM

by: dang

[under-the-rug stub][see <a href="https://news.ycombinator.com/item?id=45988611">https://news.ycombinator.com/item?id=45988611</a> for explanation]

12/8/2025, 7:13:57 PM

Hacker News Viewer

Top 20