Rust Threads on the GPU
by PaulHoule on 4/9/2026, 7:20:10 PM
https://www.vectorware.com/blog/threads-on-gpu/
Comments
by: nynx
I don’t understand why this is a useful effort. It seems like a solution in source of a problem. It’s going to be incredibly easy to end up with hopelessly inefficient programs that need a full redesign in a normal gpu programming model to be useful.
4/14/2026, 3:15:43 AM
by: kevmo314
Isn't this turning a GPU into a slower CPU? It's not like CPUs are slow, in fact they're quite a bit faster than any single GPU thread. If code is written in a GPU unaware way it's not going to take advantage of the reasons for being on the GPU in the first place.
4/14/2026, 4:34:18 AM
by: gpm
Is this proprietary, or something I can play around with? I can't find a repo.
4/14/2026, 5:07:26 AM
by: 20k
This programming model seems like the wrong one, and I think its based on some faulty assumptions<p>>Another advantage of this approach is that it prevents divergence by construction. Divergence occurs when lanes within a warp take different branches. Because thread::spawn() maps one closure to one warp, every lane in that warp runs the same code. There is no way to express divergent branching within a single std::thread, so divergence cannot occur<p>This is extremely problematic - being able to write divergent code between lanes is good. Virtually all high performance GPGPU code I've ever written contains divergent code paths!<p>>The worst case is that a workload only uses one lane per warp and the remaining lanes sit idle. But idle lanes are strictly better than divergent lanes: idle lanes waste capacity while divergent lanes serialize execution<p>This is where I think it falls apart a bit, and we need to dig into GPU architecture to find out why. A lot of people think that GPUs are a bunch of executing threads, that are grouped into warps that execute in lockstep. This is a very overly restrictive model of how they work, that misses a lot of the reality<p>GPUs are a collection of threads, that are broken up into local work groups. These share l2 cache, which can be used for fast intra work group communication. Work groups are split up into subgroups - which map to warps - that can communicate extra fast<p>This is the first problem with this model: it neglects the local work group execution unit. To get adequate performance, you have to set this value much higher than the size of a warp, at least 64 for a 32-sized warp. In general though, 128-256 is a better size. Different warps in a local work group make true independent progress, so if you take this into account in rust, its a bad time and you'll run into races. To get good performance and cache management, these warps need to be executing the <i>same</i> code. Trying to have a task-per-warp is a really bad move for performance<p>>Each warp has its own program counter, its own register file, and can execute independently from other warps<p>The second problem is: it used to be true that all threads in a warp would execute in lockstep, and strictly have on/off masks for thread divergence, but this is strictly no longer true for modern GPUs, the above is just wrong. On a modern GPU, each *thread* has its own program counter and callstack, and can independently make forward progress. Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here<p>Say we have two warps, both running the same code, where half of each warp splits at a divergence point. Modern GPUs will go: huh, it sure would be cool if we just shifted the threads about to produce two non divergent warps, and bam divergence solved at the hardware level. But notice that to get this hardware acceleration, we need to actually use the GPU programming model to its fullest<p>The key mistake is to assume that the current warp model is always going to stick rigidly to being strictly wide SIMD units with a funny programming model, but we already ditched that concept a while back on GPUs, around the Pascal era. As time goes on this model will only increasingly diverge from how GPUs actually work under the hood, which seems like an error. Right now even with just the local work group problems, I'd guess you're dropping ~50% of your performance on the table, which seems like a bit of a problem when the entire reason to use a GPU is performance!
4/14/2026, 5:58:28 AM