Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data (2024)
by tosh on 5/23/2026, 12:11:47 PM
https://www.thonking.ai/p/strangely-matrix-multiplications
Comments
by: dan_sbl
> For example, when the GPU is fully idle, nvidia-smi tells me that it’s only pulling 88W of power.<p>I haven't used a non-laptop GPU in some time, but that is a crazy amount of "idle" power consumption. Is this normal for cards like this?
5/27/2026, 3:14:28 PM
by: ggambetta
I'd have guessed multiply-by-0 and multiply-by-1 can be special-cased to run much faster and simpler code paths, like you'd do when writing MUL for a processor that doesn't have it (I <3 z80)
5/27/2026, 5:04:06 PM
by: nzach
I went in expecting to find 'branch prediction'[0] as the answer, but apparently things are even more complex nowadays.<p>[0] - <a href="https://stackoverflow.com/questions/11227809/why-is-conditional-processing-of-a-sorted-array-faster-than-of-an-unsorted-array" rel="nofollow">https://stackoverflow.com/questions/11227809/why-is-conditio...</a>
5/27/2026, 2:27:29 PM
by: jetsamflotsam
I feel like many of the comments missed the point or didn't read the article. What I believe this article is stating (and I've read this many times during my PhD for various reasons), is that the input data distributions affect how many transistor state changes there are during multiplication. Since these events are a large portion of energy loss/heat generation, the clocks won't be throttled as much for certain data patterns.<p>There was a workshop paper from SC24 that did more experiments around this I believe. I can't find it now though.
5/27/2026, 4:20:50 PM
by: jayd16
I can't tell from the blog, is this actually verified or is it theory and then numbers showing plausibility?<p>I could certainly come up with alternative theories about memory compression and prefetching if we were talking about texture reads.
5/27/2026, 2:56:53 PM
by: amelius
Sounds like a side channel attack waiting to happen.
5/27/2026, 3:17:47 PM
by:
5/27/2026, 2:52:34 PM
by: gdevenyi
People have been noticing the effects of this in local LLM inference. Power limiting seems to improve overall performance!
5/27/2026, 12:53:42 PM
by: bitwize
It wouldn't surprise me to see some ML algorithm <i>in silico</i> somewhere to select faster matmul paths on favorable data. Yo dawg, I heard you like AI, so we put some AI in your AI so you can infer while you're inferring.
5/27/2026, 3:20:20 PM
by: evanjrowley
Designing for predictable execution flow is one of the advantages of Tenstorrent hardware.<p><a href="https://clehaxze.tw/gemlog/2025/04-21-programming-tensotrrent-processors.gmi" rel="nofollow">https://clehaxze.tw/gemlog/2025/04-21-programming-tensotrren...</a><p><a href="https://clehaxze.tw/gemlog/2026/01-22-the-real-tenstorrent-tensix-programming-model.gmi" rel="nofollow">https://clehaxze.tw/gemlog/2026/01-22-the-real-tenstorrent-t...</a><p><a href="https://arxiv.org/html/2604.03279" rel="nofollow">https://arxiv.org/html/2604.03279</a>
5/27/2026, 2:46:17 PM
by: cold_harbor
[dead]
5/27/2026, 2:25:03 PM