Hacker News Viewer

Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data (2024)

by tosh on 5/23/2026, 12:11:47 PM

https://www.thonking.ai/p/strangely-matrix-multiplications

Comments

by: dan_sbl

&gt; For example, when the GPU is fully idle, nvidia-smi tells me that it’s only pulling 88W of power.<p>I haven&#x27;t used a non-laptop GPU in some time, but that is a crazy amount of &quot;idle&quot; power consumption. Is this normal for cards like this?

5/27/2026, 3:14:28 PM


by: ggambetta

I&#x27;d have guessed multiply-by-0 and multiply-by-1 can be special-cased to run much faster and simpler code paths, like you&#x27;d do when writing MUL for a processor that doesn&#x27;t have it (I &lt;3 z80)

5/27/2026, 5:04:06 PM


by: nzach

I went in expecting to find &#x27;branch prediction&#x27;[0] as the answer, but apparently things are even more complex nowadays.<p>[0] - <a href="https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;11227809&#x2F;why-is-conditional-processing-of-a-sorted-array-faster-than-of-an-unsorted-array" rel="nofollow">https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;11227809&#x2F;why-is-conditio...</a>

5/27/2026, 2:27:29 PM


by: jetsamflotsam

I feel like many of the comments missed the point or didn&#x27;t read the article. What I believe this article is stating (and I&#x27;ve read this many times during my PhD for various reasons), is that the input data distributions affect how many transistor state changes there are during multiplication. Since these events are a large portion of energy loss&#x2F;heat generation, the clocks won&#x27;t be throttled as much for certain data patterns.<p>There was a workshop paper from SC24 that did more experiments around this I believe. I can&#x27;t find it now though.

5/27/2026, 4:20:50 PM


by: jayd16

I can&#x27;t tell from the blog, is this actually verified or is it theory and then numbers showing plausibility?<p>I could certainly come up with alternative theories about memory compression and prefetching if we were talking about texture reads.

5/27/2026, 2:56:53 PM


by: amelius

Sounds like a side channel attack waiting to happen.

5/27/2026, 3:17:47 PM


by:

5/27/2026, 2:52:34 PM


by: gdevenyi

People have been noticing the effects of this in local LLM inference. Power limiting seems to improve overall performance!

5/27/2026, 12:53:42 PM


by: bitwize

It wouldn&#x27;t surprise me to see some ML algorithm <i>in silico</i> somewhere to select faster matmul paths on favorable data. Yo dawg, I heard you like AI, so we put some AI in your AI so you can infer while you&#x27;re inferring.

5/27/2026, 3:20:20 PM


by: evanjrowley

Designing for predictable execution flow is one of the advantages of Tenstorrent hardware.<p><a href="https:&#x2F;&#x2F;clehaxze.tw&#x2F;gemlog&#x2F;2025&#x2F;04-21-programming-tensotrrent-processors.gmi" rel="nofollow">https:&#x2F;&#x2F;clehaxze.tw&#x2F;gemlog&#x2F;2025&#x2F;04-21-programming-tensotrren...</a><p><a href="https:&#x2F;&#x2F;clehaxze.tw&#x2F;gemlog&#x2F;2026&#x2F;01-22-the-real-tenstorrent-tensix-programming-model.gmi" rel="nofollow">https:&#x2F;&#x2F;clehaxze.tw&#x2F;gemlog&#x2F;2026&#x2F;01-22-the-real-tenstorrent-t...</a><p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;2604.03279" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;2604.03279</a>

5/27/2026, 2:46:17 PM


by: cold_harbor

[dead]

5/27/2026, 2:25:03 PM