Hacker News Viewer

Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter

by matt_d on 4/19/2026, 5:58:16 AM

https://arxiv.org/abs/2604.15039

Comments

by: martinald

Maybe I&#x27;m missing something in this paper, but this seems to me to be just pretty &quot;standard&quot; caching stuff, albeit:<p>a) very time sensitive b) huge files c) scoped per user<p>Sort of reminds me of video streaming on CDNs for live video (but per user)?<p>I still think the big win is going to come based on time of use&#x2F;live capacity. In a pure economics sense you want to charge a lot for inference when it&#x27;s oversubscribed and far less when it&#x27;s off peak (see electricity markets).<p>We have seen this with anthropics peak times, but it&#x27;s very blunt currently. We also saw this with batch processing back in the day, but that breaks down because agents are &#x27;chatty&#x27; and need to send new responses ASAP. You can&#x27;t wait ages for each response - it would take weeks to do a simple agentic task if you had to wait hours between turn.<p>So I think what we&#x27;ll see is async agents queued up, that you can then decide when to run them - either &#x27;immediately&#x27; for time sensitive stuff (for more $$$) or &#x27;best effort&#x27; where they can be scheduled to run whenever the provider wants to (3am say). If you also have diagnostics that usually agent task xyz takes y tokens total you can do far more efficient scheduling of these. This also reduces the amount of KVcache gymnastics significantly, as you can dedicate that agent task to a certain rack and schedule it all efficiently.<p>tl;dr I think the issues with inference efficiency need to be solved at a higher abstraction level of per agent &quot;task&quot; not purely on a per chat message basis. If you can schedule a load of agentic use cases off peak you don&#x27;t need to preempt them because there is spare capacity by nature.

4/22/2026, 12:53:59 PM