Show HN: TurboQuant for vector search – 2-4 bit compression

by justsomeguy1996 on 3/29/2026, 11:10:44 AM

https://github.com/RyanCodrai/py-turboquant

Comments

by: pidtom

I built TurboQuant+ (<a href="https://github.com/TheTom/llama-cpp-turboquant" rel="nofollow">https://github.com/TheTom/llama-cpp-turboquant</a>), the llama.cpp implementation of this paper with extensions: asymmetric K/V compression, boundary layer protection, sparse V dequant, and this week weight compression (TQ4_1S) that shrinks models 28-42%% on disk with minimal quality loss. 5k+ stars, 50+ community testers across Metal, CUDA, and AMD HIP.<p>Cool to see the same WHT + Lloyd-Max math applied to vector search. The data-oblivious codebook property is exactly what makes it work for online KV cache compression too. No calibration, no training, just quantize and go.<p>If anyone is running local LLMs and wants to try it: <a href="https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md" rel="nofollow">https://github.com/TheTom/turboquant_plus/blob/main/docs/get...</a>

4/3/2026, 5:01:11 PM

by: justsomeguy1996

I built a Python implementation of Google's TurboQuant paper (ICLR 2026) for vector search. The key thing that makes this different from PQ and other quantization methods: it's fully data-oblivious. The codebook is derived from math (not trained on your data), so you can add vectors online without ever rebuilding the index. Each vector encodes independently in ~4ms at d=1536.<p>The repo reproduces the benchmarks from Section 4.4 of the paper — recall@1@k on GloVe (d=200) and OpenAI embeddings (d=1536, d=3072). At 4-bit on d=1536, you get 0.967 recall@1@1 with 8x compression. At 2-bit, 0.862 recall@1@1 with ~16x compression.<p>Paper: <a href="https://arxiv.org/abs/2504.19874" rel="nofollow">https://arxiv.org/abs/2504.19874</a>

3/29/2026, 11:10:54 AM

Hacker News Viewer

Top 20

Show HN: TurboQuant for vector search – 2-4 bit compression

Comments

by: pidtom

by: justsomeguy1996