8
top 1 comments
sorted by: hot top controversial new old
[-] theunknownmuncher@lemmy.world 11 points 1 day ago* (last edited 1 day ago)

TurboQuant compresses the cache to just 3 bits per value, down from the standard 16

Uh, pretty much no one uses 16 bit KV cache, so it's extremely dubious that this specific quant technique is relevant to memory stocks at all...

We already have q4 and q8 KV cache quantization. LLM performance is highly sensitive to KV cache quantization though so q4 is probably only reasonable to use with specific models that don't suck as badly when using it, and that's likely the same for this new quantization technique.

reducing its memory footprint by at least six times without, according to Google’s benchmarks, any measurable loss in accuracy.

Extreme doubt.

At 4-bit precision, the algorithm delivered up to an eight-times speedup in computing attention on Nvidia H100 GPUs compared to the uncompressed 32-bit baseline.

Again, no one is using 32 bit values for KV cache. It is like saying how fast the latest car is by comparing it to a horse and buggy.

this post was submitted on 31 Mar 2026
8 points (72.2% liked)

Tech

3193 readers
7 users here now

A community for high quality news and discussion around technological advancements and changes

Things that fit:

Things that don't fit

Community Wiki

founded 2 years ago
MODERATORS