Google's TurboQuant compresses AI memory by 6x, rattles chip stocks (thenextweb.com)

submitted 2 months ago by sabreW4K3@lazysoci.al to c/tech@programming.dev

1 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] theunknownmuncher@lemmy.world 11 points 2 months ago* (last edited 2 months ago)

TurboQuant compresses the cache to just 3 bits per value, down from the standard 16

Uh, pretty much no one uses 16 bit KV cache, so it's extremely dubious that this specific quant technique is relevant to memory stocks at all...

We already have q4 and q8 KV cache quantization. LLM performance is highly sensitive to KV cache quantization though so q4 is probably only reasonable to use with specific models that don't suck as badly when using it, and that's likely the same for this new quantization technique.

reducing its memory footprint by at least six times without, according to Google’s benchmarks, any measurable loss in accuracy.

Extreme doubt.

At 4-bit precision, the algorithm delivered up to an eight-times speedup in computing attention on Nvidia H100 GPUs compared to the uncompressed 32-bit baseline.

Again, no one is using 32 bit values for KV cache. It is like saying how fast the latest car is by comparing it to a horse and buggy.

this post was submitted on 31 Mar 2026

8 points (72.2% liked)

Tech

3257 readers

1 users here now

A community for high quality news and discussion around technological advancements and changes

Things that fit:

New tech releases
Major tech changes
Major milestones for tech
Major tech news such as data breaches, discontinuation

Things that don't fit

Minor app updates
Government legislation
Company news
Opinion pieces

Community Wiki

founded 2 years ago

MODERATORS

Vacant@programming.dev