TurboQuant compresses the cache to just 3 bits per value, down from the standard 16
Uh, pretty much no one uses 16 bit KV cache, so it's extremely dubious that this specific quant technique is relevant to memory stocks at all...
We already have q4 and q8 KV cache quantization. LLM performance is highly sensitive to KV cache quantization though so q4 is probably only reasonable to use with specific models that don't suck as badly when using it, and that's likely the same for this new quantization technique.
reducing its memory footprint by at least six times without, according to Google’s benchmarks, any measurable loss in accuracy.
Extreme doubt.
At 4-bit precision, the algorithm delivered up to an eight-times speedup in computing attention on Nvidia H100 GPUs compared to the uncompressed 32-bit baseline.
Again, no one is using 32 bit values for KV cache. It is like saying how fast the latest car is by comparing it to a horse and buggy.