Seems there's not a lot of talk about relatively unknown finetunes these days, so I'll start posting more!
Openbuddy's been on my radar, but this one is very interesting: QwQ 32B, post-trained on openbuddy's dataset, apparently with QAT applied (though it's kinda unclear) and context-extended. Observations:
-
Quantized with exllamav2, it seems to show lower distortion levels than nomal QwQ. Its works conspicuously well at 4.0bpw and 3.5bpw.
-
Seems good at long context. Have not tested 200K, but it's quite excellent in the 64K range.
-
Works fine in English.
-
The chat template is funky. It seems to mix up the and <|think|> tags in particular (why don't they just use ChatML?), and needs some wrangling with your own template.
-
Seems smart, can't say if it's better or worse than QwQ yet, other than it doesn't seem to "suffer" below 3.75bpw like QwQ does.
Also, I reposted this from /r/locallama, as I feel the community generally should going forward. With its spirit, it seems like we should be on Lemmy instead?
Ah. You can still run them in exllamav2, but you're probably better off with ik_llama.cpp then:
https://github.com/ikawrakow/ik_llama.cpp
It supports special "KT" quantizations, aka trellis quants similar to exllamav3, and will work with vulkan (or rocm?) on your 6800.
Quantizing yourself is not too bad, but if you want, just ping me, and I can make some 16GB KT quants, or point you to how to do it yourself.
It's also a good candidate for Qwen3 30B with a little CPU offloading. ik_llama.cpp is specifically optimized for MoE offloading.