84

My 8gb vram system as i try to load GLM-4.6-Q0.00001_XXXS.gguf: (media1.tenor.com)

submitted 3 months ago* (last edited 3 months ago) by Xylight@lemdro.id to c/localllama@sh.itjust.works

13 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] afk_strats@lemmy.world 7 points 3 months ago

Im not sure if it's a me issue but that's a static image. I figure you posted where they throw a brick into it.

Also, if this post was serious, how does a highly quantitized model compare to something less quantitized but with fewer parameters? I haven't seen benchmarks other than perplexity which isn't a good measure of capability?

[-] Xylight@lemdro.id 7 points 3 months ago

It's a webp animation. Maybe your client doesn't display it right, i'll replace it with a gif

Regarding your other question, I tend to see better results with higher params + lower precision, versus low params + higher precision. That's just based on "vibes" though, I haven't done any real testing. Based on what I've seen, Q4 is the lowest safe quantization, and beyond that, the performance really starts to drop off. unfortunately even at 1 bit quantization I can't run GLM 4.6 on my system

[-] hendrik@palaver.p3x.de 3 points 3 months ago* (last edited 3 months ago)

What's higher precision for you? What I remember from the old measurements for ggml is, lower than Q3 rarely makes sense and roughly at Q3 you'd think about switching to a smaller variant. But on the other hand everything above Q6 only shows marginal differences in perplexity, so Q6 or Q8 or full precision are basically the same thing.

[-] Xylight@lemdro.id 4 points 3 months ago* (last edited 3 months ago)

As a memory-poor user (hence the 8gb vram card), I consider Q8+ to be is higher precision, Q4-Q5 is mid-low precision (what i typically use), and below that is low precision

[-] hendrik@palaver.p3x.de 5 points 3 months ago* (last edited 3 months ago)

Thanks. That sounds reasonable. Btw you're not the only poor person around, I don't even own a graphics card... I'm not a gamer so I never saw any reason to buy one before I took interest in AI. I'll do inference on my CPU and that's connected to more than 8GB of memory. It's just slow 😉 But I guess I'm fine with that. I don't rely on AI, it's just tinkering and I'm patient. And a few times a year I'll rent some cloud GPU by the hour. Maybe one day I'll buy one myself.

[-] afk_strats@lemmy.world 2 points 3 months ago

That fixed it.

I am a fan of this quant cook. He often posts perplexity charts.

https://huggingface.co/ubergarm

All of his quants require ik_llama which works best with Nvidia CUDA but they can do a lot with RAM+vRAM or even hard drive + rams. I don't know if 8gb is enough for everything.

[-] hendrik@palaver.p3x.de 4 points 3 months ago* (last edited 3 months ago)

I think perplexity is still central to evaluating models. It's notoriously difficult to come up with other ways to measure these things.

[-] afansfw@lemmynsfw.com 1 points 3 months ago

Unsloth did a test and their dynamic quants were competitive even at 1 bit in aider benchmark https://docs.unsloth.ai/new/unsloth-dynamic-ggufs-on-aider-polyglot

[-] afk_strats@lemmy.world 1 points 3 months ago

Holy cow!

this post was submitted on 25 Oct 2025

84 points (94.7% liked)

LocalLLaMA

4410 readers

30 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago

MODERATORS

pax@sh.itjust.works

noneabove1182@sh.itjust.works

Smokeydope@lemmy.world

MonsterBug@sh.itjust.works