37
submitted 2 days ago* (last edited 2 days ago) by wittycomputer@feddit.org to c/asklemmy@lemmy.ml
you are viewing a single comment's thread
view the rest of the comments
[-] iceberg314@midwest.social 2 points 1 day ago

I also have a 5060 (ti) with 16GB of RAM. I tend to use GPT-OSS:20B or Qwen3:14B with a context of ~30k. I have custom system prompt for my style of reponse I like on open web ui. That takes up about 14GB of my 16GB VRAM

But yeah it is slower and not as "smart" as the cloud based models, but I think the inconvenience of the speed and having to fact check/test code is worth the privacy and environmental trade offs

[-] Hexarei@beehaw.org 3 points 1 day ago

Ive had good success on similar hardware (5070 + more ram) with GLM-4.7-Flash, using llama.cpp's --cpu-moe flag - I can get up to 150k context with it at 20ish tok/sec. I've found it to be a lot better for agentic use than GPT-OSS as well, it seems to do a much more in depth reasoning effort, so while it spends more tokens it seems worth it for the end result.

this post was submitted on 22 Feb 2026
37 points (68.7% liked)

Asklemmy

53252 readers
230 users here now

A loosely moderated place to ask open-ended questions

Search asklemmy ๐Ÿ”

If your post meets the following criteria, it's welcome here!

  1. Open-ended question
  2. Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
  3. Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
  4. Not ad nauseam inducing: please make sure it is a question that would be new to most members
  5. An actual topic of discussion

Looking for support?

Looking for a community?

~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~

founded 6 years ago
MODERATORS