31
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 22 Feb 2026
31 points (67.4% liked)
Asklemmy
53238 readers
681 users here now
A loosely moderated place to ask open-ended questions
Search asklemmy ๐
If your post meets the following criteria, it's welcome here!
- Open-ended question
- Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
- Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
- Not ad nauseam inducing: please make sure it is a question that would be new to most members
- An actual topic of discussion
Looking for support?
Looking for a community?
- Lemmyverse: community search
- sub.rehab: maps old subreddits to fediverse options, marks official as such
- !lemmy411@lemmy.ca: a community for finding communities
~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~
founded 6 years ago
MODERATORS
That I why I like small, specialized, locally hosted AI. Runs acceptably fast and quite on my gaming PC, it's private, and I can give it knowledge is small doses in specific topics and projects.
Which model do you use and what are your specs? I ran a couple using an RTX5060 with 16gb and it's too slow to be usable for larger models while the smaller ones are mostly useless.
I also have a 5060 (ti) with 16GB of RAM. I tend to use GPT-OSS:20B or Qwen3:14B with a context of ~30k. I have custom system prompt for my style of reponse I like on open web ui. That takes up about 14GB of my 16GB VRAM
But yeah it is slower and not as "smart" as the cloud based models, but I think the inconvenience of the speed and having to fact check/test code is worth the privacy and environmental trade offs
Ive had good success on similar hardware (5070 + more ram) with GLM-4.7-Flash, using llama.cpp's
--cpu-moeflag - I can get up to 150k context with it at 20ish tok/sec. I've found it to be a lot better for agentic use than GPT-OSS as well, it seems to do a much more in depth reasoning effort, so while it spends more tokens it seems worth it for the end result.