43
submitted 1 week ago* (last edited 1 week ago) by TheCornCollector@piefed.zip to c/localllama@sh.itjust.works

The Qwen3.5 models are still the best local models I’ve used, so I’m excited to see how this updated version performs.

top 22 comments
sorted by: hot top controversial new old
[-] SuspciousCarrot78@lemmy.world 1 points 5 days ago* (last edited 5 days ago)

Wasn't there initially a CoT collapse issue in the Qwen 3.5 series when using llama.cpp? Was that fixed for 3.6?

Also, is the /no_think feature still gated behind temp=x settings?

EDIT: Yes, on the second question.

  • Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
  • Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Interesting new feature?

*Preserve Thinking

By default, only the thinking blocks generated in handling the latest user message is retained, resulting in a pattern commonly as interleaved thinking. Qwen3.6 has been additionally trained to preserve and leverage thinking traces from historical messages. You can enable this behavior by setting the preserve_thinking option: ...

This capability is particularly beneficial for agent scenarios, where maintaining full reasoning context can enhance decision consistency and, in many cases, reduce overall token consumption by minimizing redundant reasoning. Additionally, it can improve KV cache utilization, optimizing inference efficiency in both thinking and non-thinking modes.*

[-] FrankLaskey@lemmy.ml 1 points 6 days ago* (last edited 6 days ago)

Was just looking at the benchmarks for it on artificialanalysis.ai and it looks great for the size. Probably best available for general use if you’re looking for something under 40b parameters I’d say. Even more impressive is the agentic capabilities and the fact that it is actually decent in terms of hallucinations (not amazing given it’s a small-medium size model but decent).

[-] TheCornCollector@piefed.zip 2 points 6 days ago* (last edited 5 days ago)

I've been using it for the past few days and the output quality seems to be on par or slightly better than 3.5 27b. The biggest issue is the token usage that has exploded with this revision. It can easily reason for 20k-25k tokens on a question where the qwen3.5 models used 10k. Since it runs more than 3 times faster, it still finished earlier than the 27b, but I won't have any context/vram left to ask multiple questions.

Artificial Analysis has similar findings. Bar graph of output tokens for different models. Qwen3.6 35b: 140 million, Qwen3.5 35b: 100 million

[-] FrankLaskey@lemmy.ml 2 points 5 days ago

Yes I did see that as well. That does seem to be the real Achilles heel here. Will have to try it myself to see how much it exacerbates context size limitations given I would be running it on a single 24 GB VRAM GPU. I wonder if adjusting reasoning effort parameters could make a difference without affecting quality too much?

[-] locuester@lemmy.zip 1 points 6 days ago

Need one with more parameters. The quality here is way lower than something like Qwen 3.5 397 4-bit quant

[-] venusaur@lemmy.world 7 points 1 week ago

What kind of system requirements to run this new model decently?

[-] TheCornCollector@piefed.zip 6 points 1 week ago* (last edited 2 days ago)

I’m running it with the UD_Q4_K_XL quant on 24GB VRAM 7900XTX at ~130* token/s. Since it’s an MOE model, CPU inference with 32 GB ram should be doable, but I won’t make any promises on speed.

*Edit: I had a configuration issue on my llama.cpp that reduced the performance. It was limited to 85 tk/s but that was user error on my part.

[-] ericwdhs@discuss.online 0 points 5 days ago

Can I ask you what GPU driver version you're running? I'm running a 7900 XTX as well and recently encountered some stability issues after a driver update (trying to support gaming and AI stuff at the same time). The latest version I could find as a recommendation for similar issues was 24.12.1.

[-] TheCornCollector@piefed.zip 1 points 5 days ago

Ah, I don’t know anything about Windows. I’m using Linux and both the latest ROCM (7.2.2) and latest vulkan (26.0.5) packages work without issues for combined gaming and AI. My reported numbers were with Vulkan at zero context for reference.

[-] ericwdhs@discuss.online 2 points 4 days ago

Thanks! I'm migrating all my PCs to Linux anyway and just haven't gotten to the AI stuff yet, so it sounds like that might fix itself.

[-] venusaur@lemmy.world 2 points 1 week ago

Thanks! That sounds expensive. Hopefully 24GB VRAM gets cheaper or models get more efficient soon.

[-] Jakeroxs@sh.itjust.works 3 points 1 week ago

You would want to wait till smaller models for 3.6 are released, I'd assume it'll be soon

[-] venusaur@lemmy.world 1 points 1 week ago

Thanks! I’m hoping to run at least 20B. Idk if I can do that fast enough without 24GB. Seems to be the sweet spot.

[-] fonix232@fedia.io 1 points 1 week ago

Wonder what the wombo-combo of Ryzen AI APU can do with this.

Time to fire up the trusty 370.

[-] Infinite@lemmy.zip 5 points 1 week ago

Probably 24 GB VRAM and 32-64 GB RAM for minimum specs with 4-bit quantization. This is a beefy boi.

[-] venusaur@lemmy.world 2 points 1 week ago

Thanks! Not for me yet. Hope to save up enough to get 24GB VRAM in near future.

[-] altphoto@lemmy.today 3 points 1 week ago

How do you install it? Im using AI lab on podman which can import new models but I haven't been able to load other models just the ones it comes with. The ones I tried loading just crash when I start the service.

[-] TheCornCollector@piefed.zip 8 points 1 week ago

I agree with the suggestion of the other commenters, just wanted to add that I personally run llama.cpp directly with the build in llama-server. For a single-user server this seems to work great and is almost always at the forefront of model support.

[-] MalReynolds@slrpnk.net 6 points 1 week ago* (last edited 1 week ago)

Not sure about AI lab (although I also use podman, prefer Llama-swap), but pretty much everything uses llama-cpp under the hood, which usually takes a day or three to setup for a new architecture. Although I seem to recall them being ready for Qwen3.5 day one due to collaboration.

I find giving it a week or so for the dust to settle (even if it 'works', best parameters, quantization bugs etc take a while to shake out) unless there's a huge motivation.

Also benches are more like guidelines than actual rules, best to do your own on your own use cases.

[-] altphoto@lemmy.today 2 points 1 week ago

Okay, thanks for the help! I'll give llama swap a shot.

[-] MalReynolds@slrpnk.net 3 points 1 week ago* (last edited 1 week ago)

AI lab

Had a quick squiz at it, and if it meets your needs, I'd just wait, unless you want to get into the lower levels of things (and run linux), llama-swap is just an inference server, you'll need something like Open WebUI for chat as well etc.

[-] cecilkorik@piefed.ca 3 points 1 week ago

I have no experience with AI lab or what formats it supports but you can download raw model files from Huggingface, or use a tool like Ollama that can pull them (either from its own model repos or from huggingface directly, there is an ollama copy-paste command in huggingface). There are lots of formats but GGUF is usually easiest and a single file.

this post was submitted on 16 Apr 2026
43 points (95.7% liked)

LocalLLaMA

4635 readers
12 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS