What kind of system requirements to run this new model decently?
I’m running it with the UD_Q4_K_XL quant on 24GB VRAM 7900XTX at ~130* token/s. Since it’s an MOE model, CPU inference with 32 GB ram should be doable, but I won’t make any promises on speed.
*Edit: I had a configuration issue on my llama.cpp that reduced the performance. It was limited to 85 tk/s but that was user error on my part.
Thanks! That sounds expensive. Hopefully 24GB VRAM gets cheaper or models get more efficient soon.
You would want to wait till smaller models for 3.6 are released, I'd assume it'll be soon
Thanks! I’m hoping to run at least 20B. Idk if I can do that fast enough without 24GB. Seems to be the sweet spot.
Wonder what the wombo-combo of Ryzen AI APU can do with this.
Time to fire up the trusty 370.
Can I ask you what GPU driver version you're running? I'm running a 7900 XTX as well and recently encountered some stability issues after a driver update (trying to support gaming and AI stuff at the same time). The latest version I could find as a recommendation for similar issues was 24.12.1.
Ah, I don’t know anything about Windows. I’m using Linux and both the latest ROCM (7.2.2) and latest vulkan (26.0.5) packages work without issues for combined gaming and AI. My reported numbers were with Vulkan at zero context for reference.
Thanks! I'm migrating all my PCs to Linux anyway and just haven't gotten to the AI stuff yet, so it sounds like that might fix itself.
Probably 24 GB VRAM and 32-64 GB RAM for minimum specs with 4-bit quantization. This is a beefy boi.
Thanks! Not for me yet. Hope to save up enough to get 24GB VRAM in near future.
How do you install it? Im using AI lab on podman which can import new models but I haven't been able to load other models just the ones it comes with. The ones I tried loading just crash when I start the service.
I agree with the suggestion of the other commenters, just wanted to add that I personally run llama.cpp directly with the build in llama-server. For a single-user server this seems to work great and is almost always at the forefront of model support.
Not sure about AI lab (although I also use podman, prefer Llama-swap), but pretty much everything uses llama-cpp under the hood, which usually takes a day or three to setup for a new architecture. Although I seem to recall them being ready for Qwen3.5 day one due to collaboration.
I find giving it a week or so for the dust to settle (even if it 'works', best parameters, quantization bugs etc take a while to shake out) unless there's a huge motivation.
Also benches are more like guidelines than actual rules, best to do your own on your own use cases.
Okay, thanks for the help! I'll give llama swap a shot.
AI lab
Had a quick squiz at it, and if it meets your needs, I'd just wait, unless you want to get into the lower levels of things (and run linux), llama-swap is just an inference server, you'll need something like Open WebUI for chat as well etc.
I have no experience with AI lab or what formats it supports but you can download raw model files from Huggingface, or use a tool like Ollama that can pull them (either from its own model repos or from huggingface directly, there is an ollama copy-paste command in huggingface). There are lots of formats but GGUF is usually easiest and a single file.
Wasn't there initially a CoT collapse issue in the Qwen 3.5 series when using llama.cpp? Was that fixed for 3.6?
Also, is the /no_think feature still gated behind temp=x settings?
EDIT: Yes, on the second question.
- Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
- Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
Interesting new feature?
*Preserve Thinking
By default, only the thinking blocks generated in handling the latest user message is retained, resulting in a pattern commonly as interleaved thinking. Qwen3.6 has been additionally trained to preserve and leverage thinking traces from historical messages. You can enable this behavior by setting the preserve_thinking option: ...
This capability is particularly beneficial for agent scenarios, where maintaining full reasoning context can enhance decision consistency and, in many cases, reduce overall token consumption by minimizing redundant reasoning. Additionally, it can improve KV cache utilization, optimizing inference efficiency in both thinking and non-thinking modes.*
Need one with more parameters. The quality here is way lower than something like Qwen 3.5 397 4-bit quant
Was just looking at the benchmarks for it on artificialanalysis.ai and it looks great for the size. Probably best available for general use if you’re looking for something under 40b parameters I’d say. Even more impressive is the agentic capabilities and the fact that it is actually decent in terms of hallucinations (not amazing given it’s a small-medium size model but decent).
I've been using it for the past few days and the output quality seems to be on par or slightly better than 3.5 27b. The biggest issue is the token usage that has exploded with this revision. It can easily reason for 20k-25k tokens on a question where the qwen3.5 models used 10k. Since it runs more than 3 times faster, it still finished earlier than the 27b, but I won't have any context/vram left to ask multiple questions.
Artificial Analysis has similar findings.

Yes I did see that as well. That does seem to be the real Achilles heel here. Will have to try it myself to see how much it exacerbates context size limitations given I would be running it on a single 24 GB VRAM GPU. I wonder if adjusting reasoning effort parameters could make a difference without affecting quality too much?
LocalLLaMA
Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.
Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.
As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.
Rules:
Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.
Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.
Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.
Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.
