Be specific!
-
What models size (or model) are you looking to host?
-
At what context length?
-
What kind of speed (token/s) do you need?
-
Is it just for you, or many people? How many? In other words should the serving be parallel?
In other words, it depends, but the sweetpsot option for a self hosted rig, OP, is probably:
-
One 5090 or A6000 ADA GPU. Or maybe 2x 3090s/4090s, underclocked.
-
A cost-effective EPYC CPU/Mobo
-
At least 256 GB DDR5
Now run ik_llama.cpp, and you can serve Deepseek 671B faster than you can read without burning your house down with H200s: https://github.com/ikawrakow/ik_llama.cpp
It will also do for dots.llm, kimi, pretty much any of the mega MoEs de joure.
But there's all sorts of niches. In a nutshell, don't think "How much do I need for AI?" But "What is my target use case, what model is good for that, and what's the best runtime for it?" Then build your rig around that.