view the rest of the comments
Selfhosted
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
Rules:
-
Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
-
No spam posting.
-
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
-
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
-
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
-
No trolling.
Resources:
- selfh.st Newsletter and index of selfhosted software and apps
- awesome-selfhosted software
- awesome-sysadmin resources
- Self-Hosted Podcast from Jupiter Broadcasting
Any issues on the community? Report it using the report flag.
Questions? DM the mods!
It is because modern consumer GPUs do not have enough VRAM to load the 24B models. I want to run Mistral small locally.
May take a look at systems with the newer AMD SoC's first. They utilize the systems' RAM and come with a proper NPU, once ollama or mistral.rs are supporting those they might give you sufficient performance for your needs for way lower costs (incl. power consumption). Depending on how NPU support gets implemented it might even become possible to use NPU and GPU in tandem, that would probably enable pretty powerful models to be run on consumer-grade hardware at reasonable speed.
Thanks, but will NPUs integrated along with the CPU ever match the performance of a discrete GPU?
Depends on which GPU you compare it with, what model you use, what kind of RAM it has to work with, ecetera. NPU's are purpose-built chips after all. Unfortunately the whole tech is still very young, so we'll have to wait for stuff like ollama to introduce native support for an apples-to-apples comparison. The raw numbers to however do look promising.
It wouldn't even matter. OP doesn't understand how any of this works, and is instead just running rampant calling everything bullshit 😂
I'd prefer that you reply with examples/an explanation of what I'm doing wrong instead of cursing
Look at all your replies, chum. Everyone is already telling you.
This is false: Mistral small 24b at q4_K_M quantization is 15GB. q8 is 26GB. A 3090/4090/5090 with 24GB or two cards with 16GB (I recommend the 4060 Ti 16GB) will work fine with this model, and will work in a single computer. Like others have said, 10Gbe will be a huge bottleneck, plus it’s just simply not necessary to distribute a 24b model across multiple machines.
Thank you, but which consumer motherboard + CPU combo is giving me 32 lanes of PCIe Gen 4 neatly divided into 2 x16 slots for me to put 2 GPUs in? I only asked this question because I was going to buy used computers and stuff a GPU in each.
Your point about networking is valid, and I'll be hesitant to invest in 25Gbe right now
You don’t need cards to have full bandwidth, they only time it will matter is when you’re loading the models on the card. You need a motherboard with x16 slots but even x4 connections would be good enough. Running the model doesn’t need a lot of bandwidth. Remember you only load the model once then reuse it.
An x4 pcie gen 4 slot has ~7.8 GiB/s theoretical transfer rate (after overhead), a x16 has ~31.5GiB/s - so disk I/O is likely your limit even for a x4 slot.
I see. That solves a lot of the headaches I imagined I would have. Thank you so much for clearing that up
I assume you're talking about a CUDA implementation here. There's ways to do this with that system, and even sub-projects that expand on that. I'm mostly pointing how pointless it is for you to do this. What a waste of time and money.
Edit: others are also pointing this out, but I'm still being downvoted. Mkay.
Used 3090s go for $800. I was planning to wait for the ARC B580s to go down in price to buy a few. The reason for the networked setup is because I didn't find there to be enough PCIe lanes in any of the used computers I was looking at. If there's either an affordable card with good performance and 48GB of VRAM, or there's an affordable motherboard + CPU combo with a lot of PCIe lanes under $200, then I'll gladly drop the idea of the distributed AI. I just need lots of VRAM and this is the only way I could think of.
Thanks
PLEASE look back at the crypto mining rush of a decade ago. I implore you.
You're buying into something that doesn't exist.