11
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 10 Jan 2025
11 points (92.3% liked)
LocalLLaMA
2410 readers
27 users here now
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
founded 2 years ago
MODERATORS
If you're running a consumer level GPU, you'll be operating with 24GB of VRAM max (RTX 4090, RTX 3090, or Radeon 7900XTX).
90b model = 90GB at 8-bit quantization (plus some extra based on your context size and general overhead, but as a ballpark estimate, just going by the model size is good enough). You would need to drop down to 2-bit quantization to have any hope to fit it in a single consumer GPU. At that point you'd probably be better off using a smaller model will less aggressive quantization, like a 32b model at 4-bit quantization.
So forget about consumer GPUs for that size of model. Instead, you can look at systems with integrated memory, like a Mac with 96-128GB of memory, or something similar. HP has announced a mini PC that might be good, and Nvidia has announced a dedicated AI box as well. Neither of those are available for purchase yet, though.
You could also consider using multiple consumer GPUs. You might be able to get multiple RTX 3090s for cheaper than a Mac with the same amount of memory. But then you'll be using several times more power to run it, so keep that in mind.