How much gpu do i need to run a 90b model (lemm.ee)

submitted 6 months ago by muntedcrocodile@lemm.ee to c/localllama@sh.itjust.works

16 comments fedilink hide all child comments

Do i need industry grade gpu's or can i scrape by getring decent tps with a consumer level gpu.

top 16 comments

sorted by: hot top controversial new old

[-] hendrik@palaver.p3x.de 7 points 6 months ago* (last edited 6 months ago)

I'd say you're looking for something like a 80GB VRAM GPU. That'd be industry grade (an Nvidia A100 for example).

And to squeeze it into 80GB the model would need to be quantized to 4 or 5 bits. There are some LLM VRAM calculators available where you can put in your numbers, like this one.

Another option would be to rent these things by the hour in some datacenter (at about $2 to $3 per hour). Or do inference on a CPU with a wide memory interface. Like an Apple M3 processor or an AMD Epyc. But these are pricey, too. And you'd need to buy them alongside an equal amount of (fast) RAM.

[-] GenderNeutralBro@lemmy.sdf.org 5 points 6 months ago* (last edited 6 months ago)

If you're running a consumer level GPU, you'll be operating with 24GB of VRAM max (RTX 4090, RTX 3090, or Radeon 7900XTX).

90b model = 90GB at 8-bit quantization (plus some extra based on your context size and general overhead, but as a ballpark estimate, just going by the model size is good enough). You would need to drop down to 2-bit quantization to have any hope to fit it in a single consumer GPU. At that point you'd probably be better off using a smaller model will less aggressive quantization, like a 32b model at 4-bit quantization.

So forget about consumer GPUs for that size of model. Instead, you can look at systems with integrated memory, like a Mac with 96-128GB of memory, or something similar. HP has announced a mini PC that might be good, and Nvidia has announced a dedicated AI box as well. Neither of those are available for purchase yet, though.

You could also consider using multiple consumer GPUs. You might be able to get multiple RTX 3090s for cheaper than a Mac with the same amount of memory. But then you'll be using several times more power to run it, so keep that in mind.

[-] possiblylinux127@lemmy.zip 1 points 6 months ago

You can use system ram for when the GPU memory fills up. Alternatively you can run multiple GPUs.

[-] Sylovik@lemmy.world 4 points 6 months ago

In case of LLM's you should look at AirLLM. I suppose there is no conviniet integrations to local chat tools, but issue at Ollama already started.

[-] red@lemmy.zip 1 points 6 months ago

this is useless, llama.cpp already does that airllm does (offloading to CPU) but its actually faster. so just use ollama

[-] muntedcrocodile@lemm.ee 1 points 6 months ago

That looks like exactly the sort of thing i want. Any existing solution to get it to behave like an ollama instance (i have a bunch of services pointed at an ollama run on docker).

[-] Sylovik@lemmy.world 2 points 6 months ago

You may try Harbor. The description claims to provide an OpenAI-compatible API.

[-] tpWinthropeIII@lemmy.world 4 points 6 months ago

The new $3000 NVidia Digit has 128 GB of fast RAM in an Apple-M4-like unified-memory configuration, reportedly. NVidia claims it is twice as fast as an apple stack at least at inference. Four of these stacked can run a 405B model, again according to NVidia.

In my case I want the graphics power of an GPU and VRAM for other purposes as well. So I'd rather buy a graphics card. But regarding a 90B model, I do wonder if it is possible with two A6000 at 64 GB and a 3 bit quant.

[-] muntedcrocodile@lemm.ee 1 points 6 months ago

Huh so basicly sidestepping the gpu issue entirly and essentially just using some other special piece of silicon with fast (but conventional ram). I still dont understand why u cant distribute a large llm over many different processors each holding a section of the parameters in memory.

[-] tpWinthropeIII@lemmy.world 3 points 6 months ago

Not exactly. Digits still uses a Blackwell GPU, only it uses unified RAM as virtual VRAM instead of actual VRAM. The GPU is probably a down clocked Blackwell. Speculation I've seen is that these are defective and repurposed Blackwells; good for us. By defective I mean they can't run at full speed or are projected to have the cracking die problem, etc.

[-] breakingcups@lemmy.world 2 points 6 months ago

I still dont understand why u cant distribute a large llm over many different processors each holding a section of the parameters in memory.

Because each weight in a layer influences each weight in the next layer, which means the bandwidth requirements are enormous and regular networking solutions are insufficient for that.

[-] ALERT@sh.itjust.works 3 points 6 months ago

all of it

[-] fhein@lemmy.world 2 points 6 months ago

You have to specify which quantization you find acceptable, and which context size you require. I think the most affordable option to run large models locally is still getting multiple RTX3090 cards, and I guess you probably need 3 or 4 of those depending on quantization and context.

[-] red@lemmy.zip 2 points 6 months ago

all you need is 2 3090s and some good CPU and lot of ram, then you should be able to get like 10tps

[-] Gladaed@feddit.org 2 points 6 months ago

Round about that much vram. I.e. 90 GB

[-] Fisch@discuss.tchncs.de 2 points 6 months ago

The biggest issue will be your VRAM. If you don't have enough of it (which is very likely, even the 8B models I use need ~10gb), you'll have to use a GGUF model which will need to use your system RAM and CPU for the parts that don't fit in the VRAM, which will heavily slow it down.

this post was submitted on 10 Jan 2025

12 points (92.9% liked)

LocalLLaMA

2978 readers

7 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago

MODERATORS

SkySyrup@sh.itjust.works

pax@sh.itjust.works

noneabove1182@sh.itjust.works

Smokeydope@lemmy.world

MonsterBug@sh.itjust.works