11

Do i need industry grade gpu's or can i scrape by getring decent tps with a consumer level gpu.

top 14 comments
sorted by: hot top controversial new old
[-] red@lemmy.zip 2 points 47 minutes ago

all you need is 2 3090s and some good CPU and lot of ram, then you should be able to get like 10tps

[-] Sylovik@lemmy.world 4 points 9 hours ago

In case of LLM's you should look at AirLLM. I suppose there is no conviniet integrations to local chat tools, but issue at Ollama already started.

[-] red@lemmy.zip 1 points 49 minutes ago

this is useless, llama.cpp already does that airllm does (offloading to CPU) but its actually faster. so just use ollama

[-] muntedcrocodile@lemm.ee 1 points 9 hours ago

That looks like exactly the sort of thing i want. Any existing solution to get it to behave like an ollama instance (i have a bunch of services pointed at an ollama run on docker).

[-] Sylovik@lemmy.world 2 points 1 hour ago

You may try Harbor. The description claims to provide an OpenAI-compatible API.

[-] GenderNeutralBro@lemmy.sdf.org 4 points 10 hours ago* (last edited 10 hours ago)

If you're running a consumer level GPU, you'll be operating with 24GB of VRAM max (RTX 4090, RTX 3090, or Radeon 7900XTX).

90b model = 90GB at 8-bit quantization (plus some extra based on your context size and general overhead, but as a ballpark estimate, just going by the model size is good enough). You would need to drop down to 2-bit quantization to have any hope to fit it in a single consumer GPU. At that point you'd probably be better off using a smaller model will less aggressive quantization, like a 32b model at 4-bit quantization.

So forget about consumer GPUs for that size of model. Instead, you can look at systems with integrated memory, like a Mac with 96-128GB of memory, or something similar. HP has announced a mini PC that might be good, and Nvidia has announced a dedicated AI box as well. Neither of those are available for purchase yet, though.

You could also consider using multiple consumer GPUs. You might be able to get multiple RTX 3090s for cheaper than a Mac with the same amount of memory. But then you'll be using several times more power to run it, so keep that in mind.

[-] hendrik@palaver.p3x.de 5 points 19 hours ago* (last edited 19 hours ago)

I'd say you're looking for something like a 80GB VRAM GPU. That'd be industry grade (an Nvidia A100 for example).

And to squeeze it into 80GB the model would need to be quantized to 4 or 5 bits. There are some LLM VRAM calculators available where you can put in your numbers, like this one.

Another option would be to rent these things by the hour in some datacenter (at about $2 to $3 per hour). Or do inference on a CPU with a wide memory interface. Like an Apple M3 processor or an AMD Epyc. But these are pricey, too. And you'd need to buy them alongside an equal amount of (fast) RAM.

[-] ALERT@sh.itjust.works 3 points 18 hours ago
[-] tpWinthropeIII@lemmy.world 3 points 18 hours ago

The new $3000 NVidia Digit has 128 GB of fast RAM in an Apple-M4-like unified-memory configuration, reportedly. NVidia claims it is twice as fast as an apple stack at least at inference. Four of these stacked can run a 405B model, again according to NVidia.

In my case I want the graphics power of an GPU and VRAM for other purposes as well. So I'd rather buy a graphics card. But regarding a 90B model, I do wonder if it is possible with two A6000 at 64 GB and a 3 bit quant.

[-] muntedcrocodile@lemm.ee 1 points 16 hours ago

Huh so basicly sidestepping the gpu issue entirly and essentially just using some other special piece of silicon with fast (but conventional ram). I still dont understand why u cant distribute a large llm over many different processors each holding a section of the parameters in memory.

[-] tpWinthropeIII@lemmy.world 1 points 6 hours ago

Not exactly. Digits still uses a Blackwell GPU, only it uses unified RAM as virtual VRAM instead of actual VRAM. The GPU is probably a down clocked Blackwell. Speculation I've seen is that these are defective and repurposed Blackwells; good for us. By defective I mean they can't run at full speed or are projected to have the cracking die problem, etc.

[-] fhein@lemmy.world 2 points 18 hours ago

You have to specify which quantization you find acceptable, and which context size you require. I think the most affordable option to run large models locally is still getting multiple RTX3090 cards, and I guess you probably need 3 or 4 of those depending on quantization and context.

[-] Gladaed@feddit.org 2 points 19 hours ago

Round about that much vram. I.e. 90 GB

[-] Fisch@discuss.tchncs.de 2 points 20 hours ago

The biggest issue will be your VRAM. If you don't have enough of it (which is very likely, even the 8B models I use need ~10gb), you'll have to use a GGUF model which will need to use your system RAM and CPU for the parts that don't fit in the VRAM, which will heavily slow it down.

this post was submitted on 10 Jan 2025
11 points (92.3% liked)

LocalLLaMA

2410 readers
27 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago
MODERATORS