How much gpu do i need to run a 90b model (lemm.ee)

submitted 20 hours ago by muntedcrocodile@lemm.ee to c/localllama@sh.itjust.works

14 comments fedilink hide all child comments

Do i need industry grade gpu's or can i scrape by getring decent tps with a consumer level gpu.

you are viewing a single comment's thread
view the rest of the comments

[-] Sylovik@lemmy.world 4 points 9 hours ago

In case of LLM's you should look at AirLLM. I suppose there is no conviniet integrations to local chat tools, but issue at Ollama already started.

[-] red@lemmy.zip 1 points 45 minutes ago

this is useless, llama.cpp already does that airllm does (offloading to CPU) but its actually faster. so just use ollama

[-] muntedcrocodile@lemm.ee 1 points 8 hours ago

That looks like exactly the sort of thing i want. Any existing solution to get it to behave like an ollama instance (i have a bunch of services pointed at an ollama run on docker).

[-] Sylovik@lemmy.world 2 points 1 hour ago

You may try Harbor. The description claims to provide an OpenAI-compatible API.

this post was submitted on 10 Jan 2025

11 points (92.3% liked)

LocalLLaMA

2410 readers

27 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago

MODERATORS

SkySyrup@sh.itjust.works

pax@sh.itjust.works

noneabove1182@sh.itjust.works