15
What is a good model that runs on 6GB Vram?
(discuss.online)
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
Try the IQ4_XS quant of mistral nemo
If you want a more roleplay based model with more creativity at the cost of other things you can try the arliai finetune of nemo.
If you want the model to remember long term you need to bump its context size up. You can trade GPU layers for context size or go down a quant or go to a smaller model like llama 8b.
Can't you just increase context length at the cost of paging and slowdown?
At some point you'll run out of vram memory on the GPU. You make it slower by offloading some memory layers to make room for more context.
Yes, but if he's world building, a larger, slower model might just be an acceptable compromise.
I was getting oom errors doing speech to text on my 4070ti. I know (now) that I should have for for the 3090ti. Such is life.
At a certain point, layers will be pushed to RAM leading to incredibly slow inference. You don't want to wait hours for the model to generate a single response.