Hope you feel better soon
LocalLLaMA
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
Thanks, I shouldn't have said I felt sad about it thats a little hyperbolic, just a little bothered. Im much more happy about finding a model that pushes my AI to its maximum potential while still being usable in real time.
I run Mixtral on my CPU
How surprising, I did not see anything about mistral small on HN so I tried it out and it seems pretty good for its size! Thanks for sharing!
Oh, and you HAVE to try the new Qwen 2.5 14B.
The whole lineup is freaking sick, 34B it outscoring llama 3.1 70B in a lot of benchmarks, and in personal use it feels super smart.
You can try a smaller IQ3 imatrix quantization to speed it up, but 22B is indeed tight for 8GB.
If someone comes out with an AQLM for it, it might completely fit in VRAM, but I'm not sure it would even work for a Pascal card TBH.
Thanks for the recommendation. Today I tried out Mistral Small IQ4_XS in combination with running kobold through a headless terminal environment to squeeze out that last bit of vram. With that, the GPU layers offloaded were able to be bumped up from 28 to 34. The token speed went up from 2.7t/s to 3.7t/s which is like a 50% speed increase. I imagine going to Q3 would get things even faster or allow for a bump in context size.
I appreciate you recommending Qwen too, ill look into it.
A Qwen 2.5 14B IQ3_M should completely fit in your VRAM, with longish context, with acceptable quality.
An IQ4_XS will just barely overflow but should still be fast at short context.
And while I have not tried it yet, the 14B is allegedly smart.
Also, what I do on my PC is hook up my monitor to the iGPU so the GPU's VRAM is completely empty, lol.
Hey @brucethemoose hope you don't mind if I ding you one more time. Today I loaded up with qwen 14b and 32b. Yes, 32B (Q3_KS). I didn't do much testing with 14B but it spoke well and fast. Was more excited to play with the 32B once I found out it would run to be honest. It just barely makes the mark of tolerable speed just under 2T/s (really more like 1.7 with some context loaded in). I really do mean barely, the people who think 5t/s is slow would eat their heart out. However that reasoning and coherence though? Off the charts. I like the way it speaks more than mistral small too. So wow just wow is all I can say. Can't believe all the good models that came out in such a short time and leaps made in the past two months. Thank you again for recommending qwen don't think I would have tried the 32B without your input.
Good! Try the IQM, XS, and XSS quantizations as well, especially if you try a 14B, as they "squeeze" the model into less space better than the Q3_K quantizations.
Yeah I'm liking the 32B as well. If you are looking for speed just for ultilitarian Q/A, you might want to keep a Deepseek Lite V2 Code GGUF on hand, as it's uber fast partially offloaded.
Read up on Hermes3 technical paper and you'll realize it's the best one. Running 8B model with the correct initial system prompt makes it as smart as GPT4o
The linked paper was a good read. Thank you.
Ironically, if you ask ChatGPT to write you an initial system prompt for Hermes that will sound similar to its own, it will essentially share a trade secret with you and give up portions of its system prompt to make your 8B self hosted LLM perform like a commercial one.