LocalLLaMA

3220 readers

2 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago

MODERATORS

Psychoanalysis (self.localllama)

submitted 3 months ago by Rez to c/localllama

3 comments fedilink hide all child comments

Hi, I don't have much experience with running local AIs, but have tried to use ChatGPT for psychoanalysis purposes and the smarter, but limited for free users model is amazing. I don't like giving such personal information to OpenAI though, so I'd like to set something similar up locally, if possible. I am running Fedora Linux, and have had the best results with KoboldCpp, as it was by far the easiest to set up. I have a Ryzen 7600, 32 GB of ram, and a 7800 xt (16 GB vram). The two things I mostly want from this setup is the smartest model possible, as I've tried some and the responses just don't feel as insightful or though provoking as ChatGPT's, and I also really like the way it handles memory. I don't need "real time" conversation speed, if it means I can get the smarter responses that I am looking for. What models/setups would you recommend? Generally, I've been going for newer + takes up more space = better, but I'm kind of disappointed with the results, although the largest models I've tried have only been around 16 GB, is my setup capable of running bigger models? I've been hesitant to try, as I don't have fast internet and downloading a model usually means keeping my pc running overnight.

PS, I am planning to use this mostly as a way to grow/reflect, not dealing with trauma or loneliness. If you are struggling and are considering AI for help, never forget that it can not replace connections with real human beings.

you are viewing a single comment's thread
view the rest of the comments

[–] Rez 2 points 3 months ago (1 children)

Thank you so much for the suggestion! I tried Q8 of the model you mentioned, and I am very impressed with the results! The output itself was exactly what I wanted, the speed was a little on the slower side. Loading my previous conversation with a context of over 15k tokens took about 10 minutes to get the first response, but the later messages were much faster. The web ui loses connection almost every time though, and I just manually copy the response from the terminal window in to the web ui to save it for future context. I am currently downloading the Q6 model, and might experiment with going even lower for faster speeds and more stability, if the quality of the output doesn't degrade too much.

[–] [email protected] 2 points 3 months ago

Q4 will give you like 98% of quality vs Q8 and like twice the speed + much longer context lengths.

If you don't need the full context length, you can try loading the model at shorter context length, meaning you can load more layers on the GPU, meaning it will be faster.

And you can usually configure your inference engine to keep the model loaded at all times, so you're not loosing so much time when you first start the model up.

Ollama attempts to dynamically load the right context lenght for your request, but in my experience that just results in really inconsistent and long time to first token.

The nice thing about vLLM is that your model is always loaded, so you don't have to worry about that. But then again, it needs much more VRAM.