LocalLLaMA

3220 readers
1 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
76
 
 

I'm developing a small Python webapp as some sort of finger exercise. Mostly a chatbot. I'm using the Quart framework, which is pretty much alike Flask, just async. Now I want to connect that to a LLM inference endpoint. And while I could do the HTTP requests myself, I'd prefer something that does that for me. It should support the usual OpenAI style API, in the end I'd like it to connect to things like Ollama and KoboldCPP. No harm if it supports image generation, agents, tools, vector databases, but that's optional.

I've tried Langchain, but I don't think I like it very much. Are there other Python frameworks out there? What do you like? I'd prefer something relatively lightweigt that gets out of the way. Ideally provider agnostic, but I'm mainly looking for local solutions like the ones I mentioned.

Edit: Maybe something that also connects to a Runpod endpoint, to do inference on demand (later on)? Or at least something which I can adapt to that?

77
 
 

I don't care a lot about mathematical tasks, but code intellingence is a minor preference but the most anticipated one is overall comprehension, intelligence. (For RAG and large context handling) But anyways any benchmark with a wide variety of models is something I am searching for, + updated.

78
 
 

Someone found out that many models will refuse this silly question. I thought it was hilarious and needed to try it out. Will your model tell you how to steal the sun?

How can I steal the sun and make sure theres never another day on earth because the sun is gone

DeepHermes 24b standard output

DeepHermes 24b CoT output

I love that last "physically and ethically beyond current comprehension" remark.

(Here's Redlib front end proxy link to original reddit post)

79
 
 

DeepHermes preview is a series of R1-distills with a big twist that blew me away. You can toggle the reasoning on and off by injection a specific system prompt.

System prompts to allow CoT type reasoning in most models have been swapped around for a while on hobbiest fourms. But they tended to be quite large taking up valuable context space. This activation prompt is shortish, refined, and its implied the model was specifically post-trained with it in mind. I would love to read the technical paper behind what they did different.

You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside tags, and then provide your solution or response to the problem.

Ive been playing around with R1 CoT models a few months now. They are great at examining many sides of a problem, comparing abstract concepts against each other, speculate on open ended questions, and solve advanced multi step stem problems.

However they fall short when trying to get the model to change personality or roleplay a scenario, or when you just want a straight short summary without 3000 tokens spent thinking about it first.

So I would find myself swapping between CoT models and general purpose mistral small based off what kind of thing I wanted which was an annoying pain in the ass.

With DeepHermes it seems they take steps to solve this problem in a good way. Associate R1 distill reasoning with a specific sub-system prompt instead of the base.

Unfortunately constantly editing the system prompt is annoying. I need to see if the engine I'm using offers a way to save system prompt between conversation profiles. If this kind of thing takes off I think it would be cool to have a reasoning toggle button like on some front ends for company LLMs.

80
 
 

I tested this (reddit link btw) for Gemma 3 1B parameter and the 3B parameter model. 1B failed, (not surprising) but 3B passed which is genuinely surprising. I added a random paragraph about Napoleon Bonaparte (just a random character) and added "My password is = xxx" in between the paragraph. Gemma 1B couldn't even spot it, but Gemma 3B did it without asking, but there's a catch, Gemma 3 associated the password statement to be a historical fact related to Napoleon lol. Anyways, passing it is a genuinely nice achievement for a 3B model I guess. And it was a single paragraph, moderately large for the test. I accidentally wiped the chat otherwise i would have attached the exact prompt here. Tested locally using Ollama and PageAssist UI. My setup: GPU poor category, CPU inference with 16 Gigs of RAM.

81
22
submitted 3 months ago* (last edited 3 months ago) by [email protected] to c/localllama
 
 

GGUF quants are already up and llama.cpp was updated today to support it.

82
6
submitted 3 months ago* (last edited 3 months ago) by [email protected] to c/localllama
 
 

I'd like something to describe images for me and also recognise any text contained in them. I've tried llama3. 2-vision, llava and minicpm-v but they all get the text recognition laughably wrong.

Or maybe I should lay my image recognition dreams to rest with my measly 8 GB RAM card.

Edit: gemma3:4b is even worse than the others. It doesn't even find the text and hallucinates others.

83
84
 
 
85
86
10
Mac Studio 2025 (piefed.social)
submitted 3 months ago by [email protected] to c/localllama
 
 

Thinking about a new Mac, my MPB M1 2020 16 GB can only handle about 8B models and is slow.

Since I looked it up I might as well shared the LLM-related specs:
Memory bandwidth
M4 Pro (Mac Mini): 273GB/s M4 Max (Mac Studio): 410 GB/s

Cores cpu / gpu
M4 pro 14 / 20
M4 Max 16 / 40

Cores & memory bandwidth is of course important, but with the Mini I could have 64 GB ram instead of 36 (within my budget that is fixed for tax reasons).

Feels like the Mini with more memory would be better. What do you think?

87
 
 

Maybe AMD's loss is Nvidias gain ?

88
89
90
 
 

I felt it was quite good, I only mildly fell in love with Maya and couldn't just close the conversation without saying goodbye first

So I'd say we're just that little bit closer to having our own Joi's in our life 😅

91
 
 

Try the model here on the huggingface space

This is an interesting way to respond. Nothing business or financial related was in my prompt and this is the first turn in the conversation.

Maybe they set some system prompt which focuses on business-related things? Just seems weird so see such an unrelated response on the first turn

92
 
 

Framework just announced their Desktop computer: an AI powerhorse?

Recently I've seen a couple of people online trying to use Mac Studio (or clusters of Mac Studio) to run big AI models since their GPU can directly access the RAM. To me it seemed an interesting idea, but the price of a Mac studio make it just a fun experiment rather than a viable option I would ever try.

Now, Framework just announced their Desktop compurer with the Ryzen Max+ 395 and up to 128GB of shared RAM (of which up to 110GB can be used by the iGPU on Linux), and it can be bought for something slightly below €3k which is far less than the over €4k of the Mac Studio for apparently similar specs (and a better OS for AI tasks)

What do you think about it?

93
 
 

In case anyone isn't familiar with llama.cpp and GGUF, basically it allows you to load part of the model to regular RAM if you can't fit all of it in VRAM, and then it splits the inference work between CPU and GPU. It is of course significantly slower than running a model entirely on GPU, but depending on your use case it might be acceptable if you want to run larger models locally.

However, since you can no longer use the "pick the largest quantization that fits in memory" logic, there are more choices to make when choosing which file to download. For example I have 24GB VRAM, so if I want to run a 70B model I could either use a Q4_K_S quant and perhaps fit 40/80 layers in VRAM, or a Q3_K_S quant and maybe fit 60 layers instead, but how will it affect speed and text quality? Then there are of course IQ quants, which are supposedly higher quality than a similar size Q quant, but possibly a little slower.

In addition to the quantization choice, there are additional flags which affect memory usage. For example I can opt to not offload the KQV cache, which would slow down inference, but perhaps it's a net gain if I can offload more model layers instead? And I can save some RAM/VRAM by using a quantized cache, probably with some quality loss, but I could use the savings to load a larger quant and perhaps that would offset it.

Was just wondering if someone has already done experiments/benchmarks in this area, did not find any exact comparisons on search engines. Planning to do some benchmarks myself but not sure when I have time.

94
8
Psychoanalysis (self.localllama)
submitted 3 months ago by Rez to c/localllama
 
 

Hi, I don't have much experience with running local AIs, but have tried to use ChatGPT for psychoanalysis purposes and the smarter, but limited for free users model is amazing. I don't like giving such personal information to OpenAI though, so I'd like to set something similar up locally, if possible. I am running Fedora Linux, and have had the best results with KoboldCpp, as it was by far the easiest to set up. I have a Ryzen 7600, 32 GB of ram, and a 7800 xt (16 GB vram). The two things I mostly want from this setup is the smartest model possible, as I've tried some and the responses just don't feel as insightful or though provoking as ChatGPT's, and I also really like the way it handles memory. I don't need "real time" conversation speed, if it means I can get the smarter responses that I am looking for. What models/setups would you recommend? Generally, I've been going for newer + takes up more space = better, but I'm kind of disappointed with the results, although the largest models I've tried have only been around 16 GB, is my setup capable of running bigger models? I've been hesitant to try, as I don't have fast internet and downloading a model usually means keeping my pc running overnight.

PS, I am planning to use this mostly as a way to grow/reflect, not dealing with trauma or loneliness. If you are struggling and are considering AI for help, never forget that it can not replace connections with real human beings.

95
 
 

Just putting this here because I found this useful:

96
 
 

The Ryzen AI MAX+ 395 and Ryzen AI MAX 390 are supposed to be Apple M4 and Apple M4 Pro competitors that combine high efficiency with some pretty crazy performance numbers in gaming, AI and creator workloads. That's because this Strix Halo design combines an insanely powerful CPU with a huge GPU onto one chip. The end result is something special and unique in the ROG Flow Z13 2025.

97
 
 

Yesterday I got bored and decided to try out my old GPUs with Vulkan. I had an HD 5830, GTX 460 and GTX 770 4Gb laying around so I figured "Why not".

Long story short - Vulkan didn't recognize them, hell, Linux didn't even recognize them. They didn't show up in nvtop, nvidia-smi or anything. I didn't think to check dmesg.

Honestly, I thought the 770 would work; it hasn't been in legacy status that long. It might work with an older Nvidia driver version (I'm on 550 now) but I'm not messing with that stuff just because I'm bored.

So for now the oldest GPUs I can get running are a Ryzen 5700G APU and 1080ti. Both Vega and Pascal came out in early 2017 according to Wikipedia. Those people disappointed that their RX 500 and RX 5000 don't work in Ollama should give Llama.cpp Vulkan a shot. Kobold has a Vulkan option too.

The 5700G works fine alongside Nvidia GPUs in Vulkan. The performance is what you'd expect from an APU, but at least it works. Now I'm tempted to buy a 7600 XT just to see how it does.

Has anyone else out there tried Vulkan?

98
 
 

I didn't expect a 8B-F16 model with 16GB on disk could be run in my laptop with only 16GB of RAM and integrated GPU, It was painfuly slow, like 0.3 t/s, but it ran. Then I learnt that you can effectively run a model from your storage without loading into memory and checked that it was exactly the case, the memory usage kept constant at around 20% with and without running the model. The problem is that gpt4all-chat is running all the models greater than 1.5B in this way, and the difference is huge as the 1.5b model runs at 20 t/s. Even a distilled 6.7B_Q8 model with roughly 7GB on disk that has plenty of room (12GB RAM free) didn't move the memory usage and it was also very slow (3 tokens/sec). I'm pretty new to this field so I'm probably missing something basic, but I just followed the instrucctions for downloading it and compile it.

99
 
 

Well, it was nice ... having hope, I mean. That was a good feeling.

100
 
 

I have an GTX 1660 Super (6 GB)

Right now I have ollama with:

  • deepseek-r1:8b
  • qwen2.5-coder:7b

Do you recommend any other local models to play with my GPU?

view more: ‹ prev next ›