LocalLLaMA

3208 readers
1 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
26
27
28
 
 

I took a practice test (math) and would like to have it be graded by a LLM since I can't find the key online. I have 20GB VRAM, but I'm on intel Arc so I can't do gemma3. I would prefer models from ollama.com 'cause I'm not deep enough down the rabbit hole to try huggingface stuff yet and don't have time to right now.

29
19
submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/localllama
 
 

This fork introduces a Radio Station feature where AI generates continuous radio music. The process involves two key components:

LLM: Generates the lyrics for the songs. ACE: Composes the music for the generated lyrics.

Due to the limitations of slower PCs, the demo video includes noticeable gaps (approximately 4 minutes) between the generated songs.

If your computer struggles to stream songs continuously, increasing the buffer size will result in a longer initial delay but fewer gaps between songs (until the buffer is depleted again).

By default the app attempts to load the model file gemma-3-12b-it-abliterated.q4_k_m.gguf from the same directory. However, you can also use alternative LLMs. Note that the quality of generated lyrics will vary depending on the LLM's capabilities.

30
20
32B olmo-2 03/25 (huggingface.co)
submitted 1 month ago by [email protected] to c/localllama
 
 

model:
32B olmo-2 03/25

https://arxiv.org/abs/2501.00656

"We release all OLMo 2 artifacts openly -- models at 7B and 13B scales, both pretrained and post-trained, including their full training data, training code and recipes, training logs and thousands of intermediate checkpoints. "

31
32
 
 

Hi, I'm not too informed about LLMs so I'll appreciate any correction to what I might be getting wrong. I have a collection of books I would like to train an LLM on so I could use it as a quick source of information on the topics covered by the books. Is this feasible?

33
 
 

Something I always liked about NousResearch is how they seemingly try to understand cognition in a more philosophical/metaphysically symbolic way and aren't afraid to let you know it. I think their unique view may allow them to find some new perspectives that allow for advancement in the field. Check out AscensionMaze in particular the wording they use is just fascinating.

34
 
 

I'm interested in really leveraging the full capabilities of local ai, for code generation and everything else. let me know what you people are using.

35
 
 

It's amazing how far open source LLMs have come.

Qwen3-32b recreated the Windows95 Starfield screensaver as a web app with the bonus feature to enable "warp drive" on click. This was generated with reasoning disabled (/no_think) using a 4-bit quant running locally on a 4090.

Here's the result: https://codepen.io/mekelef486/pen/xbbWGpX

Model: Qwen3-32B-Q4_K_M.gguf (Unsloth quant)

Llama.cpp Server Docker Config:

docker run \
-p 8080:8080 \
-v /path/to/models:/models \
--name llama-cpp-qwen3-32b \
--gpus all \
ghcr.io/ggerganov/llama.cpp:server-cuda \
-m /models/qwen3-32b-q4_k_m.gguf \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers 65 \
--ctx-size 13000 \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0

System Prompt:

You are a helpful expert and aid. Communicate clearly and succinctly. Avoid emojis.

User Prompt:

Create a simple web app that uses javascript to visualize a simple starfield, where the user is racing forward through the stars from a first person point of view like in the old Microsoft screensaver. Stars must be uniformly distributed. Clicking inside the window enables "warp speed" mode, where the visualization speeds up and star trails are added. The app must be fully contained in a single HTML file. /no_think

36
 
 
37
38
17
Qwen3 "Leaked" (huggingface.co)
submitted 1 month ago by [email protected] to c/localllama
 
 

Qwen3 was apparently posted early, then quickly pulled from HuggingFace and Modelscope. The large ones are MoEs, per screenshots from Reddit:

screenshots

Including a 235B/22B active and a 30B/3B active.

Context appears to 'only' be 32K unfortunately: https://huggingface.co/qingy2024/Qwen3-0.6B/blob/main/config_4b.json

But its possible they're still training them to 256K:

from reddit

Take it all with a grain of salt, configs could change with the official release, but it appears it is happening today.

39
40
23
submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/localllama
 
 

This is one of the "smartest" models you can fit on a 24GB GPU now, with no offloading and very little quantization loss. It feels big and insightful, like a better (albeit dry) Llama 3.3 70B with thinking, and with more STEM world knowledge than QwQ 32B, but comfortably fits thanks the new exl3 quantization!

Quantization Loss

You need to use a backend that support exl3, like (at the moment) text-gen-web-ui or (soon) TabbyAPI.

41
 
 

I would like my model to know the code libraries I use and help me write code with them. I use llama.cpp's server and web UI for inference, but I have no clue how to get started with RAG, since it seems it is not natively supported with llama.cpp's server implementation. It almost looks like I would need to code my own agent.

I am not interested in commercial offerings or APIs. If you use RAG, how do you do it?

42
 
 

I'm currently running Gemma3, it is really good overall, but one thing that is frustrating is the relentless positivity.

It there a way to make it more critical?

I'm not looking for it to say "that is a shit" idea; but less of the "that is a great observation" or "You've made a really insightful point" etc...

If a human was talking like that, I'd be suspicious of their motives. Since it is a machine, I don't think it is trying to manipulate me, I think the programming is set too positive.

It may also be cultural, at a rule New Zealanders are less emotive in our communication, the LLM (to me) feels like are overly positive American.

43
 
 

Seems there's not a lot of talk about relatively unknown finetunes these days, so I'll start posting more!

Openbuddy's been on my radar, but this one is very interesting: QwQ 32B, post-trained on openbuddy's dataset, apparently with QAT applied (though it's kinda unclear) and context-extended. Observations:

  • Quantized with exllamav2, it seems to show lower distortion levels than nomal QwQ. Its works conspicuously well at 4.0bpw and 3.5bpw.

  • Seems good at long context. Have not tested 200K, but it's quite excellent in the 64K range.

  • Works fine in English.

  • The chat template is funky. It seems to mix up the and <|think|> tags in particular (why don't they just use ChatML?), and needs some wrangling with your own template.

  • Seems smart, can't say if it's better or worse than QwQ yet, other than it doesn't seem to "suffer" below 3.75bpw like QwQ does.

Also, I reposted this from /r/locallama, as I feel the community generally should going forward. With its spirit, it seems like we should be on Lemmy instead?

44
45
46
 
 

Just thinking about making this a monthly post, which model are you using? what are the positives and negatives?

47
48
 
 

The Trump administration is considering new restrictions on the Chinese AI lab DeepSeek that would limit it from buying Nvidia’s AI chips and potentially bar Americans from accessing its AI services, The New York Times reported on Wednesday.

49
 
 
50
 
 

Let's go! Lossless CPU inference

view more: ‹ prev next ›