LocalLLaMA

2656 readers
2 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago
MODERATORS
1
 
 

Trying something new, going to pin this thread as a place for beginners to ask what may or may not be stupid questions, to encourage both the asking and answering.

Depending on activity level I'll either make a new one once in awhile or I'll just leave this one up forever to be a place to learn and ask.

When asking a question, try to make it clear what your current knowledge level is and where you may have gaps, should help people provide more useful concise answers!

2
 
 

cross-posted from: https://lemmy.world/post/2219010

Hello everyone!

We have officially hit 1,000 subscribers! How exciting!! Thank you for being a member of [email protected]. Whether you're a casual passerby, a hobby technologist, or an up-and-coming AI developer - I sincerely appreciate your interest and support in a future that is free and open for all.

It can be hard to keep up with the rapid developments in AI, so I have decided to pin this at the top of our community to be a frequently updated LLM-specific resource hub and model index for all of your adventures in FOSAI.

The ultimate goal of this guide is to become a gateway resource for anyone looking to get into free open-source AI (particularly text-based large language models). I will be doing a similar guide for image-based diffusion models soon!

In the meantime, I hope you find what you're looking for! Let me know in the comments if there is something I missed so that I can add it to the guide for everyone else to see.


Getting Started With Free Open-Source AI

Have no idea where to begin with AI / LLMs? Try starting with our Lemmy Crash Course for Free Open-Source AI.

When you're ready to explore more resources see our FOSAI Nexus - a hub for all of the major FOSS & FOSAI on the cutting/bleeding edges of technology.

If you're looking to jump right in, I recommend downloading oobabooga's text-generation-webui and installing one of the LLMs from TheBloke below.

Try both GGML and GPTQ variants to see which model type performs to your preference. See the hardware table to get a better idea on which parameter size you might be able to run (3B, 7B, 13B, 30B, 70B).

8-bit System Requirements

Model VRAM Used Minimum Total VRAM Card Examples RAM/Swap to Load*
LLaMA-7B 9.2GB 10GB 3060 12GB, 3080 10GB 24 GB
LLaMA-13B 16.3GB 20GB 3090, 3090 Ti, 4090 32 GB
LLaMA-30B 36GB 40GB A6000 48GB, A100 40GB 64 GB
LLaMA-65B 74GB 80GB A100 80GB 128 GB

4-bit System Requirements

Model Minimum Total VRAM Card Examples RAM/Swap to Load*
LLaMA-7B 6GB GTX 1660, 2060, AMD 5700 XT, RTX 3050, 3060 6 GB
LLaMA-13B 10GB AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, A2000 12 GB
LLaMA-30B 20GB RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100 32 GB
LLaMA-65B 40GB A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000 64 GB

*System RAM (not VRAM), is utilized to initially load a model. You can use swap space if you do not have enough RAM to support your LLM.

When in doubt, try starting with 3B or 7B models and work your way up to 13B+.

FOSAI Resources

Fediverse / FOSAI

LLM Leaderboards

LLM Search Tools


Large Language Model Hub

Download Models

oobabooga

text-generation-webui - a big community favorite gradio web UI by oobabooga designed for running almost any free open-source and large language models downloaded off of HuggingFace which can be (but not limited to) models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and many others. Its goal is to become the AUTOMATIC1111/stable-diffusion-webui of text generation. It is highly compatible with many formats.

Exllama

A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs.

gpt4all

Open-source assistant-style large language models that run locally on your CPU. GPT4All is an ecosystem to train and deploy powerful and customized large language models that run locally on consumer-grade processors.

TavernAI

The original branch of software SillyTavern was forked from. This chat interface offers very similar functionalities but has less cross-client compatibilities with other chat and API interfaces (compared to SillyTavern).

SillyTavern

Developer-friendly, Multi-API (KoboldAI/CPP, Horde, NovelAI, Ooba, OpenAI+proxies, Poe, WindowAI(Claude!)), Horde SD, System TTS, WorldInfo (lorebooks), customizable UI, auto-translate, and more prompt options than you'd ever want or need. Optional Extras server for more SD/TTS options + ChromaDB/Summarize. Based on a fork of TavernAI 1.2.8

Koboldcpp

A self contained distributable from Concedo that exposes llama.cpp function bindings, allowing it to be used via a simulated Kobold API endpoint. What does it mean? You get llama.cpp with a fancy UI, persistent stories, editing tools, save formats, memory, world info, author's note, characters, scenarios and everything Kobold and Kobold Lite have to offer. In a tiny package around 20 MB in size, excluding model weights.

KoboldAI-Client

This is a browser-based front-end for AI-assisted writing with multiple local & remote AI models. It offers the standard array of tools, including Memory, Author's Note, World Info, Save & Load, adjustable AI settings, formatting options, and the ability to import existing AI Dungeon adventures. You can also turn on Adventure mode and play the game like AI Dungeon Unleashed.

h2oGPT

h2oGPT is a large language model (LLM) fine-tuning framework and chatbot UI with document(s) question-answer capabilities. Documents help to ground LLMs against hallucinations by providing them context relevant to the instruction. h2oGPT is fully permissive Apache V2 open-source project for 100% private and secure use of LLMs and document embeddings for document question-answer.


Models

The Bloke

The Bloke is a developer who frequently releases quantized (GPTQ) and optimized (GGML) open-source, user-friendly versions of AI Large Language Models (LLMs).

These conversions of popular models can be configured and installed on personal (or professional) hardware, bringing bleeding-edge AI to the comfort of your home.

Support TheBloke here.


70B


30B


13B


7B


More Models


GL, HF!

Are you an LLM Developer? Looking for a shoutout or project showcase? Send me a message and I'd be more than happy to share your work and support links with the community.

If you haven't already, consider subscribing to the free open-source AI community at [email protected] where I will do my best to make sure you have access to free open-source artificial intelligence on the bleeding edge.

Thank you for reading!

3
4
10
Mac Studio 2025 (piefed.social)
submitted 4 days ago by [email protected] to c/localllama
 
 

Thinking about a new Mac, my MPB M1 2020 16 GB can only handle about 8B models and is slow.

Since I looked it up I might as well shared the LLM-related specs:
Memory bandwidth
M4 Pro (Mac Mini): 273GB/s M4 Max (Mac Studio): 410 GB/s

Cores cpu / gpu
M4 pro 14 / 20
M4 Max 16 / 40

Cores & memory bandwidth is of course important, but with the Mini I could have 64 GB ram instead of 36 (within my budget that is fixed for tax reasons).

Feels like the Mini with more memory would be better. What do you think?

5
 
 

Maybe AMD's loss is Nvidias gain ?

6
7
8
 
 

I felt it was quite good, I only mildly fell in love with Maya and couldn't just close the conversation without saying goodbye first

So I'd say we're just that little bit closer to having our own Joi's in our life 😅

9
 
 

Try the model here on the huggingface space

This is an interesting way to respond. Nothing business or financial related was in my prompt and this is the first turn in the conversation.

Maybe they set some system prompt which focuses on business-related things? Just seems weird so see such an unrelated response on the first turn

10
 
 

Framework just announced their Desktop computer: an AI powerhorse?

Recently I've seen a couple of people online trying to use Mac Studio (or clusters of Mac Studio) to run big AI models since their GPU can directly access the RAM. To me it seemed an interesting idea, but the price of a Mac studio make it just a fun experiment rather than a viable option I would ever try.

Now, Framework just announced their Desktop compurer with the Ryzen Max+ 395 and up to 128GB of shared RAM (of which up to 110GB can be used by the iGPU on Linux), and it can be bought for something slightly below €3k which is far less than the over €4k of the Mac Studio for apparently similar specs (and a better OS for AI tasks)

What do you think about it?

11
 
 

In case anyone isn't familiar with llama.cpp and GGUF, basically it allows you to load part of the model to regular RAM if you can't fit all of it in VRAM, and then it splits the inference work between CPU and GPU. It is of course significantly slower than running a model entirely on GPU, but depending on your use case it might be acceptable if you want to run larger models locally.

However, since you can no longer use the "pick the largest quantization that fits in memory" logic, there are more choices to make when choosing which file to download. For example I have 24GB VRAM, so if I want to run a 70B model I could either use a Q4_K_S quant and perhaps fit 40/80 layers in VRAM, or a Q3_K_S quant and maybe fit 60 layers instead, but how will it affect speed and text quality? Then there are of course IQ quants, which are supposedly higher quality than a similar size Q quant, but possibly a little slower.

In addition to the quantization choice, there are additional flags which affect memory usage. For example I can opt to not offload the KQV cache, which would slow down inference, but perhaps it's a net gain if I can offload more model layers instead? And I can save some RAM/VRAM by using a quantized cache, probably with some quality loss, but I could use the savings to load a larger quant and perhaps that would offset it.

Was just wondering if someone has already done experiments/benchmarks in this area, did not find any exact comparisons on search engines. Planning to do some benchmarks myself but not sure when I have time.

12
8
Psychoanalysis (self.localllama)
submitted 2 weeks ago by Rez to c/localllama
 
 

Hi, I don't have much experience with running local AIs, but have tried to use ChatGPT for psychoanalysis purposes and the smarter, but limited for free users model is amazing. I don't like giving such personal information to OpenAI though, so I'd like to set something similar up locally, if possible. I am running Fedora Linux, and have had the best results with KoboldCpp, as it was by far the easiest to set up. I have a Ryzen 7600, 32 GB of ram, and a 7800 xt (16 GB vram). The two things I mostly want from this setup is the smartest model possible, as I've tried some and the responses just don't feel as insightful or though provoking as ChatGPT's, and I also really like the way it handles memory. I don't need "real time" conversation speed, if it means I can get the smarter responses that I am looking for. What models/setups would you recommend? Generally, I've been going for newer + takes up more space = better, but I'm kind of disappointed with the results, although the largest models I've tried have only been around 16 GB, is my setup capable of running bigger models? I've been hesitant to try, as I don't have fast internet and downloading a model usually means keeping my pc running overnight.

PS, I am planning to use this mostly as a way to grow/reflect, not dealing with trauma or loneliness. If you are struggling and are considering AI for help, never forget that it can not replace connections with real human beings.

13
 
 

Just putting this here because I found this useful:

14
 
 

The Ryzen AI MAX+ 395 and Ryzen AI MAX 390 are supposed to be Apple M4 and Apple M4 Pro competitors that combine high efficiency with some pretty crazy performance numbers in gaming, AI and creator workloads. That's because this Strix Halo design combines an insanely powerful CPU with a huge GPU onto one chip. The end result is something special and unique in the ROG Flow Z13 2025.

15
 
 

Yesterday I got bored and decided to try out my old GPUs with Vulkan. I had an HD 5830, GTX 460 and GTX 770 4Gb laying around so I figured "Why not".

Long story short - Vulkan didn't recognize them, hell, Linux didn't even recognize them. They didn't show up in nvtop, nvidia-smi or anything. I didn't think to check dmesg.

Honestly, I thought the 770 would work; it hasn't been in legacy status that long. It might work with an older Nvidia driver version (I'm on 550 now) but I'm not messing with that stuff just because I'm bored.

So for now the oldest GPUs I can get running are a Ryzen 5700G APU and 1080ti. Both Vega and Pascal came out in early 2017 according to Wikipedia. Those people disappointed that their RX 500 and RX 5000 don't work in Ollama should give Llama.cpp Vulkan a shot. Kobold has a Vulkan option too.

The 5700G works fine alongside Nvidia GPUs in Vulkan. The performance is what you'd expect from an APU, but at least it works. Now I'm tempted to buy a 7600 XT just to see how it does.

Has anyone else out there tried Vulkan?

16
 
 

I didn't expect a 8B-F16 model with 16GB on disk could be run in my laptop with only 16GB of RAM and integrated GPU, It was painfuly slow, like 0.3 t/s, but it ran. Then I learnt that you can effectively run a model from your storage without loading into memory and checked that it was exactly the case, the memory usage kept constant at around 20% with and without running the model. The problem is that gpt4all-chat is running all the models greater than 1.5B in this way, and the difference is huge as the 1.5b model runs at 20 t/s. Even a distilled 6.7B_Q8 model with roughly 7GB on disk that has plenty of room (12GB RAM free) didn't move the memory usage and it was also very slow (3 tokens/sec). I'm pretty new to this field so I'm probably missing something basic, but I just followed the instrucctions for downloading it and compile it.

17
 
 

Well, it was nice ... having hope, I mean. That was a good feeling.

18
 
 

I have an GTX 1660 Super (6 GB)

Right now I have ollama with:

  • deepseek-r1:8b
  • qwen2.5-coder:7b

Do you recommend any other local models to play with my GPU?

19
 
 

One might question why an RX 9070 card would need so much memory, but increased capacity can serve purposes beyond gaming, such as Large Language Model (LLM) support for AI workloads. Additionally, it’s worth noting that RX 9070 cards will use 20 Gbps memory, much slower than the RTX 50 series, which features 28-30 Gbps GDDR7 variants. So, while capacity may increase, bandwidth likely won’t.

20
21
 
 

Closing session, speech by Modi, JD Vance, Ursula von der Leyen

22
23
 
 

Sorry I keep posting about Mistral but if you check: https://chat.mistral.ai/chat

I duno how they do it but some of these answers are lightning fast:

Fast inference dramatically improves the user experience for chat and code generation – two of the most popular use-cases today. In the example above, Mistral Le Chat completes a coding prompt instantly while other popular AI assistants take up to 50 seconds to finish.

For this initial release, Cerebras will focus on serving text-based queries for the Mistral Large 2 model. When using Cerebras Inference, Le Chat will display a “Flash Answer ⚡” icon on the bottom left of the chat interface.

24
 
 

Example of it working in action: https://streamable.com/ueh3sj

Paper: https://arxiv.org/abs/2502.03382

Samples: https://hf.co/spaces/kyutai/hibiki-samples

Inference code: https://github.com/kyutai-labs/hibiki

Models: https://huggingface.co/kyutai

From kyutai on X: Meet Hibiki, our simultaneous speech-to-speech translation model, currently supporting FR to EN.

Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speaker’s voice and optimally adapting its pace based on the semantic content of the source speech.

Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters.

https://x.com/kyutai_labs/status/1887495488997404732

Neil Zeghidour on X: https://x.com/neilzegh/status/1887498102455869775

25
view more: next ›