LocalLLaMA

3156 readers
25 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
1
 
 

It seems mistral finally released their own version of a small 3.1 2503 with CoT reasoning pattern embedding. Before this the best CoT finetune of Small was DeepHermes with deepseeks r1 distill patterns. According to the technical report, mistral baked their own reasoning patterns for this one so its not just another deepseek distill finetune.

HuggingFace

Blog

Magistral technical research academic paper

2
 
 

I'm limited to 24GB of VRAM, and I need pretty large context for my use-case (20k+). I tried "Qwen3-14B-GGUF:Q6_K_XL," but it doesn't seem to like calling tools more than a couple times, no matter how I prompt it.

Tried using "SuperThoughts-CoT-14B-16k-o1-QwQ-i1-GGUF:Q6_K" and "DeepSeek-R1-Distill-Qwen-14B-GGUF:Q6_K_L," but Ollama or LangGraph gives me an error saying these don't support tool calling.

3
 
 
  • It seems like it'll be the best local model that can be ran fast if you have a lot of RAM and medium VRAM.
  • It uses a shared expert (like deepseek and llama4) so it'll be even faster on partial offloaded setups.
  • There is a ton of options for fine tuning or training from one of their many partially trainined checkpoints.
  • I'm hoping for a good reasoning finetune. Hoping Nous does it.
  • It has a unique voice because it has very little synthetic data in it.

llama.CPP support is in the works, and hopefully won't take too long since it's architecture is reused from other models llamacpp already supports.

Are y'all as excited as I am? Also is there any other upcoming release that you're excited for?

4
28
submitted 3 days ago* (last edited 3 days ago) by [email protected] to c/localllama
 
 

Hello. Our community, c/localllama, has always been and continues to be a safe haven for those who wish to learn about the creation and local usage of 'artificial intelligence' machine learning models to enrich their daily lives and provide a fun hobby to dabble in. We come together to apply this new computational technology in ways that protect our privacy and build upon a collective effort to better understand how this can help humanity as an open source technology stack.

Unfortunately, we have been recieving an uptick in negative interactions by those outside our community recently. This is largely due to the current political tensions caused by our association with the popular and powerful tech companies who pioneered modern machine learning models for buisiness and profit, as well as unsavory techbro individuals who care more about money than ethics. These users of models continue to create animosity for the entire field of machine learning and all associated through their illegal stealing of private data to train base models and very real threats to disrupt the economy by destroying jobs through automation.

There are legitimate criticisms to be had. The cost in creating models, how the art they produce is devoid of the soulful touch of human creativity, and how corporations are attempting to disrupt lives for profit instead of enrich them.

I did not want to be heavy handed with censorship/mod actions prior to this post because I believe that echo chambers are bad and genuine understanding requires discussion between multiple conflicting perspectives.

However, a lot of these negative comments we receive lately aren't made in good faith with valid criticisms against the corporations or technologies used with an intimate understanding of them. No, instead its base level mud slinging by people with emotionally charged vendettas making nasty comments of no substance. Common examples are comparing models to NFTs, namecalling our community members as blind zelots for thinking models could ever be used to help people, and spreading misinformation with cherry picked unreliable sources to manipulatively exaggerate enviromental impact/resource consumption used.

While I am against echo chambers, I am also against our community being harassed and dragged down by bad actors who just don't understand what we do or how this works. You guys shouldn't have to be subjected to the same brain rot antagonism with every post made here.

So Im updating guidelines by adding some rules I intend to enforce. Im still debating whether or not to retroactively remove infringing comments from previous post, but be sure any new post and comments made will be enforced based on the following guidelines.

RULES: Rule: No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Reason: More or less self explanatory, personal character attacks and childish mudslinging against community members are toxic.

Rule: No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Reason: This is a piss poor whataboutism argument. It claims something that is blaitantly untrue while attempting to discredit the entire field by stapling the animosity everyone has with crypto/NFT onto ML. Models already do more than cryptocurrency ever has. Models can generate text, pictures, audio. Models can view/read/hear text, pictures, and audio. Models may simulate aspects of cognitive thought patterns to attempt to speculate or reason through a given problem. Once they are trained they can be copied and locally hosted for many thousands of years which factors into initial energy cost vs power consumed over time equations.

Rule: No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Reason: There are grains of truth to the reductionist statement that llms rely on mathematical statistics and probability for their outputs. The same can be said for humans and the statistical patterns in our own language and how our neurons come together to predict the next word in the sentence we type out. Its the intricate complexity in the process and the way information is processed that makes all the diffence. ML models have an entire college course worth of advanced mathematics and STEM concepts to create hyperdimensional matrixes to plot the relationship of information, intricate hidden translation layers made of perceptrons connecting billions of parameters into vast abstraction mappings. There were also some major innovations and discoveries made in the 2000s which made modern model training possible that we didn't have in the early days of computing. all of that is a little more complicated than what your phones autocorrect does, and the people who make the lazy reductionist comparison just dont care about the nuances.

Rule: No implying that models are devoid of purpose or potential for enriching peoples lives.

Reason: Models are tools with great potential for helping people through the creation of accessability software for the disabled and enabling doctors to better heal the sick through advanced medical diagnostic techniques. The percieved harm models are capable of causing such as job displacement is rooted in our flawed late stage capitalist human society pressures for increased profit margins at the expense of everyone and everything.

If you have any proposals for rule additions or wording changes I will hear you out in the comments. Thank you for choosing to browse and contribute to this space.

5
16
submitted 3 days ago* (last edited 3 days ago) by HumanPerson to c/localllama
 
 

I just set up a new dedicated AI server that is quite fast by my standards. I have it running with OpenWebUI and would like to integrate it with other services. I think it would be cool to have something like copilot where I can be writing code in a text editor and have it add a readme function or something like that. I have also used some RAG stuff and like it, but I think it would be cool to have a RAG that can access live data, like having the most up to date docker compose file and nginx configs for when I ask it about server stuff. So, what are you integrating your AI stuff with, and how can I get started?

6
 
 

It looks like AI has followed Crypto chip wise in going CPU > GPU > ASIC

GPUs, while dominant in training large models, are often too power-hungry and costly for efficient inference at scale. This is opening new opportunities for specialized inference hardware, a market where startups like Untether AI were early pioneers.

In April, then-CEO Chris Walker had highlighted rising demand for Untether’s chips as enterprises sought alternatives to high-power GPUs. “There’s a strong appetite for processors that don’t consume as much energy as Nvidia’s energy-hungry GPUs that are pushing racks to 120 kilowatts,” Walker told CRN. Walker left Untether AI in May.

Hopefully the training part of AI goes to ASIC's to reduce costs and energy use but GPU's continue to improve inference and increase VRAM sizes to the point that AI requires nothing special to run it locally

7
9
submitted 1 week ago* (last edited 1 week ago) by [email protected] to c/localllama
 
 

Sorry team flipped the URL’s around to prevent overflow from lemmy.world users

https://fly.io/blog/youre-all-nuts/

8
 
 

This seems like it's a less-than-positive development for running AI on consumer-grade hardware.

9
 
 

Hey everybody. I'm just getting into LLMs. Total noob. I started using llama-server's web interface, but I'm experimenting with a frontend called SillyTavern. It looks much more powerful, but there's still a lot I don't understand about it, and some design choices I found confusing.

I'm trying the Harbinger-24B model to act as a D&D-style DM, and to run one party character while I control another. I tried several general purpose models too, but I felt the Harbinger purpose-built adventure model was noticeably superior for this.

I'll write a little about my experience with it, and then some thoughts about LLMs and D&D. (Or D&D-ish. I'm not fussy about the exact thing, I just want that flavour of experience).

General Experience

I've run two scenarios. My first try was a 4/10 for my personal satisfaction, and the 2nd was 8/10. I made no changes to the prompts or anything between, so that's all due to the story the model settled into. I'm trying not to give the model any story details, so it makes everything up, and I won't know about it in advance. The first story the model invented was so-so. The second was surprisingly fun. It had historical intrigue, a tie-in to a dark family secret from ancestors of the AI-controlled char, and the dungeon-diving mattered to the overarching story. Solid marks.

My suggestion for others trying this is, if you don't get a story you like out of the model, try a few more times. You might land something much better.

The Good

Harbinger provided a nice mixture of combat and non-combat. I enjoy combat, but I also like solving mysteries and advancing the plot by talking to NPCs or finding a book in the town library, as long as it feels meaningful.

It writes fairly nice descriptions of areas you encounter, and thoughts for the AI-run character.

It seems to know D&D spells and abilities. It lets you use them in creative but very reasonable ways you could do in a pen and paper game, but can't do in a standard CRPG engine. It might let you get away with too much, so you have to keep yourself honest.

The Bad

You may have to try multiple times until the RNG gives you a nice story. You could also inject a story in the base prompt, but I want the LLM to act as a DM for me, where I'm going in completely blind. Also, in my first 4/10 game, the LLM forced really bad "main character syndrome" on me. The whole thing was about me, me, me, I'm special! I found that off putting, but the 2nd 8/10 attempt wasn't like that at all.

As an LLM, it's loosy-goosy about things like inventory, spells, rules, and character progression.

I had a difficult time giving the model OOC instructions. OOC tended to be "heard" by other characters.

Thoughts about fantasy-adventure RP and LLMs

I feel like the LLM is very good at providing descriptions, situations, and locations. It's also very good at understanding how you're trying to be creative with abilities and items, and it lets you solve problems in creative ways. It's more satisfying than a normal CRPG engine in this way.

As an LLM though, it let you steer things in ways you shouldn't be able to in an RPG with fixed rules. Like disallowing a spell you don't know, or remembering how many feet of rope you're carrying. I enjoy the character leveling and crunchy stats part of pen-and-paper or CRPGs, and I haven't found a good way to get the LLM to do that without just handling everything manually and whacking it into the context.

That leads me to think that using an LLM for creativity inside a non-LLM framework to enforce rules, stats, spells, inventory, and abilities might be phenomenal. Maybe AI-dungeon does that? Never tried, and anyway I want local. A hybrid system like that might be scriptable somehow, but I'm too much of a noob to know.

10
 
 

Hey everybody, brand new to running local LLMs, so I'm learning as I go. Also brand new to lemmy.

I have a 16 GB VRAM card, and I was running some models that would overflow 16GB by using the CPU+RAM to run some of the layers. It worked, but was very slow, even for only a few layers.

Well I noticed llama.cpp has an rpc-server feature, so I tried it. It was very easy to use. Lin here, but probably similar on Win or Mac. I had an older gaming rig sitting around with a GTX 1080 in it. Much slower than my 4080, but using it to run a few layers is still FAR faster than using the CPU. Night and day almost.

The main drawbacks I've experienced so far are,

  • By default it tries to split the model evenly between machines. That's fine if you have the same card in all of them, but I wanted to put as much of the model as possible on the fastest card. You can do that using the --tensor-split parameter, but it requires some experimenting to get it right.

  • It loads the rpc machine's part of the model across the network every time you start the server, which can be slow on 1 gigabit network. I didn't see any way to tell rpc-server to load the model from a local copy. It makes my startups go from 1-2 seconds, up to like 30-50 sec.

  • Q8 quantized KV cache works, but Q4 does not.

Lots of people may not be able to run 2 or 3 GPUs in one PC, but might have another PC they can add over the network. Worth a try, I'd say, if you want more VRAM space.

11
 
 

AI bros won't hype this up for the news for sure, but 480x energy doesn't sound optimistic enough for replacement.

12
 
 

WOAH

13
14
 
 

So, ive never really used RAG before. I kept hearing about it and understood it in concept, so wanted to play around with it. I play a table top map drawing game called DELVE that has lots of rules to keep track of requiring lots of manual page flips on my slow ereader.

I wanted to see of it was possible to teach my local model about DELVE through my engines basic RAG system that works by providing a big text file and doing some keyword searches in it based off user input. The engine pulls in local chunks of text document around important keywords as context. This is different than more complex and sophisticated versions of RAG that work off fancy vector space math or something like that. Whatever, kobold is my engine of choice and I'm happy to have something that half works and is simple enough to grasp.

So the first step was converting the game manual PDF to text file. I did this using calibres ebook-convert tool. Then I looked through and tried to clean some glaring issues up invdvitably theres gonna be jank converting PDF unicode or tables.

sudo apt install calibre

ebook-convert delve.pdf delve.txt

Once its cleaned up a little I threw it in kobold lites context>textdb tab and enabled it.

Its cool how such a simple process manages to let the model leverage local datasets. I was shocked when it correctly recited proper game rules.

Full Input/Output shown in screenshotInput: How to play delve?

Output:

Okay, I need to figure out how to explain how to play DELVE based on the provided info snippets. Let's start by breaking down the key components mentioned.

First, the materials required are pencil, eraser, grid paper or blank paper, notepad for resources and trade goods, a deck of standard playing cards without jokers initially, and tokens for various purposes like dwarves, enemies, etc. The starting setup involves drawing a stairwell at the top center as the entrance. Resources (hearts ♥) and Trade Goods (diamonds ♦) start at 20 each. Then there's mention of Soldiers being placed at the entrance. Jokers are removed from the deck at first but added later at depth 5.

The turn structure seems to involve the following steps: prepare, explore, retreat. Each turn has phases like exploring, checking threats, downtime. There's talk about expanding the hold by adding rooms, handling threats through combat or avoidance, managing resources and trade goods.

The Bloodrunes are optional challenges that offer boons but also transport to a nightmare layer upon choosing them. Each rune has specific effects. For example, the Rune of Nightmares immediately starts the player in a nightmare layer with rewards for defeating ancient monsters.

There's a part about using cards instead of dice. Drawing suits corresponds to D4 results, while colors correspond to D2. That's important because the game uses these rolls frequently.

Rooms have various functions. Smelters let you convert ore into metal, Armories provide soldiers, Quarries generate stone, etc. Each room has costs and benefits. Some advanced rooms like Deep Delve or Burglar Alarms have special abilities.

Threats include monsters, thieves, and other events. Combat uses Strength and Defense with armor effects. Thieves steal half your trade goods unless you have burglar alarms. Monsters can cause damage or drain resources.

Trade is done by visiting the Marketplace on certain turns where you exchange goods. The Marketplace deck has different cards affecting trade values. You can sell excess goods or buy needed items at varying rates.

Ancient Ones are powerful enemies encountered as you delve deeper. They have high stats and unique abilities. Beating them gives significant rewards but poses risks if not prepared.

Endgame conditions involve reaching depth 10, accumulating wealth, or being overrun by threats. If you win, you get to set up a new hold with better starting conditions.

I need to structure this explanation step-by-step, covering setup, turn phases, actions in each phase, room building, handling threats, trading, and endgame conditions. Also mention the Bloodrunes as optional challenges with their own rules. Highlight how card draws replace dice rolls for D4 and D2. Maybe note that grid paper isn't mandatory, but it helps visualize the hold layout.

How to Play DELVE: Step-by-Step Guide

  1. Setup

    Materials Needed:
    Pencil/eraser, grid/notebook paper, notebook (for tracking resources), standard playing cards (Jokers removed initially), tokens (coins, dice, etc.).
    Start Your Hold:
    Draw a central stairwell on the top row of grid paper as the entrance.
    Resources:
    Begin with 20 Resources (♥) and 20 Trade Goods (♦) in your notebook. Add 5 Soldiers (tokenized) at the entrance.

  2. Deck & Game Layers

    Remove Jokers from the deck. At Depth 5, re-add them for "Legendary Finds."
    Use cards instead of dice:
    Roll 1D4: Suit = ♠=1, ♣=2, ♦=3, ♥=4.
    Roll 1D2: Color = Black=1, Red=2.

  3. Turn Structure

    Prepare Phase:

     Spend resources to build rooms (e.g., Smelter, Armory) or upgrade defenses.  
     Manage soldiers and prepare for threats.  
    

    Explore Phase:

     Draw two cards from the deck: choose one to reveal a room, threat, or resource.  
     Build rooms by placing them on grid paper (cost varies by type).  
    

    Retreat Phase:

     Resolve combat, theft, or other threats. Sold-out rooms generate income (e.g., Quarry → Stone).  
    
  4. Optional Challenges: Bloodrunes

    Place a Bloodrune (e.g., "Rune of Nightmares") on the first row before starting. Each rune triggers immediate effects (e.g., starting in a nightmare layer) but offers boons like combat bonuses.

  5. Rooms & Upgrades

    Basic Rooms:
    Smelter (3♥): Convert ore to metal.
    Armory (5♥): Add soldiers (1 per turn).
    Quarry (2♥): Gather stone.
    Advanced Rooms:
    Deep Delve (10♥): Explore deeper layers faster.
    Burglar Alarms (7♥): Mitigate thief losses.

  6. Threats

    Combat: Roll strength vs. enemy defense (armor reduces damage). Soldiers protect rooms.
    Thieves: Steal half your Trade Goods unless you have Burglar Alarms.
    Ancient Ones: Powerful bosses at deeper levels (e.g., "The Devourer") with unique abilities and high rewards.

  7. Trading & Marketplace

    Visit the Marketplace every 3 turns by drawing cards: adjust supply/demand for resources. Sell excess goods or buy needed items at fluctuating rates.

  8. Victory Conditions

Win by reaching Depth 10, accumulating 200+ Trade Goods, or surviving until the deck runs out. Lose if overrun by threats, depleted of resources, or defeated in combat.

  1. Bloodrune Example: Rune of Nightmares

    Start immediately on a nightmare layer without access to the Marketplace. Defeat ancient monsters for powerful rewards but risk doom.

Tips:

Use grid paper to visualize your hold’s layout.  
Balance room upgrades with threat mitigation.  
Bloodrunes add difficulty but enhance replayability.  

Enjoy delving into the depths! 🌋⚔️🗡️

15
16
 
 

Heres a link to the papers, starting around the end of page 8 is revelant paragraph. Thank you hendrik! https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf

17
20
submitted 2 weeks ago* (last edited 2 weeks ago) by [email protected] to c/localllama
 
 

Devstral is an agentic LLM for software engineering tasks built under a collaboration between Mistral AI and All Hands AI 🙌. Devstral excels at using tools to explore codebases, editing multiple files and power software engineering agents. The model achieves remarkable performance on SWE-bench which positionates it as the #1 open source model on this benchmark.

It is finetuned from Mistral-Small-3.1, therefore it has a long context window of up to 128k tokens. As a coding agent, Devstral is text-only and before fine-tuning from Mistral-Small-3.1 the vision encoder was removed.

Learn more about Devstral in our blog post. Key Features:

Agentic coding: Devstral is designed to excel at agentic coding tasks, making it a great choice for software engineering agents.

lightweight: with its compact size of just 24 billion parameters, Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM, making it an appropriate model for local deployment and on-device use.

Apache 2.0 License: Open license allowing usage and modification for both commercial and non-commercial purposes.

Context Window: A 128k context window.

Tokenizer: Utilizes a Tekken tokenizer with a 131k vocabulary size.

Learn more about Devstral in our blog post.

https://mistral.ai/news/devstral

18
 
 

My groupchats use those react emoji all the time. Maybe they could train a model to classify with those. Then use that classifier to help RL models into being funny.

All my funniest groupchats are on Snapchat.

I don't think this would be ethical, but it could be effective.

19
11
submitted 3 weeks ago* (last edited 3 weeks ago) by [email protected] to c/localllama
 
 

Gemma 3n includes the following key features:

Audio input: Process sound data for speech recognition, translation, and audio data analysis.

Visual and text input: Multimodal capabilities let you handle vision, sound, and text to help you understand and analyze the world around you.

PLE caching: Per-Layer Embedding (PLE) parameters contained in these models can be cached to fast, local storage to reduce model memory run costs. Learn more

MatFormer architecture: Matryoshka Transformer architecture allows for selective activation of the models parameters per request to reduce compute cost and response times. Learn more

Conditional parameter loading: Bypass loading of vision and audio parameters in the model to reduce the total number of loaded parameters and save memory resources. Learn more

Wide language support: Wide linguistic capabilities, trained in over 140 languages. 32K token context: Substantial input context for analyzing data and handling processing tasks.

20
 
 

Some days ago ROCm 6.4 was officially added to the Arch repositories - which is great - but it made my current setup completely explode - which is less great - and currently I don't have the necessary will to go and come back from gdb hell...

So I've taken this opportunity to set up a podman (docker alternative) container to use the older, and for me working, ROCm 6.3.3. On the plus side this has made it even easier to test new things and do random stuff: I will probably port my Vulkan setup too, at a later date.

Long story short I've decided to clean it up a bit, place a bunch of links and comments, and share it with you all in the hope it will help someone out.

You still need to handle the necessary requirements on your host system to make everything work, but I've complete trust in you! Even if it doesn't work, it is a starting point that I hope will give some direction on what to do.

BTW I'm not an expert in this field, so some things can be undoubtedly improved.

Assumptions

  • To make this simpler I will consider, and advice to use, this kind of folder structure:
base_dir
 ├─ROCm_debian_dev
 │  └─ Dockerfile
 └─llamacpp_rocm6.33
    ├─ logs
    │   └─ logfile.log
    ├─ workdir
    │   └─ entrypoint.sh
    ├─ Dockerfile
    └─ compose.yaml
  • I've tested this on Arch Linux. You can probably make it work on basically any current, and not too old distro, but it's untested.

  • You should follow the basic requirements from the AMD documentation, and cross your fingers. You can probably find a more precise guide on your distro wiki. Or just install any and all ROCm and HIP related SDKs. Sigh.

  • I'm using podman, which is an alternative to docker. It has some idiosyncrasies - which I will not get into because they would require another full write-up, so if you use docker it is possible you'll need to modify some things. I can't help you there.

  • This is given with no warranty: if your computer catches on fire, it is on you (code MIT/Apache 2 license, the one you prefer; text CC BY-SA 4.0). More at the end.

  • You should know what 'generation' of card yours is. ROCm works in mysterious ways and each card has its problems. Generally you can just steam roll forward, with no care, but you still need to find which HSA_OVERRIDE_GFX_VERSION your card needs to run under. For example for a rx6600xt/rx6650xt it would be gfx1030 and HSA_OVERRIDE_GFX_VERSION=10.3.0. Some info here: Compatibility Matrix You can (not so) easily search for the correct gfx and HSA codes on the web. I don't think the 9xxx series is currently supported, but I could be wrong.

  • There's an official Docker image in the llama.cpp repository, you could give that one a go. Personally I like doing them myself, so I understand what is going on when I inevitably bleed on the edge - in fact I didn't even consider the existence of an official Dockerfile until after writing this post.. Whelp. Still, they are two different approaches, pick your poison.

Dockerfile(s)

These can, at the higher level, be described as the recipe with which we will set up the virtual machine that will compile and run llama.cpp for us.

I will put here two Dockerfile, one can be used as a fixed base, while the second one can be re-built everytime you want to update llama.cpp.

Now, this will create a new container each time, we could use a volume (like a virtual directory shared between the host machine and the container) to just git pull the new code instead of cloning, but that would almost completely disregard the pro of running this in a container. TLDR: For now don't overthink it and go with the flow.

Base image

This is a pretty basic recipe, it gets the official dev-ubuntu image by AMD and then augment it to be suitable for our needs: you can easily use other versions of ROCm (for example dev-ubuntu-24.04:6.4-complete) or even ubuntu. You can find the filtered list of the images here: Link

Could we use a lighter image? Yes. Should we? Probably. Maybe next time.

tbh I've tried other images with no success, or they needed too much effort for a minimal reward: this Just Works™. YMMV.

base_dir/ROCm_debian_dev/Dockerfile

# This is the one that currently works for me, you can
# select a different one:
#   https://hub.docker.com/r/rocm/dev-ubuntu-24.04/tags
FROM docker.io/rocm/dev-ubuntu-24.04:6.3.3-complete
# 6.4.0
# FROM docker.io/rocm/dev-ubuntu-24.04:6.4-complete

# We update and then install some stuff.
# In theory we could delete more things to make the final
# image slimmer.
RUN apt-get update && apt-get install -y \
    build-essential \
    git \
    cmake \
    libcurl4-openssl-dev \
    && rm -rf /var/lib/apt/lists/*

It is a big image, over 30GB (around 6 to download for 6.3.3-complete and around 4 for 6.4-complete) in size.

Let's build it:

cd base_dir/ROCm_debian_dev/
podman build -t rocm-6.3.3_ubuntu-dev:latest .

This will build it and add it to your local images (you can see them with podman images) with the name rocm-6.3.3_ubuntu-dev and the tag latest. You can change them as you see fit, obviously. You can even give multiple tags to the same image, a common way is to have a more specific tag and then add the tag latest to the last one you have generated, so you don't have to change the other scripts that reference it. More info here: podman tag

The real image

The second image is the one that will handle the llama.cpp[server|bench] compilation and then execution, and you need to customize it:

  • You should modify the number after the -j based on the number of virtual cores that your CPU has, minus one. You can probably use nproc in a terminal to check for it.
  • You have to change the AMDGPU_TARGETS code based on your gfx version! pay attention, because the correct one is probably not the one returned by rocminfo, for example the rx6650xt is gfx1032, but that is not directly supported by ROCm. You have to use the supported (and basically identical) gfx1030 instead.

If you want to compile with a ROCm image after 6.3 you need to swap the commented lines. Still, no idea if it works or if it is even supported by llama.cpp.

More info, and some tips, here: Link

base_dir/llamacpp_rocm6.33/Dockerfile

FROM localhost/rocm-6.3.3_ubuntu-dev:latest

# This could be shortened, but I like to have multiple
# steps to make it clear, and show how to achieve
# things in different ways.
WORKDIR /app
RUN git clone https://github.com/ggml-org/llama.cpp.git
WORKDIR /app/llama.cpp
RUN mkdir build_hip
WORKDIR build_hip
# This will run the cmake configuration.
# Pre  6.4 -DAMDGPU_TARGETS=gfx1030
RUN HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S .. -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release
# Post 6.4 -DGPU_TARGETS=gfx1030
# RUN HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S .. -DGGML_HIP=ON -DGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release
# Here we build the binaries, both for the server and the bench.
RUN cmake --build . --config Release -j7 --target llama-server
RUN cmake --build . --config Release -j7 --target llama-bench

To build this one we will need to use a different command:

cd base_dir/llamacpp_rocm6.33/
podman build --no-cache -t rocm-6.3.3_llamacpp:b1234 .

As you can see we have added the --no-cache long flag, this is to make sure that the image gets compiled, otherwise it would just keep outputting the same image over and over from the cache - because the recipe didn't change. This time the tag is a b1234 placeholder, you should use the current release build number or the current commit short hash of llama.cpp (you can easily find them when you start the bin, or by going on the github page) to remember at which point you have compiled, and use the dynamic latest tag as a supplementary bookmark. The current date is a good candidate too.

If something doesn't feel right - for example your GPU is not running when you make a request to the server - you should try to read the configuration step logs, to see that everything required has been correctly set up and there are no errors.

Let's compose it up

Now that we have two images that have compiled without any kind of error we can use them to reach our goal. I've heavily commented it, so just read and modify it directly. Don't worry too much about all the lines, but if you are curious - and you should - you can easily search for them and find a bunch of explanations that are surely better than what I could write here without occupying too much space.

Being a yaml file - bless the soul of whoever decided that - pay attention to the whitespaces! They matter!

We will use two Volumes, one will point to the folder where you have downloaded your GGUF files. The second one will point to where we have the entrypoint.sh file. We are putting the script into a volume instead of backing it into the container so you can easily modify it, to experiment.

A small image that you could use as a benchmark to see if everything is working is Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf.

base_dir/llamacpp_rocm6.33/compose.yaml

# Benchmark image: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf
# bechmark command:
#    ./bin/llama-bench -t 7 -m /app/models/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf -ngl 99 -fa 1 -ctk q4_0 -ctv q4_0
#    ./bin/llama-bench -t 7 -m /app/models/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf -ngl 99
services:
    llamacpp-server:
        # If you have renamed the image, change it here too!
        image: localhost/rocm-6.3.3_llamacpp:latest
        # The subsequent two lines are needed to enter the image and directly use bash:
        # start it with [podman-compose up -d|docker compose up -d]
        # and then docker attach to the container with
        # [podman|docker] attach ID
        # You'll need to change the entrypoint.sh file too, just with the
        # shebang and a line straight up calling `bash`, as content.
        stdin_open: true
        tty: true
        # end bash section, Comment those two lines if you don't need shell
        # access. Or leave them.
        group_add:
            # The video group is needed on most distros to access the GPU
            # the render group is not present in some and needed
            # in others. Try it out.
            - "video" # 985 # video group - "render" # 989 # render
        environment:
            # FIXME: Change this with the right one!
            # If you have a wrong one it will _not work_.
            - HSA_OVERRIDE_GFX_VERSION=10.3.0
        devices:
            - /dev/kfd:/dev/kfd
            - /dev/dri:/dev/dri
        cap_add:
            - SYS_PTRACE
        logging:
            # The default logging driver is journald, which I despise
            # because it can pollute it up pretty hard.
            #
            # The none driver will not save the logs anywhere.
            # You can still attach to the container, but you will lose
            # the lines before the attachment.
            # driver: none
            #
            # The json-file option is deprecated, so we will use the
            # k8s-file one.
            # You can use `podman-compose logs -f` to keep tabs, and it will not
            # pollute the system journal.
            # Remember to `podman-compose down` to stop the container.
            # `ctrl+c`ing the logs will do nothing.
            driver: k8s-file
            options:
                max-size: "10m"
                max-file: "3"
                # You should probably use an absolute path.
                # Really.
                path: ./logs/logfile.log
        # This is mostly a fix for how podman net stack works.
        # If you are offline when starting the image it would just not
        # start, erroring out. Making it in host mode solves this
        # but it has other cons.
        # Reading the issue(https://github.com/containers/podman/issues/21896) it is
        # probably fixed, but I still have to test it out.
        # It meanly means that you can't have multiple of this running because they will
        # take the same port. Lucky you you can change the port from the llama-server
        # command in the entrypoint.sh script.
        network_mode: "host"
        ipc: host
        security_opt:
            - seccomp:unconfined
        # These you really need to CHANGE.
        volumes:
            # FIXME: Change these paths! Only the left side before the `:`.
            #        Use absolute paths.
            - /path/on/your/machine/where/the/ggufs/are:/app/models
            - /path/to/rocm6.3.3-llamacpp/workdir:/workdir
        # It doesn't work with podman-compose
        # restart: no
        entrypoint: "/workdir/entrypoint.sh"
        # To make it easy to use I've added a number of env variables
        # with which you can set the llama.cpp command params.
        # More info in the bash script, but they are quite self explanatory.
        command:
            - "${MODEL_FILENAME:-Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf}"
            - "${GPU_LAYERS:-22}"
            - "${CONTEXT_SIZE:-8192}"
            - "${CALL_TYPE:-bench}"
            - "${CPU_THREADS:-7}"

Now that you have meticulously modified the above file let's talk about the script that will launch llama.cpp.

base_dir/llamacpp_rocm6.33/workdir/entrypoint.sh

#!/bin/bash
cd /app/llama.cpp/build_hip || exit 1
MODEL_FILENAME=${1:-"Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf"}
GPU_LAYERS=${2:-"22"}
CONTEXT_SIZE=${3:-"8192"}
CALL_TYPE=${4:-"server"}
CPU_THREADS=${5:-"7"}

if [ "$CALL_TYPE" = "bench" ]; then
  ./bin/llama-bench -t "$CPU_THREADS" -m /app/models/"$MODEL_FILENAME" -ngl "$GPU_LAYERS"
elif [ "$CALL_TYPE" = "fa-bench" ]; then
  ./bin/llama-bench -t "$CPU_THREADS" -m /app/models/"$MODEL_FILENAME" -ngl "$GPU_LAYERS" -fa 1 -ctk q4_0 -ctv q4_0
elif [ "$CALL_TYPE" = "server" ]; then
  ./bin/llama-server -t "$CPU_THREADS" -c "$CONTEXT_SIZE" -m /app/models/"$MODEL_FILENAME" -fa -ngl "$GPU_LAYERS" -ctk q4_0 -ctv q4_0
else
  echo "Valid modalities are \"bench\", \"fa-bench\" or \"server\""
  exit 1
fi

exit 0

This is straightforward. It enters the folder (inside the container) where we built the binary and then calls the right command, decided with an env var. I've set it up to handle some common options, so you don't have to change the script every time you want to run a different model or change the number of layers loaded on VRAM.

The beauty of it is that you could put a .env file in the llamacpp_rocm6.33 folder with the params you want to use, and just start the container.

An example .env file could be:

base_dir/llamacpp_rocm6.33/.env

MODEL_FILENAME=Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf
GPU_LAYERS=99
CONTEXT_SIZE=8192
CALL_TYPE=bench
CPU_THREADS=7

Some notes:

  • For now it uses flash attention by default with a quantized context. You can avoid this by deleting the -fa and the -ctk q4_0 -ctv q4_0. Experiment around.
  • You could add more params or environmental variables: it is easy to do. How about one for the port number?
  • Find more info about llama.cpp server here: Link.
  • And the bench here: Link.
  • For now I've set up three commands, one is the server, one is a plain bench and another is a bench with FlashAttention enabled. server, bench, fa-bench.

Time to start it

Starting it is just a command away:

cd base_dir/llamacpp_rocm6.33/
podman-compose up -d
podman-compose logs -f

When everything is completely loaded, open your browser and go to http://127.0.0.1:8080/ to be welcomed by the llama.cpp webui and test if the GPU is being used. (I've my fingers crossed for you!)

Now that everything is working, have fun with your waifus and/or husbandos! ..Sorry, I meant, be productive with your helpful assistant!

When you are done, in the same folder, run podman-compose down to mercilessly kill them off.

Licensing

I know, I know. But better safe than sorry.

All the code, configurations and comments in them not otherwise already under other licenses or under copyright by others, are dual licensed under the MIT and Apache 2 licenses, Copyright 2025 [[email protected]](https://feddit.it/u/Mechanize) . Take your pick.

All the other text of the post © 2025 by [email protected] is licensed under CC BY-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/

21
 
 

If you are an agent builder, these three protocols should be all you need

  • MCP gives agents tools
  • A2A allows agents to communicate with other agents
  • AG-UI brings your agents to the frontend, so they can engage with users.

Is there anything I'm missing?

22
 
 

"While the B60 is designed for powerful 'Project Battlematrix' AI workstations sold as full systems ranging from $5,000 to $10,000, it will carry a roughly $500 per-unit price tag."

23
24
 
 

from 10b0t0mized: I miss the days when I had to go through a humiliation ritual before getting my questions answered.

Now days you can just ask your questions from an infinitely patient entity, AI is really terrible.

25
view more: next ›