LocalLLaMA

26

16

Best Upgrade Path for my Desktop (lemm.ee)

submitted 7 months ago by [email protected] to c/localllama

5 comments fedilink

Current situation: I've got a desktop with 16 GB of DDR4 RAM, a 1st gen Ryzen CPU from 2017, and an AMD RX 6800 XT GPU with 16 GB VRAM. I can 7 - 13b models extremely quickly using ollama with ROCm (19+ tokens/sec). I can run Beyonder 4x7b Q6 at around 3 tokens/second.

I want to get to a point where I can run Mixtral 8x7b at Q4 quant at an acceptable token speed (5+/sec). I can run Mixtral Q3 quant at about 2 to 3 tokens per second. Q4 takes an hour to load, and assuming I don't run out of memory, it also runs at about 2 tokens per second.

What's the easiest/cheapest way to get my system to be able to run the higher quants of Mixtral effectively? I know that I need more RAM Another 16 GB should help. Should I upgrade the CPU?

As an aside, I also have an older Nvidia GTX 970 lying around that I might be able to stick in the machine. Not sure if ollama can split across different brand GPUs yet, but I know this capability is in llama.cpp now.

Thanks for any pointers!

27

12

I'm I the only one blown away by AI? (lemmy.zip)

submitted 7 months ago by [email protected] to c/localllama

16 comments fedilink

Recently OpenAI released GPT-4o

Video I found explaining it: https://youtu.be/gy6qZqHz0EI

Its a little creepy sometimes but the voice inflection is kind of wild. What I the to be alive.

28

31

Mozilla's Llamafile 0.8.2 Scores Big With New AVX2 Performance Optimizations (www.phoronix.com)

submitted 7 months ago by [email protected] to c/localllama

3 comments fedilink

29

9

What is your average token usage (inference) pr day with your particular workflow ? (lemmy.ml)

submitted 7 months ago by [email protected] to c/localllama

0 comments fedilink

I am planning my first ai-lab setup, and was wondering how many tokens different AI-workflows/agent network eat up on an average day. For instance talking to an AI all day, have devlin running 24/7 or whatever local agent workflow is running.

Oc model inference speed and type of workflow influences most of these networks, so perhaps it's easier to define number of token pr project/result ?

So I were curious about what typical AI-workflow lemmies here run, and how many tokens that roughly implies on average, or on a project level scale ? Atmo I don't even dare to guess.

Thanks..

30

25

Llama 3 Establishes Meta as the Leader in “Open” AI (spectrum.ieee.org)

submitted 7 months ago by [email protected] to c/localllama

2 comments fedilink

31

12

Eric Hartford on X: "I am super excited to announce that I've accepted a position with @TensorWaveCloud - focused on training AI models with @AMDInstinct technologies!" (twitter.com)

submitted 8 months ago by [email protected] to c/localllama

0 comments fedilink

Hartford is credited as creator of Dolphin-Mistral, Dolphin-Mixtral and lots of other stuff.

He's done a huge amount of work on uncensored models.

32

11

Meta's Llama 3 will force OpenAI and other AI giants to up their game (www.itpro.com)

submitted 8 months ago by [email protected] to c/localllama

10 comments fedilink

33

35

Meta releases Llama 3, claims it's among the best open models available (www.yahoo.com)

submitted 8 months ago by [email protected] to c/localllama

2 comments fedilink

34

28

New Mistral model is out (twitter.com)

submitted 8 months ago* (last edited 8 months ago) by [email protected] to c/localllama

7 comments fedilink

From Simon Willison: "Mistral tweet a link to a 281GB magnet BitTorrent of Mixtral 8x22B—their latest openly licensed model release, significantly larger than their previous best open model Mixtral 8x7B. I’ve not seen anyone get this running yet but it’s likely to perform extremely well, given how good the original Mixtral was."

35

67

Meta confirms that its Llama 3 open source LLM is coming in the next month (techcrunch.com)

submitted 8 months ago by [email protected] to c/localllama

4 comments fedilink

36

37

LLaMA Now Goes Faster on CPUs (justine.lol)

submitted 8 months ago by [email protected] to c/localllama

1 comments fedilink

37

8

What's the current recommendation for an anime oriented model? (ani.social)

submitted 8 months ago* (last edited 8 months ago) by [email protected] to c/localllama

0 comments fedilink

I've been using tie-fighter which hasn't been too bad with lorebooks in tavern.

38

6

Devika is an Agentic AI Software Engineer that can understand high-level human instructions, break them down into steps, research relevant information, and write code to ach (github.com)

submitted 8 months ago by [email protected] to c/localllama

0 comments fedilink

39

17

Dock GPU to Laptop or to small SOC? (feddit.de)

submitted 8 months ago by [email protected] to c/localllama

7 comments fedilink

Afaik most LLMs run purely on the GPU, dont they?

So if I have an Nvidia Titan X with 12GB of RAM, could I plug this into my laptop and offload the load?

I am using Fedora, so getting the NVIDIA drivers would be... fun and already probably a dealbreaker (wouldnt want to run proprietary drivers on my daily system).

I know that using ExpressPort adapters people where able to use GPUs externally, and this is possible with thunderbolt too, isnt it?

The question is, how well does this work?

Or would using a small SOC to host a webserver for the interface and do all the computing on the GPU make more sense?

I am curious about the difficulties here, ARM SOC and proprietary drivers? Laptop over USB-c (maybe not thunderbolt?) and a GPU just for the AI tasks...

40

17

AnythingLLM | The ultimate AI business intelligence tool (useanything.com)

submitted 8 months ago by [email protected] to c/localllama

9 comments fedilink

Linux package available like LM Studio

41

24

Open web UI - a web UI primarily for ollama that has a bunch of useful functionally (github.com)

submitted 8 months ago* (last edited 8 months ago) by [email protected] to c/localllama

1 comments fedilink

42

10

Evolving New Foundation Models: Unleashing the Power of Automating Model Development (sakana.ai)

submitted 8 months ago by [email protected] to c/localllama

0 comments fedilink

arXiv: https://arxiv.org/abs/2403.13187 [cs.NE]
GitHub: https://github.com/SakanaAI/evolutionary-model-merge

43

5

GaLore: Advancing Large Model Training on Consumer-grade Hardware (huggingface.co)

submitted 8 months ago by [email protected] to c/localllama

0 comments fedilink

arXiv: https://arxiv.org/abs/2403.03507 [cs.LG]

44

20

Mistral 7B v0.2 Base (released at SHACK15sf hackathon) (github.com)

submitted 8 months ago* (last edited 8 months ago) by [email protected] to c/localllama

1 comments fedilink

GitHub: https://github.com/mistralai-sf24/hackathon
X: https://twitter.com/MistralAILabs/status/1771670765521281370

New release: Mistral 7B v0.2 Base (Raw pretrained model used to train Mistral-7B-Instruct-v0.2)
🔸 https://models.mistralcdn.com/mistral-7b-v0-2/mistral-7B-v0.2.tar
🔸 32k context window
🔸 Rope Theta = 1e6
🔸 No sliding window
🔸 How to fine-tune:

45

77

Ollama now supports AMD graphics cards (ollama.com)

submitted 9 months ago by [email protected] to c/localllama

4 comments fedilink

But in all fairness, it's really llama.cpp that supports AMD.

Now looking forward to the Vulkan support!

46

11

T-Ragx - Enhancing Translation with RAG-Powered LLMs (github.com)

submitted 9 months ago by [email protected] to c/localllama

3 comments fedilink

Excited to share my T-Ragx project! And here are some additional learnings for me that might be interesting to some:

vector databases aren't always the best option
- Elasticsearch or custom retrieval methods might work even better in some cases
LoRA is incredibly powerful for in-task applications
The pace of the LLM scene is astonishing
- TowerInstruct and ALMA-R translation LLMs launched while my project was underway
Above all, it was so fun!

Please let me know what you think!

47

19

My personal collection of interesting models I've quantized from the past week (yes, just week) (twitter.com)

submitted 9 months ago by noneabove1182 to c/localllama

4 comments fedilink

So you don't have to click the link, here's the full text including links:

Some of my favourite @huggingface models I've quantized in the last week (as always, original models are linked in my repo so you can check out any recent changes or documentation!):

@shishirpatil_ gave us gorilla's openfunctions-v2, a great followup to their initial models: https://huggingface.co/bartowski/gorilla-openfunctions-v2-exl2

@fanqiwan released FuseLLM-VaRM, a fusion of 3 architectures and scales: https://huggingface.co/bartowski/FuseChat-7B-VaRM-exl2

@IBM used a new method called LAB (Large-scale Alignment for chatBots) for our first interesting 13B tune in awhile: https://huggingface.co/bartowski/labradorite-13b-exl2

@NeuralNovel released several, but I'm a sucker for DPO models, and this one uses their Neural-DPO dataset: https://huggingface.co/bartowski/Senzu-7B-v0.1-DPO-exl2

Locutusque, who has been making the Hercules dataset, released a preview of "Hyperion": https://huggingface.co/bartowski/hyperion-medium-preview-exl2

@AjinkyaBawase gave an update to his coding models with code-290k based on deepseek 6.7: https://huggingface.co/bartowski/Code-290k-6.7B-Instruct-exl2

@Weyaxi followed up on the success of Einstein v3 with, you guessed it, v4: https://huggingface.co/bartowski/Einstein-v4-7B-exl2

@WenhuChen with TIGER lab released StructLM in 3 sizes for structured knowledge grounding tasks: https://huggingface.co/bartowski/StructLM-7B-exl2

and that's just the highlights from this past week! If you'd like to see your model quantized and I haven't noticed it somehow, feel free to reach out :)

48

40

[Paper] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (huggingface.co)

submitted 9 months ago* (last edited 8 months ago) by [email protected] to c/localllama

18 comments fedilink

From the abstract: "Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}."

Would allow larger models with limited resources. However, this isn't a quantization method you can convert models to after the fact, Seems models need to be trained from scratch this way, and to this point they only went as far as 3B parameters. The paper isn't that long and seems they didn't release the models. It builds on the BitNet paper from October 2023.

"the matrix multiplication of BitNet only involves integer addition, which saves orders of energy cost for LLMs." (no floating point matrix multiplication necessary)

"1-bit LLMs have a much lower memory footprint from both a capacity and bandwidth standpoint"

Edit: Update: additional FAQ published

49

12

Gemma 2B vs Phi-2 (lemmy.world)

submitted 9 months ago by [email protected] to c/localllama

3 comments fedilink

50

17

NVIDIA Chat With RTX (www.nvidia.com)

submitted 10 months ago by [email protected] to c/localllama

6 comments fedilink

This is an interesting demo, but it has some drawbacks I can already see:

It's Windows only (maybe Win11 only, the documentation isn't clear)
It only works with RTX 30 series and up
It's closed source, so you have no idea if they're uploading your data somewhere

The concept is great, having an LLM to sort through your local files and help you find stuff, but it seems really limited.

I think you could get the same functionality(and more) by writing an API for text-gen-webui.

more info here: https://videocardz.com/newz/nvidia-unveils-chat-with-rtx-ai-chatbot-powered-locally-by-geforce-rtx-30-40-gpus