LocalLLaMA

3321 readers

1 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago

MODERATORS

201

LLMs made into single-file executables with llamafile (hacks.mozilla.org)

submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/localllama

4 comments fedilink

Seems like a really cool project. Lowering the barrier to entry of locally run models. As llamacpp supports a ton of models, I imagine it be easy to adapt this for other models other than the prebuilt ones.

202

Unsloth: 80% faster 50% less memory LLM finetuning (github.com)

submitted 2 years ago by [email protected] to c/localllama

1 comments fedilink

203

I'm having a fantastic time with this model. (huggingface.co)

submitted 2 years ago by [email protected] to c/localllama

6 comments fedilink

I've been using TheBlokes Q8 of https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B, but now this one (https://huggingface.co/Weyaxi/OpenHermes-2.5-neural-chat-7b-v3-1-7B) I think is killing it. Has anyone else tested it?

204

Orca 2: Teaching Small Language Models How to Reason (www.microsoft.com)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

7 comments fedilink

Orca 2 released by Microsoft!

Full weights here:

https://huggingface.co/microsoft/Orca-2-7b

https://huggingface.co/microsoft/Orca-2-13b

My own exllamav2 quants here:

https://huggingface.co/bartowski/Orca-2-7b-exl2

https://huggingface.co/bartowski/Orca-2-13b-exl2

GGUF from TheBloke (links to GPTQ/AWQ inside it):

https://huggingface.co/TheBloke/Orca-2-7B-GGUF

https://huggingface.co/TheBloke/Orca-2-13B-GGUF

205

Hundreds of OpenAI employees threaten to resign and join Microsoft (www.theverge.com)

submitted 2 years ago by noneabove1182 to c/localllama

8 comments fedilink

206

Catch me if you can! How to beat GPT-4 with a 13B model | LMSYS Org (lmsys.org)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

LMSYS examines how improper data decontamination can lead to artificially inflated scores

207

TensorRT-LLM evaluation of the new H200 GPU achieves 11,819 tokens/s on Llama2-13B (github.com)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

H200 is up to 1.9x faster than H100. This performance is enabled by H200's larger, faster HBM3e memory.

https://nvidianews.nvidia.com/news/nvidia-supercharges-hopper-the-worlds-leading-ai-computing-platform

208

... (programming.dev)

submitted 2 years ago* (last edited 1 year ago) by [email protected] to c/localllama

0 comments fedilink

CogVLM: Visual Expert for Pretrained Language Models

Presents CogVLM, a powerful open-source visual language foundation model that achieves SotA perf on 10 classic cross-modal benchmarks

repo: https://github.com/THUDM/CogVLM abs: https://arxiv.org/abs/2311.03079

209

ExUI - a lightweight web UI for ExLlamaV2 by turboderp (github.com)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

0 comments fedilink

The creator of ExLlamaV2 (turboderp) has released a lightweight web UI for running exllamav2, it's quite nice! Missing some stuff from text-generation-webui, but makes up for it by being very streamlined and clean

I've made a docker image for it for anyone who may want to try it out, GitHub repo here:

https://github.com/noneabove1182/exui-docker

And for finding models to run with exllamav2 I've been uploading several here:

https://huggingface.co/bartowski

Enjoy!

210

... (programming.dev)

submitted 2 years ago* (last edited 1 year ago) by [email protected] to c/localllama

2 comments fedilink

article: https://x.ai

trained a prototype LLM (Grok-0) with 33 billion parameters. This early model approaches LLaMA 2 (70B) capabilities on standard LM benchmarks but uses only half of its training resources. In the last two months, we have made significant improvements in reasoning and coding capabilities leading up to Grok-1, a state-of-the-art language model that is significantly more powerful, achieving 63.2% on the HumanEval coding task and 73% on MMLU.

211

New "Context Shifting" feature in KoboldCPP 1.48 (github.com)

submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/localllama

2 comments fedilink

"This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing."

This means a major speed increase for people like me who rely on (slow) CPU inference (or big models). Consider a chatbot scenario and a long chat where old lines of dialogue need to be evicted from the context to stay within the (4096 token) context size. Previously the context had to be re-computed starting with the first changed/now missing token. This feature detects that, deletes the affected tokens from the KV cache and shifts the subsequent tokens in the KV cache so it can be re-used. Avoiding a computationally expensive re-calculation.

This is probably also more or less related to recent advancements like Streaming-LLM

This won't help once text gets inserted "in the middle" or the prompt gets changed in another way. But I managed to connect KoboldCPP as a backend for SillyTavern/Oobabooga and now I'm able to have unlimited length conversations without waiting excessively, once the chat history hits max tokens and the frontend starts dropping text.

It's just a clever way to re-use the KV cache in one specific case. But I've wished for this for quite some time.

212

LangChain Web UI - ollama (github.com)

submitted 2 years ago by [email protected] to c/localllama

0 comments fedilink

A python Gradio web UI built using GPT4AllEmbeddings

213

Talk with LLaMA AI in your terminal (github.com)

submitted 2 years ago by [email protected] to c/localllama

0 comments fedilink

214

why reddit? (lemmy.world)

submitted 2 years ago by [email protected] to c/localllama

13 comments fedilink

I don't get why a group of users that are willing to run their own LLMs locally and do not want to relay on centralized corporations like openAI or google prefer to discuss using a centralized site like Reddit

215

Phind V7 subjectively performing at GPT4 levels for coding (news.ycombinator.com)

submitted 2 years ago by noneabove1182 to c/localllama

5 comments fedilink

Phind is now using a V7 of their model for their own platform, as they have found that people overall prefer that output vs GPT4. This is extremely impressive because it's not just a random benchmark that can be gamed, but instead crowd sourced opinion on real tasks

The one place everything still lags behind GPT4 is question comprehension, but this is a huge accomplishment

Blog post: https://www.phind.com/blog/phind-model-beats-gpt4-fast

note: they've only open released V2 of their model, hopefully they release newer versions soon.. would love to play with them outside their sandbox

216

Min P sampler (an alternative to Top K/Top P) has been merged into llama.cpp (github.com)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

Very interesting new sampler, does a better drop at filtering out extremely unlikely tokens when the most likely tokens are less confident, from the results it seems to pretty reliably improve quality with no noticeable downside

217

HUGE dataset released for open source use (together.ai)

submitted 2 years ago by noneabove1182 to c/localllama

4 comments fedilink

30T tokens, 20.5T in English, allegedly high quality, can't wait to see people start putting it to use!

218

I've started uploading quants of exllama v2 models, taking requests (huggingface.co)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

0 comments fedilink

Finally got a nice script going that automates most of the process. Uploads will all be same format, with each bit per weight going into its own branch.

the first two I did don't have great READMEs but the rest will look like this one: https://huggingface.co/bartowski/Mistral-7B-claude-chat-exl2

Also taking recommendations on anything you want to see included in readme or quant levels

219

Nearly 10% of people ask AI chatbots for explicit content. Will it lead LLMs astray? [Article from October 3] (www.zdnet.com)

submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/localllama

18 comments fedilink

They are referencing this paper: LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset from September 30.

The paper itself provides some insight on how people use LLMs and the distribution of the different use-cases.

The researchers had a look at conversations with 25 LLMs. Data is collected from 210K unique IP addresses in the wild on their Vicuna demo and Chatbot Arena website.

220

Text Generation Web-UI has been updated to CUDA 12.1, and with it new docker images are needed (self.localllama)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

For anyone who happens to be using my docker images or Dockerfiles for their text-gen-webui, it all started breaking this week when Oobabooga's work was updated to support 12.1

As such, I have updated my docker images and fixed a bunch of issues in the build process. Also been awhile since I posted it here.

You can find all the details here:

https://github.com/noneabove1182/text-generation-webui-docker

It requires driver version 535.113.01

Happy LLMing!

221

Single Digit tokenization improves LLM math abilities by up to 70x (twitter.com)

submitted 2 years ago by noneabove1182 to c/localllama

2 comments fedilink

From the tweet (minus pictures):

Language models are bad a basic math.

GPT-4 has right around 0% accuracy rate on 5 digit multiplication.

Most open models can't even add. Why is that?

There are a few reasons why numbers are hard. The main one is Tokenization. When training a tokenizer from scratch, you take a large corpus of text and find the minimal byte-pair encoding for a chosen vocabulary size.

This means, however, that numbers will almost certainly not have unique token representations. "21" could be a single token, or ["2", "1"]. 143 could be ["143"] or ["14", "3"] or any other combination.

A potential fix here would be to force single digit tokenization. The state of the art for the last few years is to inject a space between every digit when creating the tokenizer and when running the model. This means 143 would always be tokenized as ["1", "4", "3"].

This helps boost performance, but wastes tokens while not fully fixing the problem.

A cool fix might be xVal! This work by The Polymathic AI Collaboration suggests a generic [NUM] token which is then scaled by the actual value of the number!

If you look at the red lines in the image above, you can get an intuition for how that might work.

It doesn't capture a huge range or high fidelity (e.g., 7.4449 vs 7.4448) but they showcase some pretty convincing results on sequence prediction problems that are primarily numeric.

For example, they want to train a sequence model on GPS conditioned temperature forecasting

They found a ~70x improvement over standard vanilla baselines and a 2x improvement over really strong baselines.

One cool side effect is that deep neural networks might be really good at regression problems using this encoding scheme!

222

Musical notation (lemmy.dbzer0.com)

submitted 2 years ago by [email protected] to c/localllama

4 comments fedilink

Would adding musical notations to LLMs training data allow it to create music since notations are a lot like normal languages? Or does it do so already?

223

Are Local LLMs Useful in Incident Response? - SANS Internet Storm Center (isc.sans.edu)

submitted 2 years ago by [email protected] to c/localllama

0 comments fedilink

224

Dolphin 2.0 based on mistral-7b released by Eric Hartford (huggingface.co)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

1 comments fedilink

Model is trained on his own orca style dataset as well as some airoboros apparently to increase creativity

Quants:

https://huggingface.co/TheBloke/dolphin-2.0-mistral-7B-GPTQ

https://huggingface.co/TheBloke/dolphin-2.0-mistral-7B-GGUF

https://huggingface.co/TheBloke/dolphin-2.0-mistral-7B-AWQ

225

Beginner questions thread (self.localllama)

submitted 2 years ago by noneabove1182 to c/localllama

28 comments fedilink

Trying something new, going to pin this thread as a place for beginners to ask what may or may not be stupid questions, to encourage both the asking and answering.

Depending on activity level I'll either make a new one once in awhile or I'll just leave this one up forever to be a place to learn and ask.

When asking a question, try to make it clear what your current knowledge level is and where you may have gaps, should help people provide more useful concise answers!