LocalLLaMA

176

14

What are your favorite models so far? (self.localllama)

submitted 2 years ago* (last edited 2 years ago) by AsAnAILanguageModel to c/localllama

12 comments fedilink

I think it's a good idea to share experiences about LLMs here, since benchmarks can only give a very rough overview on how well a model performs.

So please share how much you're using LLMs, what you use them for and how they well they perform at those tasks. For example, here are my answers to these questions:

Usage

I use LLMs daily for work and for random questions that I would previously use web search for.

I mainly use LLMs for reasoning heavy tasks, such as assisting with math or programming. Other frequent tasks include proofreading, helping with bureaucracy, or assisting with writing when it matters.

Models

The one I find most impressive at the moment is TheBloke/airoboros-l2-70B-gpt4-1.4.1-GGML/airoboros-l2-70b-gpt4-1.4.1.ggmlv3.q2_K.bin. It often manages to reason correctly on questions where most other models I tried fail, even though most humans wouldn't. I was surprised that something using only 2.5 bits per weight on average could produce anything but garbage. Downsides are that loading times are rather long, so I wouldn't ask it a question if I didn't want to wait. (Time to first token is almost 50s!). I'd love to hear how bigger quantizations or the unquantized versions perform.

Another one that made a good impression on me is Qwen-7B-Chat (demo). It manages to correctly answer some questions where even some llama2-70b finetunes fail, ~~but so far I'm getting memory leaks when running it on my M1 mac in fp16 mode, so I didn't use it a lot.~~ (this has been fixed it seems!)

All other models I briefly tried where not too useful. It's nice to be able to run them locally, but they were so much worse than chatGPT that it's often not even worth it to consider using them.

177

17

I want to train / fine-tune a model (self.localllama)

submitted 2 years ago by Naked_Yoga to c/localllama

1 comments fedilink

I want to train, or more likely fine-tune, a model on about 20 years worth of email and text data that I've collected.

The goal would be to train it how to respond like me in simple cases.

It's there a particular base model I should start with?

I'm also interested in anyone's experience in doing this kind of thing themselves.

178

19

What is wrong with LLM benchmarks, and why are we still using them? (lemmy.micheal65536.duckdns.org)

submitted 2 years ago by [email protected] to c/localllama

12 comments fedilink

You are probably familiar with the long list of various benchmarks that new models are tested on and compared against. These benchmarks are supposedly designed to assess the model's ability to perform in various aspects of language understanding, logical reasoning, information recall, and so on.

However, while I understand the need for an objective and scientific measurement scale, I have long felt that these benchmarks are not particularly representative of the actual experience of using the models. For example, people will claim that a model performs at "some percentage of GPT-3" and yet not one of these models has ever been able to produce correctly-functioning code for any non-trivial task or follow a line of argument/reasoning. Talking to GPT-3 I have felt that the model has an actual in-depth understanding of the text, question, or argument, whereas other models that I have tried always feel as though they have only a superficial/surface-level understanding regardless of what the benchmarks claim.

My most recent frustration, and the one that prompted this post, is regarding the newly-released OpenOrca preview 2 model. The benchmark numbers claim that it performs better than other 13B models at the time of writing, supposedly outperforms Microsoft's own published benchmark results for their yet-unreleased model, and scores an "average" result of 74.0% against GPT-3's 75.7% while the LLaMa model that I was using previously apparently scores merely 63%.

I've used GPT-3 (text-davinci-003), and this model does not "come within comparison" of it. Even giving it as much of a fair chance as I can, giving it plenty of leeway and benefit of the doubt, not only can it still not write correct code (or even valid code in a lot of cases) but it is significantly worse at it than LLaMa 13B (which is also pretty bad). This model does not understand basic reasoning and fails at basic reasoning tasks. It will write a long step-by-step explanation of what it claims that it will do, but the answer itself contradicts the provided steps or the steps themselves are wrong/illogical. The model has only learnt to produce "step by step reasoning" as an output format, and has a worse understanding of what that actually means than any other model does when asked to "explain your reasoning" (at least, for other models that I have tried, asking them to explain their reasoning produces at least a marginal improvement in coherence).

There is something wrong with these benchmarks. They do not relate to real-world performance. They do not appear to be measuring a model's ability to actually understand the prompt/task, but possibly only measuring its ability to provide an output that "looks correct" according to some format. These benchmarks are not a reliable way to compare model performance and as long as we keep using them we will keep producing models that score higher on benchmarks and claim to perform "almost as good as GPT-3" but yet fail spectacularly in any task/prompt that I can think of to throw at them.

(I keep using coding as an example however I have also tried other tasks besides code as I realise that code is possibly a particularly challenging task due to requirements like needing exact syntax. My interpretation of the various models' level of understanding is based on experience across a variety of tasks.)

179

10

PSA for any docker users, strange driver mismatch (self.localllama)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

0 comments fedilink

As some may know I maintain a few docker images of some available tools, and I noticed I was suddenly getting NVML mismatch, and for the life of me I could not figure out what the issue was, tried so many things, finally noticed that the docker image had some special drive 535.86.10 where my host had 535.86.05, after figuring that out I looked into it and added this to my Dockerfile:

RUN apt-get update && apt-get remove --purge -y nvidia-* && \ apt-get install -y --allow-downgrades nvidia-driver-535/jammy-updates

And voila, problem solved! Not sure what driver the docker CUDA was using, might be some special dev driver and it was causing a mismatch between the container and the host

Only started happening as of the latest driver update released late last month

180

9

What General LLM/AI Resources are out there? (lemmy.world)

submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/localllama

0 comments fedilink

I wanted to make this post so we can share all the resources we have with each other on anything machine learning related.

Please feel free to add all of your resources as well even if they are duplicates.

PS: The best way to grow our lemmy community is to produce high quality posts.

Some ideas of things you could share:

What people do you follow for AI? Such as on YT, Twitter, etc.
What other social media forums provide great information?
What GUI do you use for local LLMs?
What parameters are "best"?
Is there a Wiki you use?
Where do you go to learn about LLMs/AI/Machine Learning?
How do you find quality models?
What Awesome github repositories do you know?
What do you think would be useful to share?

General Information - Awesome

Awesome-LLM: https://github.com/Hannibal046/Awesome-LLM
Awesome Jailbreaks: https://github.com/0xk1h0/ChatGPT_DAN
Awesome Prompts: https://github.com/f/awesome-chatgpt-prompts
Prompt-Engineering-Guide: https://github.com/dair-ai/Prompt-Engineering-Guide
AI Explained (Great channel for AI news): https://piped.video/channel/UCNJ1Ymd5yFuUPtn21xtRbbw
Lex Fridman (In depth podcasts): https://piped.video/channel/UCSHZKyawb77ixDdsGog4iWA

LLM Leaderboards:

LLM Logic Tests: https://docs.google.com/spreadsheets/d/1NgHDxbVWJFolq8bLvLkuPWKC7i_R6I6W/edit#gid=2011456595
llm-leaderboard: https://github.com/LudwigStumpp/llm-leaderboard
Chat leaderboard: https://chat.lmsys.org/?leaderboard
Gotzmann LLM Score v2.4: https://docs.google.com/spreadsheets/d/1ikqqIaptv2P4_15Ytzro46YysCldKY7Ub2wcX5H1jCQ/edit#gid=0
LLM Worksheet: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=0
CanAiCode Leaderboard: https://huggingface.co/spaces/mike-ravkine/can-ai-code-results
AlpacaEval Leaderboard https://tatsu-lab.github.io/alpaca_eval/
Measuring Massive Multitask Language Understanding: https://github.com/hendrycks/test
Awesome-LLM-Benchmark: https://github.com/SihyeongPark/Awesome-LLM-Benchmark

Places to Find Models

Discovery the LLMs: https://llm.extractum.io/
Open LLM Models List: https://github.com/underlines/awesome-marketing-datascience/blob/master/llm-model-list.md
OSS_LLMs: https://docs.google.com/spreadsheets/d/1PtrPwDV8Wcdhzh-N_Siaofc2R6TImebnFvv0GuCCzdo/edit#gid=0
OpenLLaMA: An Open Reproduction of LLaMA: https://github.com/openlm-research/open_llama
open-llms: https://github.com/eugeneyan/open-llms

Training & Datasets

Uncensored Models: https://erichartford.com/uncensored-models
LLMsPracticalGuide: https://github.com/Mooler0410/LLMsPracticalGuide
awesome-chatgpt-dataset: https://github.com/voidful/awesome-chatgpt-dataset
awesome-instruction-dataset: https://github.com/yaodongC/awesome-instruction-dataset

There are still many more resources out there I'm sure. Please share what you use to try to keep up with the fast pace of AI development.

I hope some of my resources have helped you! I'm eager to hear what other resources are out there!

181

21

Any suggestions for this community? (self.localllama)

submitted 2 years ago by noneabove1182 to c/localllama

20 comments fedilink

Just wondering if anyone has any suggestions to keep things moving and growing, was thinking of doing a daily quantized models post just for keeping up with the bloke, thoughts?

182

6

Open-Orca has released their second preview of OpenChat - Hugging Face (huggingface.co)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

5 comments fedilink

This one is based on llama 2, first one worked very well for rule and structure following with guidance so I'm highly intrigued to see if this lives up to the previous

183

15

Vicuna v1.5 Has Been Released! (lemmy.world)

submitted 2 years ago by [email protected] to c/localllama

1 comments fedilink

Click Here to be Taken to the Megathread!

from [email protected]

Vicuna v1.5 Has Been Released!

Shoutout to [email protected] for catching this in an earlier post.

Given Vicuna was a widely appreciated member of the original Llama series, it'll be exciting to see this model evolve and adapt with fresh datasets and new training and fine-tuning approaches.

Feel free using this megathread to chat about Vicuna and any of your experiences with Vicuna v1.5!

Starting off with Vicuna v1.5

TheBloke is already sharing models!

Vicuna v1.5 GPTQ

7B

Vicuna-7B-v1.5-GPTQ

Vicuna-7B-v1.5-16K-GPTQ

13B

Vicuna-13B-v1.5-GPTQ

Vicuna Model Card

Model Details

Vicuna is a chat assistant fine-tuned from Llama 2 on user-shared conversations collected from ShareGPT.

Developed by: LMSYS

Model type: An auto-regressive language model based on the transformer architecture

License: Llama 2 Community License Agreement

Finetuned from model: Llama 2

Model Sources

Repository: https://github.com/lm-sys/FastChat

Blog: https://lmsys.org/blog/2023-03-30-vicuna/

Paper: https://arxiv.org/abs/2306.05685

Demo: https://chat.lmsys.org/

Uses

The primary use of Vicuna is for research on large language models and chatbots. The target userbase includes researchers and hobbyists interested in natural language processing, machine learning, and artificial intelligence.

How to Get Started with the Model

Command line interface: https://github.com/lm-sys/FastChat#vicuna-weights

APIs (OpenAI API, Huggingface API): https://github.com/lm-sys/FastChat/tree/main#api

Training Details

Vicuna v1.5 is fine-tuned from Llama 2 using supervised instruction. The model was trained on approximately 125K conversations from ShareGPT.com.

For additional details, please refer to the "Training Details of Vicuna Models" section in the appendix of the linked paper.

Evaluation Results

Vicuna is evaluated using standard benchmarks, human preferences, and LLM-as-a-judge. For more detailed results, please refer to the paper and leaderboard.

184

14

is the 4k context length of llama2 for real? (self.localllama)

submitted 2 years ago by actuallyacat to c/localllama

13 comments fedilink

I've been using airoboros-l2-70b for writing fiction, and while overall I'd describe the results as excellent and better than any llama1 model I've used, it doesn't seem to be living up to the promise of 4k token sequence length.

Around 2500 tokens output quality degrades rapidly, and either starts repeating previous text verbatim, or becomes incoherent (grammar, punctuation and capitalization disappear, becomes salad of vaguely related words)

Any other experiences with llama2 and long context? Does the base model work better? Are other fine tunes behaving similarly? I'll try myself eventually, but the 70b models are chunky downloads, and experimentation takes a while at 1 t/s.

(I'm using GGML Q4_K_M on kobold.cpp, with rope scaling off like you're supposed to do with llama2)

185

18

(Deleted for not relevant anymore) (piped.video)

submitted 2 years ago* (last edited 3 months ago) by [email protected] to c/localllama

36 comments fedilink

(Deleted for not relevant anymore)

186

16

Small guide to run Llama.cpp on windows with discrete AMD GPU (lemm.ee)

submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/localllama

5 comments fedilink

Hi!

I have an ASUS AMD Advantage Edition laptop (https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/) that runs windows. I haven't gotten time to install linux and set it up the way I like yet, still after more than a year.

I'm just dropping a small write-up for the set-up that I'm using with llama.cpp to run on the discrete GPUs using clbast.

You can use Kobold but it meant for more role-playing stuff and I wasn't really interested in that. Funny thing is Kobold can be set up to use the discrete GPU if needed.

For starters you'd need llama.cpp itself from here: https://github.com/ggerganov/llama.cpp/tags.

Pick the clblast version, which will help offload some computation over to the GPU. Unzip the download to a directory. I unzipped it to a folder called this: "D:\Apps\llama"
You'd need a llm now and that can be obtained from HuggingFace or where-ever you'd like it from. Just note that it should be in ggml format. If you have a doubt, just note that the models from HuggingFace would have "ggml" written somewhere in the filename. The ones I downloaded were "nous-hermes-llama2-13b.ggmlv3.q4_1.bin" and "Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin"
Move the models to the llama directory you made above. That makes life much easier.
You don't really need to navigate to the directory using Explorer. Just open Powershell where-ever and you can also do cd D:\Apps\llama\
Here comes the fiddly part. You need to get the device ids for the GPU. An easy way to check this is to use "GPU caps viewer", go to the tab titled OpenCl and check the dropdown next to "No. of CL devices".

The discrete GPU is normally loaded as the second or after the integrated GPU. In my case the integrated GPU was gfx90c and discrete was gfx1031c.
In the powershell window, you need to set the relevant variables that tell llama.cpp what opencl platform and devices to use. If you're using AMD driver package, opencl is already installed, so you needn't uninstall or reinstall drivers and stuff.

$env:GGML_OPENCL_PLATFORM = "AMD"

$env:GGML_OPENCL_DEVICE = "1"
Check if the variables are exported properly

Get-ChildItem env:GGML_OPENCL_PLATFORM
Get-ChildItem env:GGML_OPENCL_DEVICE

This should return the following:

Name Value

GGML_OPENCL_PLATFORM AMD

GGML_OPENCL_DEVICE 1

If GGML_OPENCL_PLATFORM doesn't show AMD, try exporting this: $env:GGML_OPENCL_PLATFORM = "AMD"
Once these are set properly, run llama.cpp using the following:

D:\Apps\llama\main.exe -m D:\Apps\llama\Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin -ngl 33 -i --threads 8 --interactive-first -r "### Human:"

OR

replace Wizard with nous-hermes-llama2-13b.ggmlv3.q4_1.bin or whatever llm you'd like. I like to play with 7B, 13B with 4_0 or 5_0 quantized llms. You might need to trawl through the fora here to find parameters for temperature, etc that work for you.
Checking if these work, I've posted the content at pastebin since formatting these was a paaaain: https://pastebin.com/peSFyF6H

salient features @ gfx1031c (6800M discrete graphics):
llama_print_timings: load time = 60188.90 ms
llama_print_timings: sample time = 3.58 ms / 103 runs ( 0.03 ms per token, 28770.95 tokens per second)
llama_print_timings: prompt eval time = 7133.18 ms / 43 tokens ( 165.89 ms per token, 6.03 tokens per second)
llama_print_timings: eval time = 13003.63 ms / 102 runs ( 127.49 ms per token, 7.84 tokens per second)
llama_print_timings: total time = 622870.10 ms

salient features @ gfx90c (cezanne architecture integrated graphics):
llama_print_timings: load time = 26205.90 ms
llama_print_timings: sample time = 6.34 ms / 103 runs ( 0.06 ms per token, 16235.81 tokens per second)
llama_print_timings: prompt eval time = 29234.08 ms / 43 tokens ( 679.86 ms per token, 1.47 tokens per second)
llama_print_timings: eval time = 118847.32 ms / 102 runs ( 1165.17 ms per token, 0.86 tokens per second)
llama_print_timings: total time = 159929.10 ms

Edit: added pastebin since I actually forgot to link it. https://pastebin.com/peSFyF6H

187

10

What are the best models you use? (lemmy.world)

submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/localllama

8 comments fedilink

Leaderboard scores often can be a bit misleading since there are other factors to consider.

Censorship: Is the model censored?
Verbosity: How concise is the output?
Intelligence: Does the model know what it is talking about?
Hallucination: How much does the model makes up facts?
Domain Knowledge: What specialization a model has.
Size: Best models for 70b, 30b, 7b respectively.

And much more! What models do you use and would recommend to everyone?

The model that has caught my attention the most personally is the original 65b Llama. It seems genuine and truly has a personality. Everyone should chat with the original non-fine tuned version if they can get a chance. It's an experience that is quite unique within the sea of "As an AI language model" openai tunes.

188

8

Large language models, explained with a minimum of math and jargon (www.understandingai.org)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

-1 comments fedilink

cross posted from : https://sh.itjust.works/post/1851078

thank you [email protected]

189

25

Llama 2 thinks it unethical to have books about fictional characters (lemmy.world)

submitted 2 years ago by [email protected] to c/localllama

5 comments fedilink

190

7

Meta’s Open Source Llama Upsets the AI Horse Race (www.wired.com)

submitted 2 years ago by [email protected] to c/localllama

4 comments fedilink

191

9

Huggingface Text Generation Inference adds exllama support (github.com)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

This is actually a pretty big deal, exllama is by far the most performant inference engine out there for CUDA, but the strangest thing is that the PR claims it works for starcoder which is a non-llama model:

https://github.com/huggingface/text-generation-inference/pull/553

So I'm extremely curious to see what this brings...

192

5

A nice write up for LMQL (analyticsindiamag.com)

submitted 2 years ago by noneabove1182 to c/localllama

1 comments fedilink

For the uninitiated, LMQL is one of a few offerings in the AI space (amongst others like guidance and guardrails) that allows for finer control of the AIs output and allows you to guarantee some patterns which is hugely helpful for a whole variety of use cases (tool use, safe chatbots, parseable data)

193

19

What is better: higher quantiation or higher parameter count? (yiffit.net)

submitted 2 years ago by [email protected] to c/localllama

16 comments fedilink

For example, does a 13B parameter model at 2_K quantiation perform worse than a 7B parameter model at 8bit or 16bit?

194

3

What determines the format of the prompt template? (lemmy.world)

submitted 2 years ago by [email protected] to c/localllama

5 comments fedilink

I'm trying to learn more about LLMs, but I haven't found any explanation for what determines which prompt template format a model requires.

For example meta-llama's llama-2 requires this format:

...INST and <> tags, BOS and EOS tokens...

But if I instead download's TheBloke's version of llama-2 the prompt template should instead be:

SYSTEM: ...

USER: {prompt}

ASSISTANT:

I thought this would have been determined how the original training data was formatted, but afaik TheBloke only converted the llama-2 models from one format to another. Looking at the documentation for the GGML format I don't see anything related to the prompt being embedded in the model file.

Anyone who understands this stuff who could point me in the right direction?

195

8

Meta’s Llama 2 Elbows Into a Still Very Open Field (spectrum.ieee.org)

submitted 2 years ago by [email protected] to c/localllama

5 comments fedilink

196

11

(Deleted for not relevant anymore) (lemmy.world)

submitted 2 years ago* (last edited 3 months ago) by [email protected] to c/localllama

2 comments fedilink

(Deleted for not relevant anymore)

197

29

llama2.c: Inference Llama 2 in one file of pure C by Andrej Karpathy (github.com)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

Have you ever wanted to inference a baby Llama 2 model in pure C? No? Well, now you can!

With this code you can train the Llama 2 LLM architecture from scratch in PyTorch, then save the weights to a raw binary file, then load that into one ~simple 500-line C file (run.c) that inferences the model, simply in fp32 for now. On my cloud Linux devbox a dim 288 6-layer 6-head model (~15M params) inferences at ~100 tok/s in fp32, and about the same on my M1 MacBook Air. I was somewhat pleasantly surprised that one can run reasonably sized models (few ten million params) at highly interactive rates with an approach this simple.

https://twitter.com/karpathy/status/1683143097604243456

198

23

Leaked GPT-4 Architecture: Demystifying Its Impact & The 'Mixture of Experts' Explained (with code) (www.youtube.com)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

Written form : https://github.com/clint-kristopher-morris/Tutorials/blob/main/mixture-of-experts/MoE-Paper-Review.ipynb

199

26

Dolphin (based on Llama 1) released by Eric Hartford! (huggingface.co)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

This model is based on llama1, so it is for non-commercial use only. Future versions will be trained on llama2 and other open models that are suitable for commercial use.

This model is uncensored. I have filtered the dataset to remove alignment and bias. This makes the model compliant to any requests. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant to any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models You are responsible for any content you create using this model. Enjoy responsibly.

Quants can of course be found from TheBloke:

https://huggingface.co/TheBloke/Dolphin-Llama-13B-GGML

https://huggingface.co/TheBloke/Dolphin-Llama-13B-GPTQ

200

9

chargoddard's frankensteined 22B llama2 (huggingface.co)

submitted 2 years ago by noneabove1182 to c/localllama

2 comments fedilink

This is Llama 2 13b with some additional attention heads from original-flavor Llama 33b frankensteined on.

Fine-tuned on ~10M tokens from RedPajama to settle in the transplants a little.

Not intended for use as-is - this model is meant to serve as a base for further tuning, hopefully with a greater capacity for learning than 13b.