LocalLLaMA

2585 readers
37 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago
MODERATORS
176
14
submitted 2 years ago* (last edited 2 years ago) by AsAnAILanguageModel to c/localllama
 
 

I think it's a good idea to share experiences about LLMs here, since benchmarks can only give a very rough overview on how well a model performs.

So please share how much you're using LLMs, what you use them for and how they well they perform at those tasks. For example, here are my answers to these questions:

Usage

I use LLMs daily for work and for random questions that I would previously use web search for.

I mainly use LLMs for reasoning heavy tasks, such as assisting with math or programming. Other frequent tasks include proofreading, helping with bureaucracy, or assisting with writing when it matters.

Models

The one I find most impressive at the moment is TheBloke/airoboros-l2-70B-gpt4-1.4.1-GGML/airoboros-l2-70b-gpt4-1.4.1.ggmlv3.q2_K.bin. It often manages to reason correctly on questions where most other models I tried fail, even though most humans wouldn't. I was surprised that something using only 2.5 bits per weight on average could produce anything but garbage. Downsides are that loading times are rather long, so I wouldn't ask it a question if I didn't want to wait. (Time to first token is almost 50s!). I'd love to hear how bigger quantizations or the unquantized versions perform.

Another one that made a good impression on me is Qwen-7B-Chat (demo). It manages to correctly answer some questions where even some llama2-70b finetunes fail, ~~but so far I'm getting memory leaks when running it on my M1 mac in fp16 mode, so I didn't use it a lot.~~ (this has been fixed it seems!)

All other models I briefly tried where not too useful. It's nice to be able to run them locally, but they were so much worse than chatGPT that it's often not even worth it to consider using them.

177
 
 

I want to train, or more likely fine-tune, a model on about 20 years worth of email and text data that I've collected.

The goal would be to train it how to respond like me in simple cases.

It's there a particular base model I should start with?

I'm also interested in anyone's experience in doing this kind of thing themselves.

178
 
 

You are probably familiar with the long list of various benchmarks that new models are tested on and compared against. These benchmarks are supposedly designed to assess the model's ability to perform in various aspects of language understanding, logical reasoning, information recall, and so on.

However, while I understand the need for an objective and scientific measurement scale, I have long felt that these benchmarks are not particularly representative of the actual experience of using the models. For example, people will claim that a model performs at "some percentage of GPT-3" and yet not one of these models has ever been able to produce correctly-functioning code for any non-trivial task or follow a line of argument/reasoning. Talking to GPT-3 I have felt that the model has an actual in-depth understanding of the text, question, or argument, whereas other models that I have tried always feel as though they have only a superficial/surface-level understanding regardless of what the benchmarks claim.

My most recent frustration, and the one that prompted this post, is regarding the newly-released OpenOrca preview 2 model. The benchmark numbers claim that it performs better than other 13B models at the time of writing, supposedly outperforms Microsoft's own published benchmark results for their yet-unreleased model, and scores an "average" result of 74.0% against GPT-3's 75.7% while the LLaMa model that I was using previously apparently scores merely 63%.

I've used GPT-3 (text-davinci-003), and this model does not "come within comparison" of it. Even giving it as much of a fair chance as I can, giving it plenty of leeway and benefit of the doubt, not only can it still not write correct code (or even valid code in a lot of cases) but it is significantly worse at it than LLaMa 13B (which is also pretty bad). This model does not understand basic reasoning and fails at basic reasoning tasks. It will write a long step-by-step explanation of what it claims that it will do, but the answer itself contradicts the provided steps or the steps themselves are wrong/illogical. The model has only learnt to produce "step by step reasoning" as an output format, and has a worse understanding of what that actually means than any other model does when asked to "explain your reasoning" (at least, for other models that I have tried, asking them to explain their reasoning produces at least a marginal improvement in coherence).

There is something wrong with these benchmarks. They do not relate to real-world performance. They do not appear to be measuring a model's ability to actually understand the prompt/task, but possibly only measuring its ability to provide an output that "looks correct" according to some format. These benchmarks are not a reliable way to compare model performance and as long as we keep using them we will keep producing models that score higher on benchmarks and claim to perform "almost as good as GPT-3" but yet fail spectacularly in any task/prompt that I can think of to throw at them.

(I keep using coding as an example however I have also tried other tasks besides code as I realise that code is possibly a particularly challenging task due to requirements like needing exact syntax. My interpretation of the various models' level of understanding is based on experience across a variety of tasks.)

179
10
submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama
 
 

As some may know I maintain a few docker images of some available tools, and I noticed I was suddenly getting NVML mismatch, and for the life of me I could not figure out what the issue was, tried so many things, finally noticed that the docker image had some special drive 535.86.10 where my host had 535.86.05, after figuring that out I looked into it and added this to my Dockerfile:

RUN apt-get update && apt-get remove --purge -y nvidia-* && \ apt-get install -y --allow-downgrades nvidia-driver-535/jammy-updates

And voila, problem solved! Not sure what driver the docker CUDA was using, might be some special dev driver and it was causing a mismatch between the container and the host

Only started happening as of the latest driver update released late last month

180
9
submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/localllama
 
 

I wanted to make this post so we can share all the resources we have with each other on anything machine learning related.

Please feel free to add all of your resources as well even if they are duplicates.

PS: The best way to grow our lemmy community is to produce high quality posts.

Some ideas of things you could share:

  • What people do you follow for AI? Such as on YT, Twitter, etc.
  • What other social media forums provide great information?
  • What GUI do you use for local LLMs?
  • What parameters are "best"?
  • Is there a Wiki you use?
  • Where do you go to learn about LLMs/AI/Machine Learning?
  • How do you find quality models?
  • What Awesome github repositories do you know?
  • What do you think would be useful to share?

General Information - Awesome

LLM Leaderboards:

Places to Find Models

Training & Datasets

There are still many more resources out there I'm sure. Please share what you use to try to keep up with the fast pace of AI development.

I hope some of my resources have helped you! I'm eager to hear what other resources are out there!

181
 
 

Just wondering if anyone has any suggestions to keep things moving and growing, was thinking of doing a daily quantized models post just for keeping up with the bloke, thoughts?

182
 
 

This one is based on llama 2, first one worked very well for rule and structure following with guidance so I'm highly intrigued to see if this lives up to the previous

183
 
 

Click Here to be Taken to the Megathread!

from [email protected]

Vicuna v1.5 Has Been Released!

Shoutout to [email protected] for catching this in an earlier post.

Given Vicuna was a widely appreciated member of the original Llama series, it'll be exciting to see this model evolve and adapt with fresh datasets and new training and fine-tuning approaches.

Feel free using this megathread to chat about Vicuna and any of your experiences with Vicuna v1.5!

Starting off with Vicuna v1.5

TheBloke is already sharing models!

Vicuna v1.5 GPTQ

7B

13B


Vicuna Model Card

Model Details

Vicuna is a chat assistant fine-tuned from Llama 2 on user-shared conversations collected from ShareGPT.

Developed by: LMSYS

  • Model type: An auto-regressive language model based on the transformer architecture
  • License: Llama 2 Community License Agreement
  • Finetuned from model: Llama 2

Model Sources

Uses

The primary use of Vicuna is for research on large language models and chatbots. The target userbase includes researchers and hobbyists interested in natural language processing, machine learning, and artificial intelligence.

How to Get Started with the Model

Training Details

Vicuna v1.5 is fine-tuned from Llama 2 using supervised instruction. The model was trained on approximately 125K conversations from ShareGPT.com.

For additional details, please refer to the "Training Details of Vicuna Models" section in the appendix of the linked paper.

Evaluation Results

Vicuna Evaluation Results

Vicuna is evaluated using standard benchmarks, human preferences, and LLM-as-a-judge. For more detailed results, please refer to the paper and leaderboard.

184
 
 

I've been using airoboros-l2-70b for writing fiction, and while overall I'd describe the results as excellent and better than any llama1 model I've used, it doesn't seem to be living up to the promise of 4k token sequence length.

Around 2500 tokens output quality degrades rapidly, and either starts repeating previous text verbatim, or becomes incoherent (grammar, punctuation and capitalization disappear, becomes salad of vaguely related words)

Any other experiences with llama2 and long context? Does the base model work better? Are other fine tunes behaving similarly? I'll try myself eventually, but the 70b models are chunky downloads, and experimentation takes a while at 1 t/s.

(I'm using GGML Q4_K_M on kobold.cpp, with rope scaling off like you're supposed to do with llama2)

185
18
submitted 2 years ago* (last edited 3 months ago) by [email protected] to c/localllama
 
 

(Deleted for not relevant anymore)

186
 
 

Hi!

I have an ASUS AMD Advantage Edition laptop (https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/) that runs windows. I haven't gotten time to install linux and set it up the way I like yet, still after more than a year.

I'm just dropping a small write-up for the set-up that I'm using with llama.cpp to run on the discrete GPUs using clbast.

You can use Kobold but it meant for more role-playing stuff and I wasn't really interested in that. Funny thing is Kobold can be set up to use the discrete GPU if needed.

  1. For starters you'd need llama.cpp itself from here: https://github.com/ggerganov/llama.cpp/tags.

    Pick the clblast version, which will help offload some computation over to the GPU. Unzip the download to a directory. I unzipped it to a folder called this: "D:\Apps\llama"

  2. You'd need a llm now and that can be obtained from HuggingFace or where-ever you'd like it from. Just note that it should be in ggml format. If you have a doubt, just note that the models from HuggingFace would have "ggml" written somewhere in the filename. The ones I downloaded were "nous-hermes-llama2-13b.ggmlv3.q4_1.bin" and "Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin"

  3. Move the models to the llama directory you made above. That makes life much easier.

  4. You don't really need to navigate to the directory using Explorer. Just open Powershell where-ever and you can also do cd D:\Apps\llama\

  5. Here comes the fiddly part. You need to get the device ids for the GPU. An easy way to check this is to use "GPU caps viewer", go to the tab titled OpenCl and check the dropdown next to "No. of CL devices".

    The discrete GPU is normally loaded as the second or after the integrated GPU. In my case the integrated GPU was gfx90c and discrete was gfx1031c.

  6. In the powershell window, you need to set the relevant variables that tell llama.cpp what opencl platform and devices to use. If you're using AMD driver package, opencl is already installed, so you needn't uninstall or reinstall drivers and stuff.

    $env:GGML_OPENCL_PLATFORM = "AMD"

    $env:GGML_OPENCL_DEVICE = "1"

  7. Check if the variables are exported properly

    Get-ChildItem env:GGML_OPENCL_PLATFORM
    Get-ChildItem env:GGML_OPENCL_DEVICE

    This should return the following:

    Name Value


    GGML_OPENCL_PLATFORM AMD

    GGML_OPENCL_DEVICE 1

    If GGML_OPENCL_PLATFORM doesn't show AMD, try exporting this: $env:GGML_OPENCL_PLATFORM = "AMD"

  8. Once these are set properly, run llama.cpp using the following:

    D:\Apps\llama\main.exe -m D:\Apps\llama\Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin -ngl 33 -i --threads 8 --interactive-first -r "### Human:"

    OR

    replace Wizard with nous-hermes-llama2-13b.ggmlv3.q4_1.bin or whatever llm you'd like. I like to play with 7B, 13B with 4_0 or 5_0 quantized llms. You might need to trawl through the fora here to find parameters for temperature, etc that work for you.

  9. Checking if these work, I've posted the content at pastebin since formatting these was a paaaain: https://pastebin.com/peSFyF6H

    salient features @ gfx1031c (6800M discrete graphics):
    llama_print_timings: load time = 60188.90 ms
    llama_print_timings: sample time = 3.58 ms / 103 runs ( 0.03 ms per token, 28770.95 tokens per second)
    llama_print_timings: prompt eval time = 7133.18 ms / 43 tokens ( 165.89 ms per token, 6.03 tokens per second)
    llama_print_timings: eval time = 13003.63 ms / 102 runs ( 127.49 ms per token, 7.84 tokens per second)
    llama_print_timings: total time = 622870.10 ms

    salient features @ gfx90c (cezanne architecture integrated graphics):
    llama_print_timings: load time = 26205.90 ms
    llama_print_timings: sample time = 6.34 ms / 103 runs ( 0.06 ms per token, 16235.81 tokens per second)
    llama_print_timings: prompt eval time = 29234.08 ms / 43 tokens ( 679.86 ms per token, 1.47 tokens per second)
    llama_print_timings: eval time = 118847.32 ms / 102 runs ( 1165.17 ms per token, 0.86 tokens per second)
    llama_print_timings: total time = 159929.10 ms

Edit: added pastebin since I actually forgot to link it. https://pastebin.com/peSFyF6H

187
10
submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/localllama
 
 

Leaderboard scores often can be a bit misleading since there are other factors to consider.

  • Censorship: Is the model censored?
  • Verbosity: How concise is the output?
  • Intelligence: Does the model know what it is talking about?
  • Hallucination: How much does the model makes up facts?
  • Domain Knowledge: What specialization a model has.
  • Size: Best models for 70b, 30b, 7b respectively.

And much more! What models do you use and would recommend to everyone?

The model that has caught my attention the most personally is the original 65b Llama. It seems genuine and truly has a personality. Everyone should chat with the original non-fine tuned version if they can get a chance. It's an experience that is quite unique within the sea of "As an AI language model" openai tunes.

188
8
submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama
 
 
189
 
 
190
191
 
 

This is actually a pretty big deal, exllama is by far the most performant inference engine out there for CUDA, but the strangest thing is that the PR claims it works for starcoder which is a non-llama model:

https://github.com/huggingface/text-generation-inference/pull/553

So I'm extremely curious to see what this brings...

192
5
A nice write up for LMQL (analyticsindiamag.com)
submitted 2 years ago by noneabove1182 to c/localllama
 
 

For the uninitiated, LMQL is one of a few offerings in the AI space (amongst others like guidance and guardrails) that allows for finer control of the AIs output and allows you to guarantee some patterns which is hugely helpful for a whole variety of use cases (tool use, safe chatbots, parseable data)

193
 
 

For example, does a 13B parameter model at 2_K quantiation perform worse than a 7B parameter model at 8bit or 16bit?

194
 
 

I'm trying to learn more about LLMs, but I haven't found any explanation for what determines which prompt template format a model requires.

For example meta-llama's llama-2 requires this format:

...INST and <> tags, BOS and EOS tokens...

But if I instead download's TheBloke's version of llama-2 the prompt template should instead be:

SYSTEM: ...

USER: {prompt}

ASSISTANT:

I thought this would have been determined how the original training data was formatted, but afaik TheBloke only converted the llama-2 models from one format to another. Looking at the documentation for the GGML format I don't see anything related to the prompt being embedded in the model file.

Anyone who understands this stuff who could point me in the right direction?

195
196
11
submitted 2 years ago* (last edited 3 months ago) by [email protected] to c/localllama
 
 

(Deleted for not relevant anymore)

197
 
 

Have you ever wanted to inference a baby Llama 2 model in pure C? No? Well, now you can!

With this code you can train the Llama 2 LLM architecture from scratch in PyTorch, then save the weights to a raw binary file, then load that into one ~simple 500-line C file (run.c) that inferences the model, simply in fp32 for now. On my cloud Linux devbox a dim 288 6-layer 6-head model (~15M params) inferences at ~100 tok/s in fp32, and about the same on my M1 MacBook Air. I was somewhat pleasantly surprised that one can run reasonably sized models (few ten million params) at highly interactive rates with an approach this simple.

https://twitter.com/karpathy/status/1683143097604243456

198
199
 
 

This model is based on llama1, so it is for non-commercial use only. Future versions will be trained on llama2 and other open models that are suitable for commercial use.

This model is uncensored. I have filtered the dataset to remove alignment and bias. This makes the model compliant to any requests. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant to any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models You are responsible for any content you create using this model. Enjoy responsibly.

Quants can of course be found from TheBloke:

https://huggingface.co/TheBloke/Dolphin-Llama-13B-GGML

https://huggingface.co/TheBloke/Dolphin-Llama-13B-GPTQ

200
 
 

This is Llama 2 13b with some additional attention heads from original-flavor Llama 33b frankensteined on.

Fine-tuned on ~10M tokens from RedPajama to settle in the transplants a little.

Not intended for use as-is - this model is meant to serve as a base for further tuning, hopefully with a greater capacity for learning than 13b.

view more: ‹ prev next ›