LocalLLaMA

2292 readers
16 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago
MODERATORS
151
 
 

I want to train, or more likely fine-tune, a model on about 20 years worth of email and text data that I've collected.

The goal would be to train it how to respond like me in simple cases.

It's there a particular base model I should start with?

I'm also interested in anyone's experience in doing this kind of thing themselves.

152
 
 

You are probably familiar with the long list of various benchmarks that new models are tested on and compared against. These benchmarks are supposedly designed to assess the model's ability to perform in various aspects of language understanding, logical reasoning, information recall, and so on.

However, while I understand the need for an objective and scientific measurement scale, I have long felt that these benchmarks are not particularly representative of the actual experience of using the models. For example, people will claim that a model performs at "some percentage of GPT-3" and yet not one of these models has ever been able to produce correctly-functioning code for any non-trivial task or follow a line of argument/reasoning. Talking to GPT-3 I have felt that the model has an actual in-depth understanding of the text, question, or argument, whereas other models that I have tried always feel as though they have only a superficial/surface-level understanding regardless of what the benchmarks claim.

My most recent frustration, and the one that prompted this post, is regarding the newly-released OpenOrca preview 2 model. The benchmark numbers claim that it performs better than other 13B models at the time of writing, supposedly outperforms Microsoft's own published benchmark results for their yet-unreleased model, and scores an "average" result of 74.0% against GPT-3's 75.7% while the LLaMa model that I was using previously apparently scores merely 63%.

I've used GPT-3 (text-davinci-003), and this model does not "come within comparison" of it. Even giving it as much of a fair chance as I can, giving it plenty of leeway and benefit of the doubt, not only can it still not write correct code (or even valid code in a lot of cases) but it is significantly worse at it than LLaMa 13B (which is also pretty bad). This model does not understand basic reasoning and fails at basic reasoning tasks. It will write a long step-by-step explanation of what it claims that it will do, but the answer itself contradicts the provided steps or the steps themselves are wrong/illogical. The model has only learnt to produce "step by step reasoning" as an output format, and has a worse understanding of what that actually means than any other model does when asked to "explain your reasoning" (at least, for other models that I have tried, asking them to explain their reasoning produces at least a marginal improvement in coherence).

There is something wrong with these benchmarks. They do not relate to real-world performance. They do not appear to be measuring a model's ability to actually understand the prompt/task, but possibly only measuring its ability to provide an output that "looks correct" according to some format. These benchmarks are not a reliable way to compare model performance and as long as we keep using them we will keep producing models that score higher on benchmarks and claim to perform "almost as good as GPT-3" but yet fail spectacularly in any task/prompt that I can think of to throw at them.

(I keep using coding as an example however I have also tried other tasks besides code as I realise that code is possibly a particularly challenging task due to requirements like needing exact syntax. My interpretation of the various models' level of understanding is based on experience across a variety of tasks.)

153
10
submitted 1 year ago* (last edited 1 year ago) by noneabove1182 to c/localllama
 
 

As some may know I maintain a few docker images of some available tools, and I noticed I was suddenly getting NVML mismatch, and for the life of me I could not figure out what the issue was, tried so many things, finally noticed that the docker image had some special drive 535.86.10 where my host had 535.86.05, after figuring that out I looked into it and added this to my Dockerfile:

RUN apt-get update && apt-get remove --purge -y nvidia-* && \ apt-get install -y --allow-downgrades nvidia-driver-535/jammy-updates

And voila, problem solved! Not sure what driver the docker CUDA was using, might be some special dev driver and it was causing a mismatch between the container and the host

Only started happening as of the latest driver update released late last month

154
9
submitted 1 year ago* (last edited 1 year ago) by cll7793@lemmy.world to c/localllama
 
 

I wanted to make this post so we can share all the resources we have with each other on anything machine learning related.

Please feel free to add all of your resources as well even if they are duplicates.

PS: The best way to grow our lemmy community is to produce high quality posts.

Some ideas of things you could share:

  • What people do you follow for AI? Such as on YT, Twitter, etc.
  • What other social media forums provide great information?
  • What GUI do you use for local LLMs?
  • What parameters are "best"?
  • Is there a Wiki you use?
  • Where do you go to learn about LLMs/AI/Machine Learning?
  • How do you find quality models?
  • What Awesome github repositories do you know?
  • What do you think would be useful to share?

General Information - Awesome

LLM Leaderboards:

Places to Find Models

Training & Datasets

There are still many more resources out there I'm sure. Please share what you use to try to keep up with the fast pace of AI development.

I hope some of my resources have helped you! I'm eager to hear what other resources are out there!

155
 
 

Just wondering if anyone has any suggestions to keep things moving and growing, was thinking of doing a daily quantized models post just for keeping up with the bloke, thoughts?

156
 
 

This one is based on llama 2, first one worked very well for rule and structure following with guidance so I'm highly intrigued to see if this lives up to the previous

157
 
 

Click Here to be Taken to the Megathread!

from !fosai@lemmy.world

Vicuna v1.5 Has Been Released!

Shoutout to GissaMittJobb@lemmy.ml for catching this in an earlier post.

Given Vicuna was a widely appreciated member of the original Llama series, it'll be exciting to see this model evolve and adapt with fresh datasets and new training and fine-tuning approaches.

Feel free using this megathread to chat about Vicuna and any of your experiences with Vicuna v1.5!

Starting off with Vicuna v1.5

TheBloke is already sharing models!

Vicuna v1.5 GPTQ

7B

13B


Vicuna Model Card

Model Details

Vicuna is a chat assistant fine-tuned from Llama 2 on user-shared conversations collected from ShareGPT.

Developed by: LMSYS

  • Model type: An auto-regressive language model based on the transformer architecture
  • License: Llama 2 Community License Agreement
  • Finetuned from model: Llama 2

Model Sources

Uses

The primary use of Vicuna is for research on large language models and chatbots. The target userbase includes researchers and hobbyists interested in natural language processing, machine learning, and artificial intelligence.

How to Get Started with the Model

Training Details

Vicuna v1.5 is fine-tuned from Llama 2 using supervised instruction. The model was trained on approximately 125K conversations from ShareGPT.com.

For additional details, please refer to the "Training Details of Vicuna Models" section in the appendix of the linked paper.

Evaluation Results

Vicuna Evaluation Results

Vicuna is evaluated using standard benchmarks, human preferences, and LLM-as-a-judge. For more detailed results, please refer to the paper and leaderboard.

158
 
 

I've been using airoboros-l2-70b for writing fiction, and while overall I'd describe the results as excellent and better than any llama1 model I've used, it doesn't seem to be living up to the promise of 4k token sequence length.

Around 2500 tokens output quality degrades rapidly, and either starts repeating previous text verbatim, or becomes incoherent (grammar, punctuation and capitalization disappear, becomes salad of vaguely related words)

Any other experiences with llama2 and long context? Does the base model work better? Are other fine tunes behaving similarly? I'll try myself eventually, but the 70b models are chunky downloads, and experimentation takes a while at 1 t/s.

(I'm using GGML Q4_K_M on kobold.cpp, with rope scaling off like you're supposed to do with llama2)

159
18
submitted 1 year ago* (last edited 1 month ago) by cll7793@lemmy.world to c/localllama
 
 

(Deleted for not relevant anymore)

160
 
 

Hi!

I have an ASUS AMD Advantage Edition laptop (https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/) that runs windows. I haven't gotten time to install linux and set it up the way I like yet, still after more than a year.

I'm just dropping a small write-up for the set-up that I'm using with llama.cpp to run on the discrete GPUs using clbast.

You can use Kobold but it meant for more role-playing stuff and I wasn't really interested in that. Funny thing is Kobold can be set up to use the discrete GPU if needed.

  1. For starters you'd need llama.cpp itself from here: https://github.com/ggerganov/llama.cpp/tags.

    Pick the clblast version, which will help offload some computation over to the GPU. Unzip the download to a directory. I unzipped it to a folder called this: "D:\Apps\llama"

  2. You'd need a llm now and that can be obtained from HuggingFace or where-ever you'd like it from. Just note that it should be in ggml format. If you have a doubt, just note that the models from HuggingFace would have "ggml" written somewhere in the filename. The ones I downloaded were "nous-hermes-llama2-13b.ggmlv3.q4_1.bin" and "Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin"

  3. Move the models to the llama directory you made above. That makes life much easier.

  4. You don't really need to navigate to the directory using Explorer. Just open Powershell where-ever and you can also do cd D:\Apps\llama\

  5. Here comes the fiddly part. You need to get the device ids for the GPU. An easy way to check this is to use "GPU caps viewer", go to the tab titled OpenCl and check the dropdown next to "No. of CL devices".

    The discrete GPU is normally loaded as the second or after the integrated GPU. In my case the integrated GPU was gfx90c and discrete was gfx1031c.

  6. In the powershell window, you need to set the relevant variables that tell llama.cpp what opencl platform and devices to use. If you're using AMD driver package, opencl is already installed, so you needn't uninstall or reinstall drivers and stuff.

    $env:GGML_OPENCL_PLATFORM = "AMD"

    $env:GGML_OPENCL_DEVICE = "1"

  7. Check if the variables are exported properly

    Get-ChildItem env:GGML_OPENCL_PLATFORM
    Get-ChildItem env:GGML_OPENCL_DEVICE

    This should return the following:

    Name Value


    GGML_OPENCL_PLATFORM AMD

    GGML_OPENCL_DEVICE 1

    If GGML_OPENCL_PLATFORM doesn't show AMD, try exporting this: $env:GGML_OPENCL_PLATFORM = "AMD"

  8. Once these are set properly, run llama.cpp using the following:

    D:\Apps\llama\main.exe -m D:\Apps\llama\Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin -ngl 33 -i --threads 8 --interactive-first -r "### Human:"

    OR

    replace Wizard with nous-hermes-llama2-13b.ggmlv3.q4_1.bin or whatever llm you'd like. I like to play with 7B, 13B with 4_0 or 5_0 quantized llms. You might need to trawl through the fora here to find parameters for temperature, etc that work for you.

  9. Checking if these work, I've posted the content at pastebin since formatting these was a paaaain: https://pastebin.com/peSFyF6H

    salient features @ gfx1031c (6800M discrete graphics):
    llama_print_timings: load time = 60188.90 ms
    llama_print_timings: sample time = 3.58 ms / 103 runs ( 0.03 ms per token, 28770.95 tokens per second)
    llama_print_timings: prompt eval time = 7133.18 ms / 43 tokens ( 165.89 ms per token, 6.03 tokens per second)
    llama_print_timings: eval time = 13003.63 ms / 102 runs ( 127.49 ms per token, 7.84 tokens per second)
    llama_print_timings: total time = 622870.10 ms

    salient features @ gfx90c (cezanne architecture integrated graphics):
    llama_print_timings: load time = 26205.90 ms
    llama_print_timings: sample time = 6.34 ms / 103 runs ( 0.06 ms per token, 16235.81 tokens per second)
    llama_print_timings: prompt eval time = 29234.08 ms / 43 tokens ( 679.86 ms per token, 1.47 tokens per second)
    llama_print_timings: eval time = 118847.32 ms / 102 runs ( 1165.17 ms per token, 0.86 tokens per second)
    llama_print_timings: total time = 159929.10 ms

Edit: added pastebin since I actually forgot to link it. https://pastebin.com/peSFyF6H

161
10
submitted 1 year ago* (last edited 1 year ago) by cll7793@lemmy.world to c/localllama
 
 

Leaderboard scores often can be a bit misleading since there are other factors to consider.

  • Censorship: Is the model censored?
  • Verbosity: How concise is the output?
  • Intelligence: Does the model know what it is talking about?
  • Hallucination: How much does the model makes up facts?
  • Domain Knowledge: What specialization a model has.
  • Size: Best models for 70b, 30b, 7b respectively.

And much more! What models do you use and would recommend to everyone?

The model that has caught my attention the most personally is the original 65b Llama. It seems genuine and truly has a personality. Everyone should chat with the original non-fine tuned version if they can get a chance. It's an experience that is quite unique within the sea of "As an AI language model" openai tunes.

162
8
submitted 1 year ago* (last edited 1 year ago) by noneabove1182 to c/localllama
 
 
163
 
 
164
165
 
 

This is actually a pretty big deal, exllama is by far the most performant inference engine out there for CUDA, but the strangest thing is that the PR claims it works for starcoder which is a non-llama model:

https://github.com/huggingface/text-generation-inference/pull/553

So I'm extremely curious to see what this brings...

166
5
A nice write up for LMQL (analyticsindiamag.com)
submitted 1 year ago by noneabove1182 to c/localllama
 
 

For the uninitiated, LMQL is one of a few offerings in the AI space (amongst others like guidance and guardrails) that allows for finer control of the AIs output and allows you to guarantee some patterns which is hugely helpful for a whole variety of use cases (tool use, safe chatbots, parseable data)

167
 
 

For example, does a 13B parameter model at 2_K quantiation perform worse than a 7B parameter model at 8bit or 16bit?

168
 
 

I'm trying to learn more about LLMs, but I haven't found any explanation for what determines which prompt template format a model requires.

For example meta-llama's llama-2 requires this format:

...INST and <> tags, BOS and EOS tokens...

But if I instead download's TheBloke's version of llama-2 the prompt template should instead be:

SYSTEM: ...

USER: {prompt}

ASSISTANT:

I thought this would have been determined how the original training data was formatted, but afaik TheBloke only converted the llama-2 models from one format to another. Looking at the documentation for the GGML format I don't see anything related to the prompt being embedded in the model file.

Anyone who understands this stuff who could point me in the right direction?

169
170
11
submitted 1 year ago* (last edited 1 month ago) by cll7793@lemmy.world to c/localllama
 
 

(Deleted for not relevant anymore)

171
 
 

Have you ever wanted to inference a baby Llama 2 model in pure C? No? Well, now you can!

With this code you can train the Llama 2 LLM architecture from scratch in PyTorch, then save the weights to a raw binary file, then load that into one ~simple 500-line C file (run.c) that inferences the model, simply in fp32 for now. On my cloud Linux devbox a dim 288 6-layer 6-head model (~15M params) inferences at ~100 tok/s in fp32, and about the same on my M1 MacBook Air. I was somewhat pleasantly surprised that one can run reasonably sized models (few ten million params) at highly interactive rates with an approach this simple.

https://twitter.com/karpathy/status/1683143097604243456

172
173
 
 

This model is based on llama1, so it is for non-commercial use only. Future versions will be trained on llama2 and other open models that are suitable for commercial use.

This model is uncensored. I have filtered the dataset to remove alignment and bias. This makes the model compliant to any requests. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant to any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models You are responsible for any content you create using this model. Enjoy responsibly.

Quants can of course be found from TheBloke:

https://huggingface.co/TheBloke/Dolphin-Llama-13B-GGML

https://huggingface.co/TheBloke/Dolphin-Llama-13B-GPTQ

174
 
 

This is Llama 2 13b with some additional attention heads from original-flavor Llama 33b frankensteined on.

Fine-tuned on ~10M tokens from RedPajama to settle in the transplants a little.

Not intended for use as-is - this model is meant to serve as a base for further tuning, hopefully with a greater capacity for learning than 13b.

175
 
 

Disclaimer: take this with a grain of salt, it is heavily researched but only by me and I am no expert, will link relevant studies at the bottom

For starters, a quick summary of the quantization process for anyone who needs some background. Basically we’re trying to convert these really accurate weights from being floating point 16 bit (FP16) to a 4 bit value to save on size. Saving size is important because we want to fit really big models onto a smaller amount of VRAM. To do this, we take a weight, quantize it, then observe the loss of the model, adjusting all non-quantized weights accordingly in an attempt to minimize this loss. It’s important to note that the numbers you see on models - 7B, 13B, 33B etc - correspond to the total number of weights, so doing this one at a time is… not ideal. Also note that these weights are represented in matrices, with one row for every output of the previous layer, and one column for every input in the current layer, so in this context a “weight vector” represents the set of weights connecting that neuron to each neuron in the previous layer. There were some proposed solutions similar to what we have now but I’ll skip them for a semblance brevity.

Okay, that’s the backstory. Onto the main story…

Enter Binary-Coding Quantization, a “versatile non-uniform quantization scheme.” The concept is that, when quantizing a weight vector, we squash the values by dividing all the values by a number (which is equal to the max value of your quantization divided by the highest number in the weight vector) and save that number for when we later need to de-quantize.

From here, we observe that weight vectors can be grouped in such a way that their scaling factor can be shared across them with minimal loss. This is where groupsize comes in. We use this to indicate how many weight vectors are grouped together to share a single scaling factor. This means, the more of them we group together, the less information we need to save, but also conversely means that more information is lost, as accuracy has to be sacrificed to share the same value.

And that’s it! You now know what groupsize means and why it changes the model’s size!

Next up is actorder.

Activation Order is a method of examining which weight vectors make most sense to quantize first in order to maintain important information. Originally this was done by greedily selecting whichever weight vector would result in the least loss, but it was observed that this method is actually barely better than random selection. With this in mind, a new solution was proposed. We start by observing which columns have the largest activation magnitude, that is, the weight vectors which most contribute to the final output of the model because they cause the most neuron activations.

After gathering that information, we start our quantization with those values, because that means they will most closely reflect their original values after the full quantization is done. Remember, after we quantize a vector, that’s it, it’s locked in. That means that if we left some of our important vectors until the end, not only might they have been adjusted several times during the process, but more importantly there remain very few extra columns that we can adjust to make up for the quantization loss. So starting with these values, IE act-order or desc_act (used interchangeably) should result in a minor increase in performance.

Side note, I’m not positive at this time why it results in an increase to model size, my best guess is that it involves rearranging the vectors in memory in ways that are no longer optimal and can’t be properly mapped into the VRAM without wasting space, but that’s a pure guess and I would love if someone chimed in with more info. My other guess is that groupsize is either not applied to those sensitive weight vectors, or that they’re applied more selectively (grouping sensitive vectors with non-sensitive vectors) and that difference results in a change. If anyone has any ideas please feel free to enlighten me.

And that’s it! To sum it up, group size means quantizing in groups rather than individually, resulting in smaller models that are quantized faster, and act order means to quantize in order of activation magnitude to try to preserve as much of the important information as possible.

If you stuck through that wall of text, thanks! I hope it was insightful (and accurate)

Sources:

https://arxiv.org/abs/2206.09557 (group size explanation)

https://arxiv.org/abs/2306.02272 (act order explanation)

view more: ‹ prev next ›