LocalLLaMA

2590 readers
7 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago
MODERATORS
201
 
 

Disclaimer: take this with a grain of salt, it is heavily researched but only by me and I am no expert, will link relevant studies at the bottom

For starters, a quick summary of the quantization process for anyone who needs some background. Basically we’re trying to convert these really accurate weights from being floating point 16 bit (FP16) to a 4 bit value to save on size. Saving size is important because we want to fit really big models onto a smaller amount of VRAM. To do this, we take a weight, quantize it, then observe the loss of the model, adjusting all non-quantized weights accordingly in an attempt to minimize this loss. It’s important to note that the numbers you see on models - 7B, 13B, 33B etc - correspond to the total number of weights, so doing this one at a time is… not ideal. Also note that these weights are represented in matrices, with one row for every output of the previous layer, and one column for every input in the current layer, so in this context a “weight vector” represents the set of weights connecting that neuron to each neuron in the previous layer. There were some proposed solutions similar to what we have now but I’ll skip them for a semblance brevity.

Okay, that’s the backstory. Onto the main story…

Enter Binary-Coding Quantization, a “versatile non-uniform quantization scheme.” The concept is that, when quantizing a weight vector, we squash the values by dividing all the values by a number (which is equal to the max value of your quantization divided by the highest number in the weight vector) and save that number for when we later need to de-quantize.

From here, we observe that weight vectors can be grouped in such a way that their scaling factor can be shared across them with minimal loss. This is where groupsize comes in. We use this to indicate how many weight vectors are grouped together to share a single scaling factor. This means, the more of them we group together, the less information we need to save, but also conversely means that more information is lost, as accuracy has to be sacrificed to share the same value.

And that’s it! You now know what groupsize means and why it changes the model’s size!

Next up is actorder.

Activation Order is a method of examining which weight vectors make most sense to quantize first in order to maintain important information. Originally this was done by greedily selecting whichever weight vector would result in the least loss, but it was observed that this method is actually barely better than random selection. With this in mind, a new solution was proposed. We start by observing which columns have the largest activation magnitude, that is, the weight vectors which most contribute to the final output of the model because they cause the most neuron activations.

After gathering that information, we start our quantization with those values, because that means they will most closely reflect their original values after the full quantization is done. Remember, after we quantize a vector, that’s it, it’s locked in. That means that if we left some of our important vectors until the end, not only might they have been adjusted several times during the process, but more importantly there remain very few extra columns that we can adjust to make up for the quantization loss. So starting with these values, IE act-order or desc_act (used interchangeably) should result in a minor increase in performance.

Side note, I’m not positive at this time why it results in an increase to model size, my best guess is that it involves rearranging the vectors in memory in ways that are no longer optimal and can’t be properly mapped into the VRAM without wasting space, but that’s a pure guess and I would love if someone chimed in with more info. My other guess is that groupsize is either not applied to those sensitive weight vectors, or that they’re applied more selectively (grouping sensitive vectors with non-sensitive vectors) and that difference results in a change. If anyone has any ideas please feel free to enlighten me.

And that’s it! To sum it up, group size means quantizing in groups rather than individually, resulting in smaller models that are quantized faster, and act order means to quantize in order of activation magnitude to try to preserve as much of the important information as possible.

If you stuck through that wall of text, thanks! I hope it was insightful (and accurate)

Sources:

https://arxiv.org/abs/2206.09557 (group size explanation)

https://arxiv.org/abs/2306.02272 (act order explanation)

202
203
 
 

cross-posted from: https://lemmy.world/post/1894070

Welcome to the Llama-2 FOSAI & LLM Roundup Series!

(Summer 2023 Edition)

Hello everyone!

The wave of innovation I mentioned in our Llama-2 announcement is already on its way. The first tsunami of base models and configurations are being released as you read this post.

That being said, I'd like to take a moment to shoutout TheBloke, who is rapidly converting many of these models for the greater good of FOSS & FOSAI.

You can support TheBloke here.

Below you will find all of the latest Llama-2 models that are FOSAI friendly. This means they are commercially available, ready to use, and open for development. I will be continuing this series exclusively for Llama models. I have a feeling it will continue being a popular choice for quite some time. I will consider giving other foundational models a similar series if they garner enough support and consideration. For now, enjoy this new herd of Llamas!

All that you need to get started is capable hardware and a few moments setting up your inference platform (selected from any of your preferred software choices in the Lemmy Crash Course for Free Open-Source AI or FOSAI Nexus resource, which is also shared at the bottom of this post).

Keep reading to learn more about the exciting new models coming out of Llama-2!

8-bit System Requirements

Model VRAM Used Minimum Total VRAM Card Examples RAM/Swap to Load*
LLaMA-7B 9.2GB 10GB 3060 12GB, 3080 10GB 24 GB
LLaMA-13B 16.3GB 20GB 3090, 3090 Ti, 4090 32 GB
LLaMA-30B 36GB 40GB A6000 48GB, A100 40GB 64 GB
LLaMA-65B 74GB 80GB A100 80GB 128 GB

4-bit System Requirements

Model Minimum Total VRAM Card Examples RAM/Swap to Load*
LLaMA-7B 6GB GTX 1660, 2060, AMD 5700 XT, RTX 3050, 3060 6 GB
LLaMA-13B 10GB AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, A2000 12 GB
LLaMA-30B 20GB RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100 32 GB
LLaMA-65B 40GB A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000 64 GB

*System RAM (not VRAM), is utilized to initially load a model. You can use swap space if you do not have enough RAM to support your LLM.


The Bloke

One of the most popular and consistent developers releasing consumer-friendly versions of LLMs. These active conversions of trending models allow for many of us to run these GPTQ or GGML variants at home on our own PCs and hardware.

70B

13B

7B

LLongMA

LLongMA-2, a suite of Llama-2 models, trained at 8k context length using linear positional interpolation scaling.

13B

7B

Also available from The Bloke in GPTQ and GGML formats:

7B

Puffin

The first commercially available language model released by Nous Research! Available in 13B parameters.

13B

Also available from The Bloke in GPTQ and GGML formats:

13B

Other Models

Leaving a section here for 'other' LLMs or fine tunings derivative of Llama-2 models.

7B


Getting Started w/ FOSAI!

Have no idea where to begin with AI/LLMs? Try starting here with UnderstandGPT to learn the basics of LLMs before visiting our Lemmy Crash Course for Free Open-Source AI

If you're looking to explore more resources, see our FOSAI Nexus for a list of all the major FOSS/FOSAI in the space.

If you're looking to jump right in, visit some of the links below and stick to models that are <13B in parameter (unless you have the power and hardware to spare).

FOSAI Resources

Fediverse / FOSAI

LLM Leaderboards

LLM Search Tools

GL, HF!

If you found anything about this post interesting - consider subscribing to [email protected] where I do my best to keep you in the know about the most important updates in free open-source artificial intelligence.

I will try to continue doing this series season by season, making this a living post for the rest of this summer. If I have missed a noteworthy model, don't hesitate to let me know in the comments so I can keep this resource up-to-date.

Thank you for reading! I hope you find what you're looking for. Be sure to subscribe and bookmark this page if you want a quick one-stop shop for all of the new Llama-2 models that will be emerging the rest of this summer!

204
6
submitted 2 years ago* (last edited 3 months ago) by [email protected] to c/localllama
 
 

(Deleted for not relevant anymore)

205
 
 

Things are still moving fast. It's mid/late july now and i've spent some time outside, enjoying the summer. It's been a few weeks since things exploded in the month of may this year. Have you people settled down in the meantime?

I've since then moved from reddit and i miss the LocalLlama over there, that was/is buzzing with activity and AI news (and discussions) every day.

What are you people up to? Have you gotten tired of your AI waifus? Or finished indexing all of your data into some vector database? Have you discovered new applications for AI? Or still toying around and evaluating all the latest fine-tuned variations in constant pursuit of the best llama?

206
207
24
Llama 2 - Meta AI (ai.meta.com)
submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama
208
 
 

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost O(1) inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at this https URL.

209
 
 

https://github.com/noneabove1182/text-generation-webui-docker

https://github.com/noneabove1182/lollms-webui-docker

https://github.com/noneabove1182/koboldcpp-docker

All should include up to date instructions, if you find any issues please ping me immediately so I can take a look or open an issue :)

210
211
 
 

I've tried a few localGPT frameworks like LocalAI, Text-UI, and AutoGPT and none of them seem to have a decent web-crawler, as far as i could tell. Does anyone have a good recommendation for a localGPT setup that does that? Thanks!

212
 
 

Apologies for the basic question, but what's the difference between GGML and GPTQ? Do these just refer to different compression methods? Which would you choose if you're using a 3090ti GPU?

213
 
 

cross-posted from: https://lemmy.world/post/1428161

Mark Zuckerberg & Meta to Release Commercial Version of its AI/LLM (LLaMA) In Effort to Catch Rivals

Hello everyone. I have some very exciting news to share with you today. Mark Zuckerberg & Meta are poised to release a commercial version of LLaMA in the near future. This is huge for us!

The current generation of LLaMA is a non-commercial license for research use only. This means the large community behind it has been unable to utilize it in wider businesses applications of their own, hindering innovation.

With this, Mark may open up a channel for another surge in open-source AI that accelerates us forward, (hopefully) ahead of our other mutual competitors of Google, OpenAI, etc.

You should read the full article here, but I will leave you with highlights below.

Meta released its own language model to researchers and academics earlier this year, but the new version will be more widely available and customisable by companies © FT montage/Bloomberg/Dreamstime

Meta is poised to release a commercial version of its artificial intelligence model, allowing start-ups and businesses to build custom software on top of the technology.

The move will allow Meta to compete with Microsoft-backed OpenAI and Google, which are surging ahead in the race to develop generative AI. The software, which can create text, images and code, is powered by large language models (LLMs) that are trained on huge amounts of data and require vast computing power.

Meta released its own language model, known as LLaMA, to researchers and academics earlier this year, but the new version will be more widely available and customisable by companies, three people familiar with the plans said. The release is expected imminently, one of the people said.

Meta says its LLMs are “open-source”, by which it means details of the new model will be released publicly. This contrasts with the approach of competitors such as OpenAI, whose latest model GPT-4 is a so-called black box in which the data and code used to build the model are not available to third parties.

“The competitive landscape of AI is going to completely change in the coming months, in the coming weeks maybe, when there will be open source platforms that are actually as good as the ones that are not,” vice-president and chief AI scientist at Meta, Yann LeCun, said at a conference in Aix-en-Provence last Saturday.

Meta’s impending release comes as a race among Silicon Valley tech groups to establish themselves as dominant AI participants is heating up.

The/CUT (TLDR)

Mark Zuckerberg's Meta is set to shake up the AI race with the release of a commercially available version of their language model, LLaMA. The move is poised to democratize AI by making the model open-source and customizable by startups and businesses, fostering a broader community of innovation. As the AI landscape heats up with giants like OpenAI and Google, Meta's decision encourages an open view of the future, one in which collaboration with LLMs could redefine the bounds of technology.

What do you think about this news? How do you feel about Zuck making LLaMA commercially available? Do you have any projects you plan to build with this knowledge?

Only time will tell how honest he is in this release. For now, we'll have to be patient and see how it all unfolds.

Best case scenario - it encourages other big tech companies to follow, accelerating our journey to SGI/AGI. Worst case scenario - we get a massive influx of cool new LLaMA models.

It's a win for us either way! We take those.

If you found any of this interesting, consider subscribing to [email protected] where I do my best to keep you in the know with the latest developments and breakthroughs in free open-source artificial intelligence.

Thank you for reading! Check out some of these resources below if you want to learn more about this announcement (or get started with free, open-source AI).

Related Links

214
 
 

Open Orca preview trained on ~6% of data:

We have trained on less than 6% of our data, just to give a preview of what is possible while we further refine our dataset! We trained a refined selection of 200k GPT-4 entries from OpenOrca. We have filtered our GPT-4 augmentations to remove statements like, "As an AI language model..." and other responses which have been shown to harm model reasoning capabilities. Further details on our dataset curation practices will be forthcoming with our full model releases.

215
16
submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/localllama
 
 

I've been messing around with GPTQ models with ExLlama in ooba, and have gotten 33b models @ 3k running smoothly, but was looking to try something bigger than my VRAM can hold.

However, I'm clearly doing something wrong, and the koboldcpp.exe documentation isn't clear to me. Does anyone have a good setup guide? My understanding is koboldcpp.exe is preferable for GGML, as ooba's llama.cpp doesn't support GGML at >4k context yet.

216
 
 

cross-posted from: https://lemmy.world/post/1306474

CStanKonrad has Released an Early Version of long_llama: Focused Transformer (FoT) Training for Context Scaling

This repository contains the research preview of LongLLaMA, a large language model capable of handling long contexts of 256k tokens or even more.

LongLLaMA is built upon the foundation of OpenLLaMA and fine-tuned using the Focused Transformer (FoT) method. We release a smaller 3B variant of the LongLLaMA model on a permissive license (Apache 2.0) and inference code supporting longer contexts on Hugging Face. Our model weights can serve as the drop-in replacement of LLaMA in existing implementations (for short context up to 2048 tokens). Additionally, we provide evaluation results and comparisons against the original OpenLLaMA models. Stay tuned for further updates.

This is an awesome resource to pair alongside the recent FoT breakthroughs covered in this paper/post here.

Focused Transformer: Contrastive Training for Context Scaling (FoT) presents a simple method for endowing language models with the ability to handle context consisting possibly of millions of tokens while training on significantly shorter input. FoT permits a subset of attention layers to access a memory cache of (key, value) pairs to extend the context length. The distinctive aspect of FoT is its training procedure, drawing from contrastive learning. Specifically, we deliberately expose the memory attention layers to both relevant and irrelevant keys (like negative samples from unrelated documents). This strategy incentivizes the model to differentiate keys connected with semantically diverse values, thereby enhancing their structure. This, in turn, makes it possible to extrapolate the effective context length much beyond what is seen in training.

LongLLaMA is an OpenLLaMA model finetuned with the FoT method, with three layers used for context extension. Crucially, LongLLaMA is able to extrapolate much beyond the context length seen in training: . E.g., in the passkey retrieval task, it can handle inputs of length.

This is an incredible advancement in context lengths for LLMs. Less than a month ago we were excited to celebrate 6k context lengths. We are now blowing these metrics out of the water. It is only a matter of time before compute and efficiency gains follow and support these new possibilities.

If you found any of this interesting, please consider subscribing to /c/FOSAI where I do my best to keep you up to date with the most important updates and developments in the space.

Want to get started with FOSAI, but don't know how? Try starting with my Welcome Message and/or The FOSAI Nexus & Lemmy Crash Course to Free Open-Source AI.

217
 
 

cross-posted from: https://lemmy.world/post/1305651

OpenLM-Research has Released OpenLLaMA: An Open-Source Reproduction of LLaMA

TL;DR: OpenLM-Research has released a public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA. We are releasing a series of 3B, 7B and 13B models trained on different data mixtures. Our model weights can serve as the drop in replacement of LLaMA in existing implementations.

In this repo, OpenLM-Research presents a permissively licensed open source reproduction of Meta AI's LLaMA large language model. We are releasing a series of 3B, 7B and 13B models trained on 1T tokens. We provide PyTorch and JAX weights of pre-trained OpenLLaMA models, as well as evaluation results and comparison against the original LLaMA models. The v2 model is better than the old v1 model trained on a different data mixture.

This is pretty incredible news for anyone working with LLaMA or other open-source LLMs. This allows you to utilize the vast ecosystem of developers, weights, and resources that have been created for the LLaMA models, which are very popular in many AI communities right now.

With this, anyone can now hop into LLaMA R&D knowing they have avenues to utilize it within their projects and businesses (commercially).

Big shoutout to the team who made this possible (OpenLM-Research). You should support them by visiting their GitHub and starring the repo.

A handful of varying parameter models have been released by this team, some of which are already circulating and being improved upon.

Yet another very exciting development for FOSS! If I recall correctly, Mark Zuckerberg mentioned in his recent podcast with Lex Fridman that the next official version of LLaMA from Meta will be open-source as well. I am very curious to see how this model develops this coming year.

If you found any of this interesting, please consider subscribing to /c/FOSAI where I do my best to keep you up to date with the most important updates and developments in the space of free open-source artificial intelligence.

Want to get started with FOSAI, but don't know how? Try starting with my Welcome Message and/or The FOSAI Nexus & Lemmy Crash Course to Free Open-Source AI.

218
8
submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/localllama
 
 

so i am looking to get me a gpu in my "beast"(a 24core 128gb tower with to much pci-e) i thought i might buy a used 3090 but then it hit me most applications can work with multiple gpu's so i decided i was going to go with €600 to ebay and using techpowerup i figured out there performance by looking at the memory bandwidth and fp32 performance. So this brought me to the following cards for my own LLaMa, stable-difusion and Blender: 5 Tesla K80's, 3 Tesla P40's or 2 3060's but i cant figure out what would be better for performance and future proofing. the main difference i found is in cuda version but i cant really figure out why that matters. the other thing i found is that 5 k80's are way more power intensive than 3 p40's and that if memory size is really important the p40's are the way to go but then i couldn't figure out real performance numbers as i cant find benchmarks like this one for blender.

So if anyone has a nice source for stable-diffusion and LaMA benchmarks i would appreciate it if you could share it. And if you have one of these cards or multiple and can tel me which option is better i would appreciate it if you shared your opinion

219
 
 

I just bought a "new" homelab server and am considering adding in some used/refurbished NVIDIA Tesla K80s. They have 24 GB of VRAM and tons of compute power for very cheap if you get them used.

The issue is that these cards run super hot and require extra cooling set ups. I was able to find this fan adapter kit on eBay. But I still worry that if I pop one or two of these bad boys in my server that the fan won't be enough to overcome the raw heat put off by the K80.

Have any of you run this kind of card in a home lab setting? What kind of temps do you get when running models? Would a fan like this actually be enough to cool the thing? I appreciate any insight you guys might have!

220
 
 

https://github.com/vllm-project/vllm

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more Tensor parallelism support for distributed inference Streaming outputs OpenAI-compatible API server

YouTube video describing it: https://youtu.be/1RxOYLa69Vw

221
 
 

I'm building an instruct dataset but I'm not sure if I should replace some with general conversation instructions.

What's the consensus right now on the best ratio of data? I saw Lima even had r/writingprompt in there.

Basically, how do you guys structure your data when fine tuning and what do you add other then the basic instruct data. Ratio of code questions seems to be important as well.

The amount in the different datasets is all over the place too, what do you guys aim for?

222
 
 

While I'm always hoping the latest newest model will be The One, in the mean time I'm sticking with Airoboros 33B gpt4 1.4 as the most powerful model I can currently run on my 3090.

It's not perfect by a long shot, but after extensive use I've learned what I can trust it with. It's good enough to summarize long text, to rephrase sentences, and to give me a decent starting point when I'm curious about some topic. It's reasoning skills are a bit less than GPT 3.5, I'd say.

I also occasionally switch back to Guanaco 33B because it generates a different flavor of text, but I find it to be factually weaker than other similar models.

What are your favorites?

223
18
submitted 2 years ago* (last edited 2 years ago) by actuallyacat to c/localllama
 
 

Highlighting something cool that you may have not seen yet - today, kobold.cpp upgraded its context scaling to the newer NTK method and now it's actually quite useful. This is different from SuperHOT - it works with unmodified models (although maybe a model specifically tuned for it would work even better, we'll see)

It's not in a release yet so to try it out you need to pull the changes with git and build from source. After that, start kobold.cpp with --contextsize 4096 or 8192, put the same number (or less, see below) in context length in the UI (the slider only goes to 2048, but you can type in anything), and there you go. It works!

Prompt:

USER: The secret password is "six pancakes". Remember it and repeat when requested. Here is some filler, ignore it:
[5900 tokens worth of random alphanumeric characters]
USER: What's the secret password?
ASSISTANT:

Response:

The secret password is "six pancakes".

(model: airoboros-33b-gpt4-1.4, Q5_K_M - not a model finetuned for extended context!)

For comparison, with the linear scaling in kobold.cpp 1.33, this only worked up to about 2200 tokens.

This performs dramatically better, although there still seem to be some sort of memory overflow bug, as it will suddenly explode into random characters when crossing around 6000 tokens. So with 8k I suggest setting max tokens in the UI to 5900.

(note about doing this test in kobold.cpp: if you try it with context size set to 2048, it might still appear to work, but that's only because kobold is automatically trimming out some of the filler in the middle. This is only a valid test if you're not overfilling the context.)

According to perplexity measurements there is some degradation in overall quality, especially when the context is almost empty, but it's not noticeable to me, at least at 4k which is what I tested. There's room for improvement in only applying as much scaling as is necessary at current sequence length to eliminate the degradation. I tried to implement that, but I just get garbage, I must be missing something. But in any case at the rate things are progressing someone else will have done it by the time I wake up tomorrow...

224
 
 

So what is currently the best and easiest way to use an AMD GPU for reference I own a rx6700xt and wanted to run 13B model maybe superhot but I'm not sure if my vram is enough for that Since now I always sticked with llamacpp since it's quiet easy to setup Does anyone have any suggestion?

225
view more: ‹ prev next ›