noneabove1182

joined 2 years ago
MODERATOR OF
[–] noneabove1182 1 points 1 year ago (1 children)

I'm not really sure I follow, it's just a simplification, the most appropriate phrasing I guess would be "given A belongs to B, does it know B 'owns' A" like the examples given with "A is the son of B, is B the parent of A"

[–] noneabove1182 5 points 1 year ago

To start, everything you're saying is entirely correct

However, the existence of emergent behaviours like chain of thought reasoning shows that there's more to this than pure text predictions, it picks up patterns that were never explicitly trained, so it's entirely feasible to ponder if they're able to recognize reverse patterns

Hallucinations are a vital part of understanding the models, they might not be long term problems but getting them to understand what they actually know to be true is extremely important in the growth and adoption of LLMs

I think there's a lot more to the training and generation of text than you're giving it credit, the simplest way to explain it is that it's text prediction, but there's way too much depth to the training and model to say that's all it is

At the end of the day it's just a fun thought inducing post :) but when Andrej karparthy says he doesn't have a great intuition on how LLM knowledge works (though in fairness he theorizes the same as you, directional learning) I think we can at least agree none of us know for sure what is correct!

[–] noneabove1182 2 points 1 year ago

Yeah I guess I meant more it just doesn't get nearly as much attention, but you're right there's some starting and that's quite nice

[–] noneabove1182 16 points 1 year ago (10 children)

My biggest problem with vaping is that there's basically no distinction made between ecigarettes that this article addresses and vaping dry herbs.. would love to read up on it and any possible health concerns but rarely see it discussed

[–] noneabove1182 3 points 1 year ago (2 children)

Woah this is pretty interesting stuff, I wonder how practical it is to do, I don't see a repo offering a script or anything so may be quite involved but looks promising. Anything to reduce size while maintaining performance is huge at this time

[–] noneabove1182 2 points 1 year ago

Good question, at the time I made it there wasn't a good option, and the one in the main repo is very comprehensive and overwhelming, I wanted to make one that was straight forward and easier to digest to see what's actually happening

[–] noneabove1182 0 points 1 year ago

Not much detail in the article but at least it's not set for phones for now

[–] noneabove1182 3 points 1 year ago

I've been using this model a bit locally and I gotta say I'm very impressed, might finally push me to getting a 3090

[–] noneabove1182 2 points 1 year ago

Humans destroying it mainly

[–] noneabove1182 3 points 1 year ago

Agreed! A lot of amazing work being done by those people, so great to see a large organization with the power to influence change get behind them, very big deal for keeping the ball rolling

[–] noneabove1182 1 points 1 year ago

Does it not? I've never received a single spam message on telegram in my 6 years of use :S

[–] noneabove1182 5 points 1 year ago

Even Sony dropped theirs, miss it so much

 

Have you ever wanted to inference a baby Llama 2 model in pure C? No? Well, now you can!

With this code you can train the Llama 2 LLM architecture from scratch in PyTorch, then save the weights to a raw binary file, then load that into one ~simple 500-line C file (run.c) that inferences the model, simply in fp32 for now. On my cloud Linux devbox a dim 288 6-layer 6-head model (~15M params) inferences at ~100 tok/s in fp32, and about the same on my M1 MacBook Air. I was somewhat pleasantly surprised that one can run reasonably sized models (few ten million params) at highly interactive rates with an approach this simple.

https://twitter.com/karpathy/status/1683143097604243456

 

This model is based on llama1, so it is for non-commercial use only. Future versions will be trained on llama2 and other open models that are suitable for commercial use.

This model is uncensored. I have filtered the dataset to remove alignment and bias. This makes the model compliant to any requests. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant to any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models You are responsible for any content you create using this model. Enjoy responsibly.

Quants can of course be found from TheBloke:

https://huggingface.co/TheBloke/Dolphin-Llama-13B-GGML

https://huggingface.co/TheBloke/Dolphin-Llama-13B-GPTQ

 

https://github.com/noneabove1182/text-generation-webui-docker (updated to 1.3.1 and has a fix for gqa to run llama2 70B)

https://github.com/noneabove1182/lollms-webui-docker (v3.0.0)

https://github.com/noneabove1182/koboldcpp-docker (updated to 1.36)

All should include up to date instructions, if you find any issues please ping me immediately so I can take a look or open an issue :)

 

This is Llama 2 13b with some additional attention heads from original-flavor Llama 33b frankensteined on.

Fine-tuned on ~10M tokens from RedPajama to settle in the transplants a little.

Not intended for use as-is - this model is meant to serve as a base for further tuning, hopefully with a greater capacity for learning than 13b.

 

Disclaimer: take this with a grain of salt, it is heavily researched but only by me and I am no expert, will link relevant studies at the bottom

For starters, a quick summary of the quantization process for anyone who needs some background. Basically we’re trying to convert these really accurate weights from being floating point 16 bit (FP16) to a 4 bit value to save on size. Saving size is important because we want to fit really big models onto a smaller amount of VRAM. To do this, we take a weight, quantize it, then observe the loss of the model, adjusting all non-quantized weights accordingly in an attempt to minimize this loss. It’s important to note that the numbers you see on models - 7B, 13B, 33B etc - correspond to the total number of weights, so doing this one at a time is… not ideal. Also note that these weights are represented in matrices, with one row for every output of the previous layer, and one column for every input in the current layer, so in this context a “weight vector” represents the set of weights connecting that neuron to each neuron in the previous layer. There were some proposed solutions similar to what we have now but I’ll skip them for a semblance brevity.

Okay, that’s the backstory. Onto the main story…

Enter Binary-Coding Quantization, a “versatile non-uniform quantization scheme.” The concept is that, when quantizing a weight vector, we squash the values by dividing all the values by a number (which is equal to the max value of your quantization divided by the highest number in the weight vector) and save that number for when we later need to de-quantize.

From here, we observe that weight vectors can be grouped in such a way that their scaling factor can be shared across them with minimal loss. This is where groupsize comes in. We use this to indicate how many weight vectors are grouped together to share a single scaling factor. This means, the more of them we group together, the less information we need to save, but also conversely means that more information is lost, as accuracy has to be sacrificed to share the same value.

And that’s it! You now know what groupsize means and why it changes the model’s size!

Next up is actorder.

Activation Order is a method of examining which weight vectors make most sense to quantize first in order to maintain important information. Originally this was done by greedily selecting whichever weight vector would result in the least loss, but it was observed that this method is actually barely better than random selection. With this in mind, a new solution was proposed. We start by observing which columns have the largest activation magnitude, that is, the weight vectors which most contribute to the final output of the model because they cause the most neuron activations.

After gathering that information, we start our quantization with those values, because that means they will most closely reflect their original values after the full quantization is done. Remember, after we quantize a vector, that’s it, it’s locked in. That means that if we left some of our important vectors until the end, not only might they have been adjusted several times during the process, but more importantly there remain very few extra columns that we can adjust to make up for the quantization loss. So starting with these values, IE act-order or desc_act (used interchangeably) should result in a minor increase in performance.

Side note, I’m not positive at this time why it results in an increase to model size, my best guess is that it involves rearranging the vectors in memory in ways that are no longer optimal and can’t be properly mapped into the VRAM without wasting space, but that’s a pure guess and I would love if someone chimed in with more info. My other guess is that groupsize is either not applied to those sensitive weight vectors, or that they’re applied more selectively (grouping sensitive vectors with non-sensitive vectors) and that difference results in a change. If anyone has any ideas please feel free to enlighten me.

And that’s it! To sum it up, group size means quantizing in groups rather than individually, resulting in smaller models that are quantized faster, and act order means to quantize in order of activation magnitude to try to preserve as much of the important information as possible.

If you stuck through that wall of text, thanks! I hope it was insightful (and accurate)

Sources:

https://arxiv.org/abs/2206.09557 (group size explanation)

https://arxiv.org/abs/2306.02272 (act order explanation)

24
Llama 2 - Meta AI (ai.meta.com)
submitted 1 year ago* (last edited 1 year ago) by noneabove1182 to c/localllama
 

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost O(1) inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at this https URL.

 

https://github.com/noneabove1182/text-generation-webui-docker

https://github.com/noneabove1182/lollms-webui-docker

https://github.com/noneabove1182/koboldcpp-docker

All should include up to date instructions, if you find any issues please ping me immediately so I can take a look or open an issue :)

view more: ‹ prev next ›