this post was submitted on 21 Jul 2023
6 points (87.5% liked)

LocalLLaMA

2292 readers
1 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago
MODERATORS
 

Disclaimer: take this with a grain of salt, it is heavily researched but only by me and I am no expert, will link relevant studies at the bottom

For starters, a quick summary of the quantization process for anyone who needs some background. Basically we’re trying to convert these really accurate weights from being floating point 16 bit (FP16) to a 4 bit value to save on size. Saving size is important because we want to fit really big models onto a smaller amount of VRAM. To do this, we take a weight, quantize it, then observe the loss of the model, adjusting all non-quantized weights accordingly in an attempt to minimize this loss. It’s important to note that the numbers you see on models - 7B, 13B, 33B etc - correspond to the total number of weights, so doing this one at a time is… not ideal. Also note that these weights are represented in matrices, with one row for every output of the previous layer, and one column for every input in the current layer, so in this context a “weight vector” represents the set of weights connecting that neuron to each neuron in the previous layer. There were some proposed solutions similar to what we have now but I’ll skip them for a semblance brevity.

Okay, that’s the backstory. Onto the main story…

Enter Binary-Coding Quantization, a “versatile non-uniform quantization scheme.” The concept is that, when quantizing a weight vector, we squash the values by dividing all the values by a number (which is equal to the max value of your quantization divided by the highest number in the weight vector) and save that number for when we later need to de-quantize.

From here, we observe that weight vectors can be grouped in such a way that their scaling factor can be shared across them with minimal loss. This is where groupsize comes in. We use this to indicate how many weight vectors are grouped together to share a single scaling factor. This means, the more of them we group together, the less information we need to save, but also conversely means that more information is lost, as accuracy has to be sacrificed to share the same value.

And that’s it! You now know what groupsize means and why it changes the model’s size!

Next up is actorder.

Activation Order is a method of examining which weight vectors make most sense to quantize first in order to maintain important information. Originally this was done by greedily selecting whichever weight vector would result in the least loss, but it was observed that this method is actually barely better than random selection. With this in mind, a new solution was proposed. We start by observing which columns have the largest activation magnitude, that is, the weight vectors which most contribute to the final output of the model because they cause the most neuron activations.

After gathering that information, we start our quantization with those values, because that means they will most closely reflect their original values after the full quantization is done. Remember, after we quantize a vector, that’s it, it’s locked in. That means that if we left some of our important vectors until the end, not only might they have been adjusted several times during the process, but more importantly there remain very few extra columns that we can adjust to make up for the quantization loss. So starting with these values, IE act-order or desc_act (used interchangeably) should result in a minor increase in performance.

Side note, I’m not positive at this time why it results in an increase to model size, my best guess is that it involves rearranging the vectors in memory in ways that are no longer optimal and can’t be properly mapped into the VRAM without wasting space, but that’s a pure guess and I would love if someone chimed in with more info. My other guess is that groupsize is either not applied to those sensitive weight vectors, or that they’re applied more selectively (grouping sensitive vectors with non-sensitive vectors) and that difference results in a change. If anyone has any ideas please feel free to enlighten me.

And that’s it! To sum it up, group size means quantizing in groups rather than individually, resulting in smaller models that are quantized faster, and act order means to quantize in order of activation magnitude to try to preserve as much of the important information as possible.

If you stuck through that wall of text, thanks! I hope it was insightful (and accurate)

Sources:

https://arxiv.org/abs/2206.09557 (group size explanation)

https://arxiv.org/abs/2306.02272 (act order explanation)

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here