LocalLLaMA

2268 readers

1 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago

MODERATORS

SkySyrup

pax

noneabove1182

QuIP#: SOTA 2 bit LLMs (github.com)

submitted 11 months ago by [email protected] to c/localllama

1 comments fedilink hide all child comments

Large language models (LLMs) exhibit amazing performance on a wide variety of tasks such as text modeling and code generation. However, they are also very large. For example Llama 2 70B has 70 billion parameters that require 140GB of memory to store in half precision. This presents many challenges, such as needing multiple GPUs just to serve a single LLM. To address these issues, researchers have developed compression methods that reduce the size of models without destroying performance.

One class of methods, post-training quantization, compresses trained model weights into lower precision formats to reduce memory requirements. For example, quantizing a model from 16 bit to 2 bit precision would reduce the size of the model by 8x, meaning that even Llama 2 70B would fit on a single 24GB GPU. In this work, we introduce QuIP#, which combines lattice codebooks with incoherence processing to create state-of-the-art 2 bit quantized models. These two methods allow QuIP# to significantly close the gap between 2 bit quantized LLMs and unquantized 16 bit models.

Project Page: https://cornell-relaxml.github.io/quip-sharp/

Code: https://github.com/Cornell-RelaxML/quip-sharp

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 3 points 11 months ago* (last edited 11 months ago)

If anyone else wonders how that compares to llama.cpp's "2bit" quantization, here is the in-depth discussion: https://github.com/ggerganov/llama.cpp/discussions/4327