this post was submitted on 26 Jul 2023
19 points (100.0% liked)
LocalLLaMA
2293 readers
1 users here now
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
https://github.com/ggerganov/llama.cpp#quantization
https://github.com/ggerganov/llama.cpp/pull/1684
Regarding your question: 13B 2_K seems to be on par with 7B 16bit and 8bit. Not much of a difference between all those. (Look at the perplexity values. Lower is better.) The second link has a nice graph.
Most people don't go as low as 2bit though. It's considerably worse than 4bit.
These are good sources, to add one more, the GPTQ paper talks a lot about perplexity at several quantization and model sizes:
https://arxiv.org/abs/2210.17323