this post was submitted on 27 Sep 2023
6 points (80.0% liked)
LocalLLaMA
2856 readers
11 users here now
Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.
Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.
As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
The abstract is meant to pull in random readers, so it's understandable they'd lay a bit of foundation about what the paper will be about, even if it seems rather simple and unnecessarily wordy
LoRA is still considered to be the gold standard in efficient fine tuning, so that's why a lot of comparisons are made to it instead of QLoRA, which is more of a hacky way. They both have their advantages, but are pretty distinct.
Another thing worth pointing out is that 4-bit is not actually just converting all 16bit weights into 4 bits (at least, not in GPTQ style) They also save a quantization factor, so there's more information that can be retrieved from the final quantization than just "multiple everything by 4"
QA LoRA vs QLoRA: I think my distinction is the same as what you said, it's just about the starting and ending state. QLoRA though also introduced a lot of other different techniques, like double quantizations, normal float datatypes, and paged optimizations to make it work
it's also worth point out, not understanding it has nothing to do with intellect, it's just how much foundational knowledge you have, i don't understand most of the math but i've read enough of the papers to understand to some degree what's going on
The one thing I can't quite figure out is, I know QLoRA is competitive with a LoRA because it trains more layers of the transformer vs a LoRA, but I don't see any specific mention of QA-LoRA following that same method which I would think is needed to maintain the quality
Overall you're right though, this paper is a bit on the weaker side, that said if it works then it works and it's a pretty decent discovery, but the paper alone does not guarantee that
Thanks