overview for noneabove1182

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models in c/localllama

[–] noneabove1182 2 points 2 years ago (1 children)

The abstract is meant to pull in random readers, so it's understandable they'd lay a bit of foundation about what the paper will be about, even if it seems rather simple and unnecessarily wordy

LoRA is still considered to be the gold standard in efficient fine tuning, so that's why a lot of comparisons are made to it instead of QLoRA, which is more of a hacky way. They both have their advantages, but are pretty distinct.

Another thing worth pointing out is that 4-bit is not actually just converting all 16bit weights into 4 bits (at least, not in GPTQ style) They also save a quantization factor, so there's more information that can be retrieved from the final quantization than just "multiple everything by 4"

QA LoRA vs QLoRA: I think my distinction is the same as what you said, it's just about the starting and ending state. QLoRA though also introduced a lot of other different techniques, like double quantizations, normal float datatypes, and paged optimizations to make it work

it's also worth point out, not understanding it has nothing to do with intellect, it's just how much foundational knowledge you have, i don't understand most of the math but i've read enough of the papers to understand to some degree what's going on

The one thing I can't quite figure out is, I know QLoRA is competitive with a LoRA because it trains more layers of the transformer vs a LoRA, but I don't see any specific mention of QA-LoRA following that same method which I would think is needed to maintain the quality

Overall you're right though, this paper is a bit on the weaker side, that said if it works then it works and it's a pretty decent discovery, but the paper alone does not guarantee that

Yogesh Brar: OS updates cycle for brands in c/[email protected]

[–] noneabove1182 4 points 2 years ago (1 children)

By far the biggest pain point of Sony.. their software is clean stable and fast, with acceptable release cadence, but their promise of 2 years is completely unacceptable in this day

Wish there was any way at all to influence them

Yogesh Brar: OS updates cycle for brands in c/[email protected]

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models in c/localllama

[–] noneabove1182 2 points 2 years ago (3 children)

I wrote that summary, maybe would help if I knew your knowledge level? Which parts didn't make sense?

Microsoft's latest LLM agent: autogen in c/localllama

[–] noneabove1182 2 points 2 years ago

There's plenty of smaller projects around that attempt to solve similar problems, metagpt, agent os, gpt-pilot, gpt-engineer, autochain, etc

Several would I'm sure love a hand , you should check em out on GitHub!

Microsoft's latest LLM agent: autogen in c/localllama

[–] noneabove1182 1 points 2 years ago (1 children)

It seems reasonably realistic if you compare it to code interpreter, it was able to recognize packages it hadn't installed and go seek them out, I don't think it's outside the scope for it to recognize which module wasn't installed and install it

Even now with regular models they'll suggest the packages you install before executing the code they provide

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models in c/localllama

[–] noneabove1182 1 points 2 years ago (5 children)

Sure, I can try to add a couple lines on top of the abstract just to give a super brief synopsis

In this case it would be something like:

This paper discusses a new technique in which we can create a LORA for an already quantized model, this is unique from QLora which quantizes the full model on the fly to create a quantized lora. With this approach you can take your small model and work with it as is, saving a ton of resources and speeding up the process massively

Boost for Lemmy is now live in c/[email protected]

[–] noneabove1182 2 points 2 years ago

Not a glowing review that this is accidentally not a reply to a comment. :p

*Permanently Deleted* in c/localllama

[–] noneabove1182 2 points 2 years ago

This is great and comes with a very interesting model!

I wonder if they cleverly slide the window in any way or if it's just a naive slide, could probably be pretty smart if you discard tokens that have minimal attention on them anyways to focus on important text

For now, this is awesome!

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding in c/localllama

[–] noneabove1182 1 points 2 years ago

The good news is if you do it wrong, much like regular speculative generation, you will still get the right result that the full model would output at the end, so there won't be any loss in quality, just loss in speed

It's definitely a good point tho, finding the optimal configuration is the difference between slower/minimal speedup and potentially huge amounts of speedup

Distilling step-by-step: Outperforming larger language models with less training data and smaller model sizes in c/localllama

[–] noneabove1182 1 points 2 years ago

Somehow this is even more confusing because that code hasn't been touched in 3 months, maybe just took them that long to validate? Will have to read through it, thanks!

Very interesting thread about reversal knowledge in c/localllama

[–] noneabove1182 3 points 2 years ago

Yeah fair point, I'll make sure to include better links in the future :) typically post from mobile so it's annoying but doable

9

GitHub - neuml/txtai: 💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows (github.com)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows

Seems like a great resource for all things embeddings related, give it a look!

18

WizardLM-70B-V1.0 Released on HF (huggingface.co)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

3 comments fedilink

These are the full weights, the quants are incoming from TheBloke already, will update this post when they're fully uploaded

From the author(s):

WizardLM-70B V1.0 achieves a substantial and comprehensive improvement on coding, mathematical reasoning and open-domain conversation capacities.

This model is license friendly, and follows the same license with Meta Llama-2.

Next version is in training and will be public together with our new paper soon.

For more details, please refer to:

Model weight: https://huggingface.co/WizardLM/WizardLM-70B-V1.0

Demo and Github: https://github.com/nlpxucan/WizardLM

Twitter: https://twitter.com/WizardLM_AI

GGML quant posted: https://huggingface.co/TheBloke/WizardLM-70B-V1.0-GGML

GPTQ quant repo posted, but still empty (GPTQ is a lot slower to make): https://huggingface.co/TheBloke/WizardLM-70B-V1.0-GPTQ

6

Turbopilot release v0.1.0 (github.com)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

Refactored codebase - now a single unified turbopilot binary that provides support for codegen and starcoder style models.

Support for starcoder, wizardcoder and santacoder models

Support for CUDA 11 and 12

Seems interesting, looks like it supports wizardcoder with GPU offloading, if starcoder also has GPU offloading then that would be great but I would need to test. If it also works with the new stabilityAI coding models that would be very interesting

4

Chai is running their own open source leaderboard (self.localllama)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

7 comments fedilink

Text from them:

Calling all model makers, or would-be model creators! Chai asked me to tell you all about their open source LLM leaderboard:

Chai is running a totally open LLM competition. Anyone is free to submit a llama based LLM via our python-package 🐍 It gets deployed to users on our app. We collect the metrics and rank the models! If you place high enough on our leaderboard you'll win money 🥇

We've paid out over $10,000 in prizes so far. 💰

Come to our discord and check it out!

https://discord.gg/chai-llm

Link to latest board for the people who don't feel like joining a random discord just to see results:

https://cdn.discordapp.com/attachments/1134163974296961195/1138833170838589471/image1.png

10

Dolphin 7B by Eric Hartford based on Llama 2 released (huggingface.co)

submitted 2 years ago by noneabove1182 to c/localllama

1 comments fedilink

And of course the quants by TheBloke

https://huggingface.co/TheBloke/Dolphin-Llama2-7B-GGML

https://huggingface.co/TheBloke/Dolphin-Llama2-7B-GPTQ

10

PSA for any docker users, strange driver mismatch (self.localllama)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

0 comments fedilink

As some may know I maintain a few docker images of some available tools, and I noticed I was suddenly getting NVML mismatch, and for the life of me I could not figure out what the issue was, tried so many things, finally noticed that the docker image had some special drive 535.86.10 where my host had 535.86.05, after figuring that out I looked into it and added this to my Dockerfile:

RUN apt-get update && apt-get remove --purge -y nvidia-* && \ apt-get install -y --allow-downgrades nvidia-driver-535/jammy-updates

And voila, problem solved! Not sure what driver the docker CUDA was using, might be some special dev driver and it was causing a mismatch between the container and the host

Only started happening as of the latest driver update released late last month

21

Any suggestions for this community? (self.localllama)

submitted 2 years ago by noneabove1182 to c/localllama

20 comments fedilink

Just wondering if anyone has any suggestions to keep things moving and growing, was thinking of doing a daily quantized models post just for keeping up with the bloke, thoughts?

6

Open-Orca has released their second preview of OpenChat - Hugging Face (huggingface.co)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

5 comments fedilink

This one is based on llama 2, first one worked very well for rule and structure following with guidance so I'm highly intrigued to see if this lives up to the previous

15

Best place to host a lemmy wiki (self.lemmy)

submitted 2 years ago by noneabove1182 to c/[email protected]

1 comments fedilink

Wanting to make a wiki for /c/localllama, but not sure if there's a known place that's nice for making free wikis, anyone got suggestions on what's being used widely on lemmy?

8

Large language models, explained with a minimum of math and jargon (www.understandingai.org)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

-1 comments fedilink

cross posted from : https://sh.itjust.works/post/1851078

thank you [email protected]

9

Huggingface Text Generation Inference adds exllama support (github.com)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

This is actually a pretty big deal, exllama is by far the most performant inference engine out there for CUDA, but the strangest thing is that the PR claims it works for starcoder which is a non-llama model:

https://github.com/huggingface/text-generation-inference/pull/553

So I'm extremely curious to see what this brings...

5

A nice write up for LMQL (analyticsindiamag.com)

submitted 2 years ago by noneabove1182 to c/localllama

1 comments fedilink

For the uninitiated, LMQL is one of a few offerings in the AI space (amongst others like guidance and guardrails) that allows for finer control of the AIs output and allows you to guarantee some patterns which is hugely helpful for a whole variety of use cases (tool use, safe chatbots, parseable data)