noneabove1182

joined 2 years ago
MODERATOR OF
[–] noneabove1182 2 points 1 year ago (1 children)

The abstract is meant to pull in random readers, so it's understandable they'd lay a bit of foundation about what the paper will be about, even if it seems rather simple and unnecessarily wordy

LoRA is still considered to be the gold standard in efficient fine tuning, so that's why a lot of comparisons are made to it instead of QLoRA, which is more of a hacky way. They both have their advantages, but are pretty distinct.

Another thing worth pointing out is that 4-bit is not actually just converting all 16bit weights into 4 bits (at least, not in GPTQ style) They also save a quantization factor, so there's more information that can be retrieved from the final quantization than just "multiple everything by 4"

QA LoRA vs QLoRA: I think my distinction is the same as what you said, it's just about the starting and ending state. QLoRA though also introduced a lot of other different techniques, like double quantizations, normal float datatypes, and paged optimizations to make it work

it's also worth point out, not understanding it has nothing to do with intellect, it's just how much foundational knowledge you have, i don't understand most of the math but i've read enough of the papers to understand to some degree what's going on

The one thing I can't quite figure out is, I know QLoRA is competitive with a LoRA because it trains more layers of the transformer vs a LoRA, but I don't see any specific mention of QA-LoRA following that same method which I would think is needed to maintain the quality

Overall you're right though, this paper is a bit on the weaker side, that said if it works then it works and it's a pretty decent discovery, but the paper alone does not guarantee that

[–] noneabove1182 4 points 1 year ago (1 children)

By far the biggest pain point of Sony.. their software is clean stable and fast, with acceptable release cadence, but their promise of 2 years is completely unacceptable in this day

Wish there was any way at all to influence them

[–] noneabove1182 2 points 1 year ago (3 children)

I wrote that summary, maybe would help if I knew your knowledge level? Which parts didn't make sense?

[–] noneabove1182 2 points 1 year ago

There's plenty of smaller projects around that attempt to solve similar problems, metagpt, agent os, gpt-pilot, gpt-engineer, autochain, etc

Several would I'm sure love a hand , you should check em out on GitHub!

[–] noneabove1182 1 points 1 year ago (1 children)

It seems reasonably realistic if you compare it to code interpreter, it was able to recognize packages it hadn't installed and go seek them out, I don't think it's outside the scope for it to recognize which module wasn't installed and install it

Even now with regular models they'll suggest the packages you install before executing the code they provide

[–] noneabove1182 1 points 1 year ago (5 children)

Sure, I can try to add a couple lines on top of the abstract just to give a super brief synopsis

In this case it would be something like:

This paper discusses a new technique in which we can create a LORA for an already quantized model, this is unique from QLora which quantizes the full model on the fly to create a quantized lora. With this approach you can take your small model and work with it as is, saving a ton of resources and speeding up the process massively

[–] noneabove1182 2 points 1 year ago

Not a glowing review that this is accidentally not a reply to a comment. :p

[–] noneabove1182 2 points 1 year ago

This is great and comes with a very interesting model!

I wonder if they cleverly slide the window in any way or if it's just a naive slide, could probably be pretty smart if you discard tokens that have minimal attention on them anyways to focus on important text

For now, this is awesome!

[–] noneabove1182 1 points 1 year ago

The good news is if you do it wrong, much like regular speculative generation, you will still get the right result that the full model would output at the end, so there won't be any loss in quality, just loss in speed

It's definitely a good point tho, finding the optimal configuration is the difference between slower/minimal speedup and potentially huge amounts of speedup

[–] noneabove1182 1 points 1 year ago

Somehow this is even more confusing because that code hasn't been touched in 3 months, maybe just took them that long to validate? Will have to read through it, thanks!

[–] noneabove1182 3 points 1 year ago

Yeah fair point, I'll make sure to include better links in the future :) typically post from mobile so it's annoying but doable

 

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows

Seems like a great resource for all things embeddings related, give it a look!

18
submitted 1 year ago* (last edited 1 year ago) by noneabove1182 to c/localllama
 

These are the full weights, the quants are incoming from TheBloke already, will update this post when they're fully uploaded

From the author(s):

WizardLM-70B V1.0 achieves a substantial and comprehensive improvement on coding, mathematical reasoning and open-domain conversation capacities.

This model is license friendly, and follows the same license with Meta Llama-2.

Next version is in training and will be public together with our new paper soon.

For more details, please refer to:

Model weight: https://huggingface.co/WizardLM/WizardLM-70B-V1.0

Demo and Github: https://github.com/nlpxucan/WizardLM

Twitter: https://twitter.com/WizardLM_AI

GGML quant posted: https://huggingface.co/TheBloke/WizardLM-70B-V1.0-GGML

GPTQ quant repo posted, but still empty (GPTQ is a lot slower to make): https://huggingface.co/TheBloke/WizardLM-70B-V1.0-GPTQ

 

Refactored codebase - now a single unified turbopilot binary that provides support for codegen and starcoder style models.

Support for starcoder, wizardcoder and santacoder models

Support for CUDA 11 and 12

Seems interesting, looks like it supports wizardcoder with GPU offloading, if starcoder also has GPU offloading then that would be great but I would need to test. If it also works with the new stabilityAI coding models that would be very interesting

4
submitted 1 year ago* (last edited 1 year ago) by noneabove1182 to c/localllama
 

Text from them:

Calling all model makers, or would-be model creators! Chai asked me to tell you all about their open source LLM leaderboard:

Chai is running a totally open LLM competition. Anyone is free to submit a llama based LLM via our python-package 🐍 It gets deployed to users on our app. We collect the metrics and rank the models! If you place high enough on our leaderboard you'll win money 🥇

We've paid out over $10,000 in prizes so far. 💰

Come to our discord and check it out!

https://discord.gg/chai-llm

Link to latest board for the people who don't feel like joining a random discord just to see results:

https://cdn.discordapp.com/attachments/1134163974296961195/1138833170838589471/image1.png

10
submitted 1 year ago* (last edited 1 year ago) by noneabove1182 to c/localllama
 

As some may know I maintain a few docker images of some available tools, and I noticed I was suddenly getting NVML mismatch, and for the life of me I could not figure out what the issue was, tried so many things, finally noticed that the docker image had some special drive 535.86.10 where my host had 535.86.05, after figuring that out I looked into it and added this to my Dockerfile:

RUN apt-get update && apt-get remove --purge -y nvidia-* && \ apt-get install -y --allow-downgrades nvidia-driver-535/jammy-updates

And voila, problem solved! Not sure what driver the docker CUDA was using, might be some special dev driver and it was causing a mismatch between the container and the host

Only started happening as of the latest driver update released late last month

 

Just wondering if anyone has any suggestions to keep things moving and growing, was thinking of doing a daily quantized models post just for keeping up with the bloke, thoughts?

 

This one is based on llama 2, first one worked very well for rule and structure following with guidance so I'm highly intrigued to see if this lives up to the previous

 

Wanting to make a wiki for /c/localllama, but not sure if there's a known place that's nice for making free wikis, anyone got suggestions on what's being used widely on lemmy?

8
submitted 1 year ago* (last edited 1 year ago) by noneabove1182 to c/localllama
 
 

This is actually a pretty big deal, exllama is by far the most performant inference engine out there for CUDA, but the strangest thing is that the PR claims it works for starcoder which is a non-llama model:

https://github.com/huggingface/text-generation-inference/pull/553

So I'm extremely curious to see what this brings...

5
A nice write up for LMQL (analyticsindiamag.com)
submitted 1 year ago by noneabove1182 to c/localllama
 

For the uninitiated, LMQL is one of a few offerings in the AI space (amongst others like guidance and guardrails) that allows for finer control of the AIs output and allows you to guarantee some patterns which is hugely helpful for a whole variety of use cases (tool use, safe chatbots, parseable data)

view more: ‹ prev next ›