noneabove1182

joined 1 year ago
MODERATOR OF
[–] noneabove1182 3 points 9 months ago

I use text-generation-webui mostly. If you're only using GGUF files (llama.cpp), koboldcpp is a really good option

A lot of it is the automatic prompt formatting, there's probably like 5-10 specific formats that are used, and using the right one for your model is very important to achieve optimal output. TheBloke usually lists the prompt format in his model card which is handy

Rope and yarn refer to extending the default context of a model through hacky (but functional) methods and probably deserve their own write up

[–] noneabove1182 3 points 9 months ago

Yeah so those are mixed, definitely not putting each individual weight to 2 bits because as you said that's very small, i don't even think it averages out to 2 bits but more like 2.56

You can read some details here on bits per weight: https://huggingface.co/TheBloke/LLaMa-30B-GGML/blob/8c7fb5fb46c53d98ee377f841419f1033a32301d/README.md#explanation-of-the-new-k-quant-methods

Unfortunately this is not the whole story either, as they get further combined with other bits per weight, like q2_k is Q4_K for some of the weights and Q2_K for others, resulting in more like 2.8 bits per weight

Generally speaking you'll want to use Q4_K_M unless going smaller really benefits you (like you can fit the full thing on GPU)

Also, the bigger the model you have (70B vs 7B) the lower you can go on quantization bits before it degrades to complete garbage

[–] noneabove1182 3 points 9 months ago (2 children)

If you're using llama.cpp chances are you're already using a quantized model, if not then yes you should be. Unfortunately without crazy fast ram you're basically limited to 7B models if you want any amount of speed (5-10 tokens/s)

[–] noneabove1182 1 points 9 months ago* (last edited 9 months ago) (1 children)

Great little machine on paper, was hoping for a bit more out for the RAM but sadly it's really low throughput due to a low bit width.. CPU is quite impressive though as is the cooling

[–] noneabove1182 2 points 9 months ago

I'm looking forward to trying it today, I think this might make a good RAG model based on the orca 2 paper, but testing will be needed

[–] noneabove1182 2 points 9 months ago (3 children)

according to the config it looks like it's only 4096, and they specify in the arxiv that they kept the training data under that value so it must be 4096.. i'm sure people will expand it soon like they have with others

[–] noneabove1182 3 points 9 months ago (1 children)

got any other articles? that one doesn't make it out to be all that bad

"Without that catalyst, I don't see an angle to a near term mutually agreeable merger of Nintendo and MS and I don't think a hostile action would be a good move, so we are playing the long game."

doesn't sound like meddling to me, just wanting to mutually merge, and who wouldn't want that as a CEO lol

[–] noneabove1182 4 points 9 months ago (3 children)

They've been doing orders of magnitude better in recent years, I'm never thrilled about aggressive vertical integration but of all the massive corporations Microsoft is pretty high up on my personal list for trust (which yeah is a pretty low bar compared to Amazon/Google/etc)

[–] noneabove1182 5 points 9 months ago (1 children)

Wtf? This is a weird take lol

[–] noneabove1182 2 points 9 months ago (1 children)

Hope it comes out soon that's some nice QOL updates :)

[–] noneabove1182 2 points 9 months ago* (last edited 9 months ago)

While the drama around X and musk cannot be understated, it's still great to see more players in the open model world (assuming this gets properly opened)

One thing that'll hold it back (for people like us at least) is developer support so I'm quite curious to see how this plays out with things like GPTQ and llama.cpp

[–] noneabove1182 1 points 9 months ago (1 children)

I almost wonder if they have but they're holding back until they have something that's more game breaking, cause let's be honest if Gemini releases and says "we're better than gpt4" people won't flock to it, they need something that's a standout feature to make people want to switch

 

Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at this https URL.

 

Promising increase in context, obviously we've seen other methods like yarn and rope scaling, but nice to see Meta validating some methods and hopefully releasing the models themselves!

 

Very detailed video covering a range of LLM topics from limitations to tips to fine tuning

Covers both OpenAIs code interpreter and local models with RAG

Worth the watch! At least for subsections that interest you

 

I know it's not exactly local llm related, but it's a large player and figure it might be worth discussing, very interesting move.

We're now seeing Google, Microsoft, and Amazon with major investments in AI. Apple is obviously also doing their own thing, just a bit quieter. Very interesting for the future.

That said, the wording of "safer AI" makes me think it'll be all the more important to have our local models that aren't needlessly censored by corporations who think they know better.

Thoughts?

 

Reversal knowledge in this case being, if the LLM knows that A is B, does it also know that B is A, and apparently the answer is pretty resoundingly no! I'd be curious to see if some CoT affected the results at all

 

We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The drafting stage generates draft tokens at a slightly lower quality but more quickly, which is achieved by selectively skipping certain intermediate layers during drafting Subsequently, the verification stage employs the original LLM to validate those draft output tokens in one forward pass. This process ensures the final output remains identical to that produced by the unaltered LLM, thereby maintaining output quality. The proposed method requires no additional neural network training and no extra memory footprint, making it a plug-and-play and cost-effective solution for inference acceleration. Benchmarks with LLaMA-2 and its fine-tuned models demonstrated a speedup up to 1.73 x.

With all the interest around speculative decoding using a smaller model, this presents an interesting opportunity to speed up without needing the extra space for a draft model

25
submitted 11 months ago* (last edited 11 months ago) by noneabove1182 to c/localllama
 

Linked is the new repo, it's still in relatively early stages but does work.

I'm using it in oobabooga text-gen-ui and the OLD GPTQ format, so not even the new stuff, and on my 3060 I see a genuine >200% increase in speed:

Exllama v1

Output generated in 21.84 seconds (9.16 tokens/s, 200 tokens, context 135, seed 1891621432)

Exllama v2

Output generated in 6.23 seconds (32.10 tokens/s, 200 tokens, context 135, seed 313599079)

Absolutely crazy, all settings are the same. And it's not just a burst at the front, it lasts:

Output generated in 22.40 seconds (31.92 tokens/s, 715 tokens, context 135, seed 717231733)

And this is using the old format, exllama v2 includes a new way to quant, allowing for much more granular bitrates.

Turbo went with a really cool approach here, you set a target bits per weight, say, 3.5, and it'll automatically adjust the appropriate weights to the appropriate quant levels to achieve maximum performance where it counts, saving data in important weights and sacrificing more on non important ones, very cool stuff!

Get your latest oobabooga webui and start playing!

https://github.com/oobabooga/text-generation-webui

https://github.com/noneabove1182/text-generation-webui-docker

Some models in the new format from turbo: https://huggingface.co/turboderp

 

A potentially useful compatibility layer between guidance and llama-cpp-python, have yet to try it but looks promising at first glance!

 

✅WizardCoder-34B surpasses GPT-4, ChatGPT-3.5 and Claude-2 on HumanEval with 73.2% pass@1

 

~~Today~~ August 23rd, Telegram celebrates its 10th birthday – with our biggest update yet. Over the past decade we’ve built hundreds of new features that are now used by over 800 million people. In this update, we launch Stories – with a unique dual camera mode, granular privacy settings, flexible duration options and much more

 

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.

view more: ‹ prev next ›