LocalLLaMA

2268 readers
1 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago
MODERATORS
101
 
 

AutoGen is a framework that enables development of LLM applications using multiple agents that can converse with each other to solve task. AutoGen agents are customizable, conversable, and seamlessly allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools.

Git repo here: https://github.com/microsoft/autogen

102
 
 

Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at this https URL.

103
 
 

Promising increase in context, obviously we've seen other methods like yarn and rope scaling, but nice to see Meta validating some methods and hopefully releasing the models themselves!

104
 
 

Very detailed video covering a range of LLM topics from limitations to tips to fine tuning

Covers both OpenAIs code interpreter and local models with RAG

Worth the watch! At least for subsections that interest you

105
 
 

I know it's not exactly local llm related, but it's a large player and figure it might be worth discussing, very interesting move.

We're now seeing Google, Microsoft, and Amazon with major investments in AI. Apple is obviously also doing their own thing, just a bit quieter. Very interesting for the future.

That said, the wording of "safer AI" makes me think it'll be all the more important to have our local models that aren't needlessly censored by corporations who think they know better.

Thoughts?

106
 
 

Reversal knowledge in this case being, if the LLM knows that A is B, does it also know that B is A, and apparently the answer is pretty resoundingly no! I'd be curious to see if some CoT affected the results at all

107
 
 

We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The drafting stage generates draft tokens at a slightly lower quality but more quickly, which is achieved by selectively skipping certain intermediate layers during drafting Subsequently, the verification stage employs the original LLM to validate those draft output tokens in one forward pass. This process ensures the final output remains identical to that produced by the unaltered LLM, thereby maintaining output quality. The proposed method requires no additional neural network training and no extra memory footprint, making it a plug-and-play and cost-effective solution for inference acceleration. Benchmarks with LLaMA-2 and its fine-tuned models demonstrated a speedup up to 1.73 x.

With all the interest around speculative decoding using a smaller model, this presents an interesting opportunity to speed up without needing the extra space for a draft model

108
109
110
 
 

Linked is the new repo, it's still in relatively early stages but does work.

I'm using it in oobabooga text-gen-ui and the OLD GPTQ format, so not even the new stuff, and on my 3060 I see a genuine >200% increase in speed:

Exllama v1

Output generated in 21.84 seconds (9.16 tokens/s, 200 tokens, context 135, seed 1891621432)

Exllama v2

Output generated in 6.23 seconds (32.10 tokens/s, 200 tokens, context 135, seed 313599079)

Absolutely crazy, all settings are the same. And it's not just a burst at the front, it lasts:

Output generated in 22.40 seconds (31.92 tokens/s, 715 tokens, context 135, seed 717231733)

And this is using the old format, exllama v2 includes a new way to quant, allowing for much more granular bitrates.

Turbo went with a really cool approach here, you set a target bits per weight, say, 3.5, and it'll automatically adjust the appropriate weights to the appropriate quant levels to achieve maximum performance where it counts, saving data in important weights and sacrificing more on non important ones, very cool stuff!

Get your latest oobabooga webui and start playing!

https://github.com/oobabooga/text-generation-webui

https://github.com/noneabove1182/text-generation-webui-docker

Some models in the new format from turbo: https://huggingface.co/turboderp

111
 
 

Hi,

Just like the title says:

I'm try to run:

With:

  • koboldcpp:v1.43 using HIPBLAS on a 7900XTX / Arch Linux

Running :

--stream --unbantokens --threads 8 --usecublas normal

I get very limited output with lots of repetition.

Illustrattion

I mostly didn't touch the default settings:

Settings

Does anyone know how I can make things run better?

EDIT: Sorry for multiple posts, Fediverse bugged out.

112
 
 

A potentially useful compatibility layer between guidance and llama-cpp-python, have yet to try it but looks promising at first glance!

113
26
How usable are AMD GPUs? (lemmy.dbzer0.com)
submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/localllama
 
 

Heyho, I'm currently on a RTX3070 but want to upgrade to a RX 7900 XT

I see that AMD installers are there, but is it all smooth sailing? How well do AMD cards compare to NVidia in terms of performance?

I'd mainly use oobabooga but would also love to try some other backends.

Anyone here with one of the newer AMD cards that could talk about their experience?

EDIT: To clear things up a little bit. I am on Linux, and i'd say i am quite experienced with it. I know how to handle a card swap and i know where to get my drivers from. I know of the gaming performance difference between NVidia and AMD. Those are the main reasons i want to switch to AMD. Now i just want to hear from someone who ALSO has Linux + AMD what their experience with Oobabooga and Automatic1111 are when using ROCm for example.

114
23
Pygmalion-2 has been released (pygmalionai.github.io)
submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/localllama
 
 

I might be a bit late to the party, but for those of you that like ERP and fiction writing:

Introducing Pygmalion-2

The people from Pygmalion have released a new model, usable for roleplaying, conversation and storywriting. It is based on Llama 2 and has been trained on SFW and NSFW roleplay, fictional stories and instruction following conversations. It is available in two sizes, 7b and 13b parameters. They're also releasing a mix with MythoMax-L2 called Mythalion 13B.

Furthermore they're (once again) announcing a website with character sharing and inference (later in october.)

For reference: Pygmalion-6b has been a well known dialogue model for (lewd) roleplay in the times before LLaMA. It had been followed up with an underwhelming successor based on LLaMA (Pygmalion-7b). In their new blogpost they promise to have improved with their new model.

(Personally, I'm curious how it performs compared to MythoMax. There aren't many models around, that excel at roleplay or have been designed specifically for that use case.)

115
116
117
 
 

The main reason I ask, is because my current favorite model is a Llama 2 70B Q4_1 GGML model quantized by The Bloke. Here's the thing though, it was labeled as "Instruct" but it defaults to chat in settings in Oobabooga/Textgen. Every other model I have tried to use for technical help and python/bash snippets has failed to meet my expectations for (skeptically acceptable) accuracy. This 70B is powerful enough that I can prompt it to generate code snippets, and if the code creates an error, by pasting the error into the prompt, it almost always generates a solution in a single correction. Other models I have tried to use this paste-error technique on often crash, 'dig in their heels' insisting they are correct, or fail in several different ways like over fitting that forces resetting context tokens.

For whatever reason, the specific 70B model I am using has far exceeded my expectations, but I must use it with very specific conditions in Oobabooga/Textgen. It must be set to: chat, llama.cpp, the "divine intellect" perimeter preset, and the character profile set to the default of "None."

For whatever reason, deviation from these settings ruins the accuracy of code snippets. Speculatively/intuitively, if I try to use the instruct prompt, or a new persistent character profile, it seems like there is an issue in the way the previous context is handled. In a single session the context seems to drift. In any case, code seems to always have errors and paste corrections fail.

I can't contextualize this issue with such large models. I have had the same issues with smaller models regardless of settings I have tried. I have written or modified a dozen scripts between bash and python using this 70B in chat mode. It is a bit of a pain because the prompt input/output is not proper markdown for code so I have to correct for whitespace scope and have a reasonable understanding of the code syntax, but for the most part, I don't need to make corrections to specific lines of output. Is this rare, an issue/quirk with: the model quantization, llama.cpp, Textgen, other? Has anyone else experienced something like this? Am I just super lucky to have found a chance combination that works really well at snippets combined with my prompting/coding skill level? I haven't had much success with the code specific LLMs either. I'm not sure why this model is doing so well for me.

118
 
 

✅WizardCoder-34B surpasses GPT-4, ChatGPT-3.5 and Claude-2 on HumanEval with 73.2% pass@1

119
 
 

Is it just memory bandwidth? Or is it that AMD is not well supported by pytorch well enough for most products? Or some combination of those?

120
 
 

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.

121
 
 

Hugging face transformers officially has support for AutoGPTQ, this is a pretty huge deal and signals a much wider adoption in quantized model support which is great for everyone!

122
 
 

The airoboros package now includes an API server similar to OpenAI chat completions.

7b/13b LMoE packages available on my 🤗

github link:

https://github.com/jondurbin/airoboros#lmoe

123
 
 

Meta just released a multimodal model for speech translation. It can do speech recognition, translation into text and speech. Supporting nearly 100 input and output languages (35 for speech output). Seamless M4T is released under CC BY-NC 4.0

Abstract

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems composed of multiple subsystems performing translation progressively, putting scalable and high-performing unified speech translation systems out of reach. To address these gaps, we introduce SeamlessM4T—Massively Multilingual & Multimodal Machine Translation—a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations, dubbed SeamlessAlign. Filtered and combined with human labeled and pseudo-labeled data (totaling 406,000 hours), we developed the first multilingual system capable of translating from and into English for both speech and text. On Fleurs, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous state-of-the-art in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. On CVSS and compared to a 2-stage cascaded model for speech-to-speech translation, SeamlessM4T-Large’s performance is stronger by 58%. Preliminary human evaluations of speech-to-text translation outputs evinced similarly impressive results; for translations from English, XSTS scores for 24 evaluated languages are consistently above 4 (out of 5). For into English directions, we see significant improvement over WhisperLarge-v2’s baseline for 7 out of 24 languages. To further evaluate our system, we developed Blaser 2.0, which enables evaluation across speech and text with similar accuracy compared to its predecessor when it comes to quality estimation. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks (average improvements of 38% and 49%, respectively) compared to the current state-of-the-art model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Compared to the state-of-the-art, we report up to 63% of reduction in added toxicity in our translation outputs. Finally, all contributions in this work—including models, inference code, finetuning recipes backed by our improved modeling toolkit Fairseq2, and metadata to recreate the unfiltered 470,000 hours of SeamlessAlign — are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication.

124
 
 

Hugging Face released IDEFICS, an 80B open-access visual language model replicating DeepMind's unreleased Flamingo. Built entirely on public data, it's the first of its size available openly. Part of its training utilized OBELICS, a dataset with 141M web pages, 353M images, and 115B text tokens from Common Crawl.

125
 
 

With release 0.5.0, PEFT now officially supports fine tuning quantized GPTQ models! This is a pretty big deal as it allows you to download a much smaller model for fine tuning!

view more: ‹ prev next ›