Yesterday Mistral AI released a new language model called Mistral 7B. @[email protected] already posted the Sliding attention part here in LocalLLaMA, yesterday. But I think the model and the company behind that are even more noteworthy and the release of the model is worth it's own post.

Mistral 7B is not based on Llama. And they claim it outperforms Llama2 13B on all benchmarks (at it's size of 7B). It has additional coding abilities and a 8k sequence length. And it's released under the Apache 2.0 license. ~~So truly an 'open' model, usable without restrictions.~~ [Edit: Unfortunately I couldn't find the dataset or a paper. They call it 'open-weight'. So my conclusion regarding the open-ness might be a bit premature. We'll see.]

(It uses Grouped-query attention and Sliding Window Attention.)

Also worth to note: Mistral AI (the company) is based in Paris. They are one of the few big european AI startups and collected $113 million funding in June.

Details are on Mistral AI's Announcement
techcrunch news article including information about the company
They released an base/foundation model and an instruction-tuned one on HuggingFace
And llama.cpp is already compatible and GGUF versions out there.

I've tried it and it indeed looks promising. It certainly has features that distinguishes it from Llama. And I like the competition. Our world is currently completely dominated by Meta. And if it performs exceptionally well at its size, I hope people pick up on it and fine-tune it for all kinds of specific tasks. (The lack of a dataset and detail regarding the training could be a downside, though. These were not included in this initial release of the model.)

EDIT 2023-10-12: Paper released at: https://arxiv.org/abs/2310.06825 (But I'd say no new information in it, they mostly copied their announcement)

As of now, it is clear they don't want to publish any details about the training.

104

13

Microsoft's latest LLM agent: autogen (microsoft.github.io)

submitted 1 year ago by noneabove1182 to c/localllama

6 comments fedilink

AutoGen is a framework that enables development of LLM applications using multiple agents that can converse with each other to solve task. AutoGen agents are customizable, conversable, and seamlessly allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools.

Git repo here: https://github.com/microsoft/autogen

105

6

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models (arxiv.org)

submitted 1 year ago by noneabove1182 to c/localllama

8 comments fedilink

Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at this https URL.

106

6

Effective Long-Context Scaling of Foundation Models | Research - AI at Meta (ai.meta.com)

submitted 1 year ago by noneabove1182 to c/localllama

1 comments fedilink

Promising increase in context, obviously we've seen other methods like yarn and rope scaling, but nice to see Meta validating some methods and hopefully releasing the models themselves!

107

9

Jeremy Howard: A Hackers' Guide to Language Models (youtu.be)

submitted 1 year ago by noneabove1182 to c/localllama

0 comments fedilink

Very detailed video covering a range of LLM topics from limitations to tips to fine tuning

Covers both OpenAIs code interpreter and local models with RAG

Worth the watch! At least for subsections that interest you

108

1

Amazon investing in Anthropic - Expanding access to safer AI with Amazon (www.anthropic.com)

submitted 1 year ago by noneabove1182 to c/localllama

0 comments fedilink

I know it's not exactly local llm related, but it's a large player and figure it might be worth discussing, very interesting move.

We're now seeing Google, Microsoft, and Amazon with major investments in AI. Apple is obviously also doing their own thing, just a bit quieter. Very interesting for the future.

That said, the wording of "safer AI" makes me think it'll be all the more important to have our local models that aren't needlessly censored by corporations who think they know better.

Thoughts?

109

15

Very interesting thread about reversal knowledge (twitter.com)

submitted 1 year ago by noneabove1182 to c/localllama

10 comments fedilink

Reversal knowledge in this case being, if the LLM knows that A is B, does it also know that B is A, and apparently the answer is pretty resoundingly no! I'd be curious to see if some CoT affected the results at all

110

6

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding (arxiv.org)

submitted 1 year ago by noneabove1182 to c/localllama

2 comments fedilink

We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The drafting stage generates draft tokens at a slightly lower quality but more quickly, which is achieved by selectively skipping certain intermediate layers during drafting Subsequently, the verification stage employs the original LLM to validate those draft output tokens in one forward pass. This process ensures the final output remains identical to that produced by the unaltered LLM, thereby maintaining output quality. The proposed method requires no additional neural network training and no extra memory footprint, making it a plug-and-play and cost-effective solution for inference acceleration. Benchmarks with LLaMA-2 and its fine-tuned models demonstrated a speedup up to 1.73 x.

With all the interest around speculative decoding using a smaller model, this presents an interesting opportunity to speed up without needing the extra space for a draft model

111

28

Distilling step-by-step: Outperforming larger language models with less training data and smaller model sizes (blog.research.google)

submitted 1 year ago by [email protected] to c/localllama

3 comments fedilink

112

12

Efficient Fine-Tuning for Llama-v2-7b on a Single GPU (www.youtube.com)

submitted 1 year ago by [email protected] to c/localllama

0 comments fedilink

113

25

Exllama V2 released! Available in Ooba! Big speed upgrades! (github.com)

submitted 1 year ago* (last edited 1 year ago) by noneabove1182 to c/localllama

3 comments fedilink

Linked is the new repo, it's still in relatively early stages but does work.

I'm using it in oobabooga text-gen-ui and the OLD GPTQ format, so not even the new stuff, and on my 3060 I see a genuine >200% increase in speed:

Exllama v1

Output generated in 21.84 seconds (9.16 tokens/s, 200 tokens, context 135, seed 1891621432)

Exllama v2

Output generated in 6.23 seconds (32.10 tokens/s, 200 tokens, context 135, seed 313599079)

Absolutely crazy, all settings are the same. And it's not just a burst at the front, it lasts:

Output generated in 22.40 seconds (31.92 tokens/s, 715 tokens, context 135, seed 717231733)

And this is using the old format, exllama v2 includes a new way to quant, allowing for much more granular bitrates.

Turbo went with a really cool approach here, you set a target bits per weight, say, 3.5, and it'll automatically adjust the appropriate weights to the appropriate quant levels to achieve maximum performance where it counts, saving data in important weights and sacrificing more on non important ones, very cool stuff!

Get your latest oobabooga webui and start playing!

https://github.com/oobabooga/text-generation-webui

https://github.com/noneabove1182/text-generation-webui-docker

Some models in the new format from turbo: https://huggingface.co/turboderp

114

8

[Help] Trying to run a local Story telling model with KoboldCpp (kbin.social)

submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/localllama

16 comments fedilink

Hi,

Just like the title says:

I'm try to run:

https://huggingface.co/TheBloke/WizardLM-Uncensored-SuperCOT-StoryTelling-30B-SuperHOT-8K-GGML

With:

koboldcpp:v1.43 using HIPBLAS on a 7900XTX / Arch Linux

Running :

--stream --unbantokens --threads 8 --usecublas normal

I get very limited output with lots of repetition.

Illustrattion

I mostly didn't touch the default settings:

Settings

Does anyone know how I can make things run better?

EDIT: Sorry for multiple posts, Fediverse bugged out.

115

17

GitHub - nicholasyager/llama-cpp-guidance: A guidance compatibility layer for llama-cpp-python (github.com)

submitted 1 year ago by noneabove1182 to c/localllama

0 comments fedilink

A potentially useful compatibility layer between guidance and llama-cpp-python, have yet to try it but looks promising at first glance!

116

26

How usable are AMD GPUs? (lemmy.dbzer0.com)

submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/localllama

47 comments fedilink

Heyho, I'm currently on a RTX3070 but want to upgrade to a RX 7900 XT

I see that AMD installers are there, but is it all smooth sailing? How well do AMD cards compare to NVidia in terms of performance?

I'd mainly use oobabooga but would also love to try some other backends.

Anyone here with one of the newer AMD cards that could talk about their experience?

EDIT: To clear things up a little bit. I am on Linux, and i'd say i am quite experienced with it. I know how to handle a card swap and i know where to get my drivers from. I know of the gaming performance difference between NVidia and AMD. Those are the main reasons i want to switch to AMD. Now i just want to hear from someone who ALSO has Linux + AMD what their experience with Oobabooga and Automatic1111 are when using ROCm for example.

117

23

Pygmalion-2 has been released (pygmalionai.github.io)

submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/localllama

10 comments fedilink

I might be a bit late to the party, but for those of you that like ERP and fiction writing:

Introducing Pygmalion-2

The people from Pygmalion have released a new model, usable for roleplaying, conversation and storywriting. It is based on Llama 2 and has been trained on SFW and NSFW roleplay, fictional stories and instruction following conversations. It is available in two sizes, 7b and 13b parameters. They're also releasing a mix with MythoMax-L2 called Mythalion 13B.

Furthermore they're (once again) announcing a website with character sharing and inference (later in october.)

For reference: Pygmalion-6b has been a well known dialogue model for (lewd) roleplay in the times before LLaMA. It had been followed up with an underwhelming successor based on LLaMA (Pygmalion-7b). In their new blogpost they promise to have improved with their new model.

(Personally, I'm curious how it performs compared to MythoMax. There aren't many models around, that excel at roleplay or have been designed specifically for that use case.)

118

18

Beating GPT-4 on HumanEval with a Fine-Tuned CodeLlama-34B (www.phind.com)

submitted 1 year ago by [email protected] to c/localllama

1 comments fedilink

119

18

Supporting the Open Source AI Community | Andreessen Horowitz (a16z.com)

submitted 1 year ago by noneabove1182 to c/localllama

4 comments fedilink

120

14

What is your favorite offline LLM for technical utility, and have you noticed anything unexpected about certain models? (lemmy.world)

submitted 1 year ago by [email protected] to c/localllama

3 comments fedilink

The main reason I ask, is because my current favorite model is a Llama 2 70B Q4_1 GGML model quantized by The Bloke. Here's the thing though, it was labeled as "Instruct" but it defaults to chat in settings in Oobabooga/Textgen. Every other model I have tried to use for technical help and python/bash snippets has failed to meet my expectations for (skeptically acceptable) accuracy. This 70B is powerful enough that I can prompt it to generate code snippets, and if the code creates an error, by pasting the error into the prompt, it almost always generates a solution in a single correction. Other models I have tried to use this paste-error technique on often crash, 'dig in their heels' insisting they are correct, or fail in several different ways like over fitting that forces resetting context tokens.

For whatever reason, the specific 70B model I am using has far exceeded my expectations, but I must use it with very specific conditions in Oobabooga/Textgen. It must be set to: chat, llama.cpp, the "divine intellect" perimeter preset, and the character profile set to the default of "None."

For whatever reason, deviation from these settings ruins the accuracy of code snippets. Speculatively/intuitively, if I try to use the instruct prompt, or a new persistent character profile, it seems like there is an issue in the way the previous context is handled. In a single session the context seems to drift. In any case, code seems to always have errors and paste corrections fail.

I can't contextualize this issue with such large models. I have had the same issues with smaller models regardless of settings I have tried. I have written or modified a dozen scripts between bash and python using this 70B in chat mode. It is a bit of a pain because the prompt input/output is not proper markdown for code so I have to correct for whitespace scope and have a reasonable understanding of the code syntax, but for the most part, I don't need to make corrections to specific lines of output. Is this rare, an issue/quirk with: the model quantization, llama.cpp, Textgen, other? Has anyone else experienced something like this? Am I just super lucky to have found a chance combination that works really well at snippets combined with my prompting/coding skill level? I haven't had much success with the code specific LLMs either. I'm not sure why this model is doing so well for me.

121

21

WizardLM introduce the newest WizardCoder 34B based on Code Llama (twitter.com)

submitted 1 year ago by noneabove1182 to c/localllama

2 comments fedilink

✅WizardCoder-34B surpasses GPT-4, ChatGPT-3.5 and Claude-2 on HumanEval with 73.2% pass@1

122

22

Is there a good reason why AMD APUs just aren't used with massive amounts of (V)RAM just like the Mac M2 is? (lemmy.ca)

submitted 1 year ago by [email protected] to c/localllama

14 comments fedilink

Is it just memory bandwidth? Or is it that AMD is not well supported by pytorch well enough for most products? Or some combination of those?

123

14

Code Llama: Open Foundation Models for Code | Meta AI Research (ai.meta.com)

submitted 1 year ago by noneabove1182 to c/localllama

2 comments fedilink

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.

124

20

Making LLMs lighter with AutoGPTQ and transformers (huggingface.co)

submitted 1 year ago by noneabove1182 to c/localllama

0 comments fedilink

Hugging face transformers officially has support for AutoGPTQ, this is a pretty huge deal and signals a much wider adoption in quantized model support which is great for everyone!

125

8

Jon Durbin: Finished up a first stab at LMoE - LoRA mixture of experts (twitter.com)

submitted 1 year ago by noneabove1182 to c/localllama

0 comments fedilink

The airoboros package now includes an API server similar to OpenAI chat completions.

7b/13b LMoE packages available on my 🤗

github link:

https://github.com/jondurbin/airoboros#lmoe