overview for noneabove1182

Very interesting thread about reversal knowledge in c/localllama

[–] noneabove1182 1 points 2 years ago (1 children)

I'm not really sure I follow, it's just a simplification, the most appropriate phrasing I guess would be "given A belongs to B, does it know B 'owns' A" like the examples given with "A is the son of B, is B the parent of A"

Very interesting thread about reversal knowledge in c/localllama

[–] noneabove1182 5 points 2 years ago

To start, everything you're saying is entirely correct

However, the existence of emergent behaviours like chain of thought reasoning shows that there's more to this than pure text predictions, it picks up patterns that were never explicitly trained, so it's entirely feasible to ponder if they're able to recognize reverse patterns

Hallucinations are a vital part of understanding the models, they might not be long term problems but getting them to understand what they actually know to be true is extremely important in the growth and adoption of LLMs

I think there's a lot more to the training and generation of text than you're giving it credit, the simplest way to explain it is that it's text prediction, but there's way too much depth to the training and model to say that's all it is

At the end of the day it's just a fun thought inducing post :) but when Andrej karparthy says he doesn't have a great intuition on how LLM knowledge works (though in fairness he theorizes the same as you, directional learning) I think we can at least agree none of us know for sure what is correct!

Vaping linked to higher risk of asthma in teens - UPI.com in c/[email protected]

[–] noneabove1182 2 points 2 years ago

Yeah I guess I meant more it just doesn't get nearly as much attention, but you're right there's some starting and that's quite nice

Vaping linked to higher risk of asthma in teens - UPI.com in c/[email protected]

[–] noneabove1182 16 points 2 years ago (10 children)

My biggest problem with vaping is that there's basically no distinction made between ecigarettes that this article addresses and vaping dry herbs.. would love to read up on it and any possible health concerns but rarely see it discussed

Distilling step-by-step: Outperforming larger language models with less training data and smaller model sizes in c/localllama

[–] noneabove1182 3 points 2 years ago (2 children)

Woah this is pretty interesting stuff, I wonder how practical it is to do, I don't see a repo offering a script or anything so may be quite involved but looks promising. Anything to reduce size while maintaining performance is huge at this time

Exllama V2 released! Available in Ooba! Big speed upgrades! in c/localllama

[–] noneabove1182 2 points 2 years ago

Good question, at the time I made it there wasn't a good option, and the one in the main repo is very comprehensive and overwhelming, I wanted to make one that was straight forward and easier to digest to see what's actually happening

How is Qualcomm's Oryon SoC doing? - SemiAccurate in c/[email protected]

[–] noneabove1182 0 points 2 years ago

Not much detail in the article but at least it's not set for phones for now

Beating GPT-4 on HumanEval with a Fine-Tuned CodeLlama-34B in c/localllama

[–] noneabove1182 3 points 2 years ago

I've been using this model a bit locally and I gotta say I'm very impressed, might finally push me to getting a 3090

Supporting the Open Source AI Community | Andreessen Horowitz in c/localllama

[–] noneabove1182 2 points 2 years ago

Humans destroying it mainly

Supporting the Open Source AI Community | Andreessen Horowitz in c/localllama

[–] noneabove1182 3 points 2 years ago

Agreed! A lot of amazing work being done by those people, so great to see a large organization with the power to influence change get behind them, very big deal for keeping the ball rolling

Stories and 10 Years of Telegram in c/[email protected]

[–] noneabove1182 1 points 2 years ago

Does it not? I've never received a single spam message on telegram in my 6 years of use :S

HTC U23 Pro review - GSMArena in c/[email protected]

[–] noneabove1182 5 points 2 years ago

Even Sony dropped theirs, miss it so much

29

llama2.c: Inference Llama 2 in one file of pure C by Andrej Karpathy (github.com)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

Have you ever wanted to inference a baby Llama 2 model in pure C? No? Well, now you can!

With this code you can train the Llama 2 LLM architecture from scratch in PyTorch, then save the weights to a raw binary file, then load that into one ~simple 500-line C file (run.c) that inferences the model, simply in fp32 for now. On my cloud Linux devbox a dim 288 6-layer 6-head model (~15M params) inferences at ~100 tok/s in fp32, and about the same on my M1 MacBook Air. I was somewhat pleasantly surprised that one can run reasonably sized models (few ten million params) at highly interactive rates with an approach this simple.

https://twitter.com/karpathy/status/1683143097604243456

23

Leaked GPT-4 Architecture: Demystifying Its Impact & The 'Mixture of Experts' Explained (with code) (www.youtube.com)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

Written form : https://github.com/clint-kristopher-morris/Tutorials/blob/main/mixture-of-experts/MoE-Paper-Review.ipynb

26

Dolphin (based on Llama 1) released by Eric Hartford! (huggingface.co)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

This model is based on llama1, so it is for non-commercial use only. Future versions will be trained on llama2 and other open models that are suitable for commercial use.

This model is uncensored. I have filtered the dataset to remove alignment and bias. This makes the model compliant to any requests. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant to any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models You are responsible for any content you create using this model. Enjoy responsibly.

Quants can of course be found from TheBloke:

https://huggingface.co/TheBloke/Dolphin-Llama-13B-GGML

https://huggingface.co/TheBloke/Dolphin-Llama-13B-GPTQ

76

For people self hosting LLMs.. I have a couple docker images I maintain (self.selfhosted)

submitted 2 years ago by noneabove1182 to c/[email protected]

6 comments fedilink

https://github.com/noneabove1182/text-generation-webui-docker (updated to 1.3.1 and has a fix for gqa to run llama2 70B)

https://github.com/noneabove1182/lollms-webui-docker (v3.0.0)

https://github.com/noneabove1182/koboldcpp-docker (updated to 1.36)

All should include up to date instructions, if you find any issues please ping me immediately so I can take a look or open an issue :)

9

chargoddard's frankensteined 22B llama2 (huggingface.co)

submitted 2 years ago by noneabove1182 to c/localllama

2 comments fedilink

This is Llama 2 13b with some additional attention heads from original-flavor Llama 33b frankensteined on.

Fine-tuned on ~10M tokens from RedPajama to settle in the transplants a little.

Not intended for use as-is - this model is meant to serve as a base for further tuning, hopefully with a greater capacity for learning than 13b.

6

My attempt at explaining group size and act order simply (but definitely not briefly) (self.localllama)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

Disclaimer: take this with a grain of salt, it is heavily researched but only by me and I am no expert, will link relevant studies at the bottom

For starters, a quick summary of the quantization process for anyone who needs some background. Basically we’re trying to convert these really accurate weights from being floating point 16 bit (FP16) to a 4 bit value to save on size. Saving size is important because we want to fit really big models onto a smaller amount of VRAM. To do this, we take a weight, quantize it, then observe the loss of the model, adjusting all non-quantized weights accordingly in an attempt to minimize this loss. It’s important to note that the numbers you see on models - 7B, 13B, 33B etc - correspond to the total number of weights, so doing this one at a time is… not ideal. Also note that these weights are represented in matrices, with one row for every output of the previous layer, and one column for every input in the current layer, so in this context a “weight vector” represents the set of weights connecting that neuron to each neuron in the previous layer. There were some proposed solutions similar to what we have now but I’ll skip them for a semblance brevity.

Okay, that’s the backstory. Onto the main story…

Enter Binary-Coding Quantization, a “versatile non-uniform quantization scheme.” The concept is that, when quantizing a weight vector, we squash the values by dividing all the values by a number (which is equal to the max value of your quantization divided by the highest number in the weight vector) and save that number for when we later need to de-quantize.

From here, we observe that weight vectors can be grouped in such a way that their scaling factor can be shared across them with minimal loss. This is where groupsize comes in. We use this to indicate how many weight vectors are grouped together to share a single scaling factor. This means, the more of them we group together, the less information we need to save, but also conversely means that more information is lost, as accuracy has to be sacrificed to share the same value.

And that’s it! You now know what groupsize means and why it changes the model’s size!

Next up is actorder.

Activation Order is a method of examining which weight vectors make most sense to quantize first in order to maintain important information. Originally this was done by greedily selecting whichever weight vector would result in the least loss, but it was observed that this method is actually barely better than random selection. With this in mind, a new solution was proposed. We start by observing which columns have the largest activation magnitude, that is, the weight vectors which most contribute to the final output of the model because they cause the most neuron activations.

After gathering that information, we start our quantization with those values, because that means they will most closely reflect their original values after the full quantization is done. Remember, after we quantize a vector, that’s it, it’s locked in. That means that if we left some of our important vectors until the end, not only might they have been adjusted several times during the process, but more importantly there remain very few extra columns that we can adjust to make up for the quantization loss. So starting with these values, IE act-order or desc_act (used interchangeably) should result in a minor increase in performance.

Side note, I’m not positive at this time why it results in an increase to model size, my best guess is that it involves rearranging the vectors in memory in ways that are no longer optimal and can’t be properly mapped into the VRAM without wasting space, but that’s a pure guess and I would love if someone chimed in with more info. My other guess is that groupsize is either not applied to those sensitive weight vectors, or that they’re applied more selectively (grouping sensitive vectors with non-sensitive vectors) and that difference results in a change. If anyone has any ideas please feel free to enlighten me.

And that’s it! To sum it up, group size means quantizing in groups rather than individually, resulting in smaller models that are quantized faster, and act order means to quantize in order of activation magnitude to try to preserve as much of the important information as possible.

If you stuck through that wall of text, thanks! I hope it was insightful (and accurate)

Sources:

https://arxiv.org/abs/2206.09557 (group size explanation)

https://arxiv.org/abs/2306.02272 (act order explanation)

6

Llama-2, Mo’ Lora (proof of concept MOE of LoRAs) (crumbly.medium.com)

submitted 2 years ago by noneabove1182 to c/localllama

2 comments fedilink

https://twitter.com/aicrumb/status/1681846805959528448?t=sG6Xn4p0hodDoB-g7gmuJQ

https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/crumb/Wizard-EvolInstruct70k-k4/blob/main/MoLora_7b_(PROOF_OF_CONCEPT).ipynb

Very interesting concept, excited to see where it goes

32

Samsung teases latest foldables ahead of Unpacked | TechCrunch (techcrunch.com)

submitted 2 years ago by noneabove1182 to c/[email protected]

6 comments fedilink

24

Llama 2 - Meta AI (ai.meta.com)

submitted 2 years ago* (last edited 2 years ago) by noneabove1182 to c/localllama

7 comments fedilink

Blog post: https://about.fb.com/news/2023/07/llama-2/

11

Retentive Network: A Successor to Transformer for Large Language Models (arxiv.org)

submitted 2 years ago by noneabove1182 to c/localllama

6 comments fedilink

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost O(1) inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at this https URL.

9

Finally got my shit together and made git repos of my docker images (self.localllama)

submitted 2 years ago by noneabove1182 to c/localllama

0 comments fedilink

https://github.com/noneabove1182/text-generation-webui-docker

https://github.com/noneabove1182/lollms-webui-docker

https://github.com/noneabove1182/koboldcpp-docker

All should include up to date instructions, if you find any issues please ping me immediately so I can take a look or open an issue :)

16

llamacpp has added custom RoPE (#2054) · ggerganov/llama.cpp@6e7cca4 (github.com)

submitted 2 years ago by noneabove1182 to c/localllama

1 comments fedilink