I am just learning in this space and I could be wrong about this one, but... The GGML and GPTQ models are nice for getting started with AI in Oobabooga. The range of models available is kinda odd to navigate and understand in context as far as how they compare and all the different quantization types, settings, and features. I still don't understand a lot of it. One of the main aspects I didn't (still don't fully) understand are how some models do not have a quantization stated like GGML/GPTQ, but still work using Transformers. I tried some of these by chance at first, and avoided them because they take longer to initially load.
Yesterday I created my first LoRAs and learned through trial and error, the only models I can use to train a LoRA on are the ones that use Transformers, and can be set to 8bit mode. Even using GGML/GPTQ models with 8 bit quantization, I could not use them to make a LoRA. It could be my software setup, but I think there is either a fundamental aspect of these models I haven't learned yet, or it is a limitation in Oobabooga's implementation. Either way, the key takeaway is to try making a LoRA with a Transformers based model loaded in Oobabooga, and be sure the "load in 8 bit" box is checked.
I didn't know what to expect with this, and haven't come across many examples, so I put off trying this until now. I have an 12th gen i7 with 20 logical cores and a 16GBV 3080Ti in a laptop. I can convert an entire novel into a text file and load this as raw text (tab) for training in Oobabooga using the default settings. If my machine has some assistance with cooling, I can create the LoRA in 40 minutes using the default settings and a 7B model. This has a mild effect. IIRC the default weight of the LoRA network is 32. If this is turned up to 96-128, it will have a more noticeable effect on personality. It still won't substantially improve the Q&A accuracy, but it may improve the quality to some extent.
I first tested with a relatively small Wikipedia article on Leto II (Dune character) formatted for this purpose manually. This didn't change anything substantially. Then I tried with the entire God Emperor of Dune e-book as raw text. This had garbage results, probably due to all the nonsense before the book even starts, and the terrible text formatting extracted from an eBook. The last dataset I tried was the book text only, with everything reflowed using a Linux bash script I wrote to alter newline characters, spacing, and remove page gaps. Then I manually edited with find and replace to remove special characters and any formatting oddballs I could find. This was the first LoRA I made where the 7B model's tendency to hallucinate seemed more evident than issues with my LoRA. For instance, picking a random name of an irrelevant character that occurs 3 times in 2 sentences of the LoRA text and prompting about it results in random unrelated output. The overall character identity is also weak despite a strong character profile and a 1.8MB text file for the LoRA.
This is just the perspective from a beginner's first attempt. Actually tuning this with a bit of experience will produce far better results. I'm just trying to say, if you're new to this and just poking around, try making a LoRA. It is quite easy to do.
FYI, Quantization is scaling those models down to a much lower resolution. Usually from long floating point numbers to 8bit (that is integer numbers from -128 to 127 (??)) or with ggml even lower. 4bit is only 16 different numbers. They're doing a bit of trickery there to get the max out of it.
You get lots of speedup and can fit many more of those more simple numbers into memory this way.
Ticking the box 'Load in 8bit' does that conversion on the fly. Downloading an already pre-quantized gptq also ends you up at that point, but you don't need to handle that large original file in the first place. and the conversion process is a bit more sophisticated.
If a quantization is not mentioned, it's probably the original, not shrunken-down version.
But mind this is a lossy process. Typically, people who want to continue processing the model itself, (i.e. fine-tune, build loras etc) take the full-resolution 'original' version to do so. and not the low-resolution quantized one. But then you're back at twice or four times more data to fit into your graphics card.
I'm not sure what Oobaboogas does in the background. If it takes the quantized version for the Lora or the original if you used 'load in 8bits'
I think I read some paper about Lora on quantized models. I don't know how it works but seems to be possible nowadays. I'm not sure about the quality implications. Usually if you start with something low resolution and use that 'degraded' data to modify things... The result won't be perfect. But that might not be a concern of yours if the result is good enough and you don't need 5-figures hardware to pull it off.
Idk. If you're trying to go anywhere with it, maybe read up on it. There is a free ML course on Huggingface and there are lots of guides and other info scattered around in places like this. I'm very aware this is a steep learning curve and I also have started without any knowledge about LLMs in June when Llama took off. And i'm not a pro.
If I might ask: What is your motivation behind training a Lora? I mean except for doing it for the sake of it. Do you want to generate some literature and get somewhere? Have you tried asking it politely to generate a continuation of the XY saga? Maybe it knows the fandom well enough. You could even provide it with one or two paragraphs and see if it picks up on the writing style.
Have fun and keep posting about your adventures...
Hey, thanks for the info. Probably the biggest mystery for me still is what "loading shards" means in the terminal output when a full model with transformers is loaded.
My goal right now is to learn the differences between LoRAs and embeddings in practice. I want to get further into the computer science curriculum on my own. I tend to get hung up on some point, and get lost in the weeds trying to find answers to my questions.
I want to explore the potential to create both a professor type of model loaded with a few books, courseware, and transcribed lectures, and a student type of model that only has access to information as I encounter it.
I probably won't achieve much with these naive objectives, but I will probably learn a thing or two along the way. I get the impression that this method of creating a model could be quite powerful. It seems like highly curated and tested tuning for purposed built models is the future. I think individualized education is probably the most powerful potential of LLMs.
As a peripheral curiosity, I want to know how a LLM can maybe interact with the Forth programming language. Everything in Forth can be made into a single word or token. The language is threaded with an interpreter. I want to know if combining these two makes something completely new. Like, can Forth give a LLM persistent memory or more.
Messing with Dune stuff is just a way to explore the basics of how modifying a model works.
A model is usually something like tens of gigabytes in size. To make it easier to store it on older file systems and to distribute the amout of data, the one giant file gets split up into several chunks / several smaller files. Those fragments are called 'shards'. If your frontend says "loading shards", it is reading all those ...part1 ...part2 ...part3 files into memory, and re-combining them. I think they kinda hijacked the term 'shard' from the database people, idk.
those things are two entirely different things
I like computer science myself. I think it's a fascinating subject. Be aware it sometimes has a steep learning curve. You'll experience disappointment. And sometimes it's hard work (and determination) to learn the basics first to be able to do things properly. And you unfortunately(?) also picked one of the more complicated topics. Don't get discouraged. You'll definitely learn things along the way! 🙃 In case you get into problems: Try and learn it in a structured way. Get a good book or one of the good(!) free online courses. There are many people who try to do it themselves and end up getting stuck. Instead, you'll want someone who has a good understanding of computer science (and knows how to teach) to tell you in which order to learn things. If you do it randomly, you might be setting yourself up for failure. It's hard work... However: you're allowed to have fun and play around. Just be aware of that fact.
Yeah. I've heard about that knowledge distillation and models learning from bigger models. I'm not really an expert, though. But there are a few scientific papers about that idea, out there.
Yeah. I think they can become a powerful tool to assist in teaching. And education is super important. But it's definitely still a long way to go before they can do more than grade your assignments and help you find your mistakes without your teacher. I wouldn't want a current AI teach me facts (they come up with fake facts all the time) or trust in their ability to teach things in a reasonable manner.
I don't know much about Forth. I know how stack-machines work. I'm not sure if that aligns in any way with how transformer language models work so the two of them would develop something like 'symbiosis'. But maybe I didn't think about it enough.
Thanks for the reply. That makes sense about the shards.
As far as tuning with LoRAs I have very limited expectations. I plan on trying a langchain database soon and I have higher expectations for that experiment.
Forth is interesting because you can basically make anything a word. Like I can make a word that is a pointer to the flags in a register, or a word that is a register. I can make a word that can be called and consists of two previously assigned words, and takes the word for the flag state and copies it to the word for the register. In Forth, everything can be a single word, and every word can be combined all the way to a complete operating system. Overall it is very linear and there is very little syntax. It is very much a language all about what word comes next. My curiosity is what happens if a LLM is given an objective in the Forth interpreter where Forth can influence the context tokens, and a simple conditional branching program can prompt the LLM to iterate solutions.
Like, let's say I want a bash find command to do something very specific, there is a sandbox terminal to test with, and the LLM has an accessible database of the manpage, help message, and is trained on stackoverflow data. I can already try a model like this and it will give me a command, but it won't work most of the time. The command will be ~80% correct. If I alter the prompt I can get a different 80% but the error is in a different place. So the correct info is present but I can't access it. So what happens if Forth could prompt the first question, then test the results and conditionally branch. Maybe it reframes the previous command output as a prompt to correct the bad command. Maybe it stores part of the command that works and prompts further. Maybe it goes meta and prompts the LLM to make a new Forth word to test and execute. Once the objective is reached, the Forth interpreter embeds the working word into the LLM with a strong weight that denotes its programming power and tested effectiveness. Now the LLM has a way to call a Forth word that does something effective. It could be like adversarial machine learning, but harnessing a LLM and the hardware in a way that it can make progress, self correct, and store the results. Forth takes away most of the issues of programming syntax and complexity associated with generating code. The required syntax for Forth can be self generated with a single word used to create it. The power of Forth is that EVERYTHING can be made into a single word.