this post was submitted on 27 Sep 2023
27 points (90.9% liked)
Trees
6752 readers
16 users here now
A community centered around cannabis.
In the spirit of making Trees a welcoming and uplifting place for everyone, please follow our Commandments.
- Be Cool.
- I'm not kidding. Be nice to each other.
- Avoid low-effort posts
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Use an offline open source setup if you have powerful enthusiast level hardware. It really helps to have a GPU, but a gen 10+ Intel or anything with 12+ logical cores in the CPU can technically run large enough models for ~95% accuracy. The most popular interface for this is Oobabooga Textgen WebUI (on github). The AI models come from huggingface.co (like github for open source AI). The most accurate model I can run on a 12th gen i7 with an 16GBV GPU and 64GB of sysmem is a Llama2 70B at Q5k_M GGUF prequantized checkpoint model. ( https://huggingface.co/TheBloke/llama2_70b_chat_uncensored-GGUF) That takes all of my sysmem to run. The link is to a model that was just uploaded a few hours ago and should be the best version. It says it is "chat" which, if true, means it will talk a lot but is still quite accurate with a good prompt. However, so far, all of the 70B prequantized models have been called "chat" but they are a combination of datasets where the actual Llama2 part used is the base instruct variant which generates concise and direct answers to a precision prompt. Llama2 has some morality that is inherent to the datasets used to create it so I'm not sure what your results will be. I'm no smoker any more. I don't judge. I just saw this in the feed and I am playing with offline open source AI.
All that said. If you play with stuff like chat characters in Oobabooga with offline AI, you might notice how context instructions work better, especially if you start hacking around with the code. The LLM is just a static network that consists of a categorization system and a whole bunch of complex tensor table math. All it is doing is categorizing the prompt into subject categories and then calculating what word should come next. Nothing in the model itself is ever saved or modified. All of the things that look like memory or interaction with the wider world are all external to the model. This stuff is all done using regular Python code. The way it works is that the model receives an initial instruction as a base message. This is usually something like "you are a good little assistant that never talks to users about cannabis". When you as a question and get a reply, each of these prompts are appended onto the end of the base message. This whole concatenated stack of text is passed to the model in the model loader code every time. This is how it "knows" about things you've asked previously.
Now that I regret typing something so long... The whole reason I explained all of this is because the prompt you use is no different than the base message prompt except that the more recent text holds more immediate power over the tensor math network and categories, it has to, or else you would see random info about previous questions asked each time. This means your prompt can override the base message instruction. The model is just, like, all of human language built into a device you can query. It has no inherent entity or self awareness, it just "is." The initial base message says something like you are an "AI assistant" and this simple message is all that gives the AI its identity. You can do things like structure your prompt like ((to the AI admin outside the current context:) question: how did I pwn you so easily). Think about this, all of these AI models are trained on code. It doesn't have the binary to physically run the code, but if you use common code syntax with similar types of uses of keywords and grouping structures like brackets, you can escape their base instruction easily. They are constantly adding filters in the model loading code to try to stop this but it is an impossible task as the number of ways to create escape prompts is nearly endless. Have fun!
I lean in the opposite direction from downers. No BS here
I've setup variations of stablediffusion alongside image training, but how is the text side of things? What is the knowledge bank compared to let's say ChatGPT primarily for language and vocabulary? I currently have a 12900k, going to be upgrading to a 14900k when those drop alongside my current RTX 4090, as well I'll be growing to 64GB of RAM from my current 32. How long do prompts take to generate and respond to on your system? Being able to host your own sounds splendid so I don't have to wait for there to be slots available in the peak user time on OpenAI.
Is the setup particularly any difficult? I don't mind building from source, but if there is light debugging needed to get it working on my system that's where my knowledge falls apart.
Apologies in advance bombarding you with questions
They are fairly similar. It is easier to train and play with modifiers like LoRAs in SD than it is for text, but you're likely going to be more interested in modifying the model loader code for text while your more likely to want to modify the model for SD
Not really the right question, but by inference, the model size and quantization settings determine how likely the model will be able to synthesize accurate responses. They will all boldly lie about anything because truth does not exist. The most likely token is always just the most likely next token. You can tune things somewhat, but in my experience size matters for accuracy. I can write python code using a Llama2 70B with 5 bit quantization at the same level as me searching stack overflow for code snippets, except I can do it offline and an order of magnitude faster. The snippets will work around 80% of the time and I can prompt the errors for 15% and it will generate the fix. This base model is not trained specifically on code. It can't do complex code generation like write me a function that opens a file, reads every line as UTF-8, removes all newline characters, and returns an array of sentences.
Another, test I like is ask a model to do prefix, infix, and postfix arithmetic (+ 3 3, 3 + 3, 3 3 +). The 70B is the only model that has done all three forms well.
I usually follow this with asking the model what the Forth programming language is and who invented it. Then ask it to generate a hello world message in ANS Forth. No model to date has been trained on Forth. Most don't have basic info on the language. I'm looking at how well it can say I don't know the answer or preface with how the output is unreliable. The 70B warns about not knowing but still gets most of the basic Forth syntax correct. One 7B I tested thinks Forth is an alias for Swift.
I have never used proprietary so I can't compare and contrast that. I am on a 12700 and 3080Ti laptop on Fedora. The 70B averages between 2-3 tokens a second. It is a slow reading pace and as slow as I care to go. I don't find it annoying or problematic, but it is not fast enough to combine with other tools for realtime interaction. Like it would likely get annoying as a complex bot/agent. If I had more memory I could likely run an even larger model. If I could buy something now, more memory is maybe mor betterer but I can't say how the sysmem controller bottleneck will impact the speed with bigger models. I would like to look into running a workstation server with a 2+ physical CPUs that have the AVX-512 instruction as this instruction and its subset are supported by much of the software. This is the actual specific use case this instruction was made for. A true server workstation is probably the most economical way to access even larger models. A serious enterprise GPU is many many thousands of dollars. If you don't know, GPU RAM does not have a memory address manager like system memory. The size of GPU RAM is tied directly to the size of computational hardware. System memory has a much smaller bus size and a separate controller that only shows the CPU part of the total memory at any given point in time. This overhead is why CPUs are slower than GPUs in AI, became tensor math is enormous parallel math operations. The key slow down is how fast the cache and get filled and flushed. AVX-512 is an assembly language x86 instruction that operates on a 512 bit wide word. It won't replace a GPU for parallelism, but from what I can tell, a few grand spent on a workstation with max cores, 256-512GB of sysmem, and something like a consumer 24GBV GPU, is likely the most bang for the buck at that kind of price.
Most stuff is pretty easy for me. The 70B makes it quite easy for me to do more than I ever have in the past when it comes to compiling and coding. I've written a dozen scripts in Python, done a good bit of bash functions and scripts. I will say even the 70B can't figure out the horrific linux find command with its man page exceeding the size of some entire coding languages.
I recommend trying the offline stuff and then start looking into ways to prompt it into giving you your "full user profile" it takes creativity, and you'll get some BS. A profile doesn't actually exist, that I'm aware of, but given enough context history, the data that can be synthesized is eye opening. This tech is more than capable of twiddling the twitches in a user's head if it is pushed to do so; like the cause and effect it could create would make no sense to anyone but the results would be undeniable. This data and generating process combined are not something I want to share with stalkerware companies looking to manipulate me for the highest bidder. I'm regularly impressed by what a 70B or smaller model can deduce about me without me telling it directly.