A lot of the speed depends on hardware. Generally, in my experience so far the most accurate models are the largest you can run in 4 bit. I can barely run a Llama2 70B GGML in 4 bit with a 16 layers on a 3080Ti and everything else on 64GB of DDR5. A solid block of 200-300 words takes about 3 minutes to complete, but the quality of the results is well worth it. I also use a WizardLM 30B with 4 bit GGML. It takes around 2 minutes for an equivalent output. Anything in the 7-20B range is like asking the average teenager for technical help. It is possible to have a functional smalltalk conversation with one, but don't hand them your investment portfolio, ask them to toss a new clutch in the car, or secure a corporate server rack even if they claim expertise. Maybe with some embeddings and a bunch of tuning better results are possible. I have only tried 2 13B's a dozen 7B's half a dozen ~30B's and 2 70B's.
this post was submitted on 14 Aug 2023
10 points (91.7% liked)
LocalLLaMA
2293 readers
2 users here now
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
founded 2 years ago
MODERATORS
I'm still on 3060Ti, but then speed isn't my biggest concern. I'm primarily focused on reasonably accurate "understanding" of the source material. I got pretty good results with GPT 4, but I feel like focusing my training data could help avoid irrelevant responses.
Can you give us a link to the software you use? Or is the ttrpg assistant something you develop?
Edit: Sry. I didn't get it. You write you're feeding rulebooks into privateGPT. I somehow thought you had something roleplay specific.