LocalLLaMA

2878 readers

66 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

founded 2 years ago

MODERATORS

Vicuna-33B-1-3-SuperHOT-8K-GPTQ (huggingface.co)

submitted 2 years ago by [email protected] to c/localllama

9 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] actuallyacat 3 points 2 years ago

That's what llama.cpp and kobold.cpp do, the KV cache is the last thing that gets offloaded so you can offload weights and keep the cache in RAM. Although neither support SuperHOT right now.

MQA models like Falcon-40B or MPT are going to be better for large context lengths. They have a tiny KV cache so even blown up 16x it's not a problem.