this post was submitted on 30 Jun 2023
15 points (100.0% liked)
LocalLLaMA
2268 readers
1 users here now
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
That's what llama.cpp and kobold.cpp do, the KV cache is the last thing that gets offloaded so you can offload weights and keep the cache in RAM. Although neither support SuperHOT right now.
MQA models like Falcon-40B or MPT are going to be better for large context lengths. They have a tiny KV cache so even blown up 16x it's not a problem.