LocalLLaMA

2266 readers

12 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago

MODERATORS

SkySyrup

pax

noneabove1182

New "Context Shifting" feature in KoboldCPP 1.48 (github.com)

submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/localllama

2 comments fedilink hide all child comments

"This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing."

This means a major speed increase for people like me who rely on (slow) CPU inference (or big models). Consider a chatbot scenario and a long chat where old lines of dialogue need to be evicted from the context to stay within the (4096 token) context size. Previously the context had to be re-computed starting with the first changed/now missing token. This feature detects that, deletes the affected tokens from the KV cache and shifts the subsequent tokens in the KV cache so it can be re-used. Avoiding a computationally expensive re-calculation.

This is probably also more or less related to recent advancements like Streaming-LLM

This won't help once text gets inserted "in the middle" or the prompt gets changed in another way. But I managed to connect KoboldCPP as a backend for SillyTavern/Oobabooga and now I'm able to have unlimited length conversations without waiting excessively, once the chat history hits max tokens and the frontend starts dropping text.

It's just a clever way to re-use the KV cache in one specific case. But I've wished for this for quite some time.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago)

I wasn't able to get good use out if the old 'Smartcontext' anyways and seems other people had the same problem. To me, this is a huge improvement. And it doesn't even need extra memory or anything.

I really like how the KoboldCPP dev(s(?)) and the llama.cpp community constantly implement all the crazy stuff.