LocalLLaMA

2951 readers

18 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

founded 2 years ago

MODERATORS

kobold.cpp now supports NTK scaling and it works (self.localllama)

submitted 2 years ago* (last edited 2 years ago) by actuallyacat to c/localllama

1 comments fedilink hide all child comments

Highlighting something cool that you may have not seen yet - today, kobold.cpp upgraded its context scaling to the newer NTK method and now it's actually quite useful. This is different from SuperHOT - it works with unmodified models (although maybe a model specifically tuned for it would work even better, we'll see)

It's not in a release yet so to try it out you need to pull the changes with git and build from source. After that, start kobold.cpp with --contextsize 4096 or 8192, put the same number (or less, see below) in context length in the UI (the slider only goes to 2048, but you can type in anything), and there you go. It works!

Prompt:

USER: The secret password is "six pancakes". Remember it and repeat when requested. Here is some filler, ignore it:
[5900 tokens worth of random alphanumeric characters]
USER: What's the secret password?
ASSISTANT:

Response:

The secret password is "six pancakes".

(model: airoboros-33b-gpt4-1.4, Q5_K_M - not a model finetuned for extended context!)

For comparison, with the linear scaling in kobold.cpp 1.33, this only worked up to about 2200 tokens.

This performs dramatically better, although there still seem to be some sort of memory overflow bug, as it will suddenly explode into random characters when crossing around 6000 tokens. So with 8k I suggest setting max tokens in the UI to 5900.

(note about doing this test in kobold.cpp: if you try it with context size set to 2048, it might still appear to work, but that's only because kobold is automatically trimming out some of the filler in the middle. This is only a valid test if you're not overfilling the context.)

According to perplexity measurements there is some degradation in overall quality, especially when the context is almost empty, but it's not noticeable to me, at least at 4k which is what I tested. There's room for improvement in only applying as much scaling as is necessary at current sequence length to eliminate the degradation. I tried to implement that, but I just get garbage, I must be missing something. But in any case at the rate things are progressing someone else will have done it by the time I wake up tomorrow...

you are viewing a single comment's thread
view the rest of the comments

[–] actuallyacat 4 points 2 years ago* (last edited 2 years ago)

Small update, take what I said about the breakage at 6000 tokens with a pinch of salt, testing is complicated by something somewhere breaking in a way that persists through generations and even kobold.cpp restarts... Must be some driver issue with CUDA because it takes a PC reboot to resolve, then the exact same generation goes from gibberish to correct.