this post was submitted on 17 Jun 2023
12 points (92.9% liked)

LocalLLaMA

2269 readers
5 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago
MODERATORS
 

Hey, I'm working on some local LLM applications and my goal is to run the smallest model possible without crippling performance. I'm already using 4 bit GPTQ but I want something smaller. These models have been trained on such a massive amount of data but my specific use case only touches a very very small fraction of that, so I would imagine it's possible to cut away large chunks of the model that I don't care about. I'm wondering if there has been any work on runtime pruning of LLMs (not just static pruning based on model weights) based on "real world" data. Something like: you run the model a bunch of times with your actual data and monitor the neuron activations to inform some kind of pruning process. Does anyone here know about something like that?

top 2 comments
sorted by: hot top controversial new old
[โ€“] [email protected] 2 points 1 year ago

The closest that I know is distillation, you can google to get few resources (e.g. https://huggingface.co/papers/2306.08543). I don't know if it is what you are looking for

[โ€“] [email protected] 2 points 1 year ago

I don't know about that, but you could try GGML (llama.cpp). It has quantization up to 2-bits so that might be small enough.