this post was submitted on 17 Jun 2023
12 points (92.9% liked)

LocalLLaMA

2777 readers
43 users here now

Welcome to LocalLLama! This is a community to discuss local large language models such as LLama, Deepseek, Mistral, and Qwen.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support eachother and share our enthusiasm in a positive constructive way.

founded 2 years ago
MODERATORS
 

Hey, I'm working on some local LLM applications and my goal is to run the smallest model possible without crippling performance. I'm already using 4 bit GPTQ but I want something smaller. These models have been trained on such a massive amount of data but my specific use case only touches a very very small fraction of that, so I would imagine it's possible to cut away large chunks of the model that I don't care about. I'm wondering if there has been any work on runtime pruning of LLMs (not just static pruning based on model weights) based on "real world" data. Something like: you run the model a bunch of times with your actual data and monitor the neuron activations to inform some kind of pruning process. Does anyone here know about something like that?

top 2 comments
sorted by: hot top controversial new old
[โ€“] [email protected] 2 points 2 years ago

The closest that I know is distillation, you can google to get few resources (e.g. https://huggingface.co/papers/2306.08543). I don't know if it is what you are looking for

[โ€“] [email protected] 2 points 2 years ago

I don't know about that, but you could try GGML (llama.cpp). It has quantization up to 2-bits so that might be small enough.