LocalLLaMA

2326 readers

1 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago

MODERATORS

SkySyrup

pax

noneabove1182

Difference between GGML & GPTQ (lemmy.fmhy.ml)

submitted 1 year ago by blunttastic@lemmy.fmhy.ml to c/localllama

2 comments fedilink hide all child comments

Apologies for the basic question, but what's the difference between GGML and GPTQ? Do these just refer to different compression methods? Which would you choose if you're using a 3090ti GPU?

top 2 comments

sorted by: hot top controversial new old

[–] Mechanize@feddit.it 2 points 1 year ago (1 children)

As far as I know they are different types of quantization.

The main difference you have to keep in mind as an end user is that, currently, GPTQ needs the full model to load in VRAM (the memory of your GPU) while GGML can share layers between the system RAM and the VRAM.

Performance wise I think it depends on the foundational model used, I know some time ago someone (The_Bloke?) did some testing, but I read it on Reddit and I don't feel like going to search for it.
There's this interesting post on huggingface Link, but it's pretty old and things could have changed (for example GGML has gone through different iterations).

I'm just going by memory, so take everything I wrote with a pinch of salt. I never personally used GPTQ.

[–] markon@lemmy.world 1 points 1 year ago

Also llama.cpp offers very fast performance with the ggmls compared to using transformers, and sometimes faster than ExLlama.