this post was submitted on 10 Apr 2024
28 points (93.8% liked)
LocalLLaMA
2266 readers
12 users here now
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I thought MoEs had to be loaded entirely in the (V)RAM and the inference speedup was because you only need to use a fraction of layers to compute the next token (but the choice of layers can be different for each token, so you need them all ready; or keep moving data between the disk <-> RAM <-> VRAM and get reduced performance).