this post was submitted on 30 Oct 2023
35 points (94.9% liked)
LocalLLaMA
2269 readers
3 users here now
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I think the implication is more stating that this dataset is even more useful if you don't jam the whole thing into your training but instead further filter it to a reasonable number of tokens, around 5T, and train on that subset instead
I could be incorrect, cause they do explicitly say deduplicating, but it's phrased oddly either way