this post was submitted on 30 Oct 2023
35 points (94.9% liked)
LocalLLaMA
2577 readers
4 users here now
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Maybe I misread it but this was the source of the 5T remark…
https://news.ycombinator.com/item?id=38077521#38080442
I think the implication is more stating that this dataset is even more useful if you don't jam the whole thing into your training but instead further filter it to a reasonable number of tokens, around 5T, and train on that subset instead
I could be incorrect, cause they do explicitly say deduplicating, but it's phrased oddly either way