this post was submitted on 30 Oct 2023
35 points (94.9% liked)
LocalLLaMA
2909 readers
25 users here now
Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.
Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.
As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Maybe I misread it but this was the source of the 5T remark…
https://news.ycombinator.com/item?id=38077521#38080442
I think the implication is more stating that this dataset is even more useful if you don't jam the whole thing into your training but instead further filter it to a reasonable number of tokens, around 5T, and train on that subset instead
I could be incorrect, cause they do explicitly say deduplicating, but it's phrased oddly either way