this post was submitted on 30 Oct 2023
35 points (94.9% liked)

LocalLLaMA

2884 readers
7 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

founded 2 years ago
MODERATORS
 

30T tokens, 20.5T in English, allegedly high quality, can't wait to see people start putting it to use!

Related github: https://github.com/togethercomputer/RedPajama-Data

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 3 points 2 years ago (3 children)

Looks like the dedupped dataset is about 5T tokens. Nothing to sneeze at for sure.

[–] noneabove1182 1 points 2 years ago (1 children)

I thought they claim the dedupped dataset is the 20.5T number, where did you see 5T? either way that would still be awesome, especially when you consider the theory that quality is most limited by datasets and llama2 was trained on 2T.. this could be huge

[–] [email protected] 3 points 2 years ago* (last edited 2 years ago) (1 children)

Maybe I misread it but this was the source of the 5T remark…

https://news.ycombinator.com/item?id=38077521#38080442

What we make available is:

(A) the dataset after pre-processing the raw CommonCrawl data (e.g., text extraction and language identification) and some minimal filtering; and (B) for each document in (A), we also pre-computed 40+ of "features" (we call the "quality annotations") you can use to further filter it or deduplicate it. For example, one such feature is "how similar this document is to Wikipedia".

(A) is around 30T tokens, but you might want to use features in (B) to further filter/dedup it down, e.g., to 5T. For example, if in your application documents similar to Wikipedia are the most helpful documents, you can take the top documents with the highest score for the feature "how similar this document is to Wikipedia". Of course, the really interesting case happens when you consider a larger subset of these features (or maybe even automatically learn what the best way of filtering it is).

Our goal is to make this as flexible as possible such that you can fit this into your own application. What we have released is both (A) and (B)

If you have any questions, please let us know! Thanks for your interests, have fun with the data!

[–] noneabove1182 3 points 2 years ago

I think the implication is more stating that this dataset is even more useful if you don't jam the whole thing into your training but instead further filter it to a reasonable number of tokens, around 5T, and train on that subset instead

I could be incorrect, cause they do explicitly say deduplicating, but it's phrased oddly either way

load more comments (1 replies)