LocalLLaMA

2269 readers

3 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago

MODERATORS

SkySyrup

pax

noneabove1182

HUGE dataset released for open source use (together.ai)

submitted 1 year ago by noneabove1182 to c/localllama

4 comments fedilink hide all child comments

30T tokens, 20.5T in English, allegedly high quality, can't wait to see people start putting it to use!

you are viewing a single comment's thread
view the rest of the comments

[–] noneabove1182 3 points 1 year ago

I think the implication is more stating that this dataset is even more useful if you don't jam the whole thing into your training but instead further filter it to a reasonable number of tokens, around 5T, and train on that subset instead

I could be incorrect, cause they do explicitly say deduplicating, but it's phrased oddly either way