this post was submitted on 25 Jun 2023
751 points (99.3% liked)
13619 readers
1 users here now
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Please correct me if I'm mistaken but isn't the reddit dataset used to train LLMs from before Chat GPT became widely known? I was under the impression data from that point onwards was poisoned and not useful for training purposes
I can't seem to find it now but I remember there being a ~90gb .zip megadb upload that got passed around a lot on machine learning reddit subs that was a snapshot of reddit before x date