this post was submitted on 20 Apr 2024
11 points (64.9% liked)

LocalLLaMA

2259 readers
1 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[โ€“] [email protected] 3 points 7 months ago (1 children)

access to the training data

That's just not realistic. There are too many legal problems with that.

Besides, Llama 3 was trained on 15 trillion tokens. Whatcha gonna do with something like that?

[โ€“] [email protected] 1 points 7 months ago* (last edited 7 months ago)

Hmm. Sure the legal issues is why it is the way it is. It doesn't necessarily mean it should be that way... But it's more complicated than that.

With the dataset, I'm sure people could figure out something to do with it. There are community curated datasets, previous attempts to recreate models like RedPajama... Sure this is a lot more, but other people are making progress, too. And if not that we could at least have a look at it, do some research, statistics... Maybe use parts of it for something else. That's the spirit of the free software movement.

I'm a bit split on the topic. FOSS doesn't translate directly to ML models. Not being able to recreate something isn't how it's supposed to be. But it's not software either and works differently. Releasing datasets would give us some progress and give the tools to other people than just the big tech companies who are free to violate copyright law. But we're still missing the millions to afford the compute to train a model anyways.