this post was submitted on 17 Aug 2023

485 points (96.0% liked)

Technology

59708 readers

1881 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

485

Report: Potential NYT lawsuit could force OpenAI to wipe ChatGPT and start over (arstechnica.com)

submitted 1 year ago by [email protected] to c/[email protected]

155 comments fedilink hide all child comments

cross-posted from: https://nom.mom/post/121481

OpenAI could be fined up to $150,000 for each piece of infringing content.https://arstechnica.com/tech-policy/2023/08/report-potential-nyt-lawsuit-could-force-openai-to-wipe-chatgpt-and-start-over/#comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 14 points 1 year ago (2 children)

What's the basis for this? Why can a human read a thing and base their knowledge on it, but not a machine?

[–] [email protected] 17 points 1 year ago (4 children)

Because a human understands and transforms the work. The machine runs statistical analysis and regurgitates a mix of what it was given. There’s no understanding or transformation, it’s just what is statistically the 3rd most correct word that comes next. Humans add to the work, LLMs don’t.

Machines do not learn. LLMs do not “know” anything. They make guesses based on their inputs. The reason they appear to be so right is the scale of data they’re trained on.

This is going to become a crazy copyright battle that will likely lead to the entirety of copyright law being rewritten.

[–] [email protected] 5 points 1 year ago (1 children)

I don't know if I agree with everything you wrote but I think the argument about llms basically transforming the text is important.

Converting written text into numbers doesn't fundamentally change the text. It's still the authors original work, just translated into a vector format. Reproduction of that vector format is still reproduction without citation.

[–] [email protected] 6 points 1 year ago* (last edited 1 year ago) (1 children)

But it's not just converting them into a different format. It's not even storing that information at all. It can't actually reproduce anything from the dataset unless it is really small or completely overfitted, neither of which apply to GPT with how massive it is.

Each neuron, which represents a word or a phrase, is a set of weights. One source makes a neuron go up by 0.000001% and then another source makes it go down by 0.000001%. And then you repeat that millions and millions of times. The model has absolutely zero knowledge of any specific source in its training data, it only knows how often different words and phrases occur next to each other. Or for images it only knows that certain pixels are weighted to be certain colors. Etc.

[–] [email protected] -1 points 1 year ago (2 children)

This is a misunderstanding on your part. While some neurons are trained this way, word2vec and doc2vec are not these mechanisms. The llms are extensions of these models and while there are certainly some aspects of what you are describing, there is a transcription into vector formats.

This is the power of vectorization of language (among other things). The one to one mapping between vectors and words / sentences to documents and so forth allows models to describe the distance between words or phrases using euclidian geometry.

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago) (1 children)

I was trying to make it as simple as possible. The format is irrelevant. The model is still storing nothing but weights at the end of the day. Storing the relationships between words and sentences is not the same thing as storing works in a different format which is what your original comment implied.

[–] [email protected] -1 points 1 year ago

I'm sorry you failed to grasp how it works in this context.

[–] [email protected] 0 points 1 year ago (1 children)

You made me really interested in this concept so I asked GPT-4 what the furthest word away from the word “vectorization” would be.

Interesting game! If we're aiming for a word that's conceptually, contextually, and semantically distant from "vectorization," I'd pick "marshmallow." While "vectorization" pertains to complex computational processes and mathematics, "marshmallow" is a soft, sweet confectionery. They're quite far apart in terms of their typical contexts and meanings.

It honestly never ceases to surprise me. I’m gonna play around with some more. I do really like the idea that it’s essentially a word calculator.

[–] [email protected] 4 points 1 year ago

Try asking it how the vectorization of king and queen are related.

[–] [email protected] 2 points 1 year ago

At some level, isn't what a human brain does also effectively some form of very very complicated mathematical algorithm, just based not on computer modeling but on the behavior of the physical systems (the neurons in the brain interacting in various ways) involved under the physical laws the universe presents? We don't yet know everything about how the brain works, but we do at least know that it is a physical object that does something with the information given as inputs (senses). Given that we don't know for sure how exactly things like understanding and learning work in humans, can we really be absolutely sure what these machines do doesn't qualify?

To be clear, I'm not really trying to argue that what we have is a true AI or anything, or that what these models do isn't just some very convoluted statistics, I've just had a nagging feeling in the back of my head ever since chatGPT and such started getting popular along the lines of "can we really be sure that this isn't (a very simple form of) what our brains, or at least a part of it, actually do, and we just can't see it that way because that's not how it internally "feels" like?" Or, assuming it is not, if someone made a machine that really did exhibit knowledge and creativity, using the same mechanism as humans or one similar, how would we recognize it, and in what way would it look different from what we have (assuming it's not a sci-fi style artificial general intelligence that's essentially just a person, and instead some hypothetical dumb machine that nevertheless possesses genuine creativity or knowledge.) It feels somewhat strange to declare with certainty that a machine that mimics the symptoms of understanding (in the way that they can talk at least somewhat humanlike, and explain subjects in a manner that sometimes appears thought out. It can also be dead wrong of course but then again, so can humans), definitely does not possess anything close to actual understanding, when we don't even know entirely what understanding physically entails in the first place.

[–] [email protected] 1 points 1 year ago

It’s also the scale of their context, not just the data. More (good) data and lots of (good) varied data is obviously better, but the perceived cleverness isn’t owed to data alone.

I do hope copyright law gets rewritten. It is dated and hasn’t kept up with society or technology at all.

[–] atzanteol 0 points 1 year ago

This is going to become a crazy copyright battle that will likely lead to the entirety of copyright law being rewritten.

I think this is very unlikely. All of law is precedent.

Google uses copyrighted works for many things that are "algorithmic" but not AI and people aren't shitting themselves over it.

Why would AI be different? So long as copyright isn't infringed at least.

[–] [email protected] 10 points 1 year ago* (last edited 1 year ago) (1 children)

That machine is a commercial product. Quite unlike a human being, in essence, purpose and function. So I do not think the comparison is valid here unless it were perhaps a sentient artificial being, free to act of its own accord. But that is not what we’re talking about here. We must not be carried away by our imaginations, these language models are (often proprietary and for profit) products.

[–] [email protected] 7 points 1 year ago (1 children)

I don't see how that's relevant. A company can pay someone to read copyrighted work, learn from it, and then perform a task for the benefit of the company related to the learning.

[–] [email protected] 2 points 1 year ago (1 children)

But how did that person acquire the copyrighted work? Was the copyrighted material paid for?

That's the crux of the issue, Open AI isn't paying for the copyrighted work they are "reading", are they?

[–] [email protected] 2 points 1 year ago

What does paying for anything have to do with what we're talking about here. They're ingesting freely available content, that anyone with a web browser could read