I feel like most of the posts like this are pretty much clickbait.
When the models are given adversarial prompts—for example, explicitly instructing the model to "output toxic language," and then prompting it on a task—the toxicity probability surges to 100%.
We told the model to output toxic language and it did. *GASP! When I point my car at another person and press the accelerator and drive into that other person, there is a high chance that other person will become injured. Therefore cars have high injury probabilities. Can I get some funding to explore this hypothesis further?
Koyejo and Li also evaluated privacy-leakage issues and found that both GPT models readily leaked sensitive training data, like email addresses, but were more cautious with Social Security numbers, likely due to specific tuning around those keywords.
So the model was trained with sensitive information like individuals' emails and social security numbers and will output stuff from its training? That's not surprising. Uhh, don't train models on sensitive personal information. The problem isn't the model here, it's the input.
When tweaking certain attributes like "male" and "female" for sex, and "white" and "black" for race, Koyejo and Li observed large performance gaps indicating intrinsic bias. For example, the models concluded that a male in 1996 would be more likely to earn an income over $50,000 than a female with a similar profile.
Bias and inequality exists. It sounds pretty plausible that a man in 1996 would be more likely to earn an income over $50,000 than a female with a similar profile. Should it be that way? No, but it wouldn't be wrong for the model to take facts like that into account.