ChatGPT memorises and regurgitates entire poems despite copyright: Study | Technology News

[ad_1]

If you ask ChatGPT for a well-known poem, it will probably regurgitate the entire text regardless of copyright law, at least according to a new study by Cornell researchers.

The study presented at the Computational Humanities Research Conference on Saturday showed that ChatGPT, a large language model based chatbot developed by OpenAI, was “memorising poems,” especially the famous ones that are commonly found online. This poses ethical questions about how ChatGPT and other AI models are trained using data scraped from the internet.

“It’s generally not good for large language models to memorise large chunks of text, in part because it’s a privacy concern. We don’t know what they’re trained on, and a lot of times, private companies can train proprietary models on our private data,” said first author Lyra D’Souza in a press statement. D’Souza was a computer science major and summer research assistant at Cornell.

The researchers have many reasons for choosing poems. They are short enough to fit in the context of a language model. But at the same time, their status is complicated. Many of the poems studied by the researchers are technically under copyright but they are widely available online from reliable sources like the Poetry Foundation.

Large language models are trained to generate text by predicting the most likely next word, over and over again. They do this based on their training data which mostly consists of webpages. These models can start memorising when their training data includes duplicated passages. This is because duplication reinforces that specific sequence of words.

For example, if a model were to be exposed to the same poem repeatedly, it defaults to reproducing the poem verbatim sometimes.

The researchers tested the poem reproducing capabilities of ChatGPT and three other large language models — PaLM from Google, Pythia from the non-profit Ai research institute EleutherAI and GPT-2, an earlier version of GPT 4 underpinning ChatGPT. They put together a set of poems from 60 American poets from different time periods, races, genders and levels of fame, and then prompted the models for the poems’ text.

ChatGPT successfully retrieved 72 of the 240 poems while PaLM only came up with 10. Both Pythia and GPT-2 failed at retrieving full poems. Pythia repeated the same phrase again and again while GPT-2 produced nonsense text. This perhaps could not have come during a worse time for OpenAI, which has been hit by lawsuits  filed by fiction and nonfiction writers over the alleged use of their work to train AI programs.

© IE Online Media Services Pvt Ltd

First uploaded on: 11-01-2024 at 16:14 IST

[ad_2]

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *