Advertisement

AIs trained on other AI output will start producing junk within a few generations, scientists warn

AIs trained on other AI output will start producing junk within a few generations, scientists warn
  • AIs trained solely on other AIs will eventually spiral into gibberish, machine learning experts say.

  • One author compared the effect to pollution, saying: "we're about to fill the Internet with blah."

  • It means AI training by scraping the web will become less and less effective, they said.

As more and more AI-generated content is published online, future AIs trained on this material will ultimately spiral into gibberish, machine learning experts have predicted.

A group of British and Canadian scientists released a paper in May seeking to understand what happens after several generations of AIs are trained off each other.

In one instance, a ninth-generation AI ended up babbling about jackrabbits when the original source material had been about medieval architecture.

The math shows that "within a few generations, text becomes garbage," one of the authors, Prof. Ross Anderson of the University of Cambridge, wrote in a blog about the findings. Images, too, lose intelligibility, they said.

The scientists — whose paper has not yet been peer reviewed — are calling it "model collapse."

Large Language Models (LLMs) like ChatGPT and OpenAI are generally understood to have been trained on vast amounts of data pulled from the internet — which was, of course, mostly human-generated until recently.

But as our use of these tools grows, AI-generated content is being added en masse to the online pool of data from which future LLMs will learn.

Over generations of training, this means errors and instances of nonsense will spiral, making it impossible for later AIs to distinguish between fact and fiction, the scientists said. The AIs will "start misinterpreting what they believe to be real, by reinforcing their own beliefs," they wrote.

Anderson used the works of Mozart and his less-talented contemporary, Antonio Salieri, to illustrate the problem.

"If you train a music model on Mozart, you can expect output that's a bit like Mozart but without the sparkle – let's call it 'Salieri,'" he wrote. "And if Salieri now trains the next generation, and so on, what will the fifth or sixth generation sound like?"

Lead author, Dr Ilia Shumailov, of the University of Oxford told Insider that the problem lies in the AI's perception of probability after being trained on an earlier AI. Improbable events are less and less likely to be reflected in its output, narrowing what the next AI — trained on that output — understands to be possible.

In one example given in the paper, a human-generated text about medieval architecture was fed through an AI language model, whose output was then used to train the next generations of AIs.

The original text, which deftly handled competing architectural theories, was put through this cycle several times.

By the ninth generation, the text was drivel. It read: "architecture. In addition to being home to some of the world's largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-"

Anderson likened it to massive pollution, writing: "Just as we've strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we're about to fill the Internet with blah."

AI-generated content is already finding its way online at scale. In May, online disinformation watchdog NewsGuard warned that it had found 49 news sites that appeared to be written entirely by AI.

And marketing and PR agencies are increasingly farming their copy out to chatbots — shunting human writers out of a job, as The Washington Post reported.

But, following Shumailov and Anderson's findings, human writers hoping not to be bested by AI shouldn't pop the Champagne yet.

There's nothing about human-generated data that is absolutely essential to training AIs, Shumailov said — the reason it's useful is because of the amount of natural variation, error, and improbable results our language produce, Shumailov said.

"So humans are definitely going to help," he said. "At the same time, this is not to say that human data is a big requirement."

Read the original article on Business Insider