How generative AI is erasing our knowledge

The image that manufacturers paint of their LLMs (Large Language Models, the technology behind ChatGPT, Claude, etc.) is that of an answer machine – the knowledge of humanity just one prompt away. However, they are at best a simulacrum of such a machine, the answers mere platitudes from an immortal idiot. Taking a step back, it’s rather the opposite: instead of providing universal access to knowledge, LLMs are destroying our knowledge and cultural heritage. And I’m not referring to Elon Musk wanting to use Grok to rewrite history¹ or the inevitable errors (aka “hallucinations”) in the answers. I mean this very literally.

How does book burning work in a digital world?

There are two ways to eradicate a book: the more obvious one is to make it hard or impossible to get hold of. The book is banned from libraries, taken off the market, and existing copies are tracked down and destroyed. This is difficult enough in an analogue world where books are printed on paper. In a digital world, where copies of the book can exist on computers distributed around the world and can be created at the touch of a button, it is even more difficult.

The cantina scene from Star Wars: A New Hope — Star Wars: Episode IV – A New Hope: Han shot second.

An alternative – and possibly even more effective – approach is to render the book meaningless: a novel is more than a string of words, a film more than a series of images. They tell a story. With twists and turns and decisions made by the characters. Instead of finding and destroying all copies of the story, you can rewrite it and offer an alternative at every turning point, the opposite of every decision. When these variants of the story are spread, they begin to blur with the original. And at some point, no one knows for sure anymore: Who shot first – Han or Greedo?

The diabolical part is that even if a copy of the original eventually turns up, it doesn’t simply undo the damage. Without consensus on what the canonical version is, there is no correct version, and thus the story is completely erased.

This is because information consists not only of what has been said or written, but also of what has not been said. 1 bit of information (I can’t let you get away without at least a little detour into information theory) corresponds to a decision, and if suddenly both variants are equally valid, there is no decision and the bit is deleted. If we add content to the corpus of human texts at random, the amount of information does not increase, but rather decreases, which is counterintuitive. Only the distinction between whether a text is worth reading or not makes it information. Before that, it is just noise in the multitude of possible texts.

What does this have to do with LLMs?

The emergence of large language models in 2018 marked a technological breakthrough in text processing. Instead of working at the level of bits and bytes, we suddenly had a tool that could be used at the text level. This was an enormous achievement and a consequential paradigm shift. And this is also the reason why we often have the impression that LLMs create meaningful knowledge: they fundamentally break with our expectations of how we interact with text:

A seemingly endless cylinder made of stacked books — Every book that was never written.

We are used to authors typing their stories character by character on a keyboard. With each keystroke, they narrow down the number of possible stories until they finally settle on a single one with the last dot. But they could as well be searching for their story in a hypothetical library containing all possible stories. The degree of originality involved in the act of reaching for the right volume from the infinite shelf would be the same as if they had written it word for word. Intuitively, this feels wrong, but ultimately it is merely impossible.

This is where LLMs come into play. In a way, they are more like an infinite library than a keyboard: the innovation through LLMs is that they offer the possibility to interact directly with language instead of bytes. An LLM does not generate a random string of letters, but syntactically correct and even semantically plausible text. In the end, however, it is still just a random text from the set of all possible texts – weighted according to the stochastic model underlying the LLM.

And while that is impressive, it is also the standard by which we must measure the infomation content of AI-generated texts.

How much information is in AI-generated texts?

The information contained in AI-generated texts does not lie in the fact that they are formally correct English texts. Grammar and vocabulary do indeed contain a wealth of knowledge, but this can be assumed to be available to both the author and the reader, as it is a requirement for communication to be possible in the first place.

The only sources of information that contribute to the text are the prompt and the subsequent selection process. The upper limit for the information content is therefore determined by the prompt and the number of texts available for selection. The amount of training data used for the LLM is irrelevant here, as it does not constitute new information. Randomly recombining existing information does not generate new information, but merely adds noise to the text corpus. Of course, this does not mean that summaries can’t offer added value. However, the information must be deliberately selected and accurately reproduced.

So if the prompt is not at least as long as the generated text, or if thousands of variants have been generated, the information density of AI-generated texts is minimal. This does not mean, that an AI-generated text cannot contain new information for the reader. My argument here concerns the broader implications of AI-generated texts, and just because a statement is trivial in relation to the underlying text corpus, this does not automatically apply to all readers.

Conclusion

Combining these thoughts shows a hidden danger in the use of generative AI: with every AI-generated text that finds its way into the corpus of human literature without being recognisable as such, our cultural heritage becomes a little more blurred. Especially because these are not distinctly new creations and they fit inconspicuously into the gaps between existing works, the latter lose their distinctiveness and become obscured by arbitrariness. And this does not even take into account the damage done to the credibility of texts, images and videos by the negligent and deliberate omnipresence of misinformation.

Large parts of the internet are already generated by AI² and AI-generated books are starting to appear in between the ones written by humans. If we don’t manage to control the spread of AI-generated content, we could within a few decades face a loss of knowledge that would dwarf the loss of books in late antiquity.

The problem cannot be solved purely through political means (e.g. mandatory labelling of AI content³) or individual behaviour (e.g. boycotting AI content). Rather, it requires social rethinking: a consensus that the potential individual and short-term gains from AI are not worth the social damage caused by its irresponsible use.

Elon Musk wants to use AI to rewrite humanity’s knowledge in accordance with his own ideas.

↩︎
Estimates of the proportion of AI-generated content on the internet vary between 10% and 70%. As the credibility of these figures varies just as greatly, I will refrain from citing sources at this point. However, there is consensus that it is a lot and is growing rapidly. ↩︎
AI Act, Art. 50, para. 2: “Providers of AI systems, including general-purpose AI systems, generating synthetic audio, image, video or text content, shall ensure that the outputs of the AI system are marked in a machine-readable format and detectable as artificially generated or manipulated.” ↩︎

How generative AI is erasing our knowledge

How does book burning work in a digital world?

What does this have to do with LLMs?

How much information is in AI-generated texts?

Conclusion

Comments