The Digital Delusion
Information Literacy, Iron Mountain, and the Human Knowledge AI Can't Touch
“If I had a world of my own, everything would be nonsense. Nothing would be what it is, because everything would be what it isn't. And contrary wise, what is, it wouldn't be.” — Alice in Wonderland
A well-known figure in the world of technology and AI (who has recently acquired considerable political power and influence) said in a recent interview that artificial intelligence models have:
This is nonsense. It’s also one of the first things that I teach in my Introduction to Information Literacy class. Because the people who aspire to run the world should know at least as much as my students, I invite him and the other tech titans to silence their devices, sit back and listen.
Info Literacy Part 1: AI Doesn’t Know Everything, Really
AI tools and LLMs are operating on a fraction of a fraction of recorded human knowledge. They haven’t run out of human-created information. In fact, they’ve barely scratched the surface.
There’s a much-quoted statistic that traditional search engines like Google only index 4-5% of the information available on the open internet. I’ve never been able to find the original source for this, but the earliest mentions of that statistic go back to the late 1990s/early 2000s, when people were trying to quantify the size of the "deep web." Things have changed massively since then, and while the 4-5% estimate is useful, I believe that the true percentage of digital information that’s discoverable on the “open” internet is small, dynamic, and essentially unknowable.
What is known is that LLMs are built on datasets created by large-scale crawling and scraping of publicly available websites. The crawlers that create these datasets operate like conventional engines and the amount of data harvested is subject to the same limitations. Crawlers are prevented from accessing data in multiple ways. Sites can use a Robots.txt file exclusions – a file which tells search engines that they shouldn’t visit, or scrape, your site. Directives can be inserted on a page telling search engines not to index your page (noindex tag), or crawlers can be diverted to index other, similar pages (canonical tag). Passwords, encryption, and paywalls can also restrict data gathering.
Some AI companies are attempting to get around these limits and their efforts are running into problems both technical and ethical (lawsuits around attempts to breach paywalls or scrape data from repositories without permission are growing).
Even if they succeed in overcoming some limitations, AI and large language models (LLMs) are only accessing the smallest fraction of humanity's collective digital knowledge. A recent post “Advent of AI: “AI Knows Everything” speculates that AI tools are trained on “one one-hundred-billionth of the internet” – and that’s as good a guess as any out there right now.
If the information that AI is trained on is so seriously limited, where does this idea come from that AI is an all-knowing instrument with access to the totality of human knowledge?
It would be easy to point out the vast technical ignorance of some of the leading figures in AI and stop there. In another era I would call this a collective delusion.
There is another theory: resistance to acknowledging AI’s limitations might be an instance of what’s called human "model collapse". A phenomenon where:
I believe that the tech evangelists of Silicon Valley are in this type of model collapse.
And when the cracks start appearing in the bubble as they bump up against the real world it’s bound to end in tears or, worse.
Info Literacy Part 2: The Real-World Information Edition
Encrypted information, secure records archives, and countless other protected digital resources remain beyond the reach of AI and its training data. Then there’s reality that a vast universe of non-digital information exists in the physical world.
When the very obvious fact of the existence of physical information penetrated the tech-bro consciousness, our recently elevated idealogue did not reacted well.
The trigger was Iron Mountain. It sounds like a place in a fairy story or fable, but it’s very real. Iron Mountain is a fortress of information housed 220 feet underground in a former limestone mine near Pittsburgh. This facility, along with others like it, safeguards an immense collection of physical records: government documents, business archives, legal records, photographs, audio recordings, and video footage. Current Federal records in print are also stored there.
The chief objection that the new efficiency tsar has offered to this system is that storing records in paper form is “inefficient”. He embellished his complaint by saying, falsely, that there’s an old mine elevator that can only transport a limited number of records at time (there is no ancient “mine elevator”). He implied that the production of these paper records is an anachronism and a sign that government record keeping is hopelessly out of date.
This is a fantasy. Iron Mountain is a modern and efficient facility. The physical limitations are features, not bugs – they protect the authenticity and integrity of our historical record. The present facility was created during the cold war to protect these records from nuclear attack and other disasters. There’s still an enormous practical advantage in keeping physical records - digital systems are susceptible to cyberattacks, technical failures, data corruption, and alteration.
What’s troubling to these techies is not lack of efficiency – it’s the idea that information, facts, records, and ideas, exist that are beyond their control. They can’t access, or eliminate, these records with a click.
Iron Mountain is a private company - it doesn’t have to let them in. If the bros ever succeed in gaining entry, it’s so much more trouble to burn paper than it is to erase digital data. The optics would be so very bad.
The Wartburg Festival, Thuringia, where antinationalist writings were burned, October 18, 1817. (Courtesy of Deutsches Historisches Museum, Berlin.)
The existence of physical libraries, with their books and paper trails and precious physical archives, will continue to torment this crowd. Print is stubbornly tangible, rooted in what it is, and curiously resistant to change. These records exist in their original context, with their physical form providing crucial authenticity and meaning that cannot be replicated by digital alternatives.
Physical repositories also represent something that increasingly enrages these tech leaders. This is information that cannot be easily altered; it can’t be scraped, or sucked into a black-box AI tool, to be broken up into tokens and then reconstituted. It persists when you close your eyes, forever immune to the tech industry's dreams of infinite malleability.
Efforts to convince us that “all of human knowledge” is both available and reliable in our new chatbots and tools will continue. Voices, avatars, and AI “companions” can do much to disguise the poverty of the information that they offer up in reconstructed and regenerated form. But they cannot replace or replicate the depth and authenticity of our physical historical record. No amount of conversational polish or avatar design can change the fact that these systems draw from a shallow pool of knowledge, largely divorced from the vast depths of human experience and memories that are embodied in physical form.
Libraries, archives, and facilities like Iron Mountain are not inefficient relics – they are essential guardians of human knowledge in its fullest form. While AI tools certainly have their uses, the claim that they have absorbed "all of human knowledge" reveals more about Silicon Valley's narrowing worldview than about the true scope of human information.
Class dismissed.