The AI Knowledge Trap — Omitted Information May Be Lost Forever
Support CleanTechnica's work through a Substack subscription or on Stripe.
AI protagonists like Sam Altman, Elon Musk, Jason Huang, Google, Meta, Microsoft, and a zillion entrepreneurs who see artificial intelligence as a path to instant riches can’t stop talking (and talking) about the wonders of AI. Elon Musk says, coupled with the humanoid robots he is creating, artificial intelligence will usher in a wondrous new world free from poverty and illness.
But there is a contrarian point of view, and it has nothing to do with the gigawatts of power needed to run the data centers that make AI possible or the issue of cooling them in areas where water is already in short supply. Instead, it argues that, by omitting oral histories and languages that are not predominant in the world, large language models exclude significant sources of information and marginalize people in less dominant cultures.
Deepak Varuvel Dennison is a PhD student at Cornell University. His research explores responsible AI, with a focus on designing and evaluating systems that serve the needs of the majority world. In a recent article for Aeon, he argues that “huge swathes of human knowledge are missing from the internet. By definition, generative AI is shockingly ignorant, too.”
Elon Musk says robotic surgeons will be far more skillful than humans, and yet Dennison tells the story of his father, who found a traditional remedy for a tumor that traditional doctors believed was malignant. He treated it with a special herb infused oil provided by a vaithiyar — a doctor who practices Siddha medicine in his home state of Tamil Nadu in India. Siddha medicine is not included in any large language models in use today.
What To Leave In, What To Leave Out
“I find it hard to believe my dad’s herbal concoctions worked, but I have also since come to realize that the seemingly all-knowing internet I so readily trusted contains huge gaps — and in a world of AI, it’s about to get worse,” he wrote. “I study what it takes to design responsible AI systems. My work has been revealing to me how the digital world reflects profound power imbalances in knowledge, and how this is amplified by generative AI.” Here is the crux of Dennison’s argument:
“The early internet was dominated by the English language and Western institutions, and this imbalance has hardened over time, leaving whole worlds of human knowledge and experience undigitized. Now with the rise of GenAI — which is trained on this available digital corpus — that asymmetry threatens to become entrenched.
“For many people, GenAI is becoming their primary way to learn about the world. A large scale study published in September 2025, analyzing how people have been using ChatGPT since its launch in November 2022, revealed that around half the queries were for practical guidance, or to seek information.
“These systems may appear neutral, but they are far from it. The most popular models privilege dominant epistemologies — typically Western and institutional — while marginalizing alternative ways of knowing, especially those encoded in oral traditions, embodied practice and the languages considered ‘low-resource’ in the computing world, such as Hindi or Swahili, both spoken by hundreds of millions.
“By amplifying these hierarchies, GenAI risks contributing to the erasure of systems of understanding that have evolved over centuries, disconnecting future generations from vast bodies of insights and wisdom that were never encoded yet remain essential to human ways of knowing. What’s at stake then isn’t just representation — it’s the resilience and diversity of knowledge itself.”
AI & Prior Knowledge
Readers can probably think of several similar instances in which a knowledge base was erased. Indigenous people all around the world have had their language and cultures erased accidentally or deliberately by more dominant cultures. What the Incas and Aztecs knew has been lost. Native people in the US, Canada, and Australia were forced to learn new languages and never refer to their prior culture. Much harsher decultrural programming was visited on those brought to the New World by slavery.
In the digital world, many documents stored on floppy discs, zip drives, magnetic tape, or CD-ROMs cannot be recovered because the operating systems needed to decode them are no longer available. Dennison adds:
“GenAI is trained with massive datasets of text from sources like books, articles, websites and transcripts, hence the name ‘large language model.’ But this training data is far from the sum total of human knowledge. As well as oral cultures, many languages are underrepresented or absent. To understand why this matters, we must first recognize that languages serve as vessels for knowledge.
“They are not merely communication tools, but repositories of specialized understanding. Each language carries entire worlds of human experience and insight developed over centuries — the rituals and customs that shape communities, distinctive ways of seeing beauty and creating art, deep familiarity with specific landscapes and natural systems, spiritual and philosophical worldviews, subtle vocabularies for inner experiences, specialized expertise in various fields, frameworks for organizing society and justice, collective memories and historical narratives, healing traditions, and intricate social bonds.”
The Value Of Local Knowledge
An example of how historical narratives need to be preserved can be found in building homes that are appropriate to their environment. In parts of India, houses are made from local materials, a topic that Dharan Ashok, chief architect at Thannal, knows a great deal about. He agreed there is a strong connection between language and local ecological knowledge, and that this in turn underpins Indigenous architectural knowledge.
While modern construction is largely synonymous with concrete and steel, Indigenous building methods were deeply ecological. They relied on materials available in the surrounding environment, with biopolymers derived from native plants playing a significant role instead of concrete.
On its website, the company says, “At Thannal Natural Homes, we believe the earth beneath our feet is not just a material, but a living partner in the making of shelter. Our work stands for 0 percent cement, fully natural construction, rooted in the conviction that homes should breathe with us and return to the soil without harm.”
Dhahan said the greatest challenge is that a great deal of human knowledge is undocumented and is passed down orally through native languages. It is often held by just a few elders, and when they pass away, it is lost. He spoke of how recently he missed an opportunity to learn how to make a specific type of limestone-based brick when the last person with knowledge of the technology died.
The Danger Of Unintended Bias
“When AI systems lack adequate exposure to a language, they have blind spots in their comprehension of human experience,” Dennison explains. Common Crawl, one of the largest public sources of training data for AI, contains more than 300 billion web pages spanning 18 years, but the majority of those pages are in English. Hindi is the third most spoken language in the world, yet it accounts for only 0.2 percent of the data available on Common Crawl. Tamil is spoken by more than 86 million people, yet it represents just 0.04 percent of the data.
English is spoken by about 20 percent of the global population, but it dominates the digital space by a wide margin. Other colonial languages such as French, Italian, and Portuguese, with far fewer speakers than Hindi, are better represented.
In the computing world, approximately 97 percent of the world’s languages are classified as “low-resource,” yet many of them are spoken by millions of people and carry centuries of rich linguistic heritage. A study from 2020 showed 88 percent of the world’s languages are severely neglected in AI technologies.
Colonialism In The Digital World
In her book Decolonizing Methodologies (1999), the Māori scholar Linda Tuhiwai Smith emphasized that colonialism profoundly disrupted local knowledge systems — and the cultural and intellectual foundations upon which they were built — by severing ties to land, language, history and social structures. Smith’s insights reveal how these processes are not confined to a single region but form part of a broader legacy that continues to shape how knowledge is produced and valued. It is on this distorted foundation that today’s digital and GenAI systems are built. Of course, conservative initiatives that seek to downplay or eliminate some sources of ethnic knowledge play a key role in what gets included in LLM databases as well.
How Distortions Occur
Dennision explains that LLMs often amplify dominant patterns in a way that distorts their original proportions — ofter called “mode amplification.” If the training data includes 60 percent references to pizza, 30 percent to pasta, and 10 percent to biriyani as favorite foods, you might expect the program to produce answers in the same proportion if asked the same question 100 times. In reality, LLMs tend to overproduce the most frequent answer.
Pizza may appear more than 60 times, while less frequent items like biriyani may be underrepresented or omitted altogether because LLMs are optimized to predict the most probable next ‘token’ — the next word or word fragment in a sequence — which leads to a disproportionate emphasis on high likelihood responses. Because of uneven internal knowledge representation and mode amplification in output generation, LLMs often reinforce dominant cultural patterns or ideas.
Things get skewed further through reinforcement learning from human feedback, which fine tunes GenAI models based on human preferences. This inevitably embeds the values and worldviews of their creators into the models themselves.
“Ask ChatGPT about a controversial topic and you’ll get a diplomatic response that sounds like it was crafted by a panel of lawyers and HR professionals who are overly eager to please you. Ask Grok the same question and you might get a sarcastic quip followed by a politically charged take that would fit right in at a certain tech billionaire’s dinner party,” Dennison writes.
The Sum Of The Parts
It is common to say the loss of Indigenous knowledge is a tragedy only for local communities, but Dennison suggests each loss impacts the world at large. Human knowledge is like the natural world — deeply interdependent in ways that may not be obvious.
For instance, when Yellowstone National Park eradicated wolves in the early 20th century, there were a number of unexpected ecological consequences. Without wolves to keep their numbers in check, the deer populations exploded. The deer overgrazed vegetation and altered the landscape. Riverbanks eroded, tree growth stalled, and the broader ecosystem suffered. When wolves were reintroduced decades later, the system began to heal, vegetation rebounded, songbirds returned, and even the behavior of rivers changed.
Dennison’s premise is that the health of a system depends on the presence of all its parts, even those that might seem inconsequential. The same principle applies to human knowledge.
“The disappearance of local knowledge is not a trivial loss. It is a disruption to the larger web of understanding that sustains both human and ecological well being. Just as biological species have evolved to thrive in specific local environments, human knowledge systems are adapted to the particularities of place. When these systems are disrupted, the consequences can ripple far beyond their point of origin,” he suggests.
Living Up To The Hype
AI is being touted as the most significant technological advance in human history, and maybe it is. But if it excludes much of human experience — including knowledge that is handed down orally — it will miss fulfilling its promise by a wide margin. It may even lead to a dangerous over-reliance on flawed information. The danger is greatest when it comes to addressing an overheating planet. Absent access to the most relevant data from all sources, AI may lead us further down the path of destruction.
It is perhaps instructive to remember the famous line from the early days of computer technology — Garbage In, Garbage Out. While we are bombarded with statements extolling the virtues of artificial intelligence and are rushing to build new nuclear, coal, and methane powered generating stations to power the data centers needed to make AI a reality, few are taking the time to ask one critical question: Is AI giving us accurate answers or just telling us what it thinks we want to hear — or what people like Elon Musk, Peter Thiel, and our political leaders want us to hear?
CleanTechnica readers, being well above average, are free to formulate their own answers to that question, with or without the assistance of AI.
Sign up for CleanTechnica's Weekly Substack for Zach and Scott's in-depth analyses and high level summaries, sign up for our daily newsletter, and follow us on Google News!
Have a tip for CleanTechnica? Want to advertise? Want to suggest a guest for our CleanTech Talk podcast? Contact us here.
Sign up for our daily newsletter for 15 new cleantech stories a day. Or sign up for our weekly one on top stories of the week if daily is too frequent.
CleanTechnica's Comment Policy
