Research shows that while artificial intelligence is good at tasks like coding and generating podcasts, it has a hard time accurately answering advanced history questions.
The researchers tested OpenAI's GPT-4, Meta's Llama, and Google's Gemini using a newly developed benchmark called Hist-LLM.
This benchmark relies on the Seshat Global History Databank, a comprehensive database of historical knowledge.
The study, presented at last month's NeurIPS AI conference, had disappointing results, according to TechCrunch.
GPT-4 Turbo had the best performance, but its accuracy was only about 46%, slightly better than random guessing.
“Although the LLM is impressive, it still lacks the depth needed for advanced history,” says study co-author Maria del Rio Chanona, associate professor at University College London.
“It's good at basic facts, but fails at nuanced doctoral-level historical research.”
The researchers found that LLM often makes inferences from prominent historical data but struggles with more obscure details.
For example, GPT-4 incorrectly states that scale armor existed in ancient Egypt at a specific time, when in fact the technology appeared 1,500 years later.
Similarly, this model erroneously claimed that ancient Egypt had a professional standing army during a specific period, perhaps because information about standing armies in other ancient empires such as Persia was widespread.
“If you're told A and B 100 times and C only once, you're more likely to remember A and B,” del Rio Chanona explained.
Another concern was potential bias.
OpenAI's GPT-4 model and Meta's Llama model performed poorly when answering questions about regions such as sub-Saharan Africa, demonstrating the limitations of their training data.
“These biases suggest that the LLM reflects gaps in historical documentation rather than an unbiased representation of history,” said Peter Turchin, the study's lead researcher.
Despite these limitations, researchers are hopeful that AI can help historians in the future.
They plan to improve the Hist-LLM benchmark by incorporating more diverse data sources and increasing the complexity of questions.
“Our findings highlight areas for improvement in LLM, but also show potential to support historical research,” the paper concludes.
As AI continues to evolve, experts say it's clear that human historians remain invaluable in interpreting complex historical stories and ensuring the accuracy of academic research.





