A new study finds that artificial intelligence is not good in history

A new study finds that artificial intelligence is not good in history

AI may excel at certain tasks Like coding or Generate podcast. But she finds it difficult to pass a high-level history exam, a new study has found.

A team of researchers has created a new benchmark for testing three large language models (LLMs) – OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini – on historical questions. The Hist-LLM benchmark tests the validity of answers against the Seshat World History Databank, a vast database of historical knowledge named after the ancient Egyptian goddess of wisdom.

Results that It was submitted Last month at the high-level artificial intelligence conference NeurIPS, it was disappointing, according to researchers affiliated with Complexity Science Center (CSH), a research institute based in Austria. The best-performing LLM program was GPT-4 Turbo, but it only achieved an accuracy of about 46% – not much higher than a random guess.

“The main takeaway from this study is that the MBA, while impressive, still lacks the depth of understanding required for advanced history. It is remarkable for the facts,” said Maria del Río-Shanona, one of the study’s co-authors and a colleague. “Basic, but when it comes to more rigorous historical investigation at PhD level, they are not yet up to the task.” Professor of Computer Science at University College London.

Researchers shared samples of historical questions with TechCrunch that LLM students got wrong. For example, GPT-4 Turbo was asked whether scaled armor existed during a specific time period in ancient Egypt. The LLM answered yes, but the technology did not appear in Egypt until 1,500 years later.

Why are LLM students so bad at answering art history questions, when they can be so good at answering very complex questions about things like programming? Del Rio-Chanona told TechCrunch that’s likely because LLM holders tend to extrapolate from very salient historical data, making it difficult to retrieve more obscure historical knowledge.

For example, GPT-4 researchers asked whether ancient Egypt had a professional standing army during a specific historical period. While the correct answer is no, LLM incorrectly answered that he did. This is likely because there is a lot of general information about other ancient empires, such as Persia, that had standing armies.

“If you were told A and B 100 times, and C once, and then asked a question about C, you might just remember A and B and try to extrapolate from that,” del Rio-Shanona said.

The researchers also identified other trends, including that the OpenAI and Llama models performed worse in certain regions such as sub-Saharan Africa, suggesting potential biases in their training data.

Peter Turchin, who led the study and is a CSH faculty member, said the results show that MBAs are still no substitute for humans when it comes to certain fields.

But researchers still hope that LLM holders can help historians in the future. They are working to improve their criterion by including more data from underrepresented areas and adding more complex questions.

“Overall, while our results highlight areas where MBAs need improvement, they also underscore the potential of these models to aid historical research,” the paper states.

Leave a Comment

Your email address will not be published. Required fields are marked *