In a groundbreaking study, researchers at Patronus AI, a pioneering startup, have unveiled that large language models (LLMs) like GPT frequently falter when answering questions derived from Securities and Exchange Commission (SEC) filings. The most proficient AI model, OpenAI's GPT-4 Turbo, could only muster a 79% accuracy rate in their tests. Anand Kannappan, the co-founder of Patronus AI, branded this level of performance as 'absolutely unacceptable' for applications intended for production.
Challenging Terrain for AI in Regulated Industries
The results shed light on the daunting challenges AI models encounter in regulated industries such as finance, where precision and reliability are cardinal. LLMs often stumble, producing incorrect figures or outright refusing to answer questions. This revelation is of paramount importance since the ability to accurately summarize or interrogate SEC filings could proffer competitive edges in the financial sector.
FinanceBench: A New Benchmark for Financial-Sector AI Performance
The study conducted by Patronus AI, dubbed FinanceBench, includes over 10,000 questions and answers derived from SEC filings. It aims to create a benchmark to measure the performance of AI in the financial sector. The dataset not only demands text extraction but also light mathematical reasoning, adding to the complexity and rigor of the test.
Ensuring Reliable AI Implementations in Business
The overarching goal of Patronus AI is to offer software tools for automated LLM testing. This is to guarantee that AI implementations in business are reliable and do not disseminate misleading information. With the increasing reliance on AI in various sectors, the company's focus on ensuring the accuracy and reliability of these technologies is becoming more critical than ever.