Trase is back at the top of the GAIA leaderboard for general AI assistants, this time marking a major milestone by becoming the first general AI agent with the highest published test score of nearly 67% and validation score of 70.3% – figures that outperform official results by industry giants like Google, Microsoft, Hugging Face, and H2O.ai. Even more remarkable, we achieved the scores in far fewer passes, spending roughly 1/100th the cost per query of other competitors.
We first topped the GAIA leaderboard – a benchmark that tests AI systems on multi-step reasoning, multimodal data handling, real-time web browsing, and proficiency in general tool usage – in September 2024, leapfrogging over competitors with an accuracy score of 35.55%. Since then, we have nearly doubled our overall test score of 66.78% by refining our agents, identifying edge cases where they may have initially provided incorrect responses on previous tests due to network errors, and addressing those failures to prevent them from reoccurring. At Trase, we have repeatedly shown that our specialized approach to building agentic AI – combined with domain-specific data integration – allows us to consistently outperform competitors and to do so cost-effectively.
Validation vs. Full Test
In February, OpenAI’s Deep Research announced that it reached the top of the validation set of the GAIA Leaderboard with an accuracy score of 72.57%. The validation set is a test consisting of 150+ published questions that allow participants to train their AI models, enabling submitters to assess their performance before submitting to the full test set which is a more rigorous evaluation that consists of 300+ non-trivial questions where there’s a single unambiguous answer and entrants have the ability to demonstrate the effectiveness and the sophistication of their technologies in a straightforward, ranked manner. Hosted on the machine learning platform Hugging Face, GAIA gives AI agents questions that cannot be easily answered through large web scrapes typically used to train large language models (LLMs). To solve each question, agents must have different levels of tooling and autonomy. In addition, some questions come with files that reflect real-world use cases, such as images, audio, and range in type from Word, Excel, and PowerPoint, to PDF, JSON, SML, CSV, and Zip, among others.
Cost to Win
Cost of submission is driven primarily by two factors: the complexity of each question and the quantity of runs required to arrive at an answer. The more complex the query, the more reasoning or “deep thinking” agents must do. As with a human trying to solve a difficult problem, in some cases, agents make an attempt to answer a question, fail, and need to rethink and retry their approach, resulting in multiple runs which further increases total cost. Deep Research’s accuracy and Level 1-3 scores were aggregated internally from 64 attempts per question. At $17-$20 per task in low-compute mode, the test ran OpenAI anywhere from ~$1,100-$1,300 per question. Conversely, Trase, which was able to answer most questions in a single pass, spent roughly $10 in computing power per query. This cost-differential not only highlights Trase’s cost-efficiency and strong value proposition for enterprise customers, it also underscores how our model-agnostic architecture and our focus on outcome-supervised reinforcement learning allow us to deliver the best agentic AI available without the need for expensive infrastructure.
Enterprise Ready and Available
The fundamental difference between Trase and other AI agentic platforms, like OpenAI DeepResearch, Google AgentSpace, Salesforce Agentforce, and Microsoft Copilot Agents, is that Trase fine tunes models on industry-specific data providing more accurate and high-quality results; provides a secure, scalable platform and configures solutions for organizations, even those without extensive IT teams; and scales to scenarios across the organization to leverage industry-specific opportunities, like clinical-trial matching.
Data privacy and security are at the forefront of everything that Trase does, which is why we isolate customer data and provide full auditability and explainability. While other agentic systems, including that of OpenAI, may excel at general-purpose web browsing and “universal” tasks, they have limitations when it comes to private data and compliance-heavy environments like healthcare (HIPAA) and finance (SOC2).
This last-mile customization and data integration capability – often the most challenging part of implementing AI in real-world workflows – is precisely where Trase stands out. Our approach ensures that even customers without extensive in-house AI teams can efficiently adopt and scale advanced AI solutions.
For more information about how Trase can help you unlock new levels of productivity with AI agents or to schedule a demo, click here.
Build With Us
Our formula for success begins with those who dare to look beyond what’s possible.
Get Started