Skip to content

Evaluation results

We evaluated freeact using three state-of-the-art models:

  • claude-3-5-sonnet-20241022
  • claude-3-5-haiku-20241022
  • gemini-2.0-flash-exp

The evaluation was performed on the m-ric/agents_medium_benchmark_2 dataset, developed by the smolagents team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:

architecture

When comparing our results with smolagents using claude-3-5-sonnet-20241022, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data here):

architecture

Interestingly, these results were achieved using zero-shot prompting in freeact, while the smolagents implementation utilizes few-shot prompting. To ensure a fair comparison, we employed identical evaluation protocols and tools. You can find all evaluation details here.