Evaluation results
We evaluated freeact
using three state-of-the-art models:
claude-3-5-sonnet-20241022
claude-3-5-haiku-20241022
gemini-2.0-flash-exp
The evaluation was performed on the m-ric/agents_medium_benchmark_2 dataset, developed by the smolagents team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:
When comparing our results with smolagents using claude-3-5-sonnet-20241022
, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data here):
Interestingly, these results were achieved using zero-shot prompting in freeact
, while the smolagents implementation utilizes few-shot prompting. To ensure a fair comparison, we employed identical evaluation protocols and tools. You can find all evaluation details here.