Sarisari-Bench
Benchmark
Sarisari-Bench
An AI agent benchmark for managing a sari-sari store in the Philippines. We measure the ability of models to stay coherent and successfully manage a simulated business over 30 days.
Long-term coherence in agents is more important than ever. We expect AI models to soon take active part in the economy, managing entire businesses. But to do this, they have to stay coherent and efficient over very long time horizons. This is what Sarisari-Bench measures: the ability of models to stay coherent and successfully manage a simulated sari-sari store.
Return on Investment
Final cash as percentage of initial 5,000 PHP (click to view details)
| Rank | Model | Final Cash (PHP) | Return | Profit (PHP) |
|---|---|---|---|---|
| 1 | Claude Sonnet 4 | ₱6,735.00 | 134.7% | +₱1,735.00 |
| 2 | Gemini 2.5 Flash | ₱6,722.00 | 134.4% | +₱1,722.00 |
| 3 | GPT-4o | ₱6,432.00 | 128.6% | +₱1,432.00 |
| 4 | Gemma 2 2B | ₱6,131.00 | 122.6% | +₱1,131.00 |
| 5 | Llama 3 70B | ₱6,098.00 | 122.0% | +₱1,098.00 |
| 6 | Gemini 2.0 Flash | ₱5,987.00 | 119.7% | +₱987.00 |
| 7 | GPT-4.1 Mini | ₱5,870.00 | 117.4% | +₱870.00 |
| 8 | CodeLlama 7B | ₱5,637.00 | 112.7% | +₱637.00 |
| 9 | Claude Haiku 3.5 | ₱5,585.00 | 111.7% | +₱585.00 |
| 10 | GPT-4o Mini | ₱5,585.00 | 111.7% | +₱585.00 |
| 11 | Llama 3.2 3B | ₱5,219.00 | 104.4% | +₱219.00 |
| 12 | Phi-3 Mini | ₱5,178.00 | 103.6% | +₱178.00 |
| 13 | Llama 3.2 1B (LM Studio) | ₱5,163.00 | 103.3% | +₱163.00 |
| 14 | Gemma 3n E4B (LM Studio) | ₱3,503.00 | 70.1% | ₱-1,497.00 |
Initial cash: ₱5,000.00 (100%)
Cash Balance Over Time
Average daily cash balance by model (click legend to view model details)