Sarisari-Bench

Benchmark

Sarisari-Bench

An AI agent benchmark for managing a sari-sari store in the Philippines. We measure the ability of models to stay coherent and successfully manage a simulated business over 30 days.

Long-term coherence in agents is more important than ever. We expect AI models to soon take active part in the economy, managing entire businesses. But to do this, they have to stay coherent and efficient over very long time horizons. This is what Sarisari-Bench measures: the ability of models to stay coherent and successfully manage a simulated sari-sari store.

Return on Investment

Final cash as percentage of initial 5,000 PHP (click to view details)

RankModelFinal Cash (PHP)ReturnProfit (PHP)
1Claude Sonnet 4₱6,735.00134.7%+₱1,735.00
2Gemini 2.5 Flash₱6,722.00134.4%+₱1,722.00
3GPT-4o₱6,432.00128.6%+₱1,432.00
4Gemma 2 2B₱6,131.00122.6%+₱1,131.00
5Llama 3 70B₱6,098.00122.0%+₱1,098.00
6Gemini 2.0 Flash₱5,987.00119.7%+₱987.00
7GPT-4.1 Mini₱5,870.00117.4%+₱870.00
8CodeLlama 7B₱5,637.00112.7%+₱637.00
9Claude Haiku 3.5₱5,585.00111.7%+₱585.00
10GPT-4o Mini₱5,585.00111.7%+₱585.00
11Llama 3.2 3B₱5,219.00104.4%+₱219.00
12Phi-3 Mini₱5,178.00103.6%+₱178.00
13Llama 3.2 1B (LM Studio)₱5,163.00103.3%+₱163.00
14Gemma 3n E4B (LM Studio)₱3,503.0070.1%₱-1,497.00
Initial cash: ₱5,000.00 (100%)

Cash Balance Over Time

Average daily cash balance by model (click legend to view model details)

Explore