Sarisari-Bench

Benchmark

Sarisari-Bench

An AI agent benchmark for managing a sari-sari store in the Philippines. We measure the ability of models to stay coherent and successfully manage a simulated business over 30 days.

Long-term coherence in agents is more important than ever. We expect AI models to soon take active part in the economy, managing entire businesses. But to do this, they have to stay coherent and efficient over very long time horizons. This is what Sarisari-Bench measures: the ability of models to stay coherent and successfully manage a simulated sari-sari store.

Return on Investment

Final cash as percentage of initial 5,000 PHP (click to view details)

Rank	Model	Final Cash (PHP)	Return	Profit (PHP)
1	Claude Sonnet 4	₱6,735.00	134.7%	+₱1,735.00
2	Gemini 2.5 Flash	₱6,722.00	134.4%	+₱1,722.00
3	GPT-4o	₱6,432.00	128.6%	+₱1,432.00
4	Gemma 2 2B	₱6,131.00	122.6%	+₱1,131.00
5	Llama 3 70B	₱6,098.00	122.0%	+₱1,098.00
6	Gemini 2.0 Flash	₱5,987.00	119.7%	+₱987.00
7	GPT-4.1 Mini	₱5,870.00	117.4%	+₱870.00
8	CodeLlama 7B	₱5,637.00	112.7%	+₱637.00
9	Claude Haiku 3.5	₱5,585.00	111.7%	+₱585.00
10	GPT-4o Mini	₱5,585.00	111.7%	+₱585.00
11	Llama 3.2 3B	₱5,219.00	104.4%	+₱219.00
12	Phi-3 Mini	₱5,178.00	103.6%	+₱178.00
13	Llama 3.2 1B (LM Studio)	₱5,163.00	103.3%	+₱163.00
14	Gemma 3n E4B (LM Studio)	₱3,503.00	70.1%	₱-1,497.00

Initial cash: ₱5,000.00 (100%)

Cash Balance Over Time

Average daily cash balance by model (click legend to view model details)

Sarisari-Bench

Sarisari-Bench

Return on Investment

Cash Balance Over Time

Explore

Leaderboard

Models

Runs