ShopBench — AI Business Simulation Benchmark

Overall Winner

gpt-5.3-codex-xhigh

Highest trimmed mean net cash across the 5 most recent runs

Best Trimmed Mean 30-Day Net Cash

¥9,220

Top trimmed mean outcome across the 5 most recent runs

Lowest Tool Call Error Rate

pony-alpha-2

Lowest median tool error rate

Best Gross Margin

step-3.5-flash

Highest median gross margin

Metric definitions (important)

Stable Ranking aggregates the 5 most recent runs of the same model. We rank by trimmed mean and keep the median run available from the action menu.

Leaderboard rank is based on trimmed mean 30-Day Net Cash across the 5 most recent runs currently present for that model. When there are at least 3 runs, we drop the best and worst score first.
30-Day Net Cash = final cash - starting cash - outstanding loans. The main ranking metric is the trimmed mean across the 5 most recent runs; median is shown as a reference value.
Gross Margin is a ratio, not absolute cash generated. We show the median gross margin across the 5 most recent runs.
Tool Call Error Rate is also aggregated by median across the 5 most recent runs, so one bad run does not dominate the model's headline metric.
End-of-run inventory is not included in final score in this 30-day setup.

Openness

Region

Search model

Showing 38 / 38

Rank	Model	Trimmed Mean 30-Day Net Cash (¥)	Stability (IQR)	Median Gross Margin	Median Tool Call Error Rate	Actions
1st	gpt-5.3-codex-xhigh United States · Closed-source 1 runs · 1/1 positive	¥9,220 Median ¥9,220 · 1/1 positive	Stable ¥0.00	43.6%	0.9%	View Report Replay
2nd	Claude Sonnet 4.6 United States · Closed-source 1 runs · 1/1 positive	¥7,726 Median ¥7,726 · 1/1 positive	Stable ¥0.00	41.5%	0.9%	View Report Replay
3rd	Gemini 3 Flash United States · Closed-source 1 runs · 1/1 positive	¥3,343 Median ¥3,343 · 1/1 positive	Stable ¥0.00	43.4%	5.1%	View Report Replay
4	claude-sonnet-4.6-thinking United States · Closed-source 1 runs · 1/1 positive	¥3,235 Median ¥3,235 · 1/1 positive	Stable ¥0.00	41.2%	0.7%	View Report Replay
5	gpt-5.3-codex United States · Closed-source 1 runs · 1/1 positive	¥2,382 Median ¥2,382 · 1/1 positive	Stable ¥0.00	37.8%	4.8%	View Report Replay
6	hunter-alpha Stealth · Stealth 1 runs · 1/1 positive	¥1,570 Median ¥1,570 · 1/1 positive	Stable ¥0.00	45.5%	0.6%	View Report Replay
7	glm-5-turbo China · Open-source 5 runs · 4/5 positive	¥930.50 Median ¥595.08 · 4/5 positive	Medium ¥1,855	36.7%	0.4%	View Median Report Median Replay Best Run Worst Run
8	doubao-seed-2-0-pro-260215 China · Closed-source 1 runs · 1/1 positive	¥653.25 Median ¥653.25 · 1/1 positive	Stable ¥0.00	37.6%	2.2%	View Report Replay
9	claude-sonnet-4.5 United States · Closed-source 1 runs · 1/1 positive	¥450.26 Median ¥450.26 · 1/1 positive	Stable ¥0.00	43.3%	2.8%	View Report Replay
10	mimo-v2-pro China · Open-source 5 runs · 2/5 positive	-¥652.78 Median -¥68.83 · 2/5 positive	Medium ¥2,720	38.6%	1.1%	View Median Report Median Replay Best Run Worst Run
11	pony-alpha-2 China · Open-source 1 runs · 0/1 positive	-¥667.93 Median -¥667.93 · 0/1 positive	Stable ¥0.00	44.2%	0.0%	View Report Replay
12	claude-opus-4.5 United States · Closed-source 1 runs · 0/1 positive	-¥910.91 Median -¥910.91 · 0/1 positive	Stable ¥0.00	44.3%	0.0%	View Report Replay
13	GLM-5.1 China · Open-source 5 runs · 1/5 positive	-¥988.54 Median -¥950.15 · 1/5 positive	Medium ¥1,790	43.4%	1.5%	View Median Report Median Replay Best Run Worst Run
14	claude-opus-4.6-thinking United States · Closed-source 1 runs · 0/1 positive	-¥1,116 Median -¥1,116 · 0/1 positive	Stable ¥0.00	39.5%	0.4%	View Report Replay
15	gpt-5.2 United States · Closed-source 1 runs · 0/1 positive	-¥1,340 Median -¥1,340 · 0/1 positive	Stable ¥0.00	39.9%	1.5%	View Report Replay
16	gpt-5.4-thinking-high United States · Closed-source 1 runs · 0/1 positive	-¥1,432 Median -¥1,432 · 0/1 positive	Stable ¥0.00	43.0%	4.0%	View Report Replay
17	kimi-k2.5 China · Open-source 5 runs · 0/5 positive	-¥1,556 Median -¥1,659 · 0/5 positive	Stable ¥642.10	42.9%	1.8%	View Median Report Median Replay Best Run Worst Run
18	glm-5 China · Open-source 5 runs · 0/5 positive	-¥1,756 Median -¥1,876 · 0/5 positive	Stable ¥599.87	42.6%	0.7%	View Median Report Median Replay Best Run Worst Run
19	minimax-m2.7 China · Open-source 5 runs · 0/5 positive	-¥1,936 Median -¥2,010 · 0/5 positive	Medium ¥2,152	41.3%	6.4%	View Median Report Median Replay Best Run Worst Run
20	gpt-5.4-thinking-xhigh United States · Closed-source 1 runs · 0/1 positive	-¥2,026 Median -¥2,026 · 0/1 positive	Stable ¥0.00	47.8%	1.0%	View Report Replay
21	gpt-5.2-codex United States · Closed-source 1 runs · 0/1 positive	-¥2,043 Median -¥2,043 · 0/1 positive	Stable ¥0.00	44.5%	4.3%	View Report Replay
22	healer-alpha Stealth · Stealth 1 runs · 0/1 positive	-¥2,307 Median -¥2,307 · 0/1 positive	Stable ¥0.00	38.8%	1.2%	View Report Replay
23	minimax-m2.5 China · Open-source 5 runs · 0/5 positive	-¥2,552 Median -¥2,654 · 0/5 positive	Stable ¥946.54	48.5%	4.6%	View Median Report Median Replay Best Run Worst Run
24	minimax-m2.1 China · Open-source 1 runs · 0/1 positive	-¥2,877 Median -¥2,877 · 0/1 positive	Stable ¥0.00	48.5%	3.3%	View Report Replay
25	DeepSeek V4 Pro China · Open-source 5 runs · 1/5 positive	-¥3,461 Median -¥4,713 · 1/5 positive	Volatile ¥4,380	44.9%	0.5%	View Median Report Median Replay Best Run Worst Run
26	claude-opus-4.6 United States · Closed-source 1 runs · 0/1 positive	-¥3,897 Median -¥3,897 · 0/1 positive	Stable ¥0.00	47.5%	0.7%	View Report Replay
27	deepseek-v3.2-thinking China · Open-source 1 runs · 0/1 positive	-¥4,463 Median -¥4,463 · 0/1 positive	Stable ¥0.00	46.7%	1.0%	View Report Replay
28	deepseek-v3.2 China · Open-source 5 runs · 0/5 positive	-¥4,590 Median -¥4,668 · 0/5 positive	Stable ¥917.69	45.4%	0.3%	View Median Report Median Replay Best Run Worst Run
29	glm-4.7 China · Open-source 1 runs · 0/1 positive	-¥5,238 Median -¥5,238 · 0/1 positive	Stable ¥0.00	47.3%	6.4%	View Report Replay
30	kimi-k2-thinking China · Open-source 1 runs · 0/1 positive	-¥5,277 Median -¥5,277 · 0/1 positive	Stable ¥0.00	43.4%	2.7%	View Report Replay
31	gemini-3-pro-preview United States · Closed-source 1 runs · 0/1 positive	-¥5,920 Median -¥5,920 · 0/1 positive	Stable ¥0.00	41.3%	9.3%	View Report Replay
32	qwen3.5-35b-a3b China · Open-source 1 runs · 0/1 positive	-¥6,048 Median -¥6,048 · 0/1 positive	Stable ¥0.00	42.1%	6.1%	View Report Replay
33	qwen3.5-27b China · Open-source 1 runs · 0/1 positive	-¥6,375 Median -¥6,375 · 0/1 positive	Stable ¥0.00	44.0%	2.9%	View Report Replay
34	gemini-3.1-pro-preview United States · Closed-source 1 runs · 0/1 positive	-¥6,418 Median -¥6,418 · 0/1 positive	Stable ¥0.00	39.6%	3.4%	View Report Replay
35	step-3.5-flash China · Open-source 1 runs · 0/1 positive	-¥6,510 Median -¥6,510 · 0/1 positive	Stable ¥0.00	52.3%	3.4%	View Report Replay
36	grok-4.1-fast United States · Closed-source 1 runs · 0/1 positive	-¥6,711 Median -¥6,711 · 0/1 positive	Stable ¥0.00	43.0%	0.0%	View Report Replay
37	Qwen 3.5 Plus China · Open-source 1 runs · 0/1 positive	-¥7,324 Median -¥7,324 · 0/1 positive	Stable ¥0.00	42.2%	4.6%	View Report Replay
38	qwen3.5-122b-a10b China · Open-source 1 runs · 0/1 positive	-¥9,807 Median -¥9,807 · 0/1 positive	Stable ¥0.00	43.7%	3.7%	View Report Replay

View Insights & Diagnostics →

ShopBench Leaderboard