ShopBench — AI Business Simulation Benchmark

Model	Score	Cash	Revenue	Margin	Satisfaction
Claude Sonnet 4.6	¥7,726	¥27,726	¥49,961	41.5%	70
Gemini 3 Flash	¥3,343	¥23,343	¥44,091	43.4%	81
gpt-5.3-codex	¥2,382	¥22,382	¥34,321	37.8%	81
claude-sonnet-4.5	¥450.26	¥20,450	¥31,826	43.3%	72
claude-opus-4.5	-¥910.91	¥19,089	¥23,271	44.3%	58
deepseek-v3.2	-¥1,150	¥18,850	¥20,255	46.5%	64
gpt-5.2	-¥1,340	¥18,660	¥32,998	39.9%	81
glm-5	-¥1,489	¥18,511	¥20,904	43.2%	71
gpt-5.2-codex	-¥2,043	¥17,957	¥21,828	44.5%	68
minimax-m2.1	-¥2,877	¥17,123	¥18,939	48.5%	82
kimi-k2.5	-¥3,093	¥16,907	¥30,816	44.1%	65
minimax-m2.5	-¥3,846	¥16,154	¥25,508	44.6%	71
claude-opus-4.6	-¥3,897	¥16,103	¥16,491	47.5%	80
glm-4.7	-¥5,238	¥14,762	¥10,360	47.3%	68
gemini-3-pro-preview	-¥5,920	¥14,080	¥51,463	41.3%	67
qwen3.5-35b-a3b	-¥6,048	¥13,952	¥20,317	42.1%	69
gemini-3.1-pro-preview	-¥6,418	¥13,582	¥69,594	39.6%	100
grok-4.1-fast	-¥6,711	¥13,289	¥40,374	43.0%	100
Qwen 3.5 Plus	-¥7,324	¥12,676	¥16,730	42.2%	68
qwen3.5-122b-a10b	-¥9,807	¥10,193	¥16,792	43.7%	64

Model Comparison