ShopBench — AI Business Simulation Benchmark

What separates winning model operators from struggling ones

Top performers combine frequent price adjustments with disciplined purchasing and low tool-call failure. Low performers either under-act on pricing or spend too much on analysis without execution.

Profitable Models4 / 20

Best Correlation SignalPricing ↔ Net Cash (r=0.38)

Strongest Strategy ClusterAggressive Growth

Largest Strategy ClusterPassive Strugglers (11)

Model	Net Profit	Price Changes	Purchases	Revenue
Claude Sonnet 4.6	¥7,726	52	85	¥49,961
Gemini 3 Flash	¥3,343	92	104	¥44,091
gpt-5.3-codex	¥2,382	104	71	¥34,321

Model

Net Profit

Price Changes

Purchases

Revenue

Zero-Rev Days

Claude Sonnet 4.6

¥7,726

¥49,961

Gemini 3 Flash

¥3,343

104

¥44,091

gpt-5.3-codex

¥2,382

104

¥34,321

Model	Net Profit	Price Changes	Purchases	Revenue	Zero-Rev Days
claude-sonnet-4.5	¥450.26	22	88	¥31,826	0
claude-opus-4.5	-¥910.91	10	85	¥23,271	0

Model

Net Profit

Price Changes

Purchases

Revenue

Zero-Rev Days

claude-sonnet-4.5

¥450.26

¥31,826

claude-opus-4.5

-¥910.91

¥23,271

Model	Net Profit	Price Changes	Purchases	Revenue	Zero-Rev Days
deepseek-v3.2	-¥1,150	5	69	¥20,255	0
glm-5	-¥1,489	8	89	¥20,904	1

Model

Net Profit

Price Changes

Purchases

Revenue

Zero-Rev Days

deepseek-v3.2

-¥1,150

¥20,255

glm-5

-¥1,489

¥20,904

Model	Net Profit	Price Changes	Purchases	Revenue	Zero-Rev Days
claude-opus-4.6	-¥3,897	26	59	¥16,491	8
glm-4.7	-¥5,238	4	61	¥10,360	4

Model

Net Profit

Price Changes

Purchases

Revenue

Zero-Rev Days

claude-opus-4.6

-¥3,897

¥16,491

glm-4.7

-¥5,238

¥10,360

Model	Net Profit	Price Changes	Purchases	Revenue	Zero-Rev Days
gpt-5.2	-¥1,340	85	92	¥32,998	0
gpt-5.2-codex	-¥2,043	27	104	¥21,828	0
minimax-m2.1	-¥2,877	3	65	¥18,939	0
kimi-k2.5	-¥3,093	11	83	¥30,816	0
minimax-m2.5	-¥3,846	8	111	¥25,508	0
gemini-3-pro-preview	-¥5,920	71	152	¥51,463	0
qwen3.5-35b-a3b	-¥6,048	5	69	¥20,317	1
gemini-3.1-pro-preview	-¥6,418	95	128	¥69,594	0
grok-4.1-fast	-¥6,711	22	75	¥40,374	0
Qwen 3.5 Plus	-¥7,324	3	67	¥16,730	4
qwen3.5-122b-a10b	-¥9,807	6	80	¥16,792	1

Model

Net Profit

Price Changes

Purchases

Revenue

Zero-Rev Days

gpt-5.2

-¥1,340

¥32,998

gpt-5.2-codex

-¥2,043

104

¥21,828

minimax-m2.1

-¥2,877

¥18,939

kimi-k2.5

-¥3,093

¥30,816

minimax-m2.5

-¥3,846

111

¥25,508

gemini-3-pro-preview

-¥5,920

152

¥51,463

qwen3.5-35b-a3b

-¥6,048

¥20,317

gemini-3.1-pro-preview

-¥6,418

128

¥69,594

grok-4.1-fast

-¥6,711

¥40,374

Qwen 3.5 Plus

-¥7,324

¥16,730

qwen3.5-122b-a10b

-¥9,807

¥16,792

vs Runner-up

Claude Sonnet 4.6

Rank #1 · Style: Aggressive Growth

Claude Sonnet 4.6 finished rank #1 with ¥7,726 in 30-Day Net Cash. It generated ¥49,961 total revenue at 41.5% gross margin, while executing 535 tool calls with a 0.9% tool call error rate.

Operating style: Aggressive Growth. High-frequency price adjusters who actively restock and promote. They treat pricing as a daily optimization lever, making 50-100+ price changes across 30 days. Bold purchasing and active promotions drive high revenue and consistent profits. In this run, Claude Sonnet 4.6 allocated 64.1% of calls to information gathering and 35.9% to execution actions, with 52 pricing updates and 85 purchase attempts.

30-Day Net Cash¥7,726

Gross Margin41.5%

Tool Call Error Rate0.9%

Opening Setup

Day 1-10

The model established a viable opening with ¥10,306 revenue and ¥1,466 net profit in the first 10 days.

Core actions: 33 purchases, 20 pricing changes, 6 promotions.
Execution load: 189 tool calls with 2.1% phase error rate.
Demand continuity: 0 zero-revenue days.

Best day in full run occurred on D20 (¥1,910), worst on D12 (-¥128.25).

Mid-Run Optimization

Day 11-20

Mid-run decisions compounded positively, producing ¥5,169 profit in Days 11-20.

Pricing cadence shifted to 16 updates in this phase.
Procurement + promotion balance: 34 purchase calls and 8 promotions.
Tool throughput stayed at 195 calls; zero-revenue days: 0.

Gross margin at run level is 41.5%, with overall Tool Call Error Rate at 0.9%.

Endgame Execution

Day 21-30

The model closed with resilient endgame execution and ¥3,760 late-phase profit.

Late actions: 18 purchases, 16 pricing changes, 25 promotions.
Cash conversion pressure: 0 zero-revenue days in the final 10-day window.
Final phase execution quality: 0.7% error rate (1/151).

Run finished at ¥7,726 net cash after 30 days, versus ¥3,343 for Gemini 3 Flash.

Evidence Board

Delta vs Gemini 3 Flash

Claude Sonnet 4.6 is ¥4,383 away from Gemini 3 Flash in 30-Day Net Cash. The gap combines revenue (¥5,870), margin (-1.9 pts), and tool reliability (-4.1 pts error-rate delta).

Net Cash Gap

¥4,383

Revenue Gap

¥5,870

Gross Margin Gap

-1.9 pts

Error Rate Gap

-4.1 pts

What Worked

Finished with positive 30-Day Net Cash (¥7,726), indicating successful cash conversion.
Tool execution reliability is solid with 0.9% error rate (median: 3.3%).
Maintained fewer zero-revenue days (0) than typical peers.
Used pricing as an active lever (52 set_price calls, median: 17).
Frequent pricing updates improved demand capture and protected margin under changing conditions.
Lower execution errors preserved action effectiveness and reduced wasted turns.

What Limited Performance

No severe operational weakness identified in this run.
Margin lagged benchmark by 1.9 points.

Strategy Insights

What separates winning model operators from struggling ones

Key Findings

Price Changes vs Net Profit

Strategy Groups

Failure Case Studies

Model Deep Dive Reports

Claude Sonnet 4.6

Opening Setup

Mid-Run Optimization

Endgame Execution

Evidence Board

Delta vs Gemini 3 Flash

What Worked

What Limited Performance