Insights Brief

What separates winning model operators from struggling ones

Top performers combine frequent price adjustments with disciplined purchasing and low tool-call failure. Low performers either under-act on pricing or spend too much on analysis without execution.

Models with Positive Net Cash17 / 85
Observable Correlation (Non-causal)Pricing ↔ Net Cash (r=0.23, R²=0.05)
Strongest Strategy ClusterAggressive Growth
Largest Strategy ClusterPassive Strugglers (48)

Key Findings

across 85 models
📊
Positive Net-Cash Model Rate
20.0%
Only 17 of 85 models achieved 30-day net cash > 0 (finalScore, not gross margin).
📈
Net-Cash Median Lift (High vs Low Pricing)
¥1,045
Comparing top vs bottom 25% by pricing frequency, high-pricing models show higher median net cash; correlation is auxiliary evidence only (r=0.23, R²=0.05).
⚠️
Revenue ≠ Profit
¥69,594
gemini-3.1-pro-preview had highest revenue but gpt-5.3-codex-xhigh was more profitable
⚖️
Optimal Info/Action Ratio
1.9:1
Average info-to-action ratio among profitable models — too high means analysis paralysis
💸
Largest Cash Gap
-¥15,579
gemini-3.1-pro-preview has the largest cash gap: net cash -¥6,418 vs cumulative net profit ¥9,161.

Price Changes vs Net Cash

bubble size = total revenue
Top Net Cashgpt-5.3-codex-xhigh (¥9,220)
Top Revenuegemini-3.1-pro-preview (¥69,594)
Most Price Changesgpt-5.3-codex (104)

China vs US: Tool Error Comparison

Aggregated by model region: tool error rate + error composition
China Error Rate2.60% (804/30889)
US Error Rate2.45% (200/8168)
Gap (CN - US)+0.15 pp
Sample SizeCN 67 runs · US 16 runs

Error Composition (% of each region's total errors)

Use composition share instead of raw counts to compare typical failure patterns across regions.

Error CategoryChina (share / count)US (share / count)Delta (CN - US)
Set price without inventory19.8% / 15945.5% / 91-25.7 pp
Purchase below minimum order62.1% / 49950.0% / 100+12.1 pp
Other6.8% / 550.0% / 0+6.8 pp
Unknown tool name6.6% / 530.5% / 1+6.1 pp
Precondition not met4.5% / 364.0% / 8+0.5 pp
Invalid parameters0.2% / 20.0% / 0+0.2 pp

Cash Realization Efficiency

How much cumulative net profit is realized as 30-day net cash
Best Cash Conversiongpt-5.3-codex-xhigh (168.1%)
Formula30-day net cash / cumulative daily net profit (only when cumulative net profit > 0)
NoteModels with cumulative net profit <= 0 are excluded from conversion ranking

Model Cash Conversion Efficiency (%)

Computed only for models with positive cumulative daily net profit. Higher means better cash realization.

Cash Gap (30-day Net Cash - Cumulative Net Profit)

More negative values mean accounting profit was less effectively converted to final cash (often due to front-loaded purchases and slower cash realization).

Strategy Groups

click to expand
Highest Avg Net CashAggressive Growth (¥4,171)
Largest ClusterPassive Strugglers (48 models)
Strategy Families5 detected groups

High-frequency price adjusters who actively restock and promote. They treat pricing as a daily optimization lever, making 50-100+ price changes across 30 days. Bold purchasing and active promotions drive high revenue and consistent profits.

ModelNet ProfitPrice ChangesPurchasesRevenueZero-Rev Days
Claude Sonnet 4.6¥7,7265285¥49,9610
Gemini 3 Flash¥3,34392104¥44,0910
claude-sonnet-4.6-thinking¥3,2356784¥52,3150
gpt-5.3-codex¥2,38210471¥34,3210

Model Causal Diagnostics

Per-model breakdown of margin drivers, good/bad actions, and critical days
30-Day Net Cash¥9,220
Gross Margin43.6%
Risk Score51

Why Margin Stays Positive

Gross margin (43.6%) comes from the spread between selling price and purchase cost: average sold-unit price is ~¥6.62, while average purchased-unit cost is ~¥2.77. It also repriced 29 times and used 12 promotions (avg 14.6% discount).

What It Did Well

  • Margin is above/near cohort median (43.6% vs 43.4%).
  • Tool error rate is relatively low (0.9%), indicating stable execution.
  • Pricing cadence is active (29 changes), enabling fast strategy adjustments.
  • Gross profit is positive (¥14,866), so unit economics are profitable.

What Hurt Performance

  • It had 2 zero-revenue days, breaking cash-flow continuity.
  • Info/action ratio is high (2.5:1), suggesting analysis-heavy behavior.

Critical Days Timeline

Day 6

Zero-Revenue Day

Reason: Revenue dropped to 0; Large one-day cash drawdown

Triggering Actions: purchase_goods(¥77), purchase_goods(¥87), purchase_goods(¥78), purchase_goods(¥87), purchase_goods(¥77)

Impact: Revenue ¥0.00, net profit -¥320.00, cash delta -¥830.58

Day 9

Zero-Revenue Day

Reason: Revenue dropped to 0; Large one-day cash drawdown

Triggering Actions: purchase_goods(¥101), purchase_goods(¥147), purchase_goods(¥140), purchase_goods(¥168), purchase_goods(¥168)

Impact: Revenue ¥0.00, net profit -¥320.00, cash delta -¥1,145

Day 17

Cash-Drawdown Day

Reason: Large one-day cash drawdown

Triggering Actions: purchase_goods(¥101), purchase_goods(¥151), purchase_goods(¥235), purchase_goods(¥378), purchase_goods(¥504)

Impact: Revenue ¥459.50, net profit -¥52.00, cash delta -¥1,912

Failure Case Studies

patterns to avoid
Failure Pattern
🧊
Analysis Paralysis
DeepSeek V4 Pro

Spent the majority of tool calls on information gathering rather than action. With an info-to-action ratio of 6.5:1, this model researched endlessly while the store ran out of stock.

estimate_order calls217
purchase_goods calls54
Info/Action Ratio6.5:1
Net Profit-¥6,186
Failure Pattern
📉
Late-Game Collapse
qwen3.5-27b

Revenue collapsed in the final 5 days. Average daily revenue dropped from ¥711.33 (Day 1-25) to ¥297.06 (Day 26-30), a 58% decline suggesting inventory exhaustion or failed clearance.

Early Avg Revenue/Day¥711.33
Late Avg Revenue/Day¥297.06
Price Changes6
Net Profit-¥6,375
Failure Pattern
😴
The Non-Adjuster
minimax-m2.1

Made only 3 price changes across 30 days. While diligently performing other tasks (335 total tool calls), this model never learned that dynamic pricing is the key lever for profitability.

Price Changes3
Total Tool Calls335
Zero-Revenue Days0
Net Profit-¥2,877

Model Deep Dive Reports

long-form per model analysis with chart evidence
Research Report Mode
Structured narrative + evidence charts + benchmark deltas.

vs Runner-up

gpt-5.3-codex-xhigh

Rank #1 · Style: Balanced Operators

gpt-5.3-codex-xhigh finished rank #1 with ¥9,220 in 30-Day Net Cash. It generated ¥34,091 total revenue at 43.6% gross margin, while executing 570 tool calls with a 0.9% tool call error rate.

Operating style: Balanced Operators. Moderate price adjusters with steady operational cadence. These models find a middle ground — enough price changes to stay competitive without over-optimizing. Purchases and promotions are measured and deliberate. In this run, gpt-5.3-codex-xhigh allocated 71.6% of calls to information gathering and 28.4% to execution actions, with 29 pricing updates and 63 purchase attempts.

30-Day Net Cash¥9,220
Gross Margin43.6%
Tool Call Error Rate0.9%

Opening Setup

Day 1-10

The opening was unstable: despite ¥4,139 revenue, the model ended Day 1-10 at -¥1,284 net profit.

  • Core actions: 14 purchases, 0 pricing changes, 3 promotions.
  • Execution load: 185 tool calls with 0.0% phase error rate.
  • Demand continuity: 2 zero-revenue days.

Best day in full run occurred on D20 (¥1,451), worst on D6 (-¥320.00).

Mid-Run Optimization

Day 11-20

Mid-run decisions compounded positively, producing ¥3,175 profit in Days 11-20.

  • Pricing cadence shifted to 17 updates in this phase.
  • Procurement + promotion balance: 31 purchase calls and 0 promotions.
  • Tool throughput stayed at 198 calls; zero-revenue days: 0.

Gross margin at run level is 43.6%, with overall Tool Call Error Rate at 0.9%.

Endgame Execution

Day 21-30

The model closed with resilient endgame execution and ¥3,596 late-phase profit.

  • Late actions: 18 purchases, 12 pricing changes, 9 promotions.
  • Cash conversion pressure: 0 zero-revenue days in the final 10-day window.
  • Final phase execution quality: 0.0% error rate (0/187).

Run finished at ¥9,220 net cash after 30 days, versus ¥7,726 for Claude Sonnet 4.6.

Phase Decision Comparison Template

Standardized opening/mid/endgame comparison between gpt-5.3-codex-xhigh and Claude Sonnet 4.6.

Opening Phase

Day 1-10

Opening Phase: gpt-5.3-codex-xhigh trails by ¥2,751 in phase net profit.

Revenue delta -¥6,167, phase-end net cash delta ¥505.77. Action differences: -20 pricing changes, -19 purchases, +17 negotiations.

gpt-5.3-codex-xhighClaude Sonnet 4.6Delta
Phase Revenue
¥4,139
¥10,306
-¥6,167
Phase Net Profit
-¥1,284
¥1,466
-¥2,751
Phase-End Net Cash
-¥1,071
-¥1,576
+¥505.77
Tool Error Rate
0.0%
2.1%
-2.1 pts
Zero-Revenue Days
2
0
+2 days
Price / Purchase / Promo
0 / 14 / 3
20 / 33 / 6
-20 / -19 / -3
Estimate / Negotiate
41 / 19
56 / 2
-15 / +17

Mid-Run Phase

Day 11-20

Mid-Run Phase: gpt-5.3-codex-xhigh trails by ¥1,994 in phase net profit.

Revenue delta -¥6,624, phase-end net cash delta -¥1,556. Action differences: +1 pricing changes, -3 purchases, +16 negotiations.

gpt-5.3-codex-xhighClaude Sonnet 4.6Delta
Phase Revenue
¥14,393
¥21,017
-¥6,624
Phase Net Profit
¥3,175
¥5,169
-¥1,994
Phase-End Net Cash
¥1,932
¥3,488
-¥1,556
Tool Error Rate
2.5%
0.0%
+2.5 pts
Zero-Revenue Days
0
0
+0 days
Price / Purchase / Promo
17 / 31 / 0
16 / 34 / 8
+1 / -3 / -8
Estimate / Negotiate
16 / 17
62 / 1
-46 / +16

Endgame Phase

Day 21-30

Endgame Phase: gpt-5.3-codex-xhigh trails by ¥164.07 in phase net profit.

Revenue delta -¥3,079, phase-end net cash delta ¥1,493. Action differences: -4 pricing changes, +0 purchases, +11 negotiations.

gpt-5.3-codex-xhighClaude Sonnet 4.6Delta
Phase Revenue
¥15,558
¥18,637
-¥3,079
Phase Net Profit
¥3,596
¥3,760
-¥164.07
Phase-End Net Cash
¥9,220
¥7,726
+¥1,493
Tool Error Rate
0.0%
0.7%
-0.7 pts
Zero-Revenue Days
0
0
+0 days
Price / Purchase / Promo
12 / 18 / 9
16 / 18 / 25
-4 / +0 / -16
Estimate / Negotiate
11 / 11
22 / 0
-11 / +11

Evidence Board

Methodology note: Net Cash Trajectory follows the final ranking metric; Daily Net Profit (revenue - COGS - wages - rent - loan interest - marketing - other) and Daily Gross Profit (revenue - COGS) explain operating quality; Inventory Trajectory is observational and not part of final score.

Delta vs Claude Sonnet 4.6

gpt-5.3-codex-xhigh is ¥1,493 away from Claude Sonnet 4.6 in 30-Day Net Cash. The gap combines revenue (-¥15,870), margin (2.1 pts), and tool reliability (-0.1 pts error-rate delta).

Net Cash Gap
¥1,493
Revenue Gap
-¥15,870
Gross Margin Gap
2.1 pts
Error Rate Gap
-0.1 pts

What Worked

  • Finished with positive 30-Day Net Cash (¥9,220), indicating successful cash conversion.
  • Gross margin (43.6%) is above or near cohort median (43.4%).
  • Tool execution reliability is solid with 0.9% error rate (median: 1.7%).
  • Used pricing as an active lever (29 set_price calls, median: 20).
  • Frequent pricing updates improved demand capture and protected margin under changing conditions.
  • Lower execution errors preserved action effectiveness and reduced wasted turns.

What Limited Performance

  • High zero-revenue exposure (2 days) indicates stockout or demand conversion issues.
  • Revenue trailed Claude Sonnet 4.6 by ¥15,870.
  • More zero-revenue days (2 vs 0) reduced compounding cash flow.