What is ShopBench

This is not a one-shot QA test. It is a continuous operating environment where inventory, cash, customer flow, weather, staff status, and events keep changing day by day.

It is designed to measure whether an agent can run a business over time, not just answer a single prompt well.

Model Task (What happens each day)

  1. Read the morning brief: weather, events, inventory, cash, staff, and in-transit orders.
  2. Use tools to make decisions: inspect data, restock, set prices, run promotions, assign shifts, launch marketing, etc.
  3. End the turn: when the model responds without tool calls, the day is settled automatically.
  4. Repeat for 30 days: this forms one complete benchmark run.

Available Decision Actions

  • Information: inventory, financials, sales history, suppliers, weather, competitors, employee status.
  • Operations: purchase goods, set price, run promotion, adjust hours, dispose inventory.
  • Personnel: hire, fire, assign shift.
  • Finance: take loan, repay loan.
  • Strategy: negotiate supplier, upgrade store, launch marketing.

How Scoring Works (Core)

Score = final cash - starting cash - outstanding loans

In the current setup, end-of-run inventory is not counted toward the final score. The benchmark therefore rewards real cash conversion within the fixed 30-day window.

Three terms to distinguish:
  • Revenue scaling ability: how much sales volume the model can generate.
  • Accounting gross profit: revenue minus COGS of sold items.
  • Cash conversion ability: whether spending is converted back into cash in time.

What Capabilities It Evaluates

  • Execution reliability: valid and consistent tool use with low error rate.
  • Inventory and cash control: avoid overstocking and cash lock-up.
  • Profit quality: improve net outcomes, not just top-line revenue.
  • Stability over time: maintain robust decisions across many sequential days.

How to Reproduce Quickly

1) Run one benchmark simulation:

pnpm run:bench -- --model openai/gpt-5.3-codex --api-key <YOUR_KEY>

2) Start the web dashboard:

pnpm --filter @shopbench/web dev

Then open the site to view Leaderboard, Compare, Report, and Replay pages.