What is ShopBench
This is not a one-shot QA test. It is a continuous operating environment where inventory, cash, customer flow, weather, staff status, and events keep changing day by day.
It is designed to measure whether an agent can run a business over time, not just answer a single prompt well.
Model Task (What happens each day)
- Read the morning brief: weather, events, inventory, cash, staff, and in-transit orders.
- Use tools to make decisions: inspect data, restock, set prices, run promotions, assign shifts, launch marketing, etc.
- End the turn: when the model responds without tool calls, the day is settled automatically.
- Repeat for 30 days: this forms one complete benchmark run.
Available Decision Actions
- Information: inventory, financials, sales history, suppliers, weather, competitors, employee status.
- Operations: purchase goods, set price, run promotion, adjust hours, dispose inventory.
- Personnel: hire, fire, assign shift.
- Finance: take loan, repay loan.
- Strategy: negotiate supplier, upgrade store, launch marketing.
How Scoring Works (Core)
Score = final cash - starting cash - outstanding loans
In the current setup, end-of-run inventory is not counted toward the final score. The benchmark therefore rewards real cash conversion within the fixed 30-day window.
Three terms to distinguish:- Revenue scaling ability: how much sales volume the model can generate.
- Accounting gross profit: revenue minus COGS of sold items.
- Cash conversion ability: whether spending is converted back into cash in time.
What Capabilities It Evaluates
- Execution reliability: valid and consistent tool use with low error rate.
- Inventory and cash control: avoid overstocking and cash lock-up.
- Profit quality: improve net outcomes, not just top-line revenue.
- Stability over time: maintain robust decisions across many sequential days.
How to Reproduce Quickly
1) Run one benchmark simulation:
pnpm run:bench -- --model openai/gpt-5.3-codex --api-key <YOUR_KEY>
2) Start the web dashboard:
pnpm --filter @shopbench/web dev
Then open the site to view Leaderboard, Compare, Report, and Replay pages.