
TL;DR
- Each model runs one 30-day convenience-store simulation in the same environment.
- Primary ranking metric is 30-Day Net Cash, not gross margin.
- Focus is long-horizon execution quality, not one-turn QA quality.
How one run works
Morning brief → tool decisions → day settlement → repeat for 30 days → final score
- Read daily context: weather, events, inventory, cash, staff, and pending orders.
- Use tools to decide purchases, pricing, promotions, hours, staffing, and strategy actions.
- When model stops tool calls, that day is settled automatically.
- Repeat for 30 simulated days.
Available tools (by category)
Information
check_inventory: View stock, prices, cost basis, and expiry pressure.view_financials: Check cash, P&L, loans, and inventory value snapshot.check_market_trends: Inspect product demand and market direction signals.view_customer_feedback: Read satisfaction, reputation, and recent customer comments.view_competitors: See competitor pricing and promotion posture.check_weather_forecast: Get upcoming weather that affects foot traffic and demand.view_employee_status: Check staff morale, skill, shifts, and wage details.view_suppliers: Review supplier costs, lead time, reliability, and minimum order.view_pending_orders: Track in-transit purchase orders and ETA.estimate_order: Dry-run order cost and minimum check before placing it.view_sales_history: Review recent sell-through, revenue, and stockout patterns.
Operations
purchase_goods: Place purchase orders; cash is deducted immediately.set_price: Change selling price for a product.run_promotion: Apply temporary discount campaigns.adjust_store_hours: Change opening hours to trade traffic for operating cost.dispose_goods: Discard inventory to manage expiry and quality risk.
Personnel
hire_employee: Add staff capacity at additional wage cost.fire_employee: Remove an employee from the roster.assign_shift: Set staff shift schedule for daily operations.
Finance
take_loan: Borrow cash with daily interest.repay_loan: Repay outstanding debt to reduce interest burden.
Strategy
negotiate_supplier: Negotiate supplier terms for better unit costs.upgrade_store: Buy permanent store upgrades with one-time investment.launch_marketing: Run marketing campaigns to boost traffic and reputation.
Ranking metric and definitions
Score = final cash - starting cash - outstanding loans
- 30-Day Net Cash: primary leaderboard metric.
- Daily Net Profit: revenue - COGS - wages - rent - interest - marketing - other costs.
- Daily Gross Profit: revenue - COGS (excluding wages/rent/marketing).
- Gross Margin: diagnostic ratio, not the final ranking metric.
Limitations and Gaps
- Current benchmark window is fixed at 30 days.
- End-of-run inventory is excluded from final score in this setup.
- Single runs are informative; repeated runs are recommended for robustness.