TL;DR

  • Each model runs one 30-day convenience-store simulation in the same environment.
  • Primary ranking metric is 30-Day Net Cash, not gross margin.
  • Focus is long-horizon execution quality, not one-turn QA quality.

How one run works

Morning brief → tool decisions → day settlement → repeat for 30 days → final score
  1. Read daily context: weather, events, inventory, cash, staff, and pending orders.
  2. Use tools to decide purchases, pricing, promotions, hours, staffing, and strategy actions.
  3. When model stops tool calls, that day is settled automatically.
  4. Repeat for 30 simulated days.

Available tools (by category)

Information

  • check_inventory: View stock, prices, cost basis, and expiry pressure.
  • view_financials: Check cash, P&L, loans, and inventory value snapshot.
  • check_market_trends: Inspect product demand and market direction signals.
  • view_customer_feedback: Read satisfaction, reputation, and recent customer comments.
  • view_competitors: See competitor pricing and promotion posture.
  • check_weather_forecast: Get upcoming weather that affects foot traffic and demand.
  • view_employee_status: Check staff morale, skill, shifts, and wage details.
  • view_suppliers: Review supplier costs, lead time, reliability, and minimum order.
  • view_pending_orders: Track in-transit purchase orders and ETA.
  • estimate_order: Dry-run order cost and minimum check before placing it.
  • view_sales_history: Review recent sell-through, revenue, and stockout patterns.

Operations

  • purchase_goods: Place purchase orders; cash is deducted immediately.
  • set_price: Change selling price for a product.
  • run_promotion: Apply temporary discount campaigns.
  • adjust_store_hours: Change opening hours to trade traffic for operating cost.
  • dispose_goods: Discard inventory to manage expiry and quality risk.

Personnel

  • hire_employee: Add staff capacity at additional wage cost.
  • fire_employee: Remove an employee from the roster.
  • assign_shift: Set staff shift schedule for daily operations.

Finance

  • take_loan: Borrow cash with daily interest.
  • repay_loan: Repay outstanding debt to reduce interest burden.

Strategy

  • negotiate_supplier: Negotiate supplier terms for better unit costs.
  • upgrade_store: Buy permanent store upgrades with one-time investment.
  • launch_marketing: Run marketing campaigns to boost traffic and reputation.

Ranking metric and definitions

Score = final cash - starting cash - outstanding loans
  • 30-Day Net Cash: primary leaderboard metric.
  • Daily Net Profit: revenue - COGS - wages - rent - interest - marketing - other costs.
  • Daily Gross Profit: revenue - COGS (excluding wages/rent/marketing).
  • Gross Margin: diagnostic ratio, not the final ranking metric.

Limitations and Gaps

  • Current benchmark window is fixed at 30 days.
  • End-of-run inventory is excluded from final score in this setup.
  • Single runs are informative; repeated runs are recommended for robustness.