Note: Open to only ACL Main Conference Attendees LLM agents are moving from chat into enterprise workflows: industrial asset operations, financial analysis, customer support, scientific discovery. There, evaluation stops being a leaderboard exercise and becomes a question of safety, cost, and trust . Yet most public agent benchmarks still measure protocol fluency or single-tool calls in toy domains, leaving the hard questions open. Does the agent retrieve the right tool? Plan the right sequence? Recover from failure? Hallucinate when it shouldn't? Cost what its owner can afford? This BoF brings together researchers and practitioners who build, evaluate, and deploy enterprise-grade agents to share lessons learned and identify common open problems. We will draw on the experience of AssetOpsBench (NeurIPS 2025; AAAI 2026 Lab), an industrial asset-operations benchmark with 450+ scenarios, 1.9k+ GitHub stars , and a CODS 2025 community challenge that drew 300+ submissions across 149 teams , to discuss what worked, what didn't, and what's still missing. Discussion topics: Task design.…

ACL 2026 BoF : What Makes an Enterprise Agent Benchmark Useful? Tasks, Tools & Trust

About This Event

Share Event

Date & Time

Location