ACL 2026 BoF : What Makes an Enterprise Agent Benchmark Useful? Tasks, Tools & Trust

ACL 2026 BoF : What Makes an Enterprise Agent Benchmark Useful? Tasks, Tools & Trust

About This Event

Note: Open to only ACL Main Conference Attendees LLM agents are moving from chat into enterprise workflows: industrial asset operations, financial analysis, customer support, scientific discovery. There, evaluation stops being a leaderboard exercise and becomes a question of safety, cost, and trust . Yet most public agent benchmarks still measure protocol fluency or single-tool calls in toy domains, leaving the hard questions open. Does the agent retrieve the right tool? Plan the right sequence? Recover from failure? Hallucinate when it shouldn't? Cost what its owner can afford? This BoF brings together researchers and practitioners who build, evaluate, and deploy enterprise-grade agents to share lessons learned and identify common open problems. We will draw on the experience of AssetOpsBench (NeurIPS 2025; AAAI 2026 Lab), an industrial asset-operations benchmark with 450+ scenarios, 1.9k+ GitHub stars , and a CODS 2025 community challenge that drew 300+ submissions across 149 teams , to discuss what worked, what didn't, and what's still missing. Discussion topics: Task design.

See the rest of the description and register on Luma.

Share Event

Date & Time

Tuesday, July 7, 2026

11:00 AM - 12:30 PM

Location

Manchester Grand Hyatt San Diego, 1 Market Pl, San Diego, CA 92101, USA