About This Event

Can an AI rebuild a paper from scratch, code, experiments, results, with nothing but the PDF? PaperBench is the test. Let's read it closely and argue about what it really measures. PaperBench: Evaluating AI's Ability to Replicate AI Research (OpenAI) MLE-bench: Evaluating ML Agents on Machine Learning Engineering (OpenAI) RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents (METR) SWE-bench: Can Language Models Resolve Real-World GitHub Issues? MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (Sakana AI) Can LLMs Generate Novel Research Ideas? (Si, Yang & Hashimoto) Bring the paper that changed your mind about what these systems can do. RSVP to unlock the address.

See the rest of the description and register on Luma.

PaperBench: Evaluating AI’s Ability to Replicate AI Research

About This Event

Share Event

Date & Time

Location