
Reading Group
About This Event
Food and Research :) Paper: Pre-Training Under Infinite Compute (Kim et al., 2025) https://arxiv.org/abs/2509.14786 TL;DR: Compute is growing 4x every year while data grows 1.03x. So, how do we train a model when constrained by data and unconstrained by compute? The authors limit themselves to a 200M token corpus and try to train the best model they can. They find 1) scaling epochs or parameters leads to overfitting, 2) heavy regularization (30x more weight decay from standard practices) is optimal, 3) ensembling models and averaging logics (or distilling into one model) is an efficient scaling mechanism. Some seed questions to think about: Can we say synthetic data generation also scale with compute? If so, does the paper's premise collapse? What would a "post-training under infinite compute" paper look like? What experiments would you run? On top of data and compute, RL is also constrained by feedback/verification—does that change anything? This paper finds both distilling from an ensemblel of trained models and self-distillation helps. Does this "free lunch" imply that…
See the rest of the description and register on Luma.
Share Event
Date & Time
Sunday, June 21, 2026
8:00 PM - 9:30 PM
Location
500 Washington St, San Francisco, CA 94111, USA