Food and Research :) Paper: Pre-Training Under Infinite Compute (Kim et al., 2025) https://arxiv.org/abs/2509.14786 TL;DR: Compute is growing 4x every year while data grows 1.03x. So, how do we train a model when constrained by data and unconstrained by compute? The authors limit themselves to a 200M token corpus and try to train the best model they can. They find 1) scaling epochs or parameters leads to overfitting, 2) heavy regularization (30x more weight decay from standard practices) is optimal, 3) ensembling models and averaging logics (or distilling into one model) is an efficient scaling mechanism. Some seed questions to think about: Can we say synthetic data generation also scale with compute? If so, does the paper's premise collapse? What would a "post-training under infinite compute" paper look like? What experiments would you run? On top of data and compute, RL is also constrained by feedback/verification—does that change anything? This paper finds both distilling from an ensemblel of trained models and self-distillation helps. Does this "free lunch" imply that…

Reading Group

About This Event

Share Event

Date & Time

Location