Citation Tests, Not a Real Post

February 1, 2001

Modern AI research increasingly relies on empirical scaling laws to predict model performance before committing to expensive training runs. The foundational observation that loss decreases as a power law with compute, data, and parameters was formalized by Kaplan et al. [1], establishing a quantitative framework that reshaped how labs allocate resources.

Hoffmann et al. [2] later revised these estimates, demonstrating that earlier models were significantly undertrained relative to their size. Their "Chinchilla" scaling laws suggested that for a given compute budget, one should train a smaller model on substantially more data — a finding that influenced the design of every major model since.

A parallel line of work has studied what capabilities emerge as models scale. Wei et al. [3] catalogued dozens of "emergent abilities" — tasks where performance is near-random below a critical model size, then jumps sharply. However, Schaeffer et al. [4] challenged this narrative, arguing that apparent emergence can be an artifact of nonlinear evaluation metrics rather than a genuine phase transition.

The Scaling Paradigm

The scaling paradigm is not limited to language. Zhai et al. [5] characterized scaling behavior for vision transformers, finding that performance scales log-linearly with compute across image classification, few-shot learning, and robustness benchmarks.

A key practical consequence of scaling laws is their use in efficient training. The "μP" framework by Yang et al. [6] showed that hyperparameters tuned on small models can transfer reliably to models 1000x larger, saving enormous amounts of compute.

Biological parallels have also been explored. Hestness et al. [7] found power-law scaling in domains as varied as machine translation and speech recognition, suggesting these laws may reflect something fundamental about learning from data.

On the safety side, Perez et al. [8] showed that certain dangerous model behaviors — sycophancy, deceptive alignment, power-seeking — also scale with model size, raising urgent questions about whether capabilities and risks are fundamentally coupled.
‍

This image shows lorem ipsum dolor sit amet

The Biological Basis

Finally, for those interested in the biological basis of neural scaling, Roberts et al. [9] provide a compelling review of how cortical neuron counts scale with brain size across species, establishing the primate scaling advantage that may underlie human cognitive uniqueness.

Recent work by Clark et al. [10] examined the role of data quality versus quantity, finding that careful deduplication and filtering can shift scaling curves substantially upward.

The theoretical underpinnings of scaling laws remain an active area of research, with Sharma & Kaplan ([11] proposing connections to statistical mechanics, and Bahri et al. ([12] aming deep learning phenomena through the lens of phase transitions.

‍

Recent work by Clark et al. [10] examined the role of data quality versus quantity, finding that careful deduplication and filtering can shift scaling curves substantially upward.

‍