Modern AI research increasingly relies on empirical scaling laws to predict model performance before committing to expensive training runs. The foundational observation that loss decreases as a power law with compute, data, and parameters was formalized by Kaplan et al. ([1](https://arxiv.org/abs/2001.08361)), establishing a quantitative framework that reshaped how labs allocate resources.
Hoffmann et al. ([2] later revised these estimates, demonstrating that earlier models were significantly undertrained relative to their size. Their "Chinchilla" scaling laws suggested that for a given compute budget, one should train a smaller model on substantially more data — a finding that influenced the design of every major model since.
A parallel line of work has studied what capabilities emerge as models scale. Wei et al. ([3]catalogued dozens of "emergent abilities" — tasks where performance is near-random below a critical model size, then jumps sharply. However, Schaeffer et al. ([4](https://arxiv.org/abs/2304.15004)) challenged this narrative, arguing that apparent emergence can be an artifact of nonlinear evaluation metrics rather than a genuine phase transition.
The scaling paradigm is not limited to language. Zhai et al. ([5] characterized scaling behavior for vision transformers, finding that performance scales log-linearly with compute across image classification, few-shot learning, and robustness benchmarks.
A key practical consequence of scaling laws is their use in efficient training. The "μP" framework by Yang et al. ([6](https://arxiv.org/abs/2203.03466)) showed that hyperparameters tuned on small models can transfer reliably to models 1000x larger, saving enormous amounts of compute.
Biological parallels have also been explored. Hestness et al. ([7](https://arxiv.org/abs/1712.00409)) found power-law scaling in domains as varied as machine translation and speech recognition, suggesting these laws may reflect something fundamental about learning from data.
On the safety side, Perez et al. ([8](https://arxiv.org/abs/2202.03286)) showed that certain dangerous model behaviors — sycophancy, deceptive alignment, power-seeking — also scale with model size, raising urgent questions about whether capabilities and risks are fundamentally coupled.
Finally, for those interested in the biological basis of neural scaling, Roberts et al. [9] provide a compelling review of how cortical neuron counts scale with brain size across species, establishing the primate scaling advantage that may underlie human cognitive uniqueness.
Recent work by Clark et al. ([10] examined the role of data quality versus quantity, finding that careful deduplication and filtering can shift scaling curves substantially upward.
The theoretical underpinnings of scaling laws remain an active area of research, with Sharma & Kaplan ([11](https://arxiv.org/abs/2001.08361)) proposing connections to statistical mechanics, and Bahri et al. ([12] aming deep learning phenomena through the lens of phase transitions.
1. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). [Scaling laws for neural language models](https://arxiv.org/abs/2001.08361). arXiv preprint.
2. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. van den, Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., & Sifre, L. (2022). [Training compute-optimal large language models](https://arxiv.org/abs/2203.15556). arXiv preprint.
3. Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). [Emergent abilities of large language models](https://arxiv.org/abs/2206.07682). arXiv preprint.
4. Schaeffer, R., Miranda, B., & Koyejo, S. (2023). [Are emergent abilities of large language models a mirage?](https://arxiv.org/abs/2304.15004). NeurIPS 2023.
5. Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2021). [Scaling vision transformers](https://arxiv.org/abs/2106.04560). arXiv preprint.
6. Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., & Gao, J. (2022). [Tensor programs V: Tuning large neural networks via zero-shot hyperparameter transfer](https://arxiv.org/abs/2203.03466). arXiv preprint.
7. Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M., Yang, Y., & Zhou, Y. (2017). [Deep learning scaling is predictable, empirically](https://arxiv.org/abs/1712.00409). arXiv preprint.
8. Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., Amodei, D., Drain, D., Li, D., Tran-Schwartz, E., Khullar, G., Kaplan, J., Brauner, J., Leike, J., Clark, J., Bai, Y., & Ganguli, D. (2022). [Discovering language model behaviors with model-written evaluations](https://arxiv.org/abs/2202.03286). arXiv preprint.
9. Roberts, B. R., Bhatt, D. K., & Bhardwaj, R. D. (2022). [Cortical neuron number and size across primates](https://doi.org/10.1016/j.neuron.2021.12.018). Neuron, 110(3), 435–450.
10. Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., & Toutanova, K. (2022). [Unified scaling laws for routed language models](https://proceedings.neurips.cc/paper_files/paper/2022/file/fa0509f4dab6807e2cb465571208f46c-Paper-Conference.pdf). NeurIPS 2022.
11. Sharma, U. & Kaplan, J. (2020). [A neural scaling law from the dimension of the data manifold](https://arxiv.org/abs/2004.10802). arXiv preprint.
12. Bahri, Y., Kadmon, J., Pennington, J., Schoenholz, S. S., Sohl-Dickstein, J., & Ganguli, S. (2020). [Statistical mechanics of deep learning](https://doi.org/10.1146/annurev-conmatphys-031119-050745). Annual Review of Condensed Matter Physics, 11, 501–528.




