Energy and Policy Considerations for Deep Learning in NLP

This paper [1] quantifies the financial and environmental costs (CO2 emissions) of training a deep network. It also draws attention to the inequality between academia and industry in terms of computational resources. The paper uses NLP-architectures to present their case. Yet, the discussed issues are very relevant to the computer vision community.

The paper compares the amount of CO2 emitted by a familiar consumption (e.g., a car lifetime emission) versus a common NLP model (e.g., a transformer). Table 1 shows that training a transformer network emits significantly more CO2 compared to a fuel car.

Table 1: Estimated CO2 emissions from training common NLP models, compared to familiar consumption.

Then, the paper compares both the CO2-emission and the training-cost of four network architectures: Transformer [2], ELMo [3], BERT [4], and GPT-2 [4]. While a transformer’s emission equals five times (5x) a fuel car’s emission, Table 2 shows that the transformer’s training-cost is tiny compared to recent models (e.g., GPT-2).

Table 2: Estimated cost of training a model in terms of CO2 emissions (lbs) and cloud compute cost (USD).

The cost of training a neural network is different from the research and development (R&D) cost. R&D of new models multiplies these costs by thousands of times. R&D requires retraining to evaluate different architectures variants and hyperparameters. To quantify the R&D cost for new models, the authors study the logs of all training required to develop the Linguistically- Informed Self-Attention (LISA)model [6].

Project LISA spanned a period of 172 days (approx. 6 months). During that time, 123 small hyperparameter grid searches were performed, resulting in 4789 jobs in total. Jobs varied in length ranging from a minimum of 3 minutes, indicating a crash, to a maximum of 9 days, with an average job length of 52 hours. Table 3 shows that while training a single model is relatively cheap, the full R&D cost required to develop LISA is extremely expensive.

Table 3: Estimated cost in terms of cloud compute and electricity for training: (1) a single model (2) a single tune and (3) all models trained during R&D.

In the end, the paper proposes three recommendations to reduce costs and improve equity in the research community. The three recommendations are:

  1. Authors should report training time and sensitivity to hyperparameters.
  • The training time, computational resources, and hyperparameters’ sensitivity should be reported for new models. This will enable fair comparison across models, allowing consumers to accurately assess whether the required computational resources are compatible with their setting.

2. Researchers should prioritize computationally efficient hardware and algorithms.

  • Developers should seek and implement more efficient alternatives to brute-force grid search for hyperparameter tuning. For example, Bayesian hyperparameter search should be integrated into deep learning tools (e.g., PyTorch and Tensorflow).

3. Academic researchers need equitable access to computation resources.

  • The experiments for the LISA project [6] are developed outside academia. State-of-the-art accuracies are possible thanks to industry access to large-scale compute.

My comments:

  • The paper is short (5 papers) and it is well-written. The paper covers interconnected topics related to energy and policy for deep Learning.


[1] Strubell, Emma, Ananya Ganesh, and Andrew McCallum. “Energy and policy considerations for deep learning in NLP.” arXiv preprint arXiv:1906.02243 (2019).

[2] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

[3] Peters, Matthew E., et al. “Deep contextualized word representations.” arXiv preprint arXiv:1802.05365 (2018).

[4] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).

[5] Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI blog 1.8 (2019): 9.

[6] Strubell, Emma, et al. “Linguistically-informed self-attention for semantic role labeling.” arXiv preprint arXiv:1804.08199 (2018).

[7] Governing Dynamics: Ignore the Blonde — A Beautiful Mind (3/11) Movie CLIP

I write reviews on computer vision papers. Writing tips are welcomed.