Energy and Policy Considerations for Deep Learning in NLP

This paper [1] quantifies the financial and environmental costs (CO2 emissions) of training a deep network. It also draws attention to the inequality between academia and industry in terms of computational resources. The paper uses NLP-architectures to present their case. Yet, the discussed issues are very relevant to the computer vision community.

The paper compares the amount of CO2 emitted by a familiar consumption (e.g., a car lifetime emission) versus a common NLP model (e.g., a transformer). Table 1 shows that training a transformer network emits significantly more CO2 compared to a fuel car.

Table 1: Estimated CO2 emissions from training common NLP models, compared to familiar consumption.

Then, the paper compares both the CO2-emission and the training-cost of four network architectures: Transformer [2], ELMo [3], BERT [4], and GPT-2 [4]. While a transformer’s emission equals five times (5x) a fuel car’s emission, Table 2 shows that the transformer’s training-cost is tiny compared to recent models (e.g., GPT-2).

Table 2: Estimated cost of training a model in terms of CO2 emissions (lbs) and cloud compute cost (USD).

The cost of training a neural network is different from the research and development (R&D) cost. R&D of new models multiplies these costs by thousands of times. R&D requires retraining to evaluate different architectures variants and hyperparameters. To quantify the R&D cost for new models, the authors study the logs of all training required to develop the Linguistically- Informed Self-Attention (LISA)model [6].

Project LISA spanned a period of 172 days (approx. 6 months). During that time, 123 small hyperparameter grid searches were performed, resulting in 4789 jobs in total. Jobs varied in length ranging from a minimum of 3 minutes, indicating a crash, to a maximum of 9 days, with an average job length of 52 hours. Table 3 shows that while training a single model is relatively cheap, the full R&D cost required to develop LISA is extremely expensive.

Table 3: Estimated cost in terms of cloud compute and electricity for training: (1) a single model (2) a single tune and (3) all models trained during R&D.

In the end, the paper proposes three recommendations to reduce costs and improve equity in the research community. The three recommendations are:

  1. Authors should report training time and sensitivity to hyperparameters.
  • The training time, computational resources, and hyperparameters’ sensitivity should be reported for new models. This will enable fair comparison across models, allowing consumers to accurately assess whether the required computational resources are compatible with their setting.

2. Researchers should prioritize computationally efficient hardware and algorithms.

  • Developers should seek and implement more efficient alternatives to brute-force grid search for hyperparameter tuning. For example, Bayesian hyperparameter search should be integrated into deep learning tools (e.g., PyTorch and Tensorflow).

3. Academic researchers need equitable access to computation resources.

  • The experiments for the LISA project [6] are developed outside academia. State-of-the-art accuracies are possible thanks to industry access to large-scale compute.
  • Limiting research to rich industry labs will hurt the research community in multiple ways. First, it stifles creativity. Researchers with good ideas will not execute their ideas if large-scale resources are not available. Second, it promotes the already problematic “rich get richer” cycle of research funding, where successful and well-funded groups receive more funding due to their existing accomplishments. Third, the prohibitive start-up cost of building in-house resources forces resource-poor groups to rely on cloud compute services (e.g., Google Cloud and Microsoft Azure).
  • Yet, the cost of these cloud resources is two times (2x) the actual hardware cost per project. Unlike money spent on cloud compute, the purchased resources would continue to pay off as resources are shared across many projects. Unfortunately, non-profit educational institutions lack the initial funding required to build compute-centers (computer clusters). Accordingly, the paper [1] suggests that it is more cost-effective for academic researchers to pool resources to build shared compute-centers.

My comments:

  • The paper is short (5 papers) and it is well-written. The paper covers interconnected topics related to energy and policy for deep Learning.
  • As a graduate student, there is very little I can contribute to the paper topic. Yet, I think graduate students should be aware of these energy and policy discussions.
  • While reading the *equitable access to computation resources* recommendation, I kept remembering the quotation “life is not fair”. It is true, but we should not accept life as it is. We should strive, within our capabilities, to make life fairer.
  • The paper proposes to pool resources among academic researchers to build shared compute-centers effectively. Of course, shared resources are better than either no resources or cloud compute. However, shared resources are not ideal. Shared resources tend to be overused inefficiently. Concretely, a researcher will utilize shared resources recklessly to maximize his/her return on investment (ROI). With shared resources, there is no incentive to promote the best interest of both the researcher and his/her fellow researchers. This argument is beautifully present in this movie scene [7].

Resources:

[1] Strubell, Emma, Ananya Ganesh, and Andrew McCallum. “Energy and policy considerations for deep learning in NLP.” arXiv preprint arXiv:1906.02243 (2019).

[2] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

[3] Peters, Matthew E., et al. “Deep contextualized word representations.” arXiv preprint arXiv:1802.05365 (2018).

[4] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).

[5] Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI blog 1.8 (2019): 9.

[6] Strubell, Emma, et al. “Linguistically-informed self-attention for semantic role labeling.” arXiv preprint arXiv:1804.08199 (2018).

[7] Governing Dynamics: Ignore the Blonde — A Beautiful Mind (3/11) Movie CLIP

I write reviews on computer vision papers. Writing tips are welcomed.