F Reproducibility and Environmental Impact

Table F.1: Variable training parameters used in the experiments of this study. MTL stands for multitask learning.
	Chapter 3			Chapter 4			Chapter 5
	PC	ET	Probes	PC	ET	RA	ALBERT	GPT-2
fine-tuning	standard	MTL	MTL	standard	MTL	standard	MTL	MTL
granularity	sent.	sent.	sent.	sent.	word	sent.	word	word
freeze LM \(w\)	❌	❌	✅	❌	❌	❌	❌	❌
weighted loss		✅	❌		❌		❌	❌
CV folds	5	5	5
early stopping	✅	✅	❌	✅	✅	✅	✅	✅
training epochs	15	15	5	15	15	15	15	15
patience	5	5		5	5	5	5	5
evaluation steps	20	40		20	100	80	100	100

Tools Experiments were executed on a Ubuntu 18.04 LTS server, using a NVIDIA K40 GPU with 12GB RAM and CUDA 10.1. Relevant Python libraries used throughout the study with their respective versions are: 🤗 transformers 2.11.0 for accessing pre-trained Transformer language models, farm 0.4.5 for multitask learning, torch 1.3.0 as a backed for deep learning, and syntaxgym 0.5.3 for Chapter 5 experiments. Python 3.6.3 was used for all training scripts. A custom adaptation of the Oxforddown template was used for this thesis.²⁵ Code for reproducibility purposes is available at the address https://github.com/gsarti/interpreting-complexity.

Model Training Table F.1 present the set of variable training parameters used in all the experiments of this study. Besides those, a set of fixed parameters was also used: all experiments were performed using a batch size of 32 observations, a maximum sequence length of 128 tokens, a linear training schedule with one-tenth of total steps used as warmup steps, the AdamW optimizer (Loshchilov and Hutter 2019) with weight decay equal to \(0.01\), and a learning rate of \(10^{-5}\). No hyperparameter search was performed due to time limitations.

Tokenization All tokenizers used in the experiments used cased text and were based respectively on the SentencePiece approach (Kudo and Richardson 2018) for ALBERT and a custom version of Byte-Pair Encoding tokenization (Sennrich, Haddow, and Birch 2016) with token-like whitespaces for GPT-2. Default AlbertTokenizer and GPT2Tokenizer classes available in the 🤗 transformers library with pretrained tokenizers were used for this purpose. The vocabulary used by those had size 30’000 for ALBERT and 50’257 for GPT-2, including special tokens.

Architecture The default parameters for the 🤗 transformers checkpoints of ALBERT and GPT-2 (specifically, albert-base-v2 and gpt2 in the Model Hub) were used for this study. Concretely, this means embeddings and hidden sizes of 128 and 3072 for ALBERT and tied embedding-hidden size of 768 for GPT-2, 12 transformer blocks using 12 heads for multi-head self-attention each, and a smoothed variant of the Gaussian Error Linear Unit (GELU) as nonlinearity (Hendrycks and Gimpel 2016). GPT-2 has an embedding and attention dropout rate of 0.1 and a layer normalization (Ba, Kiros, and Hinton 2016) epsilon of \(10^{-5}\), while ALBERT employs a classifier dropout rate of 0.1 and a layer normalization epsilon of \(10^{-12}\).

CO2 Emissions Related to Experiments Experiments were conducted using the private infrastructure of the ItaliaNLP Lab²⁶ at the Institute for Computational Linguistics “A. Zampolli” (ILC-CNR) in Pisa, which has an estimated carbon efficiency of 0.321 kgCO\(_2\)eq/kWh (Moro and Lonza 2018). A cumulative of roughly 100 hours of computation was performed on a Tesla K40 GPU (TDP of 245W). Total emissions are estimated to be 7.86 kgCO\(_2\)eq. Estimations were conducted using the Machine Learning Impact Calculator²⁷ presented in Lacoste et al. (2019).

In-detail reports of all experimental runsre produced automatically using the MLFlow²⁸ tool and are available at the following address: https://public-mlflow.deepset.ai/#/experiments/99.

References

Ba, Jimmy, J. Kiros, and Geoffrey E. Hinton. 2016. “Layer Normalization.” ArXiv Pre-Print 1607.06450. https://arxiv.org/abs/1607.06450.

Hendrycks, Dan, and Kevin Gimpel. 2016. “Gaussian Error Linear Units (Gelus).” ArXiv Pre-Print 1606.08415. https://arxiv.org/abs/1606.08415.

Kudo, Taku, and John Richardson. 2018. “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 66–71. Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-2012.

Lacoste, Alexandre, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. 2019. “Quantifying the Carbon Emissions of Machine Learning.” ArXiv Pre-Print 1910.09700.

Loshchilov, I., and F. Hutter. 2019. “Decoupled Weight Decay Regularization.” In Proceeding of the 7th International Conference on Learning Representations (Iclr’19).

Moro, Alberto, and Laura Lonza. 2018. “Electricity Carbon Intensity in European Member States: Impacts on Ghg Emissions of Electric Vehicles.” Transportation Research Part D: Transport and Environment 64. Elsevier: 5–14.

Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–25. Berlin, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1162.