Tuning parameters of deep neural network training algorithms pays off: a computational study

Corrado Coppola; Lorenzo Papa; Marco Boresta; Irene Amerini; Laura Palagi

Ayuda

Tuning parameters of deep neural network training algorithms pays off: a computational study

Corrado Coppola ^[1] ; Lorenzo Papa ^[1] ; Marco Boresta ^[2] ; Irene Amerini ^[1] ; Laura Palagi ^[1]
1. [1] Università de Roma La Sapienza
  
  Università de Roma La Sapienza
  
  Roma Capitale, Italia
2. [2] Istituto di Analisi dei Sistemi ed Informatica “Antonio Ruberti”, Consiglio Nazionale delle Ricerche, Via dei Taurini 19, 00185 Rome, Italy
Localización: Top, ISSN-e 1863-8279, ISSN 1134-5764, Vol. 32, Nº. Extra 3, 2024 (Ejemplar dedicado a: Mathematical Optimization and Machine Learning), págs. 579-620
Idioma: inglés
DOI: 10.1007/s11750-024-00683-x
Enlaces
- Texto completo
Resumen
- The paper aims to investigate the impact of the optimization algorithms on the training of deep neural networks with an eye to the interaction between the optimizer and the generalization performance. In particular, we aim to analyze the behavior of state-of-the-art optimization algorithms in relationship to their hyperparameters setting to detect robustness with respect to the choice of a certain starting point in ending on different local solutions. We conduct extensive computational experiments using nine open-source optimization algorithms to train deep Convolutional Neural Network architectures on an image multi-class classification task. Precisely, we consider several architectures by changing the number of layers and neurons per layer, to evaluate the impact of different width and depth structures on the computational optimization performance. We show that the optimizers often return different local solutions and highlight the strong correlation between the quality of the solution found and the generalization capability of the trained network. We also discuss the role of hyperparameters tuning and show how a tuned hyperparameters setting can be re-used for the same task on different problems achieving better efficiency and generalization performance than a default setting.
Referencias bibliográficas
- Abbaschian BJ, Sierra-Sosa D, Elmaghraby AS (2021) Deep learning techniques for speech emotion recognition, from databases to models. Sensors...
- Advani MS, Saxe AM, Sompolinsky H (2020) High-dimensional dynamics of generalization error in neural networks. Neural Netw 132:428–446
- Baumann P, Hochbaum DS, Yang YT (2019) A comparative study of the leading machine learning techniques and two new optimization algorithms....
- Bengio Y, Goodfellow I, Courville A (2017) Deep Learning, vol 1. MIT press Cambridge, MA, USA
- Berahas AS, Takáˇc M (2020) A robust multi-batch L-BFGs method for machine learning. Opt Methods Softw 35(1):191–219
- Berahas AS, Nocedal J, Takác M (2016) A multi-batch L-BFGS method for machine learning. Advances in Neural Information Processing Systems 29
- Bertsekas DP, Tsitsiklis JN (2000) Gradient convergence in gradient methods with errors. SIAM J Opt 10(3):627–642
- Bertsimas D, Dunn J (2017) Optimal classification trees. Mach Learn 106:1039–1082
- Bischl B, Binder M, Lang M et al (2023) Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. Wiley Interdisciplinary...
- Bollapragada R, Nocedal J, Mudigere D, et al (2018) A progressive batching L-BFGS method for machine learning. In: International Conference...
- Borawar L, Kaur R (2023) Resnet: Solving vanishing gradient in deep networks. In: Proceedings of International Conference on Recent Trends...
- Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60(2):223–311
- Braiek HB, Khomh F (2020) On testing machine learning programs. J Syst Softw 164(110):542. https://doi.org/10.1016/j.jss.2020.110542
- Buntine W (2020) Learning classification trees. In: Artificial Intelligence frontiers in statistics. Chapman and Hall/CRC, p 182–201
- Carrizosa E, Molero-Río C, Romero Morales D (2021) Mathematical optimization in classification and regression trees. Top 29(1):5–33
- Chen X, Liu S, Sun R, et al (2018) On the convergence of a class of Adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941
- Connor S, Khoshgoftaar TM (2019) A survey on image data augmentation for deep learning. J Big Data 6(1):1–48
- De S, Mukherjee A, Ullah E (2018) Convergence guarantees for RMSProp and Adam in non-convex optimization and an empirical comparison to Nesterov...
- Défossez A, Bottou L, Bach F, et al (2020) A simple convergence proof of Adam and Adagrad. arXiv preprint arXiv:2003.02395
- Diaz GI, Fokoue-Nkoutche A, Nannicini G et al (2017) An effective algorithm for hyperparameter optimization of neural networks. IBM J. Res....
- Ding C, Tao D (2017) Trunk-branch ensemble convolutional neural networks for video-based face recognition. IEEE Trans Pattern Anal Mach Intell...
- Ding T, Li D, Sun R (2022) Suboptimal local minima exist for wide neural networks with smooth activations. Math Oper Res 47(4):2784–2814
- Dogo EM, Afolabi O, Nwulu N et al (2018) A comparative analysis of gradient descent-based optimization algorithms on convolutional neural...
- Dolan ED, Moré JJ (2002) Benchmarking optimization software with performance profiles. Math Programm 91(2):201–213
- Dozat T (2016) Incorporating Nesterov momentum into Adam. In: ICLR Workshop
- Drori Y, Shamir O (2020) The complexity of finding stationary points with stochastic gradient descent. In: International Conference on Machine...
- Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(7):2121–2159
- Gambella C, Ghaddar B, Naoum-Sawaya J (2021) Optimization problems for machine learning: a survey. Euro J Oper Res 290(3):807–828
- Gärtner E, Metz L, Andriluka M, et al (2023) Transformer-based learned optimization. In: Proceedings of the IEEE/CVF Conference on Computer...
- Geiger M, Spigler S, d’Ascoli S, et al (2019) Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical...
- Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international...
- Goodfellow IJ, Vinyals O, Saxe AM (2014) Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544
- Guo Y, Liu Y, Georgiou T et al (2018) A review of semantic segmentation using deep neural networks. Int J Multimedia Inform Retrieval 7(2):87–93
- Haji SH, Abdulazeez AM (2021) Comparison of optimization techniques based on gradient descent algorithm: A review. PalArch’s J Archaeol Egypt/Egyptol...
- He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition...
- Hijazi S, Kumar R, Rowen C, et al (2015) Using convolutional neural networks for image recognition. Cadence Design Systems Inc: San Jose,...
- Hinton G, Srivastava N, Swersky K (2012) Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on...
- Howard AG, Zhu M, Chen B, et al (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. ArXiv abs/1704.04861
- Huang G, Liu Z, Weinberger KQ (2017) Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition...
- Im DJ, Tao M, Branson K (2016) An empirical analysis of the optimization of deep network loss surfaces. arXiv preprint arXiv:1612.04010
- Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International...
- Jais IKM, Ismail AR, Nisa SQ (2019) Adam optimization algorithm for wide and deep neural network. Knowl Eng Data Sci 2(1):41–46
- Kandel I, Castelli M, Popoviˇc A (2020) Comparative study of first order optimizers for image classification using convolutional neural networks...
- Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. CoRR abs/1412.6980
- Krizhevsky A, Nair V, Hinton G (2009) Cifar-10 (Canadian Institute for Advanced Research). http://www.cs.toronto.edu/~kriz/cifar.html
- Kuutti S, Bowden R, Jin Y et al (2021) A survey of deep learning applications to autonomous vehicle control. IEEE Trans Intell Trans Syst...
- Lan G (2020) First-order and stochastic optimization methods for machine learning. Springer, New York
- LaValley MP (2008) Logistic regression. Circulation 117(18):2395–2399
- LeCun Y, Bengio Y, et al (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks...
- LeCun Y et al (1989) Generalization and network design strategies. Connectionism Perspective 19(143–155):18
- Lewis-Beck C, Lewis-Beck M (2015) Applied regression: An introduction, vol 22. Sage publications
- Li H, Xu Z, Taylor G, et al (2018) Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31
- Li X, Orabona F (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In: The 22nd international conference on...
- Lim TS, Loh WY, Shih YS (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification...
- Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Programm 45(1):503–528
- McMahan B (2011) Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In: Proceedings of the Fourteenth...
- McMahan HB, Holt G, Sculley D, et al (2013) Ad click prediction: a view from the trenches. Proceedings of the 19th ACM SIGKDD International...
- Mercioni MA, Holban S (2020) P-swish: Activation function with learnable parameters based on swish activation function in deep learning. In:...
- Moré JJ, Wild SM (2009) Benchmarking derivative-free optimization algorithms. SIAM J Opt 20(1):172–191
- Nocedal J, Wright SJ (1999) Numerical optimization. Springer, New York
- Palagi L (2019) Global optimization issues in deep network regression: an overview. J Global Opt 73(2):239–277
- Papa L, Alati E, Russo P, et al (2022) Speed: Separable pyramidal pooling encoder-decoder for real-time monocular depth estimation on low-resource...
- Pisner DA, Schnyer DM (2020) Support vector machine. In: Machine learning. Elsevier, p 101–121
- Pouyanfar S, Sadiq S, Yan Y et al (2018) A survey on deep learning: algorithms, techniques, and applications. ACM Computing Surveys (CSUR)...
- Probst P, Boulesteix AL, Bischl B (2019) Tunability: importance of hyperparameters of machine learning algorithms. J Mach Learn Res 20(1):1934–1965
- Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941
- Ranstam J, Cook J (2018) LASSO regression. J Br Surg 105(10):1348–1348
- Robbins H, Monro S (1951) A stochastic approximation method. The Annals of Mathematical Statistics pp 400–407
- Rokach L, Maimon O (2010) Classification trees. In: Data Mining and Knowledge Discovery Handbook pp 149–174
- Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747
- Sandler M, Howard A, Zhu M, et al (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on...
- Shorten C, Khoshgoftaar TM, Furht B (2021) Deep learning applications for COVID-19. J Big Data 8
- Spigler S, Geiger M, d’Ascoli S, et al (2019) A jamming transition from under-to over-parametrization affects generalization in deep learning....
- Steinwart I, Christmann A (2008) Support vector machines. Springer Science & Business Media
- Sun R (2019) Optimization for deep learning: theory and algorithms. arXiv preprint arXiv:1912.08957
- Sun R, Li D, Liang S, et al (2020) The global landscape of neural networks: an overview. IEEE Signal Process Mag 37(5):95–108
- Sun S, Cao Z, Zhu H, et al (2019) A survey of optimization methods from a machine learning perspective. IEEE Trans Cybernet 50(8):3668–3681
- Suthaharan S, Suthaharan S (2016) Support vector machine. In: Machine Learning Models and Algorithms for Big Data Classification: Thinking...
- Sutskever I, Martens J, Dahl G, et al (2013a) On the importance of initialization and momentum in deep learning. In: International Conference...
- Sutskever I, Martens J, Dahl GE, et al (2013b) On the importance of initialization and momentum in deep learning. In: ICML
- Swirszcz G, Czarnecki WM, Pascanu R (2016) Local minima in training of neural networks. arXiv preprint arXiv:1611.06310
- Tatsumi K, Tanino T (2014) Support vector machines maximizing geometric margins for multi-class classification. Top 22:815–840
- Van Dyk DA, Meng XL (2001) The art of data augmentation. J Comput Graph Stat 10(1):1–50
- Wang H, Gemmeke H, Hopp T, et al (2019) Accelerating image reconstruction in ultrasound transmission tomography using L-BFGS algorithm. In:...
- Xu P, Roosta F, Mahoney MW (2020) Second-order optimization for non-convex machine learning: An empirical study. In: Proceedings of the 2020...
- Yang L, Shami A (2020) On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 415:295–316
- Yang Y, Newsam S (2010) Bag-of-visual-words and spatial extensions for land-use classification. In: Proceedings of the 18th SIGSPATIAL International...
- Yu T, Zhu H (2020) Hyper-parameter optimization: A review of algorithms and applications. arXiv preprint arXiv:2003.05689
- Yun C, Sra S, Jadbabaie A (2018) Small nonlinearities in activation functions create bad local minima in neural networks. arXiv preprint arXiv:1802.03487
- Zeiler MD (2012) Adadelta: An adaptive learning rate method. arXiv abs/1212.5701
- Zhang C, Bengio S, Hardt M, et al (2016) Understanding deep learning requires rethinking generalization. In: International Conference on Learning...
- Zhang C, Bengio S, Hardt M, et al (2021) Understanding deep learning (still) requires rethinking generalization. Commun ACM 64(3):107–115
- Zhou D, Chen J, Cao Y, et al (2018) On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671