Information theory for deep learning

   Information-theoretic approaches in deep learning have attracted recent interest due to intriguing fundamental results and new hypotheses. Applying information theory to deep neural networks (DNNs) may provide novel tools for explainable AI via estimation of information flows [1,2], as well as new ways to encourage models to extract and generalize information [1,3,4,5]. Information theory also serves as a basis of some novel results on generalization capabilities and robustness of DNNs [4].


   We plan to cover several research topics on the intersection of deep learning and information theory. Firstly, we are going to explore the process of deep learning through the lense of information bottleneck principle. Our recent work [6] reveals an interesting connection between the certain features of the loss function time plot and the information plane plots: we observe several so-called “compression phases”, with the first one coinciding with the rapid decrease of the loss function. We plan to further investigate this and other interesting phenomena: neural collapse, “grokking”, etc.


   Secondly, we aim to improve existing information-based representation learning approaches. We plan to modify the well-established Deep InfoMax [5] self-supervised representation learning method to allow for an automatic distribution matching, i.e., learning representations admitting a specific distribution. This is an important task for several downstream applications. We are also interested in developing information-theoretic methods for representations disentaglement.


   Finally, with the growing number of applications of information theory to deep learning, accurate estimation of information-theoretic quantities become more and more important. That is why we also plan to develop advanced neural estimators of mutual information and entropy. In [6] we use autoencoders to compress the data and estimate the mutual information between compressed representations. In [7], normalizing flows are utilized to allow for closed-form expressions for mutual information to be used. During our research, we plan to harness the expressive power of diffusion models to assist the estimation of information-theoretic quantities.


   Information plane plots for the MNIST classifier. The lower left parts of the plots (b)-(d) correspond to the first epochs. We use 95% asymptotic CIs for the MI estimates acquired from the compressed data. The colormap represents the difference of losses between two consecutive epochs.  

Grants:

1. 2020–2021, Russian Foundation for Basic Research, Scientific Mentoring, “19-37-51036 – Information-theory based analysis of deep neural networks”.

References:
1. R. Shwartz-Ziv and N. Tishby, Opening the black box of deep neural networks via information,CoRR, vol. abs/1703.00810, 2017.
2. Z. Goldfeld, E. Van Den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy, Estimating information flow in deep neural networks, Proceedings of Machine Learning Research, vol. 97, pp. 2299-2308, 2019.
3. Tishby, Naftali; Pereira, Fernando C.; Bialek, William (September 1999). The Information Bottleneck Method. The 37th annual Allerton Conference on Communication, Control, and Computing. pp. 368–377.
4. Kenji Kawaguchi, Zhun Deng, Xu Ji, Jiaoyang Huang. How Does Information Bottleneck Help Deep Learning? Proceedings of the 40th International Conference on Machine Learning, PMLR 202:16049-16096, 2023.
5. R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv:1808.06670, 2019
6. Ivan Butakov, Alexander Tolmachev, Sofia Malanchuk, Anna Neopryatnaya, Alexey Frolov, and Kirill Andreev. Information bottleneck analysis of deep neural networks via lossy compression. In The Twelfth International Conference on Learning Representations, 2024
7. Ivan Butakov, Aleksander Tolmachev, Sofia Malanchuk, Anna Neopryatnaya, Alexey Frolov. Mutual information estimation via normalizing flows. arXiv preprint arXiv:2403.02187, 2024.