Information theory for deep learning

Information-theoretic approaches in deep learning have attracted recent interest due to intriguing fundamental results and new hypotheses. Applying information theory to deep neural networks (DNNs) may provide novel tools for explainable AI via estimation of information flows [1,2], as well as new ways to encourage models to extract and generalize information [1,3,4,5]. Information theory also serves as a basis of some novel results on generalization capabilities and robustness of DNNs [4].

We plan to cover several research topics on the intersection of deep learning and information theory. Firstly, we are going to explore the process of deep learning through the lense of information bottleneck principle. Our recent work [6] reveals an interesting connection between the certain features of the loss function time plot and the information plane plots: we observe several so-called “compression phases”, with the first one coinciding with the rapid decrease of the loss function. We plan to further investigate this and other interesting phenomena: neural collapse, “grokking”, etc.

Secondly, we aim to improve existing information-based representation learning approaches. We plan to modify the well-established Deep InfoMax [5] self-supervised representation learning method to allow for an automatic distribution matching, i.e., learning representations admitting a specific distribution. This is an important task for several downstream applications. We are also interested in developing information-theoretic methods for representations disentaglement.

Finally, with the growing number of applications of information theory to deep learning, accurate estimation of information-theoretic quantities become more and more important. That is why we also plan to develop advanced neural estimators of mutual information and entropy. In [6] we use autoencoders to compress the data and estimate the mutual information between compressed representations. In [7], normalizing flows are utilized to allow for closed-form expressions for mutual information to be used. During our research, we plan to harness the expressive power of diffusion models to assist the estimation of information-theoretic quantities.

Information plane plots for the MNIST classifier. The lower left parts of the plots (b)-(d) correspond to the first epochs. We use 95% asymptotic CIs for the MI estimates acquired from the compressed data. The colormap represents the difference of losses between two consecutive epochs.