Deep Learning: Mathematical Foundations and Applications to Information Science | IEEE Information Theory Society

IEEE Journal on Selected Areas in Information Theory

Table of Contents

JSAIT cover 1 deep learning

2020

Deep Learning: Mathematical Foundations and Applications to Information Science

Guest editors

Richard Baraniuk

Alex Dimakis

Negar Kiyavash

Sewoong Oh

Rebecca Willett

JSAIT's first special issue will focus on the mathematical foundations of deep learning as well as applications across information science.

Welcome to the IEEE Journal on Selected Areas in Information Theory (JSAIT)

Andrea J. Goldsmith

I would like to warmly welcome our readers to this inaugural special issue of JSAIT, the Information Theory Society’s first new journal since the IRE Transactions on Information Theory launched in 1953. The society’s desire to expand its technical scope, incubate new research directions, catalyze connections with other disciplines, and highlight new and emerging applications formed the impetus for the new journal. To achieve these goals, JSAIT publishes special issues focusing on the intersections of information theory with fields such as machine learning, statistics, biology, finance, computer science, and physics. There will also be special issues for emerging topics of broad interest that are firmly within Information Theory. Each special issue will have at least one tutorial aimed to benefit people in our field and to forge connections with people in other fields that information theory has impacted or can impact.

READ ON IEEE Xplore

Guest Editorial

Richard Baraniuk Alex Dimakis Negar Kiyavash Sewoong Oh Rebecca Willett

Welcome to the first issue of the Journal on Selected Areas in Information Theory (JSAIT) focusing on Deep Learning: Mathematical Foundations and Applications to Information Science.

READ ON IEEE Xplore

Physical Layer Communication via Deep Learning

Hyeji Kim Sewoong Oh Pramod Viswanath

Reliable digital communication is a primary workhorse of the modern information age. The disciplines of communication, coding, and information theories drive the innovation by designing efficient codes that allow transmissions to be robustly and efficiently decoded. Progress in near optimal codes is made by individual human ingenuity over the decades, and breakthroughs have been, befittingly, sporadic and spread over several decades. Deep learning is a part of daily life where its successes can be attributed to a lack of a (mathematical) generative model. Deep learning empirically fits neural network models to the data, and the result has been extremely potent. In yet other applications, the data is generated by a simple model and performance criterion mathematically precise and training/test samples infinitely abundant, but the space of algorithmic choices is enormous (example: chess). Deep learning has recently shown strong promise in these problems too (example: alphazero). The latter scenario is a good description of communication theory. The mathematical models underlying canonical communication channels allow one to sample an unlimited amount of data to train and test the communication algorithms ((encoder, decoder) pairs) and the metric of bit (or block) error rate allows for mathematically precise evaluation. Motivated by the successes of deep learning in mathematically well defined and extremely challenging tasks of chess, Go, and protein folding, we posit that deep learning methods can play a crucial role in solving core goals of coding theory: designing new (encoder, decoder) pairs that improve state of the art performance over canonical channel models. This manuscript surveys recent advances towards demonstrating this hypothesis, by focusing on strengthening a specific family of coding methods - sequential codes (convolutional codes) and codes that use them as basic building blocks (Turbo codes) - via deep learning methods. New state of the art results are derived on several canonical channels, including the AWGN channel with feedback.

READ ON IEEE Xplore

The Information Bottleneck Problem and its Applications in Machine Learning

Ziv Goldfeld Yury Polyanskiy

Inference capabilities of machine learning (ML) systems skyrocketed in recent years, now playing a pivotal role in various aspect of society. The goal in statistical learning is to use data to obtain simple algorithms for predicting a random variable Y from a correlated observation X. Since the dimension of X is typically huge, computationally feasible solutions should summarize it into a lower-dimensional feature vector T, from which Y is predicted. The algorithm will successfully make the prediction if T is a good proxy of Y, despite the said dimensionality-reduction. A myriad of ML algorithms (mostly employing deep learning (DL)) for finding such representations T based on real-world data are now available. While these methods are effective in practice, their success is hindered by the lack of a comprehensive theory to explain it. The information bottleneck (IB) theory recently emerged as a bold information-theoretic paradigm for analyzing DL systems. Adopting mutual information as the figure of merit, it suggests that the best representation T should be maximally informative about Y while minimizing the mutual information with X. In this tutorial we survey the information-theoretic origins of this abstract principle, and its recent impact on DL. For the latter, we cover implications of the IB problem on DL theory, as well as practical algorithms inspired by it. Our goal is to provide a unified and cohesive description. A clear view of current knowledge is important for further leveraging IB and other information-theoretic ideas to study DL models.

READ ON IEEE Xplore

Deep Learning Techniques for Inverse Problems in Imaging

Gregory Ongie Ajil Jalal Christopher A. Metzler Richard G. Baraniuk Alexandros G. Dimakis Rebecca Willett

Recent work in machine learning shows that deep neural networks can be used to solve a wide variety of inverse problems arising in computational imaging. We explore the central prevailing themes of this emerging area and present a taxonomy that can be used to categorize different problems and reconstruction methods. Our taxonomy is organized along two central axes: (1) whether or not a forward model is known and to what extent it is used in training and testing, and (2) whether or not the learning is supervised or unsupervised, i.e., whether or not the training relies on access to matched ground truth image and measurement pairs. We also discuss the tradeoffs associated with these different reconstruction approaches, caveats and common failure modes, plus open problems and avenues for future work.

READ ON IEEE Xplore

Expression of Fractals Through Neural Network Functions

Nadav Dym Barak Sober Ingrid Daubechies

To help understand the underlying mechanisms of neural networks (NNs), several groups have studied the number of linear regions ℓ of piecewise linear (PwL) functions, generated by deep neural networks (DNN). In particular, they showed that ℓ can grow exponentially with the number of network parameters p, a property often used to explain the advantages of deep over shallow NNs. Nonetheless, a dimension argument shows that DNNs cannot generate all PwL functions with ℓ linear regions when ℓ > p.It is thus natural to seek to characterize specific families of functions with ℓ > p linear regions that can be constructed by DNNs. Iterated Function Systems (IFS) recursively construct a sequence of PwL functions Fk with a number of linear regions which is exponential in k. We show that Fk can be generated by a NN using only O(k) parameters. IFS are used extensively to generate natural-looking landscape textures in artificial images as well as for compression of natural images. The surprisingly good performance of this compression suggests that human visual system may lock in on self-similarities. The combination of this phenomenon with the capacity of DNNs to efficiently approximate IFS may contribute to the success of DNNs, particularly striking for image processing tasks.

READ ON IEEE Xplore

Harmless Interpolation of Noisy Data in Regression

Vidya Muthukumar Kailas Vodrahalli Vignesh Subramanian Anant Sahai

A continuing mystery in understanding the empirical success of deep neural networks is their ability to achieve zero training error and generalize well, even when the training data is noisy and there are more parameters than data points. We investigate this overparameterized regime in linear regression, where all solutions that minimize training error interpolate the data, including noise. We lower-bound the fundamental generalization (mean-squared) error of any interpolating solution in the presence of noise, and show that this bound decays to zero with the number of features. Thus, overparameterization can be beneficial in ensuring harmless interpolation of noise. We discuss two root causes for poor generalization that are complementary in nature - signal “bleeding” into a large number of alias features, and overfitting of noise by parsimonious feature selectors. For the sparse linear model with noise, we provide a hybrid interpolating scheme that mitigates both these issues and achieves order-optimal MSE over all possible interpolating solutions.

READ ON IEEE Xplore

Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks

Samet Oymak Mahdi Soltanolkotabi

Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that perfectly interpolate any labels. A number of recent theoretical works have shown that for very wide neural networks where the number of hidden units is polynomially large in the size of the training data gradient descent starting from a random initialization does indeed converge to a global optima. However, in practice much more moderate levels of overparameterization seems to be sufficient and in many cases overparameterized models seem to perfectly interpolate the training data as soon as the number of parameters exceed the size of the training data by a constant factor. Thus there is a huge gap between the existing theoretical literature and practical experiments. In this paper we take a step towards closing this gap. Focusing on shallow neural nets and smooth activations, we show that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Our results also benefit from a fast convergence rate and continue to hold for non-differentiable activations such as Rectified Linear Units (ReLUs).

READ ON IEEE Xplore

Sample Compression, Support Vectors, and Generalization in Deep Learning

Christopher Snyder Sriram Vishwanath

Even though Deep Neural Networks (DNNs) are widely celebrated for their practical performance, they possess many intriguing properties related to depth that are difficult to explain both theoretically and intuitively. Understanding how weights in deep networks coordinate together across layers to form useful learners has proven challenging, in part because the repeated composition of nonlinearities has proved intractable. This paper presents a reparameterization of DNNs as a linear function of a feature map that is locally independent of the weights. This feature map transforms depth-dependencies into simple tensor products and maps each input to a discrete subset of the feature space. Then, using a max-margin assumption, the paper develops a sample compression representation of the neural network in terms of the discrete activation state of neurons induced by s “support vectors”. The paper shows that the number of support vectors s relates with learning guarantees for neural networks through sample compression bounds, yielding a sample complexity of O(ns/E) for networks with n neurons. Finally, the number of support vectors s is found to monotonically increase with width and label noise but decrease with depth.

READ ON IEEE Xplore

Tightening Mutual Information-Based Bounds on Generalization Error

Yuheng Bu Shaofeng Zou Venugopal V. Veeravalli

An information-theoretic upper bound on the generalization error of supervised learning algorithms is derived. The bound is constructed in terms of the mutual information between each individual training sample and the output of the learning algorithm. The bound is derived under more general conditions on the loss function than in existing studies; nevertheless, it provides a tighter characterization of the generalization error. Examples of learning algorithms are provided to demonstrate the tightness of the bound, and to show that it has a broad range of applicability. Application to noisy and iterative algorithms, e.g., stochastic gradient Langevin dynamics (SGLD), is also studied, where the constructed bound provides a tighter characterization of the generalization error than existing results. Finally, it is demonstrated that, unlike existing bounds, which are difficult to compute and evaluate empirically, the proposed bound can be estimated easily in practice.

READ ON IEEE Xplore

Extracting Robust and Accurate Features via a Robust Information Bottleneck

Ankit Pensia Varun Jog Po-Ling Loh

We propose a novel strategy for extracting features in supervised learning that can be used to construct a classifier which is more robust to small perturbations in the input space. Our method builds upon the idea of the information bottleneck, by introducing an additional penalty term that encourages the Fisher information of the extracted features to be small when parametrized by the inputs. We present two formulations where the relevance of the features to output labels is measured using either mutual information or MMSE. By tuning the regularization parameter, we can explicitly trade off the opposing desiderata of robustness and accuracy when constructing a classifier. We derive optimal solutions to both robust information bottleneck formulations when the inputs and outputs are jointly Gaussian, proving that the optimally robust features are also jointly Gaussian in this setting. We also propose methods for optimizing variational bounds on the robust information bottleneck objectives in general settings using stochastic gradient descent, which may be implemented efficiently in neural networks. Our experimental results for synthetic and real data sets show that the proposed feature extraction methods indeed produce classifiers with increased robustness to perturbations.

READ ON IEEE Xplore

A Fourier-Based Approach to Generalization and Optimization in Deep Learning

Farzan Farnia Jesse M. Zhang David N. Tse

The success of deep neural networks stems from their ability to generalize well on real data; however, et al. have observed that neural networks can easily overfit randomly-generated labels. This observation highlights the following question: why do gradient methods succeed in finding generalizable solutions for neural networks while there exist solutions with poor generalization behavior? In this work, we use a Fourier-based approach to study the generalization properties of gradient-based methods over 2-layer neural networks with band-limited activation functions. Our results indicate that in such settings if the underlying distribution of data enjoys nice Fourier properties including bandlimitedness and bounded Fourier norm, then the gradient descent method can converge to local minima with nice generalization behavior. We also establish a Fourier-based generalization error bound for band-limited function spaces, applicable to 2-layer neural networks with general activation functions. This generalization bound motivates a grouped version of path norms for measuring the complexity of 2-layer neural networks with ReLU-type activation functions. We empirically demonstrate that regularization of the group path norms results in neural network solutions that can fit true labels without losing test accuracy while not overfitting random labels.

READ ON IEEE Xplore

An Information-Theoretic Approach to Unsupervised Feature Selection for High-Dimensional Data

Shao-Lun Huang Xiangxiang Xu Lizhong Zheng

In this paper, we propose an information-theoretic approach to design the functional representations to extract the hidden common structure shared by a set of random variables. The main idea is to measure the common information between the random variables by Watanabe's total correlation, and then find the hidden attributes of these random variables such that the common information is reduced the most given these attributes. We show that these attributes can be characterized by an exponential family specified by the eigen-decomposition of some pairwise joint distribution matrix. Then, we adopt the log-likelihood functions for estimating these attributes as the desired functional representations of the random variables, and show that such representations are informative to describe the common structure. Moreover, we design both the multivariate alternating conditional expectation (MACE) algorithm to compute the proposed functional representations for discrete data, and a novel neural network training approach for continuous or high-dimensional data. Furthermore, we show that our approach has deep connections to existing techniques, such as Hirschfeld-Gebelein-Rényi (HGR) maximal correlation, linear principal component analysis (PCA), and consistent functional map, which establishes insightful connections between information theory and machine learning. Finally, the performances of our algorithms are validated by numerical simulations.

READ ON IEEE Xplore

MaxiMin Active Learning in Overparameterized Model Classes

Mina Karzand Robert D. Nowak

Generating labeled training datasets has become a major bottleneck in Machine Learning (ML) pipelines. Active ML aims to address this issue by designing learning algorithms that automatically and adaptively select the most informative examples for labeling so that human time is not wasted labeling irrelevant, redundant, or trivial examples. This paper proposes a new approach to active ML with nonparametric or overparameterized models such as kernel methods and neural networks. In the context of binary classification, the new approach is shown to possess a variety of desirable properties that allow active learning algorithms to automatically and efficiently identify decision boundaries and data clusters.

READ ON IEEE Xplore

DeepJSCC-f: Deep Joint Source-Channel Coding of Images With Feedback

David Burth Kurka Deniz Gündüz

We consider wireless transmission of images in the presence of channel output feedback. From a Shannon theoretic perspective feedback does not improve the asymptotic end-to-end performance, and separate source coding followed by capacity-achieving channel coding, which ignores the feedback signal, achieves the optimal performance. It is well known that separation is not optimal in the practical finite blocklength regime; however, there are no known practical joint source-channel coding (JSCC) schemes that can exploit the feedback signal and surpass the performance of separation-based schemes. Inspired by the recent success of deep learning methods for JSCC, we investigate how noiseless or noisy channel output feedback can be incorporated into the transmission system to improve the reconstruction quality at the receiver. We introduce an autoencoder-based JSCC scheme, which we call DeepJSCC-f , that exploits the channel output feedback, and provides considerable improvements in terms of the end-to-end reconstruction quality for fixed-length transmission, or in terms of the average delay for variable-length transmission. To the best of our knowledge, this is the first practical JSCC scheme that can fully exploit channel output feedback, demonstrating yet another setting in which modern machine learning techniques can enable the design of new and efficient communication methods that surpass the performance of traditional structured coding-based designs.

READ ON IEEE Xplore

Deepcode: Feedback Codes via Deep Learning

Hyeji Kim Yihan Jiang Sreeram Kannan Sewoong Oh Pramod Viswanath

The design of codes for communicating reliably over a statistically well defined channel is an important endeavor involving deep mathematical research and wide-ranging practical applications. In this work, we present the first family of codes obtained via deep learning, which significantly outperforms state-of-the-art codes designed over several decades of research. The communication channel under consideration is the Gaussian noise channel with feedback, whose study was initiated by Shannon; feedback is known theoretically to improve reliability of communication, but no practical codes that do so have ever been successfully constructed. We break this logjam by integrating information theoretic insights harmoniously with recurrent-neural-network based encoders and decoders to create novel codes that outperform known codes by 3 orders of magnitude in reliability and achieve a 3dB gain in terms of SNR. We also demonstrate several desirable properties of the codes: (a) generalization to larger block lengths, (b) composability with known codes, and (c) adaptation to practical constraints. This result also has broader ramifications for coding theory: even when the channel has a clear mathematical model, deep learning methodologies, when combined with channel-specific information-theoretic insights, can potentially beat state-of-the-art codes constructed over decades of mathematical research.

READ ON IEEE Xplore

LEARN Codes: Inventing Low-Latency Codes via Recurrent Neural Networks

Yihan Jiang Hyeji Kim Himanshu Asnani Sreeram Kannan Sewoong Oh Pramod Viswanath

Designing channel codes under low-latency constraints is one of the most demanding requirements in 5G standards. However, a sharp characterization of the performance of traditional codes is available only in the large block-length limit. Guided by such asymptotic analysis, code designs require large block lengths as well as latency to achieve the desired error rate. Tail-biting convolutional codes and other recent state-of-the-art short block codes, while promising reduced latency, are neither robust to channel-mismatch nor adaptive to varying channel conditions. When the codes designed for one channel (e.g., Additive White Gaussian Noise (AWGN) channel) are used for another (e.g., non-AWGN channels), heuristics are necessary to achieve non-trivial performance. In this paper, we first propose an end-to-end learned neural code, obtained by jointly designing a Recurrent Neural Network (RNN) based encoder and decoder. This code outperforms canonical convolutional code under block settings. We then leverage this experience to propose a new class of codes under low-latency constraints, which we call Low-latency Efficient Adaptive Robust Neural (LEARN) codes. These codes outperform state-of-the-art low-latency codes and exhibit robustness and adaptivity properties. LEARN codes show the potential to design new versatile and universal codes for future communications via tools of modern deep learning coupled with communication engineering insights.

READ ON IEEE Xplore

Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations

Debraj Basu Deepesh Data Can Karakus Suhas N. Diggavi

Communication bottleneck has been identified as a significant issue in distributed optimization of large-scale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper, we propose Qsparse-local-SGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of Qsparse-local-SGD. We analyze convergence for Qsparse-local-SGD in the distributed setting for smooth non-convex and convex objective functions. We demonstrate that Qsparse-local-SGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We use Qsparse-local-SGD to train ResNet-50 on ImageNet and show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.

READ ON IEEE Xplore

Learning-Based Coded Computation

Jack Kosaian K. V. Rashmi Shivaram Venkataraman

Recent advances have shown the potential for coded computation to impart resilience against slowdowns and failures that occur in distributed computing systems. However, existing coded computation approaches are either unable to support non-linear computations, or can only support a limited subset of non-linear computations while requiring high resource overhead. In this work, we propose a learning-based coded computation framework to overcome the challenges of performing coded computation for general non-linear functions. We show that careful use of machine learning within the coded computation framework can extend the reach of coded computation to imparting resilience to more general non-linear computations. We showcase the applicability of learning-based coded computation to neural network inference, a major workload in production services. Our evaluation results show that learning-based coded computation enables accurate reconstruction of unavailable results from widely deployed neural networks for a variety of inference tasks such as image classification, speech recognition, and object localization. We implement our proposed approach atop an open-source prediction serving system and show its promise in alleviating slowdowns that occur in neural network inference. These results indicate the potential for learning-based approaches to open new doors for the use of coded computation for broader, non-linear computations.

READ ON IEEE Xplore

On Distributed Quantization for Classification

Osama A. Hanna Yahya H. Ezzeldin Tara Sadjadpour Christina Fragouli Suhas Diggavi

We consider the problem of distributed feature quantization, where the goal is to enable a pretrained classifier at a central node to carry out its classification on features that are gathered from distributed nodes through communication constrained channels. We propose the design of distributed quantization schemes specifically tailored to the classification task: unlike quantization schemes that help the central node reconstruct the original signal as accurately as possible, our focus is not reconstruction accuracy, but instead correct classification. Our work does not make any a priori distributional assumptions on the data, but instead uses training data for the quantizer design. Our main contributions include: we prove NP-hardness of finding optimal quantizers in the general case; we design an optimal scheme for a special case; we propose quantization algorithms, that leverage discrete neural representations and training data, and can be designed in polynomial-time for any number of features, any number of classes, and arbitrary division of features across the distributed nodes. We find that tailoring the quantizers to the classification task can offer significant savings: as compared to alternatives, we can achieve more than a factor of two reduction in terms of the number of bits communicated, for the same classification accuracy.

READ ON IEEE Xplore

Energy-Reliability Limits in Nanoscale Feedforward Neural Networks and Formulas

Avhishek Chatterjee Lav R. Varshney

Due to energy-efficiency requirements, computational systems are now being implemented using noisy nanoscale semiconductor devices whose reliability depends on energy consumed. We study circuit-level energy-reliability limits for deep feedforward neural networks (multilayer perceptrons) built using such devices, and en route also establish the same limits for formulas (boolean tree-structured circuits). To obtain energy lower bounds, we extend Pippenger's mutual information propagation technique for characterizing the complexity of noisy circuits, since small circuit complexity need not imply low energy. Many device technologies require all gates to have the same electrical operating point; in circuits of such uniform gates, we show that the minimum energy required to achieve any non-trivial reliability scales superlinearly with the number of inputs. Circuits implemented in emerging device technologies like spin electronics can, however, have gates operate at different electrical points; in circuits of such heterogeneous gates, we show energy scaling can be linear in the number of inputs. Building on our extended mutual information propagation technique and using crucial insights from convex optimization theory, we develop an algorithm to compute energy lower bounds for any given boolean tree under heterogeneous gates. This algorithm runs in linear time in number of gates, and is therefore practical for modern circuit design. As part of our development we find a simple procedure for energy allocation across circuit gates with different operating points and neural networks with differently-operating layers.

READ ON IEEE Xplore

Functional Error Correction for Robust Neural Networks

Kunping Huang Paul H. Siegel Anxiao Jiang

When neural networks (NeuralNets) are implemented in hardware, their weights need to be stored in memory devices. As noise accumulates in the stored weights, the NeuralNet's performance will degrade. This paper studies how to use error correcting codes (ECCs) to protect the weights. Different from classic error correction in data storage, the optimization objective is to optimize the NeuralNet's performance after error correction, instead of minimizing the Uncorrectable Bit Error Rate in the protected bits. That is, by seeing the NeuralNet as a function of its input, the error correction scheme is function-oriented. A main challenge is that a deep NeuralNet often has millions to hundreds of millions of weights, causing a large redundancy overhead for ECCs, and the relationship between the weights and its NeuralNet's performance can be highly complex. To address the challenge, we propose a Selective Protection (SP) scheme, which chooses only a subset of important bits for ECC protection. To find such bits and achieve an optimized tradeoff between ECC's redundancy and NeuralNet's performance, we present an algorithm based on deep reinforcement learning. Experimental results verify that compared to the natural baseline scheme, the proposed algorithm can achieve substantially better performance for the functional error correction task.

READ ON IEEE Xplore

Stochastic Gradient Coding for Straggler Mitigation in Distributed Learning

Rawad Bitar Mary Wootters Salim El Rouayheb

We consider distributed gradient descent in the presence of stragglers. Recent work on gradient coding and approximate gradient coding have shown how to add redundancy in distributed gradient descent to guarantee convergence even if some workers are stragglers-that is, slow or non-responsive. In this work we propose an approximate gradient coding scheme called Stochastic Gradient Coding (SGC), which works when the stragglers are random. SGC distributes data points redundantly to workers according to a pair-wise balanced design, and then simply ignores the stragglers. We prove that the convergence rate of SGC mirrors that of batched Stochastic Gradient Descent (SGD) for the ℓ2 loss function, and show how the convergence rate can improve with the redundancy. We also provide bounds for more general convex loss functions. We show empirically that SGC requires a small amount of redundancy to handle a large number of stragglers and that it can outperform existing approximate gradient codes when the number of stragglers is large.

READ ON IEEE Xplore

Information-Theoretic Lower Bounds for Compressive Sensing With Generative Models

Zhaoqiang Liu Jonathan Scarlett

It has recently been shown that for compressive sensing, significantly fewer measurements may be required if the sparsity assumption is replaced by the assumption the unknown vector lies near the range of a suitably-chosen generative model. In particular, in (Bora et al., 2017) it was shown roughly O(k log L) random Gaussian measurements suffice for accurate recovery when the generative model is an L-Lipschitz function with bounded k-dimensional inputs, and O(kdlogw) measurements suffice when the generative model is a k-input ReLU network with depth d and width w. In this paper, we establish corresponding algorithm-independent lower bounds on the sample complexity using tools from minimax statistical analysis. In accordance with the above upper bounds, our results are summarized as follows: (i) We construct an L-Lipschitz generative model capable of generating group-sparse signals, and show that the resulting necessary number of measurements is Ω(k logL); (ii) Using similar ideas, we construct ReLU networks with high depth and/or high depth for which the necessary number of measurements scales as Ω(kd logw/logn) (with output dimension n), and in some cases Ω(kd logw). As a result, we establish that the scaling laws derived in (Bora et al., 2017) are optimal or near-optimal in the absence of further assumptions.

READ ON IEEE Xplore

Understanding GANs in the LQG Setting: Formulation, Generalization and Stability

Soheil Feizi Farzan Farnia Tony Ginart David Tse

Generative Adversarial Networks (GANs) have become a popular method to learn a probability model from data. In this paper, we provide an understanding of basic issues surrounding GANs including their formulation, generalization and stability on a simple LQG benchmark where the generator is Linear, the discriminator is Quadratic and the data has a high-dimensional Gaussian distribution. Even in this simple benchmark, the GAN problem has not been well-understood as we observe that existing state-of-the-art GAN architectures may fail to learn a proper generative distribution owing to (1) stability issues (i.e., convergence to bad local solutions or not converging at all), (2) approximation issues (i.e., having improper global GAN optimizers caused by inappropriate GAN's loss functions), and (3) generalizability issues (i.e., requiring large number of samples for training). In this setup, we propose a GAN architecture which recovers the maximum-likelihood solution and demonstrates fast generalization. Moreover, we analyze global stability of different computational approaches for the proposed GAN and highlight their pros and cons. Finally, through experiments on MNIST and CIFAR-10 datasets, we outline extensions of our model-based approach to design GANs in more complex setups than the considered Gaussian benchmark.

READ ON IEEE Xplore

Solving Inverse Problems via Auto-Encoders

Pei Peng Shirin Jalali Xin Yuan

Compressed sensing (CS) is about recovering a structured signal from its under-determined linear measurements. Starting from sparsity, recovery methods have steadily moved towards more complex structures. Emerging machine learning tools such as generative functions that are based on neural networks are able to learn general complex structures from training data. This makes them potentially powerful tools for designing CS algorithms. Consider a desired class of signals Q, Q ⊂ ℝn, and a corresponding generative function g : Uk → ℝn, U ⊂ ℝ, such that supxEQminuϵUk 1/√(n)∥lig(u) - x∥ ≤ δ. A recovery method based on g seeks g(u) with minimum measurement error. In this paper, the performance of such a recovery method is studied, under both noisy and noiseless measurements. In the noiseless case, roughly speaking, it is proven that, as k and n grow without bound and δ converges to zero, if the number of measurements (m) is larger than the input dimension of the generative model (k), then asymptotically, almost lossless recovery is possible. Furthermore, the performance of an efficient iterative algorithm based on projected gradient descent is studied. In this case, an auto-encoder is used to define and enforce the source structure at the projection step. The auto-encoder is defined by encoder and decoder (generative) functions f : ℝn → Uk and g : Uk → B.n, respectively. We theoretically prove that, roughly, given m > 40k log1/δ measurements, such an algorithm converges to the vicinity of the desired result, even in the presence of additive white Gaussian noise. Numerical results exploring the effectiveness of the proposed method are presented.

READ ON IEEE Xplore

PacGAN: The Power of Two Samples in Generative Adversarial Networks

Zinan Lin Ashish Khetan Giulia Fanti Sewoong Oh

Generative adversarial networks (GANs) are innovative techniques for learning generative models of complex data distributions from samples. Despite remarkable improvements in generating realistic images, one of their major shortcomings is the fact that in practice, they tend to produce samples with little diversity, even when trained on diverse datasets. This phenomenon, known as mode collapse, has been the main focus of several recent advances in GANs. Yet there is little understanding of why mode collapse happens and why recently-proposed approaches mitigate mode collapse. We propose a principled approach to handle mode collapse called packing. The main idea is to modify the discriminator to make decisions based on multiple samples from the same class, either real or artificially generated. We borrow analysis tools from binary hypothesis testing—in particular the seminal result of (Blackwell, 1953)—to prove a fundamental connection between packing and mode collapse. We show that packing naturally penalizes generators with mode collapse, thereby favoring generator distributions with less mode collapse during the training process. Numerical experiments on benchmark datasets suggests that packing provides significant improvements in practice as well.

READ ON IEEE Xplore

Inference With Deep Generative Priors in High Dimensions

Parthe Pandit Mojtaba Sahraee-Ardakan Sundeep Rangan Philip Schniter Alyson K. Fletcher

Deep generative priors offer powerful models for complex-structured data, such as images, audio, and text. Using these priors in inverse problems typically requires estimating the input and/or hidden signals in a multi-layer deep neural network from observation of its output. While these approaches have been successful in practice, rigorous performance analysis is complicated by the non-convex nature of the underlying optimization problems. This paper presents a novel algorithm, Multi-Layer Vector Approximate Message Passing (ML-VAMP), for inference in multi-layer stochastic neural networks. ML-VAMP can be configured to compute maximum a priori (MAP) or approximate minimum mean-squared error (MMSE) estimates for these networks. We show that the performance of ML-VAMP can be exactly predicted in a certain high-dimensional random limit. Furthermore, under certain conditions, ML-VAMP yields estimates that achieve the minimum (i.e., Bayes-optimal) MSE as predicted by the replica method. In this way, ML-VAMP provides a computationally efficient method for multi-layer inference with an exact performance characterization and testable conditions for optimality in the large-system limit.

READ ON IEEE Xplore

Deep Learning: Mathematical Foundations and Applications to Information Science

Prospective authors are requested to submit new, unpublished manuscripts for inclusion in the upcoming event described in this call for papers.

READ ON IEEE Xplore