## SCHEDULE

All lectures are in room 11.13 in Casa Convalescencia

###### WEDNESDAY, 30 June

08:30-09:20 REGISTRATION

09:10-09:20 Welcome

09:20-10:00 Nicolò Cesa-Bianchi (University of Milan): A Regret Analysis of Bilateral Trade

short break

10:10-10:50 Alessandro Rudi (INRIA - Paris): Effective models for non-negative functions, with applications to optimization, optimal transport and probability representation & inference

10:50-11:30 COFFEE BREAK

11:30-12:10 Steffen Lauritzen (University of Copenhagen): Locally associated graphical models and mixed convex exponential families

short break

12:20-13:00 Anna Ben-Hamou (Sorbonne Université): Speeding up Markov chains with random permutations

13:00-15:00 LUNCH BREAK

*** FREE AFTERNOON (with working spaces)

16:30-17:00 AFTERNOON COFFEE

09:20-10:00 Judith Rousseau (Oxford University): Semiparametric and nonparametric Bayesian inference in hidden Markov models

short break

10:10-10:50 Silvio Lattanzi (Google): Exact Clusters Recovery with Same-Cluster Queries

10:50-11:30 COFFEE BREAK

11:30-12:10 Andreas Maurer (Istituto Italiano di Technologia): Useful bounds for U-statistics

short break

12:20-13:00 Rémi Bardenet (CNRS): Monte Carlo integration with repulsive point processes

13:00-15:00 LUNCH BREAK

15:00-15:40 Wicher Bergsma (London School of Economics): Regression modelling with I-priors

Short break

15:50-16:30 Małgorzata Bogdan (University of Wrocław): SLOPE and its application for Gaussian graphical models

16:30-17:00 AFTERNOON COFFEE

19:00-DRINKS-SNACKS.

###### FRIDAY, 2 July

09:20-10:00 Alessandro Rinaldo (Carnegie Mellon University): A modern look at inference in linear regression: assumption-lean validity, normal approximations and high dimensions.

short break

10:10-10:50 Mohamed Ndaoud (ESSEC, University of Southern California): Minimax Supervised Clustering in the Anisotropic Gaussian Mixture Model: interpolation is all you need

10:50-11:30 COFFEE BREAK

11:30-12:10 Umut Şimşekli (INRIA - Paris): Towards Building a Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

Short break

12:20-13:00 Matteo Papini (Universitat Pompeu Fabra): Leveraging Good Representations in Linear Contextual Bandits and MDPs

13:00-15:00 LUNCH BREAK

15:00-15:40 Pedro Delicado (Universitat Politècnica de Catalunya): Understanding complex predictive models through Ghost Variables

Short break

15:50-16:30 Benjamin Recht (UC Berkeley): A story about machine learning: Historical thoughts about modern prediction

16:30-17:00 AFTERNOON COFFEE

20:00 CONFERENCE DINNER (Camping Mar)

09:30-10:10 Lorenzo Rosasco (University of Genoa): Interpolation and learning with scale dependent kernels

10:10-11:00 COFFEE BREAK

11:00-11:40 ​Xiaoyi Mai (CNRS)A random matrix approach for high-dimensional semi-supervised  learning by graph regularization

Short break

12:00-12:40 Emmanuel Candès (Stanford University): Reliable predictions? Counterfactual predictions? Equitable treatment? Some recent progress in predictive inference

*** FREE AFTERNOON

## ABSTRACTS

Rémi Bardenet (CNRS)

Monte Carlo integration with repulsive point processes

Monte Carlo integration is the workhorse of Bayesian inference, but the mean square error of Monte Carlo estimators decreases slowly, typically as 1/N, where N is the number of integrand evaluations. This becomes a bottleneck in Bayesian applications where evaluating the integrand can take tens of seconds, like in the life sciences, where evaluating the likelihood often requires solving a large system of differential equations. I will present two approaches to faster Monte Carlo rates using interacting particle systems. First, I will show how results from random matrix theory lead to a stochastic version of Gaussian quadrature in any dimension d, with mean square error decreasing as 1/N^{1+1/d}. This quadrature is based on determinantal point processes, which can be argued to be the kernel machine of point processes. Second, I will show how to further take this error rate down assuming the integrand is smooth. In particular, I will give a tight error bound when the integrand belongs to any arbitrary reproducing kernel Hilbert space, using a mixture of determinantal point processes tailored to that space. This mixture is reminiscent of volume sampling, a randomized experimental design used in linear regression.

Anna Ben-Hamou (Sorbonne Université)

Speeding up Markov chains with random permutations

Markov chains are widely used to sample approximately from a given distribution. Some chains however might take a very long time to mix. This raises the question of how to speed up a chain to make it mix faster: can one design a simple perturbation of the chain which ensures fast mixing? In this talk, we will be interested in the following perturbation: take a bijection on the state space and consider the chain which alternates between steps governed the original chain, and deterministic steps governed by the bijection. We will first show that if the bijection satisfies some expansion condition with respect to the original chain, then the mixing time of the permuted chain is logarithmic in the size of the state space, whatever the initial chain, provided it satisfies some mild assumptions. Then, we will see that almost all bijections do the job: if the bijection is chosen uniformly at random, then the permuted chain has cutoff at a time which is characterised by the entropy of the initial chain. This is a joint work with Yuval Peres.

Wicher Bergsma (London School of Economics and Political Science)

Regression modelling with I-priors

Standard objective priors, such as the Jeffreys prior, g-prior, or reference prior, have the drawback that they can only be used for finite dimensional parameters. We introduce a new objective prior based on the Fisher information, which can be used for both finite and infinite dimensional parameters. The posterior mean or mode then provides a regularized estimator of the parameter. The I-prior is generally defined as a maximum entropy prior and has the intuitively appealing property that the more information is available on a linear functional of the parameter, the larger the prior variance, and the smaller the influence of the prior mean on the posterior distribution.

The I-prior methodology is particularly attractive from a computational point of view for estimating a parametric or nonparametric regression function. The I-prior for a regression function is very simple, namely Gaussian with covariance kernel proportional to the Fisher information on the regression function. We use the I-prior methodology to give a unifying framework for estimating a variety of regression models, including varying coefficient, multilevel, longitudinal models, and models with functional covariates and responses.

Advantages compared to competing methods, such as Gaussian process regression or Tikhonov regularization, are ease of estimation and model comparison. In particular, we develop an EM algorithm with a simple E and M step for estimating hyperparameters, facilitating estimation for complex models. We also propose a novel parsimonious model formulation, requiring a single scale parameter for each (possibly multidimensional) covariate and no further parameters for interaction effects. This simplifies estimation because fewer hyperparameters need to be estimated, and also simplifies model comparison of models with the same covariates but different interaction effects; in this case, the model with the highest estimated likelihood can be selected.

Using a number of widely analysed real data sets we show that predictive performance of our methodology is competitive. An R-package implementing the methodology is available (Jamil, 2020).

Małgorzata Bogdan (University of Wrocław)

SLOPE and its application for Gaussian graphical models

Sorted L-One Penalized Estimator is a relatively new convex optimization method for estimating parameters of high dimensional statistical models. SLOPE reduces the dimension by eliminating some parameters as well as by averaging values of similar estimators. In this talk we will discuss recent theoretical results on the properties of SLOPE as well as its application to graphical models.

Emmanuel Candès (Stanford University)

Reliable predictions? Counterfactual predictions? Equitable treatment? Some recent progress in predictive inference

Recent progress in machine learning provides us with many potentially effective tools to learn from datasets of ever increasing sizes and make useful predictions. How do we know that these tools can be trusted in critical and high-sensitivity systems?This talk introduces statistical ideas to ensure that the learned models satisfy some crucial properties, especially reliability and fairness (in the sense that the models need to apply to individuals in an equitable manner). To achieve these important objectives, we shall not “open up the black box” and try understanding its underpinnings. Rather, we discuss broad methodologies that can be wrapped around any black box to produce results that can be trusted and are equitable. The bulk of the talk will be about causal predictive inference. That is, answering questions of the following type: predict with confidence what the outcome would have been if a patient had not been treated.

Nicolò Cesa-Bianchi (University of Milan)

A Regret Analysis of Bilateral Trade

Bilateral trade, a fundamental topic in economics, models the problem of intermediating between a seller and a buyer willing to trade a good for which they hold private valuations. In this paper, we investigate the regret of bilateral trade assuming no prior knowledge on the private valuations. Our main contribution is a complete characterization of the regret regimes for fixed-price mechanisms with different models of feedback and private valuations, using as benchmark the best fixed price in hindsight.

Joint work with: T. Cesari, R. Colomboni, F. Fusco, and S. Leonardi

Pedro Delicado (Universitat Politècnica de Catalunya)

Understanding complex predictive models through Ghost Variables

In the context of Interpretable Machine Learning, we propose a procedure for assigning a relevance measure to each explanatory variable in a complex predictive model. We assume that we have a training set to fit the model and a test set to check its out-of-sample performance. The individual relevance of a given variable is computed by comparing the model predictions for the test set, with those given for a modified test set, in which the variable of interest is substituted by its "ghost variable", defined as the prediction of this variable as a function of the rest of explanatory variables. Additionally, the joint relevance of the predicting variables is measured through the eigenvalues of the "relevance matrix", that is the covariance matrix of the vectors of individual effects. We prove that in simple models, as linear or additive regression, the proposed measures are good approximations of the explanatory variables significance statistics. Given that our proposals are model-agnostic, they provide a way to generalize these significance measures to any algorithmic prediction model (as neural networks, for instance). We illustrate our proposals with simulated examples and the analysis of a real data set on rental housing.
This is a joint work with Daniel Peña (based on the article arXiv:1912.06407).

Exact Clusters Recovery with Same-Cluster Queries

We study the cluster recovery problem in the semi-supervised active clustering framework. Given a finite set of input points, and an oracle revealing whether any two points lie in the same cluster, our goal is to recover all clusters exactly using as few queries as possible. To this end, we first relax the spherical $k$-means cluster assumption of Ashtiani et al.\ to allow for arbitrary ellipsoidal clusters with margin. We show that, even in this much more general setting, it is still possible to recover the latent clustering exactly using a number of queries that scales only logarithmically with the number of input points.

We then turn our attention to the more challenging non-convex setting. We introduce a structural characterization of clusters, called $(\beta,\gamma)$-convexity, that can be applied to any finite set of points equipped with a metric (or even a semimetric, as the triangle inequality is not needed). Using $(\beta,\gamma)$-convexity, we can translate natural density properties of clusters (which include, for instance, clusters that are strongly non-convex in $\R^d$) into a graph-theoretic notion of convexity. By exploiting this convexity notion, we design a deterministic algorithm that recovers $(\beta,\gamma)$-convex clusters efficiently.

Joint work with M. Bressan, N. Cesa-Bianchi, A. Paudice.

Steffen Lauritzen (University of Copenhagen)

Locally associated graphical models and mixed convex exponential families

The notion of multivariate total positivity has proved to be useful in finance and psychology but may be too restrictive in other applications. We propose a concept of local association, where highly connected components in a graphical model are positively associated and study its properties. Our main motivation comes from gene expression data, where graphical models have become a popular exploratory tool. The models are instances of what we term mixed convex exponential families and we show that a mixed dual likelihood estimator has simple exact properties for such families as well as asymptotic properties similar to the maximum likelihood estimator. We further relax the positivity assumption by penalizing negative partial correlations in what we term the positive graphical lasso. Finally, we develop a GOLAZO algorithm based on block-coordinate descent that applies to a number of optimization procedures that arise in the context of graphical models, including the estimation problems described above. We derive results on existence of the optimum for such problems. (based on the article arXiv:2008.04688)

Xiaoyi Mai (CNRS)

A random matrix approach for high-dimensional semi-supervised  learning by graph regularization

In modern machine learning, we often encounter tasks with high-dimensional and numerous data vectors. There is currently a growing volume of works on characterizing the exact performance of machine learning methods in an asymptotic regime where the dimension of feature vectors and the number of data samples are comparably large. The approach of high-dimensional asymptotic analyses has a particular interest for understanding and improving learning algorithms confronted to the challenge of high-dimensional data learning. Our study of semi-supervised learning by graph regularization starts with an unified analysis of classical Laplacian regularization methods, revealing a series of consequences that are consistent with the practical behavior on high-dimensional data. In particular, our analysis shows that semi-supervised Laplacian-regularization, despite being a common semi-supervised learning approach, fails to learn effectively from both labelled and unlabelled data of high dimensionality. Motivated by our theoretical findings, we propose a new method of centered regularization, with a theoretically established consistent semi-supervised learning performance in the same high-dimensional regime.

Andreas Maurer (Istituto Italiano di Technologia)

Useful bounds for U-statistics

I will present a Bernstein inequality for nondegenerate U-statistics with bounded kernels. The inequality reduces to the standard Bernstein-inequality for order one, it gives the correct variance term and avoids the combinatorial catastrophy, which the decoupling techniques incur for U-statistics of higher order. The bound can be made data-dependent by estimation of the variance term, and it can be equally applied to computationally more attractive incomplete U-statistics. Time permitting I will also give a bound for the suprema of U-processes in terms of Gaussian width.

Mohamed Ndaoud (ESSEC, University of Southern California)

Minimax Supervised Clustering in the Anisotropic Gaussian Mixture Model: interpolation is all you need

We study the non-asymptotic supervised clustering problem under the two-component anisotropic Gaussian mixture model in high dimensions. We first derive a lower and a matching upper bound for the minimax risk of clustering. Unlike in low dimensions, the LDA classifier turns out to be sub-optimal in a minimax sense. Next, we characterize precisely the risk of regularized supervised least squares classifiers under $\ell_2$ regularization.  As a conclusion, we show that the interpolating solution (0 training solution) outperforms the regularized ones, under mild assumptions on the covariance of the noise. Our analysis also shows that interpolation takes the best of the averaging classifier and the LDA classifiers based on whether the signal is aligned with the covariance of the noise or not, where we define the notion of alignment. To the best of our knowledge this peculiar phenomenon has not yet been investigated in the rapidly growing literature around interpolation - [Joint work with Y. Shen and S. Minsker].

Matteo Papini (Universitat Pompeu Fabra)

Leveraging Good Representations in Linear Contextual Bandits and MDPs

The linear contextual bandit literature is mostly focused on the design of efficient learning algorithms for a given representation. However, a contextual bandit problem may admit multiple linear representations, each one with different characteristics that directly impact the regret of the learning algorithm. In particular, recent works showed that there exist "good" representations for which constant problem-dependent regret can be achieved. We first provide a precise characterization of "good" representations. We then propose a novel selection algorithm able to adapt to the best representation in a set of candidates. We show that the regret is indeed never worse than the regret obtained by running LinUCB on the best representation (up to a factor that is just logarithmic in the number of representations). As a result, our algorithm achieves constant regret whenever a "good" candidate representation is available. We further show that the algorithm may still achieve constant regret by implicitly constructing a "good" representation, even when none of the candidtates is "good". Finally, we extend these results to low-rank Markov Decision Processes.

Benjamin Recht (UC Berkeley)

I’ll tell a history of statistical prediction, beginning with Wiener and Rosenblatt and ending with contemporary machine learning. This history will highlight the cyclical rediscovery of pattern recognition and subsequent disillusionment with its shortcomings. I will describe both how much of our theoretical understanding of statistical learning has not deepened for over half a century and trace how the empirical standards of the field arose from technological and social developments.

Most of the material for this talk will be drawn from a new graduate textbook, “Patterns, Predictions, and Action: A Story About Machine Learning,” co-authored with Moritz Hardt.

Alessandro Rinaldo (Carnegie Mellon University)

A modern look at inference in linear regression: assumption-lean validity, normal approximations and high dimensions

We derive finite sample bounds on the Normal approximation to the law of the least squares estimator of the projection parameters normalized by the sandwich-based standard errors in assumption-lean/distribution-free settings. In particular, we do not assume a linear regression function and only require the existence of finitely many moments for the response and the covariates. Using the bootstrap, we construct simultaneous confidence sets for the projection parameters in the form of hyper-rectangles and establish finite sample bounds on their coverage and accuracy. Our rates exhibit an explicit dependence on the dimension and other features of the data generating distribution. We obtain analogous results for the partial correlations among the entries of sub-Gaussian vectors.

Joint work with Arun Kumar Kuchibhotla and Larry Wasserman

Lorenzo Rosasco (University of Genoa)

Interpolation and learning with scale dependent kernels

We study the learning properties of nonparametric ridge-less least squares. In particular, we consider the common case of estimators defined by scale dependent (Matern) kernels, and focus on the role  scale and smoothness. These estimators interpolate the data and the scale can be shown to control their stability to noise and sampling.  Larger scales, corresponding to smoother functions, improve stability with respect to sampling. However, smaller scales, corresponding to more complex functions, improve stability to noise.

We will discuss to which extent these results can explain the learning curves observed for  large overparameteerized models.  Our analysis combines, probabilistic results with analytic techniques from interpolation theory.

Judith Rousseau (Oxford University)

Semiparametric and nonparametric Bayesian inference in hidden Markov models

In this work we are interested in inference in Hidden Markov models with finite state space and nonparametric emission distributions. Since the seminal paper of Gassiat et al. (2016), it is known that in such models the transition matrix Q and the emission distributions F1, · · · , FK are identifiable, up to label switching. We propose an (almost) Bayesian method to simultanuously estimate Q at the rate √n and the emission distributions at the usual nonparametric rates. To do so, we first consider a prior π on Q and F1, · · · , Fk which leads to a posterior marginal distribution on Q which verifies the Bernstein von mises property and thus to an estimator of Q which is efficient. We then combine the marginal posterior on Q with an other posterior distribution on the emission distributions, following the cut-posterior approach, to obtain a posterior which also concentrates around the emission distributions at the minimax rates. In addition an important intermediate result of our work is an inversion inequality which allows to upper bound the L1 norms between the emission densities by the L1 norms between marginal densities of 3 consecutive observations. (joint work with D. Moss (Oxford))

Alessandro Rudi (INRIA - Paris)

Effective models for non-negative functions, with applications to optimization, optimal transport and probability representation & inference

Linear models are the workhorse in many fields of applied mathematics, from approximation & learning to solution of PDEs. Their effectiveness is mainly due to 1) the linearity in the parameters, 2) the fact that linear models are concise approximators, i. e. they require few units to approximate well wide classes of functions as smooth functions. This lead usually to algorithms that are convex and able to provably tame the curse of dimensionality for very smooth problems.

However, not all problems can be defined in terms of signed fuctions. Important problems in the field of nonconvex optimization, optimal transport and probability representation & inference are defined in terms of non-negative functions, whose models usually don't share all the propeties of linear ones.

In this talk we derive a model for non-negative function with the same properties of linear models. We show how using this model in the mentioned applications leads to convex algorithms that tame the curse of dimensionality.

Umut Şimşekli (INRIA - Paris)

Towards Building a Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

In this talk, I will focus on the 'tail behavior' in SGD in deep learning. I will first empirically illustrate that heavy tails arise in the gradient noise (i.e., the difference between the stochastic gradient and the true gradient). Accordingly I will propose to model the gradient noise as a heavy-tailed α-stable random vector, and accordingly propose to analyze SGD as a discretization of a stochastic differential equation (SDE) driven by a stable process. As opposed to classical SDEs that are driven by a Brownian motion, SDEs driven by stable processes can incur ‘jumps’, which force the SDE (and its discretization) transition from 'narrow minima' to 'wider minima', as proven by existing metastability theory and the extensions that we proved recently. These results open up a different perspective and shed more light on the view that SGD 'prefers' wide minima. In the second part of the talk, I will focus on the generalization properties of such heavy-tailed SDEs and show that the generalization error can be controlled by the Hausdorff dimension of the trajectories of the SDE, which is closely linked to the tail behavior of the driving process. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of "capacity metric”. Finally, if time permits, I will talk about the 'originating cause' of such heavy-tailed behavior and present theoretical results which show that heavy-tails can even emerge in very sterile settings such as linear regression with iid Gaussian data.

The talk will be based on the following papers:
U. Şimşekli, L. Sagun, M. Gürbüzbalaban, "A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks", ICML 2019
U. Şimşekli, O. Sener, G. Deligiannidis, M. A. Erdogdu, "Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks", NeurIPS 2020
M. Gurbuzbalaban, U. Simsekli, L. Zhu, "The Heavy-Tail Phenomenon in SGD", ICML 2021