## Max-Information and Differential Privacy

The goal of machine learning is, given a dataset, to produce classifiers and models that generalize well to unseen data instances of the problem (that is, assuming the dataset is sampled according to some distribution $P$, we want to find classifiers that perform well on fresh samples from $P$). More generally, statistical data analysis is concerned with estimating properties of the underlying data distribution, rather than properties that are specific to the finite data set at hand. The generalization error of an algorithm measures how well it performs on unseen examples from the same distribution as its data.

In practice, the same dataset may be used at several stages of a machine learning algorithm. Reasoning about the generalization error of such ``adaptive" algorithms is challenging, and very little theory exists about such algorithms. For example, a common practice is to perform feature selection on a dataset, and then use the features for some supervised learning task. When these two steps are performed on the same dataset, it is no longer clear that the results obtained from the combined algorithm will generalize.

In this talk, through the notion of 'approximate max-information', we show how one can bound the generalization error of an adaptive algorithm when each stage of the algorithm is differentially private. The talk will be self-contained.

Based on recent work with Ryan Rogers, Aaron Roth and Adam Smith.

## Counting the number of triangles in sublinear time

Counting the number of triangles in a graph has been extensively studied, with work that look at both worst-case running time guarantees and empirical performance. But the vast majority of work considers algorithms that, at the very least, need to read the whole graph---a step that can be prohibitively expensive when the graph is large.

In this talk, we describe an algorithm, due to Eden, Levi, Ron and Seshadri (2015), that approximates the number of triangles in a graph in sublinear time. The algorithm can make three kinds of queries: degree queries, vertex-pair queries and neighbor queries. When these queries are unit cost, the algorithm produces an estimate with relative error at most ? in time at most $(n / t^{1/3} + m^{3/2} / t )⋅poly(log n,1/?)$, where $t$ is the true number of triangles.

Paper: Talya Eden, Amit Levi, Dana Ron, C. Seshadhri. "Approximately Counting Triangles in Sublinear Time", FOCS 2015.

## A quad tree-based distance function for faster approximate geometric bipartite matching

In a paper from STOC 2012, the authors tackle the minimum cost bipartite matching problem in the geometric setting. Specifically, both sets of points lie in R^d and the cost function is the distance between the two points under some L_p norm. The authors provide an algorithm for an (1+ε) factor approximation which runs in O(n poly(logn, 1/ε)) time. They achieve this by utilizing a quad tree-based distance function d(.,.) to approximate the L_p norm. Furthermore, they restrict the total length of augmenting paths created while finding this approximation by modifying the costs for certain edges. This enables them to efficiently find and augment paths quickly enough to achieve their bound.

Paper: Sharathkumar, R., and Agarwal, P. K., "A near-linear time ε-approximation algorithm for geometric bipartite matching.", STOC 2012.