# Theoretical Computer Science Group

## Theory Seminar Schedule

Date Speaker Title
Jan 16, 2007 Piotr Berman Tight Approximability Results for Test Set Problems in Bioinformatics
Jan 23, 2007 Adam Smith Calibrating Noise to Sensitivity in Private Data Analysis
Jan 30, 2007 Sofya Raskhodnikova Instance-based Noise in Private Data Analysis
Feb 07, 2007 Sean Hallgreen Quantum Algorithms and Cryptography (Part of Departmental Colloquium)
Feb 13, 2007 Shiva Kasiviswanathan Spanners for Disk Graphs
Feb 20, 2007 Piotr Berman Minimum Digraph Problem
Feb 27, 2007 Shiva Kasiviswanathan Faster All-Pairs Approximate Shortest Path
Mar 06, 2007 Sergey Orshanskiy Finding, counting and approximating the number of small subgraphs
Mar 20, 2007 Robert D. Kleinberg Online Markets and Optimal Stopping (Part of Departmental Colloquium)
Apr 03, 2007 Shiva Kasiviswanathan Approximately Counting Embeddings into Random Graphs
Apr 18, 2007 Jonathan Katz Universally-Composable Multi-Party Computation Using Tamper-Proof Hardware (Part of Departmental Colloquium)
Apr 24, 2007 Martin Furer Faster Integer Multiplication
Apr 27, 2007 Ryan O’Donnell Understanding Parallel Repetition Requires Understanding Foams (Part of Departmental Colloquium)
May 01, 2007 Alex Andoni Approximate Nearest Neighbor Problem: Near-Optimal Hashing Algorithms (Part of Departmental Colloquium)

If you would like to speak, please email: theory at cse psu dot edu.

If you want to receive announcements of upcoming talks join the theory mailing list.

Archives

### Tight Approximability Results for Test Set Problems in Bioinformatics

In this paper, we investigate the test set problem and its variations that appear in a variety of applications. In general, we are given a universe of objects to be “distinguished” by a family of “tests”, and we want to find the smallest sufficient collection of tests. In the simplest version, a test is a subset of the universe and two objects are distinguished by our collection if one test contains exactly one of them. Variations allow tests to be multi-valued functions or unions of “basic” tests, and different notions of the term distinguished. An important version of this problem that has applications in DNA sequence analysis has the universe consisting of strings over a small alphabet and tests that are detecting presence (or absence) of a substring. For most versions of the problem, including the latter, we establish matching lower and upper bounds on approximation ratio. When tests can be formed as unions of basic tests, we show that the problem is as hard as the graph-coloring problem. We conclude by reporting preliminary computational results on the implementations of our algorithmic approaches for the minimum cost probe set problems on a data set used by Borneman et al.

### Calibrating Noise to Sensitivity in Private Data Analysis

Consider a trusted server that holds a database of sensitive information. The server would like to reveal global statistics about the population in the database, and yet hide information specific to individuals. I will discuss a recent line of work exploring the tradeoff between these conflicting goals -- first, how the goals can be formulated precisely and second, to what extent they can both be satisfied. The technical results I will mention come mostly from a paper in TCC 2006.

The main contribution is a new, simple framework for "output perturbation". Given a query function F mapping databases to reals, we say the "true" query answer is the result of applying F to the database. To protect privacy, the server perturbs the true answer by adding random noise and sends the result as the query answer. In this setting, a strong notion of privacy can be guaranteed by calibrating the standard deviation of the noise to the "sensitivity" of the function F.

The second contribution is series of bounds on the power of special classes of privacy mechanisms. These show that various kinds of interaction add significantly to the "utility" of the mechanism from the users' point of view.

Based on joint work with Cynthia Dwork, Frank McSherry, Kobbi Nissim and Enav Weinreb.

### Instance-based Noise in Private Data Analysis

We introduce a new, generic framework for private data analysis. Our framework allows one to release functions f of the data with instance-based additive noise. That is, the noise magnitude is determined not only by the function we are trying to release, as in the global sensitivity framework presented by Adam, but also on the database itself. We calibrate the noise magnitude to the "smooth sensitivity" of f on the database x — a measure of how variable f is in the neighborhood of the instance x. The new framework greatly expands the applicability of output perturbation, a technique for protecting individuals' privacy by adding small random noise to the released statistics. To our knowledge, this is the first formal analysis of any kind of the effect on privacy of instance-based noise.

Our framework raises many interesting algorithmic questions. We show how to compute or approximate the smooth sensitivity and apply our framework for a variety of functions. We also give a generic procedure based on sampling that computes safe noise levels for a large class of functions without explicitly approximating sensitivity. We illustrate the procedure by applying it to the k-means clustering problem.

### Spanners for Disk Graphs

A disk graph is an intersection graph of a set of disks with arbitrary radii in the plane. In this paper, we consider the problem of efficient construction of sparse spanners of disk (ball) graphs with support for fast distance queries. These problems are motivated by issues arising from topology control and routing in wireless networks.

We present the first algorithm for constructing spanners of ball graphs. For the special case where all the balls have the same radius, we show that the spanner construction has randomized complexity almost equivalent to the construction of a Euclidean minimum spanning tree. Previously known constructions of spanners of unit ball graphs have time complexity much closer to n^2. Additionally, these spanners have a small vertex separator (hereditary), which is then exploited for fast answering of distance queries. The results on geometric graph separators might be of independent interest.

Joint work with Martin Furer. The results in this talk appear in WAOA 2006 and another unpublished paper. Both are available at: www.cse.psu.edu/~kasivisw/research.html

### Minimum Digraph Problem

Minimum digraph is defined for a directed graph. A valid solution is a subset of arcs such that we can walk from node u to node v using the subset if and only if we can walk from u to v using all the arcs. We can either minimize the number of edges in the solution or the cost. We will discuss approximation algorithms for these problems.

### Faster All-Pairs Approximate Shortest Path

Minimum digraph is defined for a directed graph. Let G=(V,E) be an weighted undirected graph on |V|=n vertices and |E|=m edges, and let d be its shortest path metric. We say that an algorithm solves the all-pairs shortest paths problem with (a,b)-approximation if for any pair of vertices u,v, the estimate f(u,v) produced by the algorithm satisfies: d(u,v) <= f(u,v) <= ad(u,v) + bh(u,v). Here, h(u,v) is the heaviest edge on a shortest path between u and v. In this talk, I will present some recent results for faster (a,b)-approximation and observations for improving some of these results.

### Finding, counting and approximating the number of small subgraphs

Let G = (V,E) be an undirected graph. For any subset of vertices we can consider the induced subgraph, G[v_1...v_n] obtained by keeping all edges connecting the vertices from the subset. For a small graph H, find the number of induced subgraphs of G, isomorphic to H, exactly or approximately. A simpler question is to test whether G is H-free.

I will present and discuss some current results. My favourite ones are: one can test if the graph contains a triangle (and find it, if it does) in O(E^1.41) = O(E^(2a/(a+1))), where a < 2.376 is the exponent of matrix multiplication. One can count the number of K4's in a graph in O(E^1.69) = O(E^((a+1)/2)). One can also sample a random fraction q of all connected induced subgraphs of size k in an unbiased way. However, so far there is even no O(N^3) algorithm to detect if the graph contains K4 as a subgraph (if the graph is dense, i.e. E^1.69 > N^3).

### Online Markets and Optimal Stopping

The growth of online retail and advertising markets into a multi-billion dollar industry has been accompanied by an explosion of research activity studying the issues which arise from the interaction of computational and incentive constraints in these markets. This talk with focus on the theory of markets in which users arrive and depart over time, and the interplay between these problems and optimal stopping theory, a branch of probability which addresses rules for deciding when to perform given actions while observing a random sequence.(The famous "secretary problem" is one example of such a problem.) On one hand, optimal stopping theory provides a toolkit of techniques for designing approximately-optimal online market mechanisms. On the other hand, the study of online markets motivates us to formulate (and, sometimes, solve) novel questions in optimal stopping theory, including generalizations of the secretary problem.

### Approximately Counting Embeddings into Random Graphs

Let H be a graph, and let C_H(G) be the number of (subgraph isomorphic) copies of H contained in a graph G. We investigate the fundamental problem of estimating C_H(G).

Previous results cover only a few specific instances of this general problem, for example, the case when H has degree at most one (monomer-dimer problem). In this paper, we present the first general subcase of the subgraph isomorphism counting problem which is almost always efficiently approximable. The results rely on a new graph decomposition technique. Informally, the decomposition is a labeling of vertices such that every edge is between vertices with different labels and for every vertex all neighbors with a higher label have the same label. The labeling implicitly generates a sequence of bipartite graphs which permits us to break the problem of counting embeddings of large subgraphs into that of counting embeddings of small subgraphs. Using this method, we present a simple randomized algorithm for the counting problem. For all graphs H and G, the algorithm is an unbiased estimator. Furthermore, for all graphs H having a decomposition where each of the bipartite graph generated is small and almost all graphs G, the algorithm is a fully polynomial randomized approximation scheme.

We show that the graph classes for which we almost always obtain a fully polynomial randomized approximation scheme includes graphs of degree at most two, bounded-degree forests, bounded-width grid graphs, subdivision of bounded-degree graphs, and major subclasses of outerplanar graphs and series-parallel graphs, whereas square grid graphs are excluded. Our general technique can easily be applied to proving many more similar results.

Joint work with Martin Furer.

### Universally-Composable Multi-Party Computation Using Tamper-Proof Hardware

Protocols proven secure within the universal composability framework satisfy strong and desirable security properties. Unfortunately, it is known that within a “plain” network model, secure computation of general functionalities without an honest majority is impossible. This has prompted researchers to propose various setup assumptions'' with which to augment the bare UC framework in order to bypass this severe negative result. Existing setup assumptions seem to inherently require some trusted party (or parties) to initialize the setup in the real world. Taking a different direction, in this talk I will propose a new assumption --- more along the lines of a physical assumption regarding the existence of tamper-proof hardware --- which also suffices to circumvent the impossibility result mentioned above. I will suggest that by exploring various physical assumptions of this sort, it may be possible to achieve secure computation without relying on any trusted parties.

### Faster Integer Multiplication

For more than 35 years, the fastest known method for integer multiplication has been the Schonhage-Strassen algorithm running in time O(n \log n \log \log n). Under very restrictive conditions there is a corresponding \Omega(n \log n) lower bound. The prevailing conjecture has always been that the complexity of an optimal algorithm is \Theta(n \log n). I move a major step towards closing this gap from above with an algorithm running in time n \log n 2^{O(\log^* n)}. The model of computation is the multitape Turing machine or boolean circuits.

### Understanding Parallel Repetition Requires Understanding Foams

Motivated by important problems in complexity theory (the Unique Games Conjecture) and the foundations of quantum mechanics (QM vs. local hidden variable theories), we investigate the best parameters obtainable in the Parallel Repetition Theorem. Unfortunately, we don't get very far. But, it's because we get blocked by the following seemingly hard question from the geometry of "foams": What is the least surface area of a cell that tiles R^d by the lattice Z^d? Very little about this foam problem is known. It is so vexing that I will offer monetary rewards for baby steps of progress.

This is joint work with Uri Feige (Microsoft) and Guy Kindler (Weizmann).

### Approximate Nearest Neighbor Problem: Near-Optimal Hashing Algorithms

I will present a new algorithm for the nearest neighbor problem in high dimensional spaces, which is defined as follows. Given a data set of n points in d-dimensional space R^d, construct a data structure that, for any query point q, reports the data point closest to q. The problem has many natural applications in different areas of computer science. A few examples are: classification (in machine learning), similarity search (in biological, text or visual databases), data quantization (in signal processing and compression), etc. Most of these applications involve high-dimensional data. Unfortunately, many classic algorithms for geometric search problems, such as kd-trees, do not scale well as the dimension d increases. On the positive side, for the approximate version of the nearest neighbor problem, researchers have been able to find much faster algorithms. One such widely used algorithm is one based on Locality-Sensitive Hashing (LSH).I will present a new LSH algorithm that improves significantly over the previous results, and is provably near-optimal from a class of hashing algorithms. Our new algorithm achieves n^{r} query time and n^{1+r} space for r=1/c^2+o(1), where c=1+\epsilon is the approximation ratio. My talk will include a brief overview of the past work on the problem. I will also mention more practical ramifications of our algorithm. This is joint work with Piotr Indyk, and is based on a FOCS'06 paper.