## Adding faulted data to produce accurate results, faster

March 5, 2012

In computer algorithms, there is this thing called ‘soft heap‘, due to Princeton computer scientist Bernard Chazelle (2000), which is basically corrupting a certain percentage of data when building the data structure. When first heard of that, I can’t imagine how on earth could such a thing be anywhere useful.

Indeed it is, and the reason it works is quite interesting to me. Hence I’m writing this post about finding the (real) $k$-th smallest element in a set via this corrupted data structure. By the way, this was presented to me in a very interesting lecture by Robert Tarjan. For you computer scientists: be warned, I really don’t know much about the subject.

Part 1: Heap

Naively sorting a set (i.e. compare the first two elements, put the smaller first; compare the 2nd and 3rd, move the third past the 2nd if it’s smaller; then compare the 3rd and the first, etc.) could take $\sum_{i=1}^N \sim O(N^2)$ many comparisons. One can of course do better, the current known fastest sort takes $O(N \log N)$ operations.

One way to achieve $O(N \log N)$ is the so-called heap sort:

A heap is a binary tree with a number (key) in each node and the child of a node always has key larger than that of the node. (i.e. the minimum element is at the root)

Now the problem of sorting breaks into two parts:

i) build a heap with our given keys
ii) sort elements in a heap

As we can see, if we manage to build a heap with all branches having roughly the same length (computer scientist call that a balanced binary tree), there will be $2^n$ elements at level $n$. The height of the heap is $\log_2 (N+1) \sim O(\log N)$. Main idea of the sort is to go through all elements and aim for doing a constant times the height many operations at each element (then we would get our $N \log N$).

Building the heap: Suppose we already have the first $n$ element put in a heap, occupying all nodes in the first $k$ levels and some nodes in the $k+1$-th level. We put the $n+1$-th element at the first empty node on the $k+1$-th level; compare it to its parent, swap if it’s smaller than it’s parent; compare to its parent again… Do this until it’s larger than it’s parent. Note that 1) swap does not change the shape of the tree. 2) when the process stops, we end up with a new heap. Hence inserting this new element takes at most twice the height many operations (compare and swap once at each level).

In total, building the heap takes no more than $N \times 2 \log N \sim O(N \log N)$ operations. (in fact one can do this in $O(N)$ time, but for our purpose it’s not needed)

Sort from the heap: Take the root as the minimum element, delete it from the heap; now compare the two elements in level 1, put the smaller into the root, now compare the two children of the empty node, etc. (if at some stage the empty node has only one child then just move it, stop if the empty node has no child.) At the end, delete the leaf node that’s empty. Again this process takes no more than
twice the height of the tree many operations.

Combining the above two steps, we have $N \log N$ operations.

As one can guess heaps does much more than sorting stuff, once we spent our time to build data into a heap, we can save such structure and update it once the data set changes:

Inserting element: $\sim O(\log N)$ (put the element at the next available leaf and move upwards)
Find minimum: $\sim O(1)$ (well, the min is at the root)
Delete minimum: $\sim O(\log N)$ (Delete, move up the smaller level-2 element, etc.)

OK~ so much for that, let’s move to the more interesting business:

Part 2: Error

The idea is: given a set of data with keys, instead of putting them into a heap, we put them into another structure (i.e. ‘soft heap’) such that the structure assigns a fake key every element and treat it as the real key. It does so in a way that:

i) All elements have fake keys larger than or equal to their real key.
ii) Only a certain percentage ( $\varepsilon$) of elements have fake key different (larger than) real key.

The advantage of doing that is, unlike heap, it now takes only $\sim O(1)$ (amortized) time to delete the minimum element, while building it still takes $O(N)$ time. (Explaining exactly how this is done requires another post which I might write…for now, check out Chazelle ’00.)

Meanwhile, I guess the most curious thing is, how can we get any uncorrupted results with a corrupted data structure? As an example, let’s see how to find the (genuine) medium with data structure having $\varepsilon$ precent of corrupted keys:

The mean observation is that, since all fake keys are larger than real keys, by deleting the smallest $1/2-\varepsilon$ of elements (according to fake keys), the genuine medium will be larger than all deleted elements (in real key) and hence surely remains in the set.

Now we have reduced the problem to finding the $N\varepsilon/2$-th smallest element in a set of cardinality $(1/2+\varepsilon)N$. Of course we can solve this by putting the elements in a reversed soft heap (i.e. all real keys are changed to their negative value), delete the smallest $(1/2+\varepsilon)$ (fake key) elements, etc.

At each step, we reduce the size of the data set by at least a factor of $(1/2+\varepsilon)$, and the amortized running time at each step is $O(n)$ (building the soft heap) $+ O(n)$ (deleting about half of the elements) which is, in total $\sim O(n)$.

Since the size shrinks in constant factor, the total running time is simply a geometric series, hence $\sim O(N)$.

Remarks:

1. In the above argument, $\varepsilon$ can be anything less than $1/2$ hence in fact quite a lot of data are allowed to be corrupted!

2. It doesn’t matter whether we are finding the medium or the $k$-th smallest element, indeed, after the first step in the above argument we are literally finding the $k$-th smallest element. The general rule being simply taking the negative of the keys before building the soft heap whenever $k.

3. The problem of finding the $k$-th smallest element is very well studied and algorithms that do this are called selection algorithms. Obviously our $O(N)$ running time is optimum since it takes that much time to read off the data set! However, there are many other ways to achieve this linear running time, some are not even amortized, but as far as I know, one cannot do this with a real heap.