GT

Putting more thoughts on heap and heapify

In a recent conversation, I was asked what the time complexity of heapify is. I said O(N * Log N) confidently. After putting more thoughts on it, I found it’s not.

I read through the source code of Python3’s builtin lib heapq, which provides a handful of methods such as heapify, heappop and heappush. I found there is a bit more complicated math behind it, and a trade-off that the author made in the implementation.

Just be clear, in this article, heap is binary min-heap.

What is the complexity of heapify?

The builtin method heapify converts a given list into a min-heap. A min-heap means the first element is the smallest, while any children has larger value than their parent node. The detail of building a heap is very well elaborated in Wikipedia and CLRS.

If you frequently use heap and know a little bit internal operation of heap like me, you might think confidently that heapify is an N*logN operation(at least I did). But after an in-depth mathematical analysis, the conclusion is it could be O(N) or N*LogN.

Let’s start.

There are two important operations when building the heap: sift_up and sift_down.

sift_down

sift_down is the operation that moves the element towards the root node if its value smaller than parent. Here is the code.

def sift_down(heap, bottom, i):
    ele = heap[i]
    while i > bottom:
        parent_of_i = (i - 1) >> 1
        if heap[i] < heap[parent_of_i]:
            heap[i] = heap[parent_of_i]
            i = parent_of_i
        else:
            break
    heap[i] = ele

The idea is fairly simple, and the complexity is quite obvious: O(Log N). Not that in some text book like CLRS, the sifting process swaps values.

sift_up

sift_up is the operation that moves the element towards the leaf if its value larger than any child. There are two ideas on sift_up.

The first one is straight forward as described above:

  1. Comparing it’s value with child(if any).
  2. Swap with the child if its value larger than element.(Choose the smaller one if both children have larger value)
  3. Until reach leaf.

The code will be:

def sift_up(heap, pos):
    while pos < len(heap):
        left, right = pos * 2 + 1, pos * 2 + 2
        smaller = pos
        if left < len(heap) and heap[left] < heap[smaller]:
            smaller = left
        if right < len(heap) and heap[right] < heap[smaller]:
            smaller = right
        if smaller != pos:
            heap[pos], heap[smaller] = heap[smaller], heap[pos]
            pos = smaller
        else:
            break

The complexity is O(Log N) at first glance, but if we look at it closely, the complexity depends on which level of the element.

The second idea of sift_up is:

  1. Moving the element all the way up to leaf, choosing the path of smaller child.
  2. Sift down to the level no lower than original position.

The code here is copied from Python 3 lib:

def _siftup(heap, pos):
    endpos = len(heap)
    startpos = pos
    newitem = heap[pos]
    childpos = 2*pos + 1   
    while childpos < endpos:
        rightpos = childpos + 1
        if rightpos < endpos and heap[rightpos]<heap[childpos]:
            childpos = rightpos
        heap[pos] = heap[childpos]
        pos = childpos
        childpos = 2*pos + 1
    heap[pos] = newitem
    _siftdown(heap, startpos, pos)
    

This idea looks cumbersome since it almost doubled the moving path, but we’ll get back to it later.

Anyway, either idea could be used in building a heap, and we now get back to the topic: what is the time complexity of heapify?

heapify is fairly simple since the sift_up and sift_down are introduced.

def heapify(heap):
    for i in reversed(range(len(heap)//2)):
        _siftup(x, i)

People like me would say the time complexity is so obvious: O(N*Log N). But let’s do some math here.

Assuming there are h levels of the heap. At the leaf leve, there are 2^h nodes, but heapify does not operate on this level. At the next level, nodes number is 2^(h-1), and each node might sift 1 level. At 3rd level there are 2^(h-2) nodes and each might move 2 level.

Therefore, we get the total moves for the heap is:

This result looks a bit frustrating, but if we can recall some basic knowledge from middle school: the sum of geometric series whose common ratio smaller than 1:

Then we take derivative on both sides with respect to x:

If we put x = 1/2 here:

Divided by 2 on both sides:

Get back to our T(n):

Remember the n = 2h-1 is the total number of heap elements, so we have h = log2n+1.

Putting them together, we get: T(n) <= 2*n - 2

Clearly, O(N) = N.

Which sift_up is better?

We have introduced two different sift_up methods above: a faster one and a slower one.

In Python’s heapq.py library, the algorithm is implemented in the slower way. The reason behind this is heappop need sift_up too. (Not finished.)