Recipe 19.14. Merging Sorted Sequences
Credit: Sébastien Keim, Raymond Hettinger, Danny
Yoo
Problem
You have several sorted sequences (iterables) and need to iterate on
the overall sorted sequence that results from
"merging" these sequences.
Solution
A generator is clearly the right tool for the job, in the general
case (i.e., when you might not have enough memory to comfortably hold
all the sequences). Implementing the generator is made easy by the
standard library module heapq, which supplies
functions to implement the "heap"
approach to priority queues:
import heapq
def merge(*subsequences):
# prepare a priority queue whose items are pairs of the form
# (current-value, iterator), one each per (non-empty) subsequence
heap = [ ]
for subseq in subsequences:
iterator = iter(subseq)
for current_value in iterator:
# subseq is not empty, therefore add this subseq's pair
# (current-value, iterator) to the list
heap.append((current_value, iterator))
break
# make the priority queue into a heap
heapq.heapify(heap)
while heap:
# get and yield lowest current value (and corresponding iterator)
current_value, iterator = heap[0]
yield current_value
for current_value in iterator:
# subseq is not finished, therefore add this subseq's pair
# (current-value, iterator) back into the priority queue
heapq.heapreplace(heap, (current_value, iterator))
break
else:
# subseq has been exhausted, therefore remove it from the queue
heapq.heappop(heap)
Discussion
The need for "merging" sorted
subsequences into a larger sorted sequence is reasonably frequent. If
the amount of data is small enough to fit entirely in memory without
problems, then the best approach is to build a list by concatenating
all subsequences, then sort the list:
def smallmerge(*subsequences):
result = [ ]
for subseq in subsequences: result.extend(subseq)
result.sort( )
return result
The sort method of list objects is based on a
sophisticated natural merge algorithm, able to
take advantage of existing sorted subsequences in the list
you're sorting; therefore, this approach is quite
fast, as well as simple (and general, since this
approach's correctness does not
depend on all subsequences being already sorted). If you can choose
this approach, it has many other advantages. For example,
smallmerge works fine even if one of the
subsequences isn't perfectly sorted
to start with; and in Python 2.4, you may add a generic keywords
argument **kwds to smallmerge and
pass it right along to the result.sort( ) step, to
achieve the flexibility afforded in that version by the
cmp=, key=, and
reverse= arguments to list's
sort method.
However, you sometimes deal with large sequences, which might not
comfortably fit in memory all at the same time (e.g., your sequences
might come from files on disk, or be computed on the fly, item by
item, by other generators). When this happens, this
recipe's generator will enable you to perform your
sequence merging while consuming a very moderate amount of extra
memory (dependent only on the number of subsequences, not on the
number of items in the subsequences).
The recipe's implementation uses a classic
sequence-merging algorithm based on a priority queue, which, in turn,
lets it take advantage of the useful heapq module
in the Python Standard Library. heapq offers
functions to implement a priority queue through the data structure
known as a heap.
A heap is any list
H such that, for any valid index
0<=i<len(H),
H[i]<=H[2*i+1], and
H[i]<=H[2*i+2] (if 2*i+1 and
2*i+2 are also valid indices into
H). This heap
property is fast to establish on an arbitrary list
(function heapify does that) and very fast to
re-establish after altering or removing the smallest item (and
functions heapreplace and
heappop do that). The smallest item is always
H[0]
(it's easy to see that the
"heap" property implies this), and
being able to find the smallest item instantly makes heaps an
excellent implementation of priority queues.
In this recipe, we use as items in the
"heap" a
"pair" (i.e., two-items tuple) for
each subsequence that is not yet exhausted (i.e., each subsequence
through which we have not yet fully iterated). As its first item,
each pair has the "current item" in
the corresponding subsequence and, as its second item, an iterator
over that subsequence. At each iteration step, we yield the smallest
"current item", then we advance the
corresponding iterator and re-establish the
"heap" property; when an iterator
is exhausted, we remove the corresponding pair from the
"heap" (so that, clearly,
we're finished when the
"heap" is emptied). Note the idiom
that we use to advance an iterator by one step, dealing with the
possibility that the iterator is exhausted:
for current_value in iterator:
# if we get here the iterator was not empty, current_value was
# its first value, and the iterator has been advanced one step
...use pair (current_value, iterator)...
# we break at once as we only wanted the |