Notes On Sorting In Python H. James Hoover 2017-02-03 Version 1.2 These comments are intended for Python 3, but they apply to sorting for any programming language. * Contents of Collections Suppose you have an enumerable collection c of objects from some set X. You can make a list of the object in c as follows: l = [ item for item in c ] So from now on we will work with the list l of items. * Total Orderings The sorting problem is to take the list l and produce a new list s that has the same items as l, but they are in increasing order in s. Note: sorting in place means that the original list l is replaced by s. But by "in order" which possible ordering do we mean? So before we can sort, we need to specify our ordering relation for objects in X. We need to say what it means to be equal and less than. In the computing world, we define equality to mean "equal with respect to the things that are important". So for example "0", "00", and "000" are strings that are all equal with respect to the integer they represent, even though they are different strings. The same holds for comparing things. We need to define what it means to be less than in terms of what we consider important for the computation. Thus we might want that "Any" is less than "apple" because we are ignoring case. So we need to specify equality (eq) and less than (lt) -- we use eq and lt to indicate that these are not necessarily the usual = and < relations. Then we can specify the less than or equal ordering (le) as a le b iff a eq b or a lt b We say that le is a total order on set X if for all a, b, c in X, it satisfies: antisymmetry: if a le b and b le a then a eq b transitivity: if a le b and b le c then a le c totality: a le b or b le a If le is a total order, then we can sort lists of objects from X under the ordering le. Note that if we specify eq and lt, then they must be consistent, that is a eq b iff not (a lt b) and not (b lt a) And of course, if you begin with le then you have a eq b iff (a le b) and (b le a) a lt b iff (a le b) and not (b le a) We can also specify an order using a comparison function, cmp, that for all a, b in X satisfies a lt b iff cmp(a, b) < 0 a eq b iff cmp(a, b) = 0 a gt b iff cmp(a, b) > 0 where greater than is defined as: a gt b iff not (a le b) Anyone who has used the qsort function from the C library is familiar with comparison functions. * How do you check if something is sorted? def in_order(l): # all ( l ) returns True if every element in l is True when considered # as a Boolean return all( [ l[i-1] <= l[i] for i in range(1, len(l)) ] ) Note: just because you have a <= relation doesn't mean that you can sort. That is, it isn't always the case that a <= b or b <= a. Consider when working with sets: neither { 1, 2 } <= { 3, 4 } or { 3, 4 } <= { 1, 2 } Python's sorted() routine has no way of knowing that the < relation is not a total order, so it will happily sort a set, but it is not in order. For example: sorted( [ {3, 4}, {1, 2}, {1, 2, 3}, {4, 1} ] ) [{3, 4}, {1, 2}, {1, 2, 3}, {1, 4}] in_order( sorted( [ {3, 4}, {1, 2}, {1, 2, 3}, {4, 1} ] ) ) False * What Does It Mean To Sort? We can now define what it means to sort. Sorting the list l under the total order le means to compute a permutation p of {0, 1, ..., len(l)-1} such that for 0 <= i < j < len(l) we have l[p[i]] le l[p[j] That is, p[i] says that the element in position p[i] should be placed into position i of s. Then a sorted version of l is s = [ l[p[i]] for i in range(0, len(p) ) ] or more directly, since p is processed in order by list abstraction s = [ l[j] for j in p ] Note that in the case where l contains duplicate elements, there will be many permutations p that satisfy the above property. This can be resolved by ensuring that the sort is stable: A stable sort produces a unique permutation p such that, in addition satisfies for all 0 <= i < j < len(l) if l[i] eq l[j] then p[i] < p[j] That is, the sort maintains the relative ordering of the original elements in l under the eq and lt relations. In python, you can sort any list of objects that have a total ordering relation by simply doing s = sorted(l) which returns a list s with the same contents of l but ordered. * Lexicographical Order Relations Between List-like Things What about sorting things like lists of tuples, or lists of lists? Once we have an equality relation eq, we can use it to define equality for lists (and other enumerable containers). Two lists are equal if they are the same length and their elements match. That is, for all lists a, b; a eq b iff len(a) = len(b) and for i in range(0, len(a) ) we have a[i] eq b[i] If you have list-like things, the ordering of their elements can be used to generate an ordering between list-like things. For example, if you have tuples, you order them by comparing their entries: (1, 2) < (1, 3) (2, 1) > (1, 1) (1, 2) = (1, 2) This can be generalized to cases where the lists are of different lengths. This is called a lexicographical ordering, and is what we generally use between strings - a string being a list of characters. Since characters have an ordering, we can order two lists of characters as follows - the recursive definition is for clarity, you would convert the tail recursion into a loop for efficiency. def lexle(a, b): # empty lists are equal if len(a) == 0 and len(b) == 0: return True if len(a) == 0 : return True if len(b) == 0 : return False # if here, at least one element in each list if a[0] < b[0] : return True if a[0] > b[0] : return False return lexle(a[1:], b[1:]) Then sorting will work on lists of list-like things, for example if l is [ ("a", "x"), ("b", "y"), ("c", "z"), ("a", "y"), ("b", "z"), ("c", "x") ] then sorted(l) is [('a', 'x'), ('a', 'y'), ('b', 'y'), ('b', 'z'), ('c', 'x'), ('c', 'z')] * Extracting The Permutation From A Sort In general a sort does not give you a permutation, but sorts the actual list. You can get the permutation by the trick of pairing the original list contents with their position, and then sorting on the first position: l = ['x', 'y', 'a', 'b', 'aa' ] lp = [ (l[i], i) for i in range (0, len(l) ) ] [('x', 0), ('y', 1), ('a', 2), ('b', 3), ('aa', 4)] sp = sorted( lp ) [('a', 2), ('aa', 4), ('b', 3), ('x', 0), ('y', 1)] p = [ item[1] for item in sp ] [2, 4, 3, 0, 1] s = [ l[j] for j in p ] ['a', 'aa', 'b', 'x', 'y'] Note This example is an instance of the functional programming idiom known as decorate-sort-undecorate (the Perl version is called the Schwartzian Transform). There are many cases where we want to sort a collection of items by some property of the items. For example we might want to sort coordinate pairs (x, y) by their distance from the origin. In that case we want to compute a key that we decorate each item with, sort by that key, and then remove the decoration. Consider def dist(p): return p[0]**2 + p[1]**2 l = [ (1, 2), (2, 1), (1, 1), (2, 2), (3, 2), (1, 3) ] # decorate ld = [ (dist(p), p) for p in l ] [(5, (1, 2)), (5, (2, 1)), (2, (1, 1)), (8, (2, 2)), (13, (3, 2)), (10, (1, 3))] # sort - but doing some unnecessary extra work sd = sorted(ld) [(2, (1, 1)), (5, (1, 2)), (5, (2, 1)), (8, (2, 2)), (10, (1, 3)), (13, (3, 2))] # un-decorate s = [ e for (d, e) in sd ] [(1, 1), (1, 2), (2, 1), (2, 2), (1, 3), (3, 2)] Or another example, suppose we want to sort a list of strings by their length: l = [ "xyz", "abcdef", "abc", "xyzt", "abcd" ] # decorate ld = [ (len(s), s) for s in l ] [(3, 'xyz'), (6, 'abcdef'), (3, 'abc'), (4, 'xyzt'), (4, 'abcd')] # sort [(3, 'abc'), (3, 'xyz'), (4, 'abcd'), (4, 'xyzt'), (6, 'abcdef')] # un-decorate ['abc', 'xyz', 'abcd', 'xyzt', 'abcdef'] Now these particular examples are actually doing all kinds of unneeded work because it is sorting the decorated tuples, for example (5, (1, 2)), or (5, "abcde")), when all that is needed is to sort on the decoration, that is 5. Not only does the sort do extra work, it also reorders the strings of equal lengths - for example "xyz" comes before "abc" in the original list while in the sorted list "xyz" is placed after "abc" because the sorting order is (3, 'abc') < (3, 'xyz') This means that the sort is not stable, that is the order of "equal" elements should be preserved in the sorted list. The Python sorted function has the decorate-sort-undecorate functionality built in. If you supply a key function, then the items to be sorted are first each decorated with the key, then sorted by the value of the key, and then undecorated. So the above examples would be s = sorted(l, key=dist) and s = sorted(l, key=len) Using a key function we could sort by ascending 1st component s = sorted(l, key=lambda p: p[0]) to get [(1, 2), (1, 1), (1, 3), (2, 1), (2, 2), (3, 2)] Note how the ordering of the elements when the 1st component matches is the same as in the original list, illustrating that the sort is stable, We can sort by ascending 1st item and descending 2nd using "-" to flip to sort order. s = sorted(l, key=lambda p: (p[0], -p[1]) ) gives [(1, 3), (1, 2), (1, 1), (2, 2), (2, 1), (3, 2)] Returning to the problem of extracting the permutation needed for a sort, we decorate the sequence 0, ..., len(l)-1, i.e, range(0, len(l) ), with the element at that position in the original list. Sorting by the element then moves each index into the proper position in the permutation. So we can compute the permutation necessary to sort list l directly as sorted(range(0, len(l)), key=(lambda i: l[i]) ) Here is a another example. Sort words by length, then lexicographical l = [ "a", "b", "zeta", "zero", "alpha" ] sorted(l, key=lambda w: (len(w), w) ) is ['a', 'b', 'zero', 'zeta', 'alpha'] And since the induced lexicographical orderings are recursive, there is a well-defined notion of sorting a list of lists. l = [ [1, 2, 3], [], [1, 2], [1, 3, 4], [1, 3, 5] ] sorted(l) is [[], [1, 2], [1, 2, 3], [1, 3, 4], [1, 3, 5]] and for tuples with different type elements l = [ ("a", 5), ("b", 7), ("c", 5), ("a", 1), ("b", 8), ("c", 5) ] sorted(l) is [('a', 1), ('a', 5), ('b', 7), ('b', 8), ('c', 5), ('c', 5)] Finally, returning to the original comment on collections, you don't need to put a collection into a list to sort it, the collection just needs to be enumerable: c = { ("a", 5), ("b", 7), ("c", 5), ("a", 1), ("b", 8), ("c", 5) } sorted(c) is [('a', 1), ('a', 5), ('b', 7), ('b', 8), ('c', 5)] or even this dict d = { "a":5, "b":7, "c":5, "a":1, "bb":8, "cc":5 } sorted(d) is just the keys: ['a', 'b', 'bb', 'c', 'cc'] while sorted(d.items()) is the (key, value) pairs [('a', 1), ('b', 7), ('bb', 8), ('c', 5), ('cc', 5)] and sorted([ (i[1], i[0]) for i in d.items() ] ) is the reverse relation from values to keys [(1, 'a'), (5, 'c'), (5, 'cc'), (7, 'b'), (8, 'bb')] which can be made into a reverse dictionary map by rd = { } for k, v in m: if k not in rd: rd[k] = [ v ] else: rd[k].append(v) which is {8: ['bb'], 1: ['a'], 5: ['c', 'cc'], 7: ['b']} * But why doesn't Python's sort have an option for supplying a compare function? If you are familiar with the C library qsort function, it has a calling sequence like this: void qsort(void *base, size_t nel, size_t width, int (*compar)(const void *, const void *)); where the compar function compares two elements a, b and returns <0, 0, >0 depending on ab. There is no corresponding sort option in Python 3. This can make the following sort problem look impossible: l = [ ("a", "x"), ("b", "y"), ("c", "z"), ("a", "y"), ("b", "z"), ("c", "x") ] Sort l by ascending 1st component, and descending 2nd component. One way to do this is via a radix sort, where we first sort by the 2nd component, reverse the list to get descending order, and then sort by the first component. The sort stability preserves the descending order of the 2nd component. l1 = sorted(l, key=lambda p: p[1]) # reverse in place l1.reverse() s = sorted(l1, key=lambda p: p[0]) Or as one command s = sorted( reversed(sorted(l, key=lambda p: p[1])), key=lambda p: p[0]) Or doing everything in place: l.sort(key=lambda p: p[1]) l.reverse() l.sort(key=lambda p: p[0]) Sorted actually has a reverse option, since it is used so frequently, so this also works: l.sort(key=lambda p: p[1], reverse=True) l.sort(key=lambda p: p[0]) * Reversing Natural Ordering Via Wrapper Classes But how do you do the previous sort with one key function? Python has this idea (documented where?) that all classes should have a natural ordering. The problem is that you cannot override the natural ordering with your own user-defined compare function. So if you want to compare two tuples, where the order of the second tuple is reversed, you need to override the order relation for the second element. To do this, you can wrap the original object in a new one, that explicitly reverses the order. class RevOrder(object): """ RevOrder(obj) creates a new instance that is has the total order behaviour of obj reversed. i.e. lt in the original becomes a gt in the new. Thus sorts will be in descending order. """ def __init__(self, obj, *args): self.obj = obj def __str__(self): return self.obj.__str__() def __repr__(self): return self.obj.__repr__() def __lt__(self, other): return self.obj > other.obj def __gt__(self, other): return self.obj < other.obj def __eq__(self, other): return self.obj == other.obj def __le__(self, other): return self.obj >= other.obj def __ge__(self, other): return self.obj <= other.obj def __ne__(self, other): return self.obj != other.obj So this will work s = sorted(l, key=lambda p: (p[0], RevOrder(p[1]) )) Or if you have a compare function at your disposal, then you can build a new wrapper class that overrides the original classes ordering relation by using the functools.cmp_to_key(func) The cmp_to_key function is defined this way, and for each argument instantiates a new object that behaves like the original except for the new comparison operation. def cmp_to_key(mycmp): 'Convert a cmp= function into a key= function' class K(object): def __init__(self, obj, *args): self.obj = obj def __lt__(self, other): return mycmp(self.obj, other.obj) < 0 def __gt__(self, other): return mycmp(self.obj, other.obj) > 0 def __eq__(self, other): return mycmp(self.obj, other.obj) == 0 def __le__(self, other): return mycmp(self.obj, other.obj) <= 0 def __ge__(self, other): return mycmp(self.obj, other.obj) >= 0 def __ne__(self, other): return mycmp(self.obj, other.obj) != 0 return K def cmp_pair(a, b): if a[0] < b[0] : return -1 if a[0] > b[0] : return 1 if a[1] > b[1] : return -1 if a[1] < b[1] : return 1 return 0 So s = sorted(l, key=functools.cmp_to_key(cmp_pair) ) Software Engineering Note: Both RevOrder and the class K from cmp_to_key are derived off of the base class "object", so we are extending the behaviour of object. But object doesn't have the underlying rich comparisons so we are not violating the Liskov Principle of Substitution which says that derived classes must have the same behaviours as their parent classes with respect to existing features. Remarks: Note 1: For Perl aficionados, the general form of the Schwartzian Transform is: @sorted = map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [$_, foo($_)] } @unsorted; Note 2: It seems a Python dogma that objects always have a single "natural" order relation (if they have one at all), and you are not normally allowed to override it. This simply doesn't make sense from an algebraic perspective. And the fact that sort has a reverse=True option hints at this. After all, they should have said to use reversed after a sort if you want it in reverse order. reverse=True is supplying a very limited override of the order relation. References: https://docs.python.org/3.5/reference/expressions.html#comparisons https://docs.python.org/3.5/tutorial/classes.html