Selection algorithm
Encyclopedia
In computer science
, a selection algorithm is an algorithm
for finding the kth smallest number in a list (such a number is called the kth order statistic
). This includes the cases of finding the minimum, maximum, and median
elements. There are , worst-case linear time, selection algorithms. Selection is a subproblem of more complex problems like the nearest neighbor problem and shortest path problems.
The term "selection" is used in other contexts in computer science, including the stage of a genetic algorithm in which genomes are chosen from a population for later breeding; see Selection (genetic algorithm)
. This article addresses only the problem of determining order statistics.
to sorting
by sorting the list and then extracting the desired element. This method is efficient when many selections need to be made from a list, in which case only one initial, expensive sort is needed, followed by many cheap extraction operations. In general, this method requires O(n log n) time, where n is the length of the list.
. Here is the minimum-based algorithm:
function select(list[1..n], k)
for i from 1 to k
minIndex = i
minValue = list[i]
for j from i+1 to n
if list[j] < minValue
minIndex = j
minValue = list[j]
swap list[i] and list[minIndex]
return list[k]
Other advantages of this method are:
In quicksort, there is a subprocedure called partition that can, in linear time, group a list (ranging from indices
function partition(list, left, right, pivotIndex)
pivotValue := list[pivotIndex]
swap list[pivotIndex] and list[right] // Move pivot to end
storeIndex := left
for i from left to right
if list[i] < pivotValue
swap list[storeIndex] and list[i]
increment storeIndex
swap list[right] and list[storeIndex] // Move pivot to its final place
return storeIndex
In quicksort, we recursively sort both branches, leading to best-case Ω(n log n) time. However, when doing selection, we already know which partition our desired element lies in, since the pivot is in its final sorted position, with all those preceding it in sorted order and all those following it in sorted order. Thus a single recursive call locates the desired element in the correct partition:
function select(list, left, right, k)
if left = right // If the list contains only one element
return list[left] // Return that element
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
pivotDist := pivotNewIndex - left + 1
// The pivot is in its final sorted position,
// so pivotDist reflects its 1-based position if list were sorted
if pivotDist = k
return list[pivotNewIndex]
else if k < pivotDist
return select(list, left, pivotNewIndex - 1, k)
else
return select(list, pivotNewIndex + 1, right, k - pivotDist)
Note the resemblance to quicksort: just as the minimum-based selection algorithm is a partial selection sort, this is a partial quicksort, generating and partitioning only O(log n) of its O(n) partitions. This simple procedure has expected linear performance, and, like quicksort, has quite good performance in practice. It is also an in-place algorithm
, requiring only constant memory overhead, since the tail recursion can be eliminated with a loop like this:
function select(list, left, right, k)
loop
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
pivotDist := pivotNewIndex - left + 1
if pivotDist = k
return list[pivotNewIndex]
else if k < pivotDist
right := pivotNewIndex - 1
else
k := k - pivotDist
left := pivotNewIndex + 1
Like quicksort, the performance of the algorithm is sensitive to the pivot that is chosen. If bad pivots are consistently chosen, this degrades to the minimum-based selection described previously, and so can require as much as O(n2) time. David Musser describes a "median-of-3 killer" sequence that can force the well-known median-of-three pivot selection algorithm to fail with worst-case behavior (see Introselect section below).
, Floyd
, Pratt
, Rivest
and Tarjan
in their 1973 paper "Time bounds for selection", sometimes called BFPRT after the last names of the authors. It is based on the quickselect algorithm and is also known as the median-of-medians algorithm.
Although quickselect is linear-time on average, it can require quadratic time with poor pivot choices (consider the case of pivoting around the smallest element at each step). The solution to make it O(n) in the worst case is to consistently find "good" pivots. A good pivot is one for which we can establish that a constant proportion of elements fall both below and above it.
The Select algorithm divides the list into groups of five elements. (Left over elements are ignored for now.) Then, for each group of five, the median is calculated (an operation that can potentially be made very fast if the five values can be loaded into registers and compared). (If sorting in-place, then these medians are moved into one contiguous block in the list.) Select is then called recursively on this sublist of n/5 elements to find their true median. Finally, the "median of medians" is chosen to be the pivot.
(red = "(one of the two possible) median of medians", gray = "number < red", white = "number > red")
5-tuples are shown here sorted by median, for clarity. Sorting the tuples is not necessary because we only need the median for use as pivot element.
Note that all elements above/left of the red (30% of the 100 elements) are less, and all elements below/right of the red (another 30% of the 100 elements) are greater.
T(n) ≤ T(n/5) + T(7n/10) + O(n)
The O(n) is for the partitioning work (we visited each element a constant number of times, in order to form them into O(n) groups and take each median in O(1) time).
From this, one can then show that T(n) ≤ c*n*(1 + (9/10) + (9/10)2 + ...) = O(n).
The worst-case algorithm can construct a worst-case O(n log n) quicksort algorithm, by using it to find the median at every step.
's well-known introsort
achieves practical performance comparable to quicksort while preserving O(n log n) worst-case behavior by creating a hybrid of quicksort and heapsort
. In the same paper, Musser introduced an "introspective selection" algorithm, popularly called introselect, which combines Hoare's algorithm with the worst-case linear algorithm described above to achieve worst-case linear selection with performance similar to Hoare's algorithm. It works by optimistically starting out with Hoare's algorithm and only switching to the worst-time linear algorithm if it recurses too many times without making sufficient progress. Simply limiting the recursion to constant depth is not good enough, since this would make the algorithm switch on all sufficiently large lists. Musser discusses a couple of simple approaches:
Both approaches limit the recursion depth to O(klog n), which is O(log n) since k is a predetermined constant. The paper suggested that more research on introselect was forthcoming, but as of 2007 it has not appeared.
the sorting cost over many subsequent selections. However, sometimes the number of selections that will be done is not known in advance, and may be either small or large. In these cases, we can adapt the algorithms given above to simultaneously select an element while partially sorting the list, thus accelerating future selections.
Both the selection procedure based on minimum-finding and the one based on partitioning can be seen as a form of partial sort. The minimum-based algorithm sorts the list up to the given index, and so clearly speeds up future selections, especially of smaller indexes. The partition-based algorithm does not achieve the same behaviour automatically, but can be adapted to remember its previous pivot choices and reuse them wherever possible, avoiding costly partition operations, particularly the top-level one. The list becomes gradually more sorted as more partition operations are done incrementally; no pivots are ever "lost." If desired, this same pivot list could be passed on to quicksort to reuse, again avoiding many costly partition operations.
The strategy to find an order statistic in sublinear time is to store the data in an organized fashion using suitable data structures that facilitate the selection. Two such data structures are tree-based structures and frequency tables.
When only the minimum (or maximum) is needed, a good approach is to use a heap
, which is able to find the minimum (or maximum) element in constant time, while all other operations, including insertion, are O(log n) or better. More generally, a self-balancing binary search tree
can easily be augmented to make it possible to both insert an element and find the kth largest element in O(log n) time. We simply store in each node a count of how many descendants it has, and use this to determine which path to follow. The information can be updated efficiently since adding a node only affects the counts of its O(log n) ancestors, and tree rotations only affect the counts of the nodes involved in the rotation.
Another simple strategy is based on some of the same concepts as the hash table
. When we know the range of values beforehand, we can divide that range into h subintervals and assign these to h buckets. When we insert an element, we add it to the bucket corresponding to the interval it falls in. To find the minimum or maximum element, we scan from the beginning or end for the first nonempty bucket and find the minimum or maximum element in that bucket. In general, to find the kth element, we maintain a count of the number of elements in each bucket, then scan the buckets from left to right adding up counts until we find the bucket containing the desired element, then use the expected linear-time algorithm to find the correct element in that bucket.
If we choose h of size roughly sqrt(n), and the input is close to uniformly distributed, this scheme can perform selections in expected O(sqrt(n)) time. Unfortunately, this strategy is also sensitive to clustering of elements in a narrow interval, which may result in buckets with large numbers of elements (clustering can be eliminated through a good hash function, but finding the element with the kth largest hash value isn't very useful). Additionally, like hash tables this structure requires table resizings to maintain efficiency as elements are added and n becomes much larger than h2. A useful case of this is finding an order statistic or extremum in a finite range of data. Using above table with bucket interval 1 and maintaining counts in each bucket is much superior to other methods. Such hash tables are like frequency tables used to classify the data in descriptive statistics
.
or self-balancing binary search tree
, with at most k elements. Whenever the data structure has more than k elements, we remove the largest element, which can be done in O(log k) time. Each insertion operation also takes O(log k) time, resulting in O(nlog k) time overall.
It is possible to transform the list into a heap
in Θ(n) time, and then traverse the heap using a modified Breadth-first search
algorithm that places the elements in a Priority Queue
(instead of the ordinary queue that is normally used in a BFS), and terminate the scan after traversing exactly k elements. As the queue size remains O(k) throughout the traversal, it would require O(k log k) time to complete, leading to a time bound of O(n + k log k) on this algorithm.
We can achieve an O(log n) time solution using skip lists. Skip lists are sorted data structures that allow insertion, deletion and indexed retrieval in O(log n) time. Thus, for any given percentile, we can insert a new element into (and possibly delete an old element from) the list in O(log n), calculate the corresponding index(es) and finally access the percentile value in O(log n) time. See, for example, this Python-based implementation for calculating running median.
function quicksortFirstK(list, left, right, k)
if right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
quicksortFirstK(list, left, pivotNewIndex-1, k)
if pivotNewIndex < left + k
quicksortFirstK(list, pivotNewIndex+1, right, k)
The resulting algorithm requires an expected time of only O(n + k log k), and is quite efficient in practice, especially if we substitute selection sort when k becomes small relative to n. However, the worst-case time complexity is still very bad, in the case of a bad pivot selection. Pivot selection along the lines of the worst-case linear time selection algorithm could be used to get better worst-case performance.
Even better is if we don't require those k items to be themselves sorted. Losing that requirement means we can ignore all partitions that fall entirely before or after the kth place. We recurse only into the partition that actually contains the kth element itself.
function quickfindFirstK(list, left, right, k)
if right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
if pivotNewIndex > left + k // new condition
quickfindFirstK(list, left, pivotNewIndex-1, k)
if pivotNewIndex < left + k
quickfindFirstK(list, pivotNewIndex+1, right, k+left-pivotNewIndex-1)
The resulting algorithm requires an expected time of only O(n), which is the best such an algorithm can hope for.
A simpler formulation of a worst-case O(n) algorithm, is as follows:
, Donald E. Knuth discussed a number of lower bounds for the number of comparisons required to locate the k smallest entries of an unorganized list of n items (using only comparisons). There is a trivial lower bound of n − 1 for the minimum or maximum entry. To see this, consider a tournament where each game represents one comparison. Since every player except the winner of the tournament must lose a game before we know the winner, we have a lower bound of n − 1 comparisons.
The story becomes more complex for other indexes. To find the k smallest values requires at least this many comparisons:
This bound is achievable for k=2 but better, more complex bounds exist for larger k.
, which provides a templated
C++ also provides the partial_sort algorithm, which solves the problem of selecting the smallest k elements (sorted), with a time complexity of O(n log k). No algorithm is provided for selecting the greatest k elements since this should be done by inverting the ordering predicate.
For Perl
, the module Sort::Key::Top, available from CPAN
, provides a set of functions to select the top n elements from a list using several orderings and custom key extraction procedures.
Python
's standard library (since 2.4) includes
Because language support for sorting is more ubiquitous, the simplistic approach of sorting followed by indexing is preferred in many environments despite its disadvantage in speed. Indeed for lazy languages
, this simplistic approach can even achieve the best complexity possible for the k smallest/greatest sorted (with maximum/minimum as a special case) if the sort is lazy enough.
specific element of the input sequence (as for example the largest or the smallest value)
with largest probability. This problem can be tackled by the Odds algorithm
which is known to be optimal under an independence condition. The algorithm is also optimal itself
with the number of operations being linear in the length of input.
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
, a selection algorithm is an algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
for finding the kth smallest number in a list (such a number is called the kth order statistic
Order statistic
In statistics, the kth order statistic of a statistical sample is equal to its kth-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference....
). This includes the cases of finding the minimum, maximum, and median
Median
In probability theory and statistics, a median is described as the numerical value separating the higher half of a sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to...
elements. There are , worst-case linear time, selection algorithms. Selection is a subproblem of more complex problems like the nearest neighbor problem and shortest path problems.
The term "selection" is used in other contexts in computer science, including the stage of a genetic algorithm in which genomes are chosen from a population for later breeding; see Selection (genetic algorithm)
Selection (genetic algorithm)
Selection is the stage of a genetic algorithm in which individual genomes are chosen from a population for later breeding .A generic selection procedure may be implemented as follows:...
. This article addresses only the problem of determining order statistics.
Selection by sorting
Selection can be reducedReduction (complexity)
In computability theory and computational complexity theory, a reduction is a transformation of one problem into another problem. Depending on the transformation used this can be used to define complexity classes on a set of problems....
to sorting
Sorting algorithm
In computer science, a sorting algorithm is an algorithm that puts elements of a list in a certain order. The most-used orders are numerical order and lexicographical order...
by sorting the list and then extracting the desired element. This method is efficient when many selections need to be made from a list, in which case only one initial, expensive sort is needed, followed by many cheap extraction operations. In general, this method requires O(n log n) time, where n is the length of the list.
Linear minimum/maximum algorithms
Linear time algorithms to find minimums or maximums work by iterating over the list and keeping track of the minimum or maximum element so far.Nonlinear general selection algorithm
Using the same ideas used in minimum/maximum algorithms, we can construct a simple, but inefficient general algorithm for finding the kth smallest or kth largest item in a list, requiring O(kn) time, which is effective when k is small. To accomplish this, we simply find the most extreme value and move it to the beginning until we reach our desired index. This can be seen as an incomplete selection sortSelection sort
Selection sort is a sorting algorithm, specifically an in-place comparison sort. It has O time complexity, making it inefficient on large lists, and generally performs worse than the similar insertion sort...
. Here is the minimum-based algorithm:
function select(list[1..n], k)
for i from 1 to k
minIndex = i
minValue = list[i]
for j from i+1 to n
if list[j] < minValue
minIndex = j
minValue = list[j]
swap list[i] and list[minIndex]
return list[k]
Other advantages of this method are:
- After locating the jth smallest element, it requires only O(j + (k-j)2) time to find the kth smallest element, or only O(k) for k ≤ j.
- It can be done with linked listLinked listIn computer science, a linked list is a data structure consisting of a group of nodes which together represent a sequence. Under the simplest form, each node is composed of a datum and a reference to the next node in the sequence; more complex variants add additional links...
data structures, whereas the one based on partition requires random accessRandom accessIn computer science, random access is the ability to access an element at an arbitrary position in a sequence in equal time, independent of sequence size. The position is arbitrary in the sense that it is unpredictable, thus the use of the term "random" in "random access"...
.
Partition-based general selection algorithm
A general selection algorithm that is efficient in practice, but has poor worst-case performance, was conceived by the inventor of quicksort, C.A.R. Hoare, and is known as Hoare's selection algorithm or quickselect.In quicksort, there is a subprocedure called partition that can, in linear time, group a list (ranging from indices
left
to right
) into two parts, those less than a certain element, and those greater than or equal to the element. Here is pseudocode that performs a partition about the element list[pivotIndex]
:function partition(list, left, right, pivotIndex)
pivotValue := list[pivotIndex]
swap list[pivotIndex] and list[right] // Move pivot to end
storeIndex := left
for i from left to right
if list[i] < pivotValue
swap list[storeIndex] and list[i]
increment storeIndex
swap list[right] and list[storeIndex] // Move pivot to its final place
return storeIndex
In quicksort, we recursively sort both branches, leading to best-case Ω(n log n) time. However, when doing selection, we already know which partition our desired element lies in, since the pivot is in its final sorted position, with all those preceding it in sorted order and all those following it in sorted order. Thus a single recursive call locates the desired element in the correct partition:
function select(list, left, right, k)
if left = right // If the list contains only one element
return list[left] // Return that element
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
pivotDist := pivotNewIndex - left + 1
// The pivot is in its final sorted position,
// so pivotDist reflects its 1-based position if list were sorted
if pivotDist = k
return list[pivotNewIndex]
else if k < pivotDist
return select(list, left, pivotNewIndex - 1, k)
else
return select(list, pivotNewIndex + 1, right, k - pivotDist)
Note the resemblance to quicksort: just as the minimum-based selection algorithm is a partial selection sort, this is a partial quicksort, generating and partitioning only O(log n) of its O(n) partitions. This simple procedure has expected linear performance, and, like quicksort, has quite good performance in practice. It is also an in-place algorithm
In-place algorithm
In computer science, an in-place algorithm is an algorithm which transforms input using a data structure with a small, constant amount of extra storage space. The input is usually overwritten by the output as the algorithm executes...
, requiring only constant memory overhead, since the tail recursion can be eliminated with a loop like this:
function select(list, left, right, k)
loop
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
pivotDist := pivotNewIndex - left + 1
if pivotDist = k
return list[pivotNewIndex]
else if k < pivotDist
right := pivotNewIndex - 1
else
k := k - pivotDist
left := pivotNewIndex + 1
Like quicksort, the performance of the algorithm is sensitive to the pivot that is chosen. If bad pivots are consistently chosen, this degrades to the minimum-based selection described previously, and so can require as much as O(n2) time. David Musser describes a "median-of-3 killer" sequence that can force the well-known median-of-three pivot selection algorithm to fail with worst-case behavior (see Introselect section below).
Linear general selection algorithm - Median of Medians algorithm
A worst-case linear algorithm for the general case of selecting the kth largest element was published by BlumManuel Blum
Manuel Blum is a computer scientist who received the Turing Award in 1995 "In recognition of his contributions to the foundations of computational complexity theory and its application to cryptography and program checking".-Biography:Blum attended MIT, where he received his bachelor's degree and...
, Floyd
Robert Floyd
Robert W Floyd was an eminent computer scientist.His contributions include the design of the Floyd–Warshall algorithm , which efficiently finds all shortest paths in a graph, Floyd's cycle-finding algorithm for detecting cycles in a sequence, and his work on parsing...
, Pratt
Vaughan Ronald Pratt
Vaughan Ronald Pratt , a Professor Emeritus at Stanford University, was one of the earliest pioneers in the field of computer science. Publishing since 1969, Pratt has made several contributions to foundational areas such as search algorithms, sorting algorithms, and primality testing...
, Rivest
Ron Rivest
Ronald Linn Rivest is a cryptographer. He is the Andrew and Erna Viterbi Professor of Computer Science at MIT's Department of Electrical Engineering and Computer Science and a member of MIT's Computer Science and Artificial Intelligence Laboratory...
and Tarjan
Robert Tarjan
Robert Endre Tarjan is a renowned American computer scientist. He is the discoverer of several important graph algorithms, including Tarjan's off-line least common ancestors algorithm, and co-inventor of both splay trees and Fibonacci heaps. Tarjan is currently the James S...
in their 1973 paper "Time bounds for selection", sometimes called BFPRT after the last names of the authors. It is based on the quickselect algorithm and is also known as the median-of-medians algorithm.
Although quickselect is linear-time on average, it can require quadratic time with poor pivot choices (consider the case of pivoting around the smallest element at each step). The solution to make it O(n) in the worst case is to consistently find "good" pivots. A good pivot is one for which we can establish that a constant proportion of elements fall both below and above it.
The Select algorithm divides the list into groups of five elements. (Left over elements are ignored for now.) Then, for each group of five, the median is calculated (an operation that can potentially be made very fast if the five values can be loaded into registers and compared). (If sorting in-place, then these medians are moved into one contiguous block in the list.) Select is then called recursively on this sublist of n/5 elements to find their true median. Finally, the "median of medians" is chosen to be the pivot.
Properties of pivot
The chosen pivot is both less than and greater than half of the elements in the list of medians, which is around elements for each half. Each of these elements is a median of 5, making it less than 2 other elements and greater than 2 other elements outside the block. Hence, the pivot is less than elements outside the block, and greater than another elements outside the block. Thus the chosen median splits the elements somewhere between 30%/70% and 70%/30%, which assures worst-case linear behavior of the algorithm. To visualize:12 | 15 | 11 | 2 | 9 | 5 | 0 | 7 | 3 | 21 | 44 | 40 | 1 | 18 | 20 | 32 | 19 | 35 | 37 | 39 | ||||||||||||||||||||
13 | 16 | 14 | 8 | 10 | 26 | 6 | 33 | 4 | 27 | 49 | 46 | 52 | 25 | 51 | 34 | 43 | 56 | 72 | 79 | ||||||||||||||||||||
Medians | 17 | 23 | 24 | 28 | 29 | 30 | 31 | 36 | 42 | 47 | 50 | 55 | 58 | 60 | 63 | 65 | 66 | 67 | 81 | 83 | |||||||||||||||||||
22 | 45 | 38 | 53 | 61 | 41 | 62 | 82 | 54 | 48 | 59 | 57 | 71 | 78 | 64 | 80 | 70 | 76 | 85 | 87 | ||||||||||||||||||||
96 | 95 | 94 | 86 | 89 | 69 | 68 | 97 | 73 | 92 | 74 | 88 | 99 | 84 | 75 | 90 | 77 | 93 | 98 | 91 | ||||||||||||||||||||
(red = "(one of the two possible) median of medians", gray = "number < red", white = "number > red")
5-tuples are shown here sorted by median, for clarity. Sorting the tuples is not necessary because we only need the median for use as pivot element.
Note that all elements above/left of the red (30% of the 100 elements) are less, and all elements below/right of the red (another 30% of the 100 elements) are greater.
Proof of O(n) running time
The median-calculating recursive call does not exceed worst-case linear behavior because the list of medians is 20% of the size of the list, while the other recursive call recurse on at most 70% of the list, making the running timeT(n) ≤ T(n/5) + T(7n/10) + O(n)
The O(n) is for the partitioning work (we visited each element a constant number of times, in order to form them into O(n) groups and take each median in O(1) time).
From this, one can then show that T(n) ≤ c*n*(1 + (9/10) + (9/10)2 + ...) = O(n).
Important notes
Although this approach optimizes quite well, it is typically outperformed in practice by the expected linear algorithm with random pivot choices.The worst-case algorithm can construct a worst-case O(n log n) quicksort algorithm, by using it to find the median at every step.
Introselect
David MusserDavid Musser
David Musser is a professor of computer science at the Rensselaer Polytechnic Institute in Troy, New York, U.S.He is known for his work in generic programming, particularly as applied to C++. His research with Alexander Stepanov led to the creation of the C++ Standard Template Library...
's well-known introsort
Introsort
Introsort or introspective sort is a sorting algorithm designed by David Musser in 1997. It begins with quicksort and switches to heapsort when the recursion depth exceeds a level based on the number of elements being sorted...
achieves practical performance comparable to quicksort while preserving O(n log n) worst-case behavior by creating a hybrid of quicksort and heapsort
Heapsort
Heapsort is a comparison-based sorting algorithm to create a sorted array , and is part of the selection sort family. Although somewhat slower in practice on most machines than a well implemented quicksort, it has the advantage of a more favorable worst-case O runtime...
. In the same paper, Musser introduced an "introspective selection" algorithm, popularly called introselect, which combines Hoare's algorithm with the worst-case linear algorithm described above to achieve worst-case linear selection with performance similar to Hoare's algorithm. It works by optimistically starting out with Hoare's algorithm and only switching to the worst-time linear algorithm if it recurses too many times without making sufficient progress. Simply limiting the recursion to constant depth is not good enough, since this would make the algorithm switch on all sufficiently large lists. Musser discusses a couple of simple approaches:
- Keep track of the list of sizes of the subpartitions processed so far. If at any point k recursive calls have been made without halving the list size, for some small positive k, switch to the worst-case linear algorithm.
- Sum the size of all partitions generated so far. If this exceeds the list size times some small positive constant k, switch to the worst-case linear algorithm. This sum is easy to track in a single scalar variable.
Both approaches limit the recursion depth to O(klog n), which is O(log n) since k is a predetermined constant. The paper suggested that more research on introselect was forthcoming, but as of 2007 it has not appeared.
Selection as incremental sorting
One of the advantages of the sort-and-index approach, as mentioned, is its ability to amortizeAmortized analysis
In computer science, amortized analysis is a method of analyzing algorithms that considers the entire sequence of operations of the program. It allows for the establishment of a worst-case bound for the performance of an algorithm irrespective of the inputs by looking at all of the operations...
the sorting cost over many subsequent selections. However, sometimes the number of selections that will be done is not known in advance, and may be either small or large. In these cases, we can adapt the algorithms given above to simultaneously select an element while partially sorting the list, thus accelerating future selections.
Both the selection procedure based on minimum-finding and the one based on partitioning can be seen as a form of partial sort. The minimum-based algorithm sorts the list up to the given index, and so clearly speeds up future selections, especially of smaller indexes. The partition-based algorithm does not achieve the same behaviour automatically, but can be adapted to remember its previous pivot choices and reuse them wherever possible, avoiding costly partition operations, particularly the top-level one. The list becomes gradually more sorted as more partition operations are done incrementally; no pivots are ever "lost." If desired, this same pivot list could be passed on to quicksort to reuse, again avoiding many costly partition operations.
Using data structures to select in sublinear time
Given an unorganized list of data, linear time (Ω(n)) is required to find the minimum element, because we have to examine every element (otherwise, we might miss it). If we organize the list, for example by keeping it sorted at all times, then selecting the kth largest element is trivial, but then insertion requires linear time, as do other operations such as combining two lists.The strategy to find an order statistic in sublinear time is to store the data in an organized fashion using suitable data structures that facilitate the selection. Two such data structures are tree-based structures and frequency tables.
When only the minimum (or maximum) is needed, a good approach is to use a heap
Heap (data structure)
In computer science, a heap is a specialized tree-based data structure that satisfies the heap property: if B is a child node of A, then key ≥ key. This implies that an element with the greatest key is always in the root node, and so such a heap is sometimes called a max-heap...
, which is able to find the minimum (or maximum) element in constant time, while all other operations, including insertion, are O(log n) or better. More generally, a self-balancing binary search tree
Self-balancing binary search tree
In computer science, a self-balancing binary search tree is any node based binary search tree that automatically keeps its height small in the face of arbitrary item insertions and deletions....
can easily be augmented to make it possible to both insert an element and find the kth largest element in O(log n) time. We simply store in each node a count of how many descendants it has, and use this to determine which path to follow. The information can be updated efficiently since adding a node only affects the counts of its O(log n) ancestors, and tree rotations only affect the counts of the nodes involved in the rotation.
Another simple strategy is based on some of the same concepts as the hash table
Hash table
In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys , to their associated values . Thus, a hash table implements an associative array...
. When we know the range of values beforehand, we can divide that range into h subintervals and assign these to h buckets. When we insert an element, we add it to the bucket corresponding to the interval it falls in. To find the minimum or maximum element, we scan from the beginning or end for the first nonempty bucket and find the minimum or maximum element in that bucket. In general, to find the kth element, we maintain a count of the number of elements in each bucket, then scan the buckets from left to right adding up counts until we find the bucket containing the desired element, then use the expected linear-time algorithm to find the correct element in that bucket.
If we choose h of size roughly sqrt(n), and the input is close to uniformly distributed, this scheme can perform selections in expected O(sqrt(n)) time. Unfortunately, this strategy is also sensitive to clustering of elements in a narrow interval, which may result in buckets with large numbers of elements (clustering can be eliminated through a good hash function, but finding the element with the kth largest hash value isn't very useful). Additionally, like hash tables this structure requires table resizings to maintain efficiency as elements are added and n becomes much larger than h2. A useful case of this is finding an order statistic or extremum in a finite range of data. Using above table with bucket interval 1 and maintaining counts in each bucket is much superior to other methods. Such hash tables are like frequency tables used to classify the data in descriptive statistics
Descriptive statistics
Descriptive statistics quantitatively describe the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics , in that descriptive statistics aim to summarize a data set, rather than use the data to learn about the population that the data are...
.
Selecting k smallest or largest elements
Another fundamental selection problem is that of selecting the k smallest or k largest elements, which is particularly useful where we want to present just the "top k" of an unsorted list, such as the top 100 corporations by gross sales.Application of simple selection algorithms
We can use the linear-time solution discussed above to select the "k"-th largest element, then run through the list in linear time and choose all elements less-than-or-equal-to "k". If the list needs to be sorted, then this can be done in O(k log k) after the fact.Direct application of the quicksort-based selection algorithm
The quicksort-based selection algorithm can be used to find the k smallest or the k largest elements. To find the k smallest elements, find the kth smallest element using the median of medians quicksort-based algorithm. After the partition that finds the kth smallest element, all elements smaller than the kth smallest element will be to the left of the kth element and all elements larger will be to the right. Thus all elements from the 1st to the kth element inclusive constitute the k smallest elements. The time complexity is linear in n, the total number of elements.Data structure-based solutions
Another simple method is to add each element of the list into an ordered set data structure, such as a heapHeap (data structure)
In computer science, a heap is a specialized tree-based data structure that satisfies the heap property: if B is a child node of A, then key ≥ key. This implies that an element with the greatest key is always in the root node, and so such a heap is sometimes called a max-heap...
or self-balancing binary search tree
Self-balancing binary search tree
In computer science, a self-balancing binary search tree is any node based binary search tree that automatically keeps its height small in the face of arbitrary item insertions and deletions....
, with at most k elements. Whenever the data structure has more than k elements, we remove the largest element, which can be done in O(log k) time. Each insertion operation also takes O(log k) time, resulting in O(nlog k) time overall.
It is possible to transform the list into a heap
Heap (data structure)
In computer science, a heap is a specialized tree-based data structure that satisfies the heap property: if B is a child node of A, then key ≥ key. This implies that an element with the greatest key is always in the root node, and so such a heap is sometimes called a max-heap...
in Θ(n) time, and then traverse the heap using a modified Breadth-first search
Breadth-first search
In graph theory, breadth-first search is a graph search algorithm that begins at the root node and explores all the neighboring nodes...
algorithm that places the elements in a Priority Queue
Priority queue
A priority queue is an abstract data type in computer programming.It is exactly like a regular queue or stack data structure, but additionally, each element is associated with a "priority"....
(instead of the ordinary queue that is normally used in a BFS), and terminate the scan after traversing exactly k elements. As the queue size remains O(k) throughout the traversal, it would require O(k log k) time to complete, leading to a time bound of O(n + k log k) on this algorithm.
We can achieve an O(log n) time solution using skip lists. Skip lists are sorted data structures that allow insertion, deletion and indexed retrieval in O(log n) time. Thus, for any given percentile, we can insert a new element into (and possibly delete an old element from) the list in O(log n), calculate the corresponding index(es) and finally access the percentile value in O(log n) time. See, for example, this Python-based implementation for calculating running median.
Optimised sorting algorithms
More efficient than any of these are specialized partial sorting algorithms based on mergesort and quicksort. The simplest is the quicksort variation: there is no need to recursively sort partitions which only contain elements that would fall after the kth place in the end (starting from the "left" boundary). Thus, if the pivot falls in position k or later, we recurse only on the left partition:function quicksortFirstK(list, left, right, k)
if right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
quicksortFirstK(list, left, pivotNewIndex-1, k)
if pivotNewIndex < left + k
quicksortFirstK(list, pivotNewIndex+1, right, k)
The resulting algorithm requires an expected time of only O(n + k log k), and is quite efficient in practice, especially if we substitute selection sort when k becomes small relative to n. However, the worst-case time complexity is still very bad, in the case of a bad pivot selection. Pivot selection along the lines of the worst-case linear time selection algorithm could be used to get better worst-case performance.
Even better is if we don't require those k items to be themselves sorted. Losing that requirement means we can ignore all partitions that fall entirely before or after the kth place. We recurse only into the partition that actually contains the kth element itself.
function quickfindFirstK(list, left, right, k)
if right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
if pivotNewIndex > left + k // new condition
quickfindFirstK(list, left, pivotNewIndex-1, k)
if pivotNewIndex < left + k
quickfindFirstK(list, pivotNewIndex+1, right, k+left-pivotNewIndex-1)
The resulting algorithm requires an expected time of only O(n), which is the best such an algorithm can hope for.
A simpler formulation of a worst-case O(n) algorithm, is as follows:
- simply use the #Linear general selection algorithm - Median of Medians algorithm described in above sections to find the kth element in O(n) time worst-case
- use the Quicksort#Algorithm partition operation (which is O(n)) from Quicksort to partition into the elements less than and greater than the kth element
Tournament Algorithm
Another method is tournament algorithm. The idea is to conduct a knockout minimal round tournament to decide the ranks. It first organises the games (comparisons) between adjacent pairs and moves the winners to next round until championship (the first best) is decided. It also constructs the tournament tree along the way. Now the second best element must be among the direct losers to winner and these losers can be found out by walking in the binary tree in O(log n) time. It organises another tournament to decide the second best among these potential elements. The third best must be one among the losers of the second best in either of the two tournament trees. The approach continues until we find k elements. This algorithm takes O(n + k log n) complexity, which for any fixed k independent of n is O(n).Lower bounds
In The Art of Computer ProgrammingThe Art of Computer Programming
The Art of Computer Programming is a comprehensive monograph written by Donald Knuth that covers many kinds of programming algorithms and their analysis....
, Donald E. Knuth discussed a number of lower bounds for the number of comparisons required to locate the k smallest entries of an unorganized list of n items (using only comparisons). There is a trivial lower bound of n − 1 for the minimum or maximum entry. To see this, consider a tournament where each game represents one comparison. Since every player except the winner of the tournament must lose a game before we know the winner, we have a lower bound of n − 1 comparisons.
The story becomes more complex for other indexes. To find the k smallest values requires at least this many comparisons:
This bound is achievable for k=2 but better, more complex bounds exist for larger k.
Language support
Very few languages have built-in support for general selection, although many provide facilities for finding the smallest or largest element of a list. A notable exception is C++C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...
, which provides a templated
nth_element
method with a guarantee of expected linear time. It is implied but not required that it is based on Hoare's algorithm by its requirement of expected linear time. (Ref section 25.3.2 of ISO/IEC 14882:2003(E) and 14882:1998(E), see also SGI STL description of nth_element)C++ also provides the partial_sort algorithm, which solves the problem of selecting the smallest k elements (sorted), with a time complexity of O(n log k). No algorithm is provided for selecting the greatest k elements since this should be done by inverting the ordering predicate.
For Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
, the module Sort::Key::Top, available from CPAN
CPAN
CPAN, the Comprehensive Perl Archive Network, is an archive of nearly 100,000 modules of software written in Perl, as well as documentation for it. It has a presence on the World Wide Web at and is mirrored worldwide at more than 200 locations...
, provides a set of functions to select the top n elements from a list using several orderings and custom key extraction procedures.
Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
's standard library (since 2.4) includes
heapq.nsmallest
and nlargest
, returning sorted lists in O(n + k log n) time.Because language support for sorting is more ubiquitous, the simplistic approach of sorting followed by indexing is preferred in many environments despite its disadvantage in speed. Indeed for lazy languages
Lazy evaluation
In programming language theory, lazy evaluation or call-by-need is an evaluation strategy which delays the evaluation of an expression until the value of this is actually required and which also avoids repeated evaluations...
, this simplistic approach can even achieve the best complexity possible for the k smallest/greatest sorted (with maximum/minimum as a special case) if the sort is lazy enough.
Online selection algorithm
In certain selection problems, selection must be online, that is, an element can only be selected from a sequential input at the instance of observation and each selection, respectively refusal, is irrevocable. The problem is to select, under these constraints, aspecific element of the input sequence (as for example the largest or the smallest value)
with largest probability. This problem can be tackled by the Odds algorithm
Odds algorithm
The odds-algorithm is a mathematical method for computing optimalstrategies for a class of problems that belong to the domain of optimal stopping problems. Their solution follows from the odds-strategy, and the importance of the...
which is known to be optimal under an independence condition. The algorithm is also optimal itself
with the number of operations being linear in the length of input.
External links
- Design and Analysis of Algorithms, for a detailed explanation of the recurrence relation for the median-of-medians