I initially had a Python need for finding the lexically “largest” files in a directory with many files, and solved it by
- reading the directory contents into a list with
os.listdir
(currently waiting foros.scandir
coming in Python 3.5) - sorting this list
- slicing out the last n elements.
This was straightforward, but conceptually it did not quite feel right to have to sort the entire list, when many elements could be discarded immediately when only the largest n were of interest.
I started to try different approaches and benchmarking these to see if there were any speed-ups to gain from different approaches. It was certainly “fast enough” already for my purposes, so the motivation was pure curiosity.
The included heapq.nlargest
method from the native heapq
module initially seemed to be exactly what I was looking for, but digging through the implementation, it still felt like there should be a more efficient way. The min-heap structure was certainly interesting to explore further in this context, though.
For picking the 5 largest numbers in the largest list tested, consisting of 10^7 randomly distributed positive integers, the times (on a certain computer) were distributed approximately like:
- initial naïve solution described above: 16.2 seconds
heapq.nlargest
: 1.53 seconds- my fastest variation for this element count (
nlargest_list3
): 0.500 seconds
but there are large fluctuations between the algorithms depending on the number of input elements.
Note that the heapq.nlargest
solution is expressed in native C code in the CPython interpreter, whereas nlargest_list3
needs relatively slow Python function calls.
At a factor 3, the performance difference is not that practically noticeable, and the use case is probably not that large. I will not pursue this matter further, but it was interesting to think about the problem and architect different solutions.
In the exploratory vein, I thought it would be interesting to use the CLI for the built-in timeit
module for this purpose. An advantage of the CLI (through python3 -m timeit
) as compared to its Python interface is that the CLI automatically performs an adaptive amount of runs depending on run time length.
I wrote a small benchmarking program that reads parameters from a configuration file and performs testing on all functions with a certain prefix imported from a configurable module.
I used matplotlib
to generate predefined comparison images. Scripts are included in the repository. Sample figures are shown below.
-
Every tested function plotted at their full range. Not very readable (see later images for more clarity), but shown to illustrate the general tendencies.
-
Comparison of initialization strategy 1 and 2 for the new functions.
-
Comparison of initialization strategy 1 and 3 for the new functions.
-
Comparison of initialization strategy 2 and 3 for the new functions.
-
Comparison of the reference functions and the new functions with initialization strategy 2.