nlargest

Why?

I initially had a Python need for finding the lexically “largest” files in a directory with many files, and solved it by

reading the directory contents into a list with os.listdir (currently waiting for os.scandir coming in Python 3.5)
sorting this list
slicing out the last n elements.

This was straightforward, but conceptually it did not quite feel right to have to sort the entire list, when many elements could be discarded immediately when only the largest n were of interest.

I started to try different approaches and benchmarking these to see if there were any speed-ups to gain from different approaches. It was certainly “fast enough” already for my purposes, so the motivation was pure curiosity.

The included heapq.nlargest method from the native heapq module initially seemed to be exactly what I was looking for, but digging through the implementation, it still felt like there should be a more efficient way. The min-heap structure was certainly interesting to explore further in this context, though.

For picking the 5 largest numbers in the largest list tested, consisting of 10^7 randomly distributed positive integers, the times (on a certain computer) were distributed approximately like:

initial naïve solution described above: 16.2 seconds
heapq.nlargest: 1.53 seconds
my fastest variation for this element count (nlargest_list3): 0.500 seconds

but there are large fluctuations between the algorithms depending on the number of input elements.

Note that the heapq.nlargest solution is expressed in native C code in the CPython interpreter, whereas nlargest_list3 needs relatively slow Python function calls.

Outlook

At a factor 3, the performance difference is not that practically noticeable, and the use case is probably not that large. I will not pursue this matter further, but it was interesting to think about the problem and architect different solutions.

Benchmarking

In the exploratory vein, I thought it would be interesting to use the CLI for the built-in timeit module for this purpose. An advantage of the CLI (through python3 -m timeit) as compared to its Python interface is that the CLI automatically performs an adaptive amount of runs depending on run time length.

I wrote a small benchmarking program that reads parameters from a configuration file and performs testing on all functions with a certain prefix imported from a configurable module.

Plotting

I used matplotlib to generate predefined comparison images. Scripts are included in the repository. Sample figures are shown below.

Every tested function plotted at their full range. Not very readable (see later images for more clarity), but shown to illustrate the general tendencies.
As above, but plotted to fewer elements.
Comparison of initialization strategy 1 and 2 for the new functions.
Comparison of initialization strategy 1 and 3 for the new functions.
Comparison of initialization strategy 2 and 3 for the new functions.
Comparison of the reference functions and the new functions with initialization strategy 2.

License

Apache license, version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
nlargest		nlargest
.hgignore		.hgignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nlargest

.hgignore

.hgignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

nlargest – exploring different algorithms for finding the n largest items in an iterable

Why?

Outlook

Benchmarking

Plotting

License

About

Releases

Packages

Languages

License

dandersson/nlargest

Folders and files

Latest commit

History

Repository files navigation

nlargest – exploring different algorithms for finding the n largest items in an iterable

Why?

Outlook

Benchmarking

Plotting

License

About

Resources

License

Stars

Watchers

Forks

Languages