Skip to content

dandersson/nlargest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

nlargest – exploring different algorithms for finding the n largest items in an iterable

Why?

I initially had a Python need for finding the lexically “largest” files in a directory with many files, and solved it by

  1. reading the directory contents into a list with os.listdir (currently waiting for os.scandir coming in Python 3.5)
  2. sorting this list
  3. slicing out the last n elements.

This was straightforward, but conceptually it did not quite feel right to have to sort the entire list, when many elements could be discarded immediately when only the largest n were of interest.

I started to try different approaches and benchmarking these to see if there were any speed-ups to gain from different approaches. It was certainly “fast enough” already for my purposes, so the motivation was pure curiosity.

The included heapq.nlargest method from the native heapq module initially seemed to be exactly what I was looking for, but digging through the implementation, it still felt like there should be a more efficient way. The min-heap structure was certainly interesting to explore further in this context, though.

For picking the 5 largest numbers in the largest list tested, consisting of 10^7 randomly distributed positive integers, the times (on a certain computer) were distributed approximately like:

  • initial naïve solution described above: 16.2 seconds
  • heapq.nlargest: 1.53 seconds
  • my fastest variation for this element count (nlargest_list3): 0.500 seconds

but there are large fluctuations between the algorithms depending on the number of input elements.

Note that the heapq.nlargest solution is expressed in native C code in the CPython interpreter, whereas nlargest_list3 needs relatively slow Python function calls.

Outlook

At a factor 3, the performance difference is not that practically noticeable, and the use case is probably not that large. I will not pursue this matter further, but it was interesting to think about the problem and architect different solutions.

Benchmarking

In the exploratory vein, I thought it would be interesting to use the CLI for the built-in timeit module for this purpose. An advantage of the CLI (through python3 -m timeit) as compared to its Python interface is that the CLI automatically performs an adaptive amount of runs depending on run time length.

I wrote a small benchmarking program that reads parameters from a configuration file and performs testing on all functions with a certain prefix imported from a configurable module.

Plotting

I used matplotlib to generate predefined comparison images. Scripts are included in the repository. Sample figures are shown below.

  • Every tested function plotted at their full range. Not very readable (see later images for more clarity), but shown to illustrate the general tendencies. benchmark_output_2015-08-31_16_00_20__plot_all_series__axis_large

  • As above, but plotted to fewer elements. benchmark_output_2015-08-31_16_00_20__plot_all_series__axis_small

  • Comparison of initialization strategy 1 and 2 for the new functions. benchmark_output_2015-08-31_16_00_20__plot_init_1_against_init_2__axis_small

  • Comparison of initialization strategy 1 and 3 for the new functions. benchmark_output_2015-08-31_16_00_20__plot_init_1_against_init_3__axis_small

  • Comparison of initialization strategy 2 and 3 for the new functions. benchmark_output_2015-08-31_16_00_20__plot_init_2_against_init_3__axis_small

  • Comparison of the reference functions and the new functions with initialization strategy 2. benchmark_output_2015-08-31_16_00_20__plot_ref_against_init_2__axis_small

License

Apache license, version 2.0.

About

Performance metrics on "pick n largest items from list" implementations in Python.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published