Skip to content

compact encoding & very fast retrieval of frequent item sets with FP-Growth

Notifications You must be signed in to change notification settings

mbrhd/fp-growth-for-frequent-itemsets

 
 

Repository files navigation

FP-Growth: compact storage & very fast retrieval of frequent item sets

Build Status

GitHub Logo

=======

to run the tests for this library:

python3 runtests.py -v

applications:

  • identifying unique sequences eg of pageviews for clickstream analysis

  • type-ahead

definitions:

  • fp-tree: a graph that provides a compact encoding of transaction data and enables efficient retrieval of frequent itemsets (frequently recurring sequences of elements that comprise the transaction data)

  • fp-growth: the technique for building an fp-tree and populating it with transaction data, then efficiently extracing frequent itemsets from it

  • header table: a dict whose keys are the unique items comprising the transactions (data rows); the value for each key is a list of len 2; the first item is the item frequency, the second is a pointer which points to the first occurrence of that item in the tree; this pointer, along w/ the 'node_link" attribute for each node, allow efficient traversal of all instances of an item type (eg, 'A') across the tree branches; this node (pointed to in the htab) represents the terminus of the path defined by nodes of the same type within a single fptree.

  • prefix path: the nodes comparising the path between root node & the item being queried; prefix paths are used to createa a conditional fp-tree

  • conditional pattern base: collection of prefix paths

2 key (novel) elements of this FP-Growth implementation:

(i) can be queried in a natural way for user-clickstream analytics, e.g.,
	which page is a user most often likely to visit given an immediate prior path?

(ii) along with min support, includes a filter for minimum sequence length

some useful references on the subject of fp-trees and fo-growth:

Prof Christian Borgelt's Site inclues implementations of FP-Growth in C and a large number of academic references on this subject

Frequent Itemset Mining Implementations Repository a large collection of FP-Growth implementations (src) and accompanying papers

Hareendra Perera's blog

- a concise but atomic explanation of the FP-Growth algorithm; the focus of this excellent Post is the algorithmic data flow--ie, it omits many details relating to the FP-Tree data structure (eg, it doesn't mention the header table); but it's this meticulous selection of detail that i found so valuable when building the FP-Growth implementation in this repository.

About

compact encoding & very fast retrieval of frequent item sets with FP-Growth

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 50.2%
  • Python 49.8%