FP-Growth: compact storage & very fast retrieval of frequent item sets

=======

to run the tests for this library:

python3 runtests.py -v

applications:

identifying unique sequences eg of pageviews for clickstream analysis
type-ahead

definitions:

fp-tree: a graph that provides a compact encoding of transaction data and enables efficient retrieval of frequent itemsets (frequently recurring sequences of elements that comprise the transaction data)
fp-growth: the technique for building an fp-tree and populating it with transaction data, then efficiently extracing frequent itemsets from it
header table: a dict whose keys are the unique items comprising the transactions (data rows); the value for each key is a list of len 2; the first item is the item frequency, the second is a pointer which points to the first occurrence of that item in the tree; this pointer, along w/ the 'node_link" attribute for each node, allow efficient traversal of all instances of an item type (eg, 'A') across the tree branches; this node (pointed to in the htab) represents the terminus of the path defined by nodes of the same type within a single fptree.
prefix path: the nodes comparising the path between root node & the item being queried; prefix paths are used to createa a conditional fp-tree
conditional pattern base: collection of prefix paths

2 key (novel) elements of this FP-Growth implementation:

(i) can be queried in a natural way for user-clickstream analytics, e.g.,
	which page is a user most often likely to visit given an immediate prior path?

(ii) along with min support, includes a filter for minimum sequence length

some useful references on the subject of fp-trees and fo-growth:

Prof Christian Borgelt's Site inclues implementations of FP-Growth in C and a large number of academic references on this subject

Frequent Itemset Mining Implementations Repository a large collection of FP-Growth implementations (src) and accompanying papers

Hareendra Perera's blog

- a concise but atomic explanation of the FP-Growth algorithm; the focus of this excellent Post is the algorithmic data flow--ie, it omits many details relating to the FP-Tree data structure (eg, it doesn't mention the header table); but it's this meticulous selection of detail that i found so valuable when building the FP-Growth implementation in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
data		data
fpgrowth		fpgrowth
ipython-notebooks		ipython-notebooks
tests		tests
utilities		utilities
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
TODO.txt		TODO.txt
fptree.json		fptree.json
fptree_build.py		fptree_build.py
fptree_config.py		fptree_config.py
fptree_mine.py		fptree_mine.py
fptree_query.py		fptree_query.py
requirements.txt		requirements.txt
runtests.py		runtests.py

mbrhd/fp-growth-for-frequent-itemsets

Folders and files

Latest commit

History

Repository files navigation

FP-Growth: compact storage & very fast retrieval of frequent item sets

About

Resources

Stars

Watchers

Forks

Languages