Skip to content

The project is to model user's patterns and behavior, then using this model we can detect anomalies and outliers of the user's sequence of actions

License

Notifications You must be signed in to change notification settings

zshwuhan/Detecting-anomalies-in-user-trajectories

 
 

Repository files navigation

TribeFlow

  1. Home
  2. Datasets
  3. Reproducibility
  4. Competing Methods

Contains the TribeFlow (previously node-sherlock) source code.

Dependencies

The python dependencies are:

  • Mpi4Py
  • numpy
  • scipy
  • cython
  • pandas
  • plac

You will also need to install and setup:

  • OpenMP
  • MPI

How to install dependencies

Easy way: Install Anaconda Python and set it up as your default enviroment.

Hard way: Use pip or your package manager to install the dependencies.

pip install numpy
pip install scipy
pip install cython
pip install pandas
pip install mpi4py
pip install plac

Use or package manager (apt on Ubuntu, HomeBrew on a mac) to install OpenMP and MPI. These are the managers I tested with. Should work on any other environment.

How to compile

Simply type make

make

How to use

Either use python setup.py install to install the packager or just use it from the package folder using the run_script.sh command.

How to parse datasets: Use the scripts/trace_converter.py script. It has a help.

For command line help:

$ python scripts/trace_converter.py -h
$ python main.py -h

Running with mpi

$ mpiexec -np 4 python main.py [OPTIONS]

Running TribeFlow from other python code:

Check the api_singlecore_example.py file

Example

Converting the Trace

Let's assume we have a trace like the Last.FM trace from Oscar Celma. In this example, each line is of the form:

userid \t timestamp \t musicbrainz-artist-id \t artist-name \t
musicbrainz-track-id \t track-name

For instance:

user_000001 2009-05-01T09:17:36Z    c74ee320-1daa-43e6-89ee-f71070ee9e8f
Impossible Beings   952f360d-d678-40b2-8a64-18b4fa4c5f8Dois Pólos

First, we want to convert this file to our input format. We do this with the scripts/trace_converter.py script. Let's have a look at the options from this script:

$ python scripts/trace_converter.py -h
usage: trace_converter.py [-h] [-d DELIMITER] [-l LOOPS] [-r SORT] [-f FMT]
                          [-s SCALE] [-k SKIP_HEADER] [-m MEM_SIZE]
                          original_trace tstamp_column hypernode_column
                          obj_node_column

positional arguments:
  original_trace        The name of the original trace
  tstamp_column         The column of the time stamp
  hypernode_column      The column of the time hypernode
  obj_node_column       The column of the object node

optional arguments:
  -h, --help            show this help message and exit
  -d DELIMITER, --delimiter DELIMITER
                        The delimiter
  -l LOOPS, --loops LOOPS
                        Consider loops
  -r SORT, --sort SORT  Sort the trace
  -f FMT, --fmt FMT     The format of the date in the trace
  -s SCALE, --scale SCALE
                        Scale the time by this value
  -k SKIP_HEADER, --skip_header SKIP_HEADER
                        Skip these first k lines
  -m MEM_SIZE, --mem_size MEM_SIZE
                        Memory Size (the markov order is m - 1)

The positional (obrigatory) arguments are:

  • original_trace is the input file
  • hypernode_column represents the users (called hypernodes since it can be playlists as well)
  • tstamp_column the column of the time stamp
  • obj_node_column the objects of interest

We can convert the file with the following line:

python scripts/trace_converter.py scripts/test_parser.dat 1 0 2 -d$'\t' \
        -f'%Y-%m-%dT%H:%M:%SZ' > trace.dat

Here, we are saying that column 1 are the timestamps, 0 is the user, and 2 are the objects (artist ids). The delimiter -d is a tab. The time stamp format is '%Y-%m-%dT%H:%M:%SZ'.

Adding memory

Use the -m argument to increase the burst (B parameter in the paper) size.

python scripts/trace_converter.py scripts/test_parser.dat 1 0 2 -d$'\t' \
        -f'%Y-%m-%dT%H:%M:%SZ' -m 3 > trace.dat

Learning the Model

The example below is the same code used for every result in the paper. It runs TribeFlow with the options used in every result in the paper. Explaining the parameters:

  • -np 20 Number of cores for execution.
  • 100 topics.
  • output.h5 model file.
  • --kernel eccdf The kernel heuristic for inter-event time estimation. ECCDF based as per described on the paper. We also have a t-student kernel.
  • --residency_priors 1 99 The priors for the inter-event time estimation.
  • --leaveout 0.3 Number of transitions to leaveout.
  • --num_iter 2000 Number of iterations.
  • --num_batches 20 Number of split/merge moves.

The example below uses 20 cores

$ mpiexec -np 20 python main.py trace.dat 100 output.h5 \
    --kernel eccdf --residency_priors 1 99 \
    --leaveout 0.3 --num_iter 2000 --num_batches 20

Predictions

The mean reciprocal rank script will generate predictions and save them to the given files. Just run:

$ PYTHONPATH=. python scripts/mrr.py output.h5 rss.dat predictions.dat

output.h5 is the model trained.

Other useful scripts

Similar to the script above, you can use the scripts:

  1. view_topics.py to print a summary of the topics with most likely objects
  2. printmat.py to print either an O by O matrix or a Z by Z matrix
  3. plotmat-toyplot.py to generate the Z by Z matrix in the ISMIR jazz paper
  4. fancyplot.py to generate the Miles Davis plot in the ISMIR jazz paper

Datasets

Below we have the list of datasets explored on the paper. We also curated links to various other timestamp datasets that can be exploited by TribeFlow and future efforts.

Datasets used on the paper:

  1. LastFM-1k
  2. LastFM-Our
  3. FourSQ This dataset was removed from the original website. Still available on archive. Other, more recent, FourSQ datasets are available. See below.
  4. Brightkite
  5. Yes

List of other, some more recent, datasets that can be explored by TribeFlow.

  1. Newer FourSQ
  2. Million Music Tweet
  3. Movie Ratings
  4. Twitter
  5. Gowalla
  6. Yelp
  7. Best Buy

Basically, anything with users (playlists, actors, etc also work), objects and timestamps.

On the example folder we have some sub-sampled datasets that can be used to better understand the method.

Reproducibility

The current version of the code may not be the exact version used in any of the papers that employ Tribeflow. However, and most importantly, I am tagging the commits closest to each paper. Please check the tags if you want to run an exact version of tribeflow used in a given paper.

Competing Methods

About

The project is to model user's patterns and behavior, then using this model we can detect anomalies and outliers of the user's sequence of actions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 76.4%
  • Jupyter Notebook 17.7%
  • C 4.8%
  • Other 1.1%