Skip to content

DenseHMM is a modification of Hidden Markov Models (HMMs) that allows to learn dense vector representations of both the hidden states and the observables via gradient-descent methods.

License

fraunhofer-iais/dense-hmm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DenseHMM

DenseHMM is a modification of Hidden Markov Models (HMMs) that allows to learn dense vector representations of both the hidden states and the observables via gradient-descent methods. The code accompanies our paper "DenseHMM: Learning Hidden Markov Models by Learning Dense Representations" and allows to reproduce the results therein.

Overview

  • DenseHMM uses a parameter-efficient, non-linear matrix factorization to describe transition probabilities of HMMs.
  • Two approaches for model training: a) EM-optimization with a gradient-based M-step or b) direct optimization of observation co-occurrences, which provides better scalability compared to EM-based multi-step schemes.
  • Competitive model performance in extensive empirical evaluations.

Baselines

DenseHMM is compared to various hidden Markov models. We base our code on the hmmlearn library.

Installation

Conda Virtual Environment

We used a conda environment on Linux Debian Version 9. Use the provided dense_hmm.yml to create this environment as follows:

conda env create --name dense_hmm --file=dense_hmm.yml

Datasets

Penn Treebank

We use the natural language toolkit library (nltk python module) to download the Penn Treebank dataset. Version of the module: nltk=3.4.5 (as specified in dense_hmm.yml). We obtained the sequences in April 2020 using (as in data.py):

from nltk.corpus import treebank
sequences = treebank.tagged_sents()
nltk.download('treebank')
RCSB PDB Protein Sequences

We downloaded the RCSB PDB protein sequences in October 2019 from https://www.rcsb.org/#Subcategory-download_sequences. We used the gzipped FASTA file containing all PDB sequences. Once downloaded, put the pdb_seqres.txt.gz file in the data directory.

Quick Start

Model Training

Paper section 4
  • The following Jupyter notebook contains the source for running the experiments of section 4: start_matrix_fit_experiment.ipynb. Just run all cells of the notebook. This will create a new directory in the same folder as the notebook, in which the results are stored.
Paper section 5
  • The following files contain the source for running the experiments of section 5:
    • data.py (data pre-processing),
    • experiment.py (parses experiment parameters, starts experiments),
    • models.py (standard HMM and DenseHMM models),
    • utils.py (various utility functions used throughout the package),
    • hmmc/_hmmc.c (from hmmlearn, function for the E-step was modified to log additional data),
    • start_protein_experiment.ipynb,
    • start_synthetic_experiment.ipynb,
    • start_penntree_experiment.ipynb.
  • In the Jupyter notebooks listed above, please set the ROOT_PATH variable to the directory containing the source files (ROOT_PATH must end on /).
  • During training, log-likelihood scores, model parameters and sequence samples are stored in a new directory that is created in ROOT_PATH. These values are stored in a dictionary that is subsequently used for evaluation and to create visualizations.
  • Run the Jupyter notebooks to start the respective model training.

Model Evaluation

Paper section 4
  • The following Jupyter notebook contains the source for evaluating the experiments of section 4: evaluate_matrix_fit_experiment.ipynb. Fill in the exp_dir path in the notebook and run all cells.
Paper section 5
  • The following files contain the source for evaluating the experiments of section 5:

    • utils.py
    • plot.py
    • evaluate.ipynb
  • Fill in the paths in evaluate.ipynb and run the cells to evaluate and plot the results.

  • Due to random train-test splits and random initializations, the obtained results might slightly deviate from those reported in the paper.

Used Hardware & Runtimes

All experiments were conducted on a Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz and a NVidia Tesla V100 GPU.

Using the training parameters specified in the Jupyter notebooks, we observed the following approximate runtimes:

Paper section 4

Matrix fit experiment usually takes less than 34 h.

Paper section 5

Penn Treebank training:

  • Training a standard HMM usually takes less than 4 min.
  • Training a DenseHMM in cooc mode usually takes less than 2 min.
  • Training a DenseHMM in EM mode usually takes less than 6 min.
  • A single experiment usually takes less than 16 min.
  • Whole experiment run (100 experiments) usually takes less than 27 h.

Protein training:

  • Fitting a DenseHMM model in EM mode usually takes less than 12 min.
  • Fitting a dense cooc model in cooc mode usually takes less than 1 min.
  • Fitting standard HMM models usually takes less than 8 min.
  • A single experiment run usually takes less than 30 min.
  • Whole experiment run (100 experiments) usually takes less than 48 h.

Synthetic training:

  • Fitting the standard HMM models usually takes less than 20 s.
  • Fitting the DenseHMM models usually takes less than 40 s.
  • A single experiment usually takes less than 2 min.
  • Whole experiment run (100 experiments) usually takes less than 4 h.

License

DenseHMM is released under the MIT license.

Citing DenseHMM

If you use or reference DenseHMM in your research, please use the following BibTeX entry.

@article{densehmm,
author = {Joachim Sicking and Maximilian Pintz and Maram Akila and Tim Wirtz},
title = {DenseHMM: Learning Hidden Markov Models by Learning Dense Representations},
journal = {NeurIPS 2020 Workshop on Learning Meaningful Representations of Life (LMRL)}
year = {2020}
}

About

DenseHMM is a modification of Hidden Markov Models (HMMs) that allows to learn dense vector representations of both the hidden states and the observables via gradient-descent methods.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published