DenseHMM

DenseHMM is a modification of Hidden Markov Models (HMMs) that allows to learn dense vector representations of both the hidden states and the observables via gradient-descent methods. The code accompanies our paper "DenseHMM: Learning Hidden Markov Models by Learning Dense Representations" and allows to reproduce the results therein.

Overview

DenseHMM uses a parameter-efficient, non-linear matrix factorization to describe transition probabilities of HMMs.
Two approaches for model training: a) EM-optimization with a gradient-based M-step or b) direct optimization of observation co-occurrences, which provides better scalability compared to EM-based multi-step schemes.
Competitive model performance in extensive empirical evaluations.

Baselines

DenseHMM is compared to various hidden Markov models. We base our code on the hmmlearn library.

Installation

Conda Virtual Environment

We used a conda environment on Linux Debian Version 9. Use the provided dense_hmm.yml to create this environment as follows:

conda env create --name dense_hmm --file=dense_hmm.yml

Datasets

Penn Treebank

We use the natural language toolkit library (nltk python module) to download the Penn Treebank dataset. Version of the module: nltk=3.4.5 (as specified in dense_hmm.yml). We obtained the sequences in April 2020 using (as in data.py):

from nltk.corpus import treebank
sequences = treebank.tagged_sents()
nltk.download('treebank')

RCSB PDB Protein Sequences

We downloaded the RCSB PDB protein sequences in October 2019 from https://www.rcsb.org/#Subcategory-download_sequences. We used the gzipped FASTA file containing all PDB sequences. Once downloaded, put the pdb_seqres.txt.gz file in the data directory.

Quick Start

Model Training

Paper section 4

The following Jupyter notebook contains the source for running the experiments of section 4: start_matrix_fit_experiment.ipynb. Just run all cells of the notebook. This will create a new directory in the same folder as the notebook, in which the results are stored.

Paper section 5

The following files contain the source for running the experiments of section 5:
- data.py (data pre-processing),
- experiment.py (parses experiment parameters, starts experiments),
- models.py (standard HMM and DenseHMM models),
- utils.py (various utility functions used throughout the package),
- hmmc/_hmmc.c (from hmmlearn, function for the E-step was modified to log additional data),
- start_protein_experiment.ipynb,
- start_synthetic_experiment.ipynb,
- start_penntree_experiment.ipynb.
In the Jupyter notebooks listed above, please set the ROOT_PATH variable to the directory containing the source files (ROOT_PATH must end on /).
During training, log-likelihood scores, model parameters and sequence samples are stored in a new directory that is created in ROOT_PATH. These values are stored in a dictionary that is subsequently used for evaluation and to create visualizations.
Run the Jupyter notebooks to start the respective model training.

Model Evaluation

Paper section 4

The following Jupyter notebook contains the source for evaluating the experiments of section 4: evaluate_matrix_fit_experiment.ipynb. Fill in the exp_dir path in the notebook and run all cells.

Paper section 5

The following files contain the source for evaluating the experiments of section 5:
- utils.py
- plot.py
- evaluate.ipynb
Fill in the paths in evaluate.ipynb and run the cells to evaluate and plot the results.
Due to random train-test splits and random initializations, the obtained results might slightly deviate from those reported in the paper.

Used Hardware & Runtimes

All experiments were conducted on a Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz and a NVidia Tesla V100 GPU.

Using the training parameters specified in the Jupyter notebooks, we observed the following approximate runtimes:

Paper section 4

Matrix fit experiment usually takes less than 34 h.

Paper section 5

Penn Treebank training:

Training a standard HMM usually takes less than 4 min.
Training a DenseHMM in cooc mode usually takes less than 2 min.
Training a DenseHMM in EM mode usually takes less than 6 min.
A single experiment usually takes less than 16 min.
Whole experiment run (100 experiments) usually takes less than 27 h.

Protein training:

Fitting a DenseHMM model in EM mode usually takes less than 12 min.
Fitting a dense cooc model in cooc mode usually takes less than 1 min.
Fitting standard HMM models usually takes less than 8 min.
A single experiment run usually takes less than 30 min.
Whole experiment run (100 experiments) usually takes less than 48 h.

Synthetic training:

Fitting the standard HMM models usually takes less than 20 s.
Fitting the DenseHMM models usually takes less than 40 s.
A single experiment usually takes less than 2 min.
Whole experiment run (100 experiments) usually takes less than 4 h.

License

DenseHMM is released under the MIT license.

Citing DenseHMM

If you use or reference DenseHMM in your research, please use the following BibTeX entry.

@article{densehmm,
author = {Joachim Sicking and Maximilian Pintz and Maram Akila and Tim Wirtz},
title = {DenseHMM: Learning Hidden Markov Models by Learning Dense Representations},
journal = {NeurIPS 2020 Workshop on Learning Meaningful Representations of Life (LMRL)}
year = {2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
code_dense_hmm		code_dense_hmm
LICENSE		LICENSE
README.md		README.md
dense_hmm.yml		dense_hmm.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code_dense_hmm

code_dense_hmm

LICENSE

LICENSE

README.md

README.md

dense_hmm.yml

dense_hmm.yml

Repository files navigation

DenseHMM

Overview

Baselines

Installation

Conda Virtual Environment

Datasets

Penn Treebank

RCSB PDB Protein Sequences

Quick Start

Model Training

Paper section 4

Paper section 5

Model Evaluation

Paper section 4

Paper section 5

Used Hardware & Runtimes

Paper section 4

Paper section 5

License

Citing DenseHMM

About

Releases

Packages

Languages

License

fraunhofer-iais/dense-hmm

Folders and files

Latest commit

History

Repository files navigation

DenseHMM

Overview

Baselines

Installation

Conda Virtual Environment

Datasets

Penn Treebank

RCSB PDB Protein Sequences

Quick Start

Model Training

Paper section 4

Paper section 5

Model Evaluation

Paper section 4

Paper section 5

Used Hardware & Runtimes

Paper section 4

Paper section 5

License

Citing DenseHMM

About

Topics

Resources

License

Stars

Watchers

Forks

Languages