This repository implements Span Based AutoEncoder for text data.
The model is structured as follows:
For each sentence we:
- tokenize and embed source words
- compute contextualized word embeddings by feeding embedded sentence to bidirectional LSTM
- extract all possible spans of width up to N from the encoder outputs (embeddings from step 2)
- prune some spans based on FeedForward network scorer that trains end-to-end with the rest of the model
(top K spans are left after this step) - try to reconstruct the sentence with decoder LSTM which attends to the remaining spans from the step 4
The code is written using allennlp library. Follow the installation procedure from the allennlp website (available via pip)
To run this code simply execute:
python -m allennlp.run train configs/experiment_gpu.json --serialization-dir models/baseline --include-package span_ae