Skip to content

oldpaper/encsum

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

EncSum

Dependency

  • python >= 3.6
  • keras >= 2.2
  • tensorflow >= 1.8
  • h5py
  • scikit-learn
  • lxml
  • nltk

ROUGE-1.5.5 Please place ROUGE-1.5.5 root folder (containing ROUGE-1.5.5.pl script) next to encsum_rtv folder.

svm_rank Please add svm_rank root folder (containing svm_rank_learn and svm_rank_classify) to $PATH.

Experiment

  • Training EncSum Model
  • COLIEE 2018 Task 1
  • SUM-HOLJ Summarization Task

Training EncSum Model

path/to/coliee2018/train/corpus/dir, path/to/coliee2018/test/corpus/dir: COLIEE 2018 train, test folders containing IR.xml file.

path/to/glove/embedding/file: path to text-format word embeddings file glove.840B.300d.txt.

  • Preprocessing
python -m encsum_rtv.preprocess_coliee2018 \
    --mode train \
    --input-dir path/to/coliee2018/training/corpus/dir \
    --output-dir data/coliee2018/preprocessed/train \
    --embeddings-file path/to/glove/embedding/file 
  • Training Neural Net
python -m encsum_rtv.encsum encsum_nn \
    --train \
    --data-dir data/coliee2018/preprocessed/train/numeric \
    --model-dir model \
    --mini-epoch-factor 50 \
    --nb-epochs 5 

COLIEE 2018 Task 1

path/to/coliee2018/train/corpus/dir, path/to/coliee2018/test/corpus/dir: COLIEE 2018 train, test folders containing IR.xml file.

  • Preprocessing
python -m encsum_rtv.preprocess_coliee2018 \
    --mode test \
    --input-dir path/to/coliee2018/test/corpus/dir \
    --output-dir data/coliee2018/preprocessed/test \
    --vocab-file data/coliee2018/preprocessed/train/numeric/emb_vocab.json 
  • Generating EncSum representation

Train:

python -m encsum_rtv.encsum encsum_nn \
    --infer-encsum \
    --data-dir data/coliee2018/preprocessed/train/numeric \
    --model-dir model \
    --output-dir data/coliee2018/preprocessed/train/encsum

python -m encsum_rtv.encsum encsum_relevance \
    --encsum-feature-file data/coliee2018/preprocessed/train/encsum/paras.encsum.npz \
    --output-file data/coliee2018/preprocessed/train/encsum/encsum_relevance.npz \
    --meta-file path/to/coliee2018/train/corpus/dir/IR.xml

Test:

python -m encsum_rtv.encsum encsum_nn \
    --infer-encsum \
    --data-dir data/coliee2018/preprocessed/test/numeric \
    --model-dir model \
    --output-dir data/coliee2018/preprocessed/test/encsum

python -m encsum_rtv.encsum encsum_relevance \
    --encsum-feature-file data/coliee2018/preprocessed/test/encsum/paras.encsum.npz \
    --output-file data/coliee2018/preprocessed/test/encsum/encsum_relevance.npz \
    --meta-file path/to/coliee2018/test/corpus/dir/IR.xml
  • Extracting lexical features:

Train

python -m encsum_rtv.lexical \
    --text-dir data/coliee2018/preprocessed/train/text \
    --numeric-dir data/coliee2018/preprocessed/train/numeric \
    --meta-file path/to/coliee2018/train/corpus/dir/IR.xml \
    --cpu-count $CPU_COUNT

Test

python -m encsum_rtv.lexical \
    --text-dir data/coliee2018/preprocessed/test/text \
    --numeric-dir data/coliee2018/preprocessed/test/numeric \
    --meta-file path/to/coliee2018/test/corpus/dir/IR.xml \
    --cpu-count $CPU_COUNT

--cpu-count $CPU_COUNT: accelerating with multiprocessing.

  • Train and Evaluate Ranker
python -m encsum_rtv.l2r train \
    --model-file model/svm_rank.dat \
    --feature-files data/coliee2018/preprocessed/train/encsum/encsum_relevance.npz \
        data/coliee2018/preprocessed/train/numeric/rouge*.npz \
    --gold-file data/coliee2018/preprocessed/train/numeric/relevance.npz

python -m encsum_rtv.l2r predict \
    --model-file model/svm_rank.dat \
    --feature-files data/coliee2018/preprocessed/test/encsum/encsum_relevance.npz \
        data/coliee2018/preprocessed/test/numeric/rouge*.npz \
    --output-file-prefix output/coliee2018/test \
    --select-top 10 \
    --meta-file path/to/coliee2018/test/corpus/dir/IR.xml

python -m encsum_rtv.l2r evaluate \
    --submission-file output/coliee2018/test.submission.txt \
    --gold-file path/to/task1_true_answer/file

SUM-HOLJ Summarization Task

  • Preprocessing
python -m encsum_rtv.preprocess_holj \
    path/to/raw/corpus/dir \
    data/holj/preprocessed
  • Generating Summary
python -m encsum_rtv.generate_summary \
    --corpus-dir data/holj/preprocessed \
    --output-dir output/holj \
    --config-file model/score_model.yaml \
    --weight-file model/model_weights.hdf5 \
    --vocab-file model/emb_vocab.json \
    --summary-mode sentence_selection \
    --top-anchor 0.10

--top-anchor 0.10: 10% of #sentences as summary.

  • Evaluation
python -m encsum_rtv.evaluate_summary \
    --corpus-dir data/holj/preprocessed \
    --predict-dir output/holj

Other Summarization Task

Train EncSum Model using other corpus

  • Preprocessing
python -m encsum_rtv.preprocess_docs \
    --corpus-dir path/to/corpus/dir \
    --embedings-file path/to/glove/embedding/file \
    --output-data data/other/task/preprocessed

path/to/corpus/dir contains documents with filename format (docname.sentences, docname.summary), for example:

doc1.sentences
doc1.summary
doc2.sentences
doc2.summary
...
  • Train Neural Net
python -m encsum_rtv.encsum encsum_nn \
    --train \
    --doctypes "{'summary': 'summary', 'sentences': 'sentences'}"
    --data-dir data/other/task/preprocessed \
    --model-dir model \
    --mini-epoch-factor 50 \
    --nb-epochs 5 

Generating Summary

  • Generating Summary
python -m encsum_rtv.generate_summary \
    --corpus-dir path/to/corpus/dir \
    --output-dir output/summary/dir \
    --config-file model/score_model.yaml \
    --weight-file model/model_weights.hdf5 \
    --vocab-file model/emb_vocab.json \
    --summary-mode sentence_selection \
    --top-anchor 0.10

--top-anchor 0.10: 10% of #sentences as summary.

  • Evaluation
python -m encsum_rtv.evaluate_summary \
    --corpus-dir path/to/corpus/dir \
    --predict-dir output/summary/dir

Reference

This code is an implementation of this paper.

Tran, V., Le Nguyen, M., Tojo, S. et al. Encoded summarization: summarizing documents into continuous vector space for legal case retrieval. Artif Intell Law (2020). https://doi.org/10.1007/s10506-020-09262-4

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%