Skip to content

matteo-grella/capturing-word-order

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Capturing Word Order in Averaging Based Sentence Embeddings

Installation

Requirements

  • gitpython
  • humanize
  • matplotlib
  • nltk
  • ray[tune]
  • pytorch
  • tqdm
  • cupy

Installation

cd capturing-word-order
pip install -e .

Prepare Wikipedia Corpus

# Store the path of the current folder as cwd
cwd=$(pwd)
# Update and initialize submodules
git submodule update
git submodule init

# Create a folder for raw data
mkdir -p $cwd/data/raw

# Download an archived Wikipedia dump (alternatively, you can download
# a recent dump from https://dumps.wikimedia.org/enwiki.) and save it
# as data/raw/wiki.bz2.
wget https://archive.org/download/enwiki-20190201/enwiki-20190201-pages-articles-multistream.xml.bz2 \
     -P $cwd/data/raw/wiki.bz2

# Create a folder for interim data
mkdir -p $cwd/data/interim

# Extract text from the wikipedia dump as data/interim/wiki.json.
# Replace --process 32 with --process n where n is the number of
# available CPU cores.
cd $cwd/src/data/wikiextractor
python WikiExtractor.py --process 32 --json -co ../../../data/interim/wiki ../../../data/raw/wiki.bz2

# Combine the articles as one json file.
cd $cwd/data/interim
find wiki -name '*bz2' -exec bunzip2 -k -c {} \; > wiki.json

Generate Tokenized Wikipedia Sentences

# Download  nltk 'punkt' and 'stopwords' packages that are necessery for tokenization and for training the models.
python -c "import nltk; nltk.download('punkt')"
python -c "import nltk; nltk.download('stopwords')"

# Create a folder for processed data
mkdir -p $cwd/data/processed

# Generate tokenized 1 million wikipedia sentences. The output is stored in processed data folder.
cd $cwd/src/data/
python tokenize_wiki.py

# Generate tokenized 1 million wikipedia sentences of length <= 25.
python tokenize_wiki.py --max_len 25
# Generate tokenized 2 million wikipedia sentences of length <= 25.
python tokenize_wiki.py --n_sents 2000000 --max_len 25

Split the data into training, validation and test sets

cd $cwd/
python -m src.data.make_splits

Download fastText word vectors and unzip it.

wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip -P $cwd/data/raw/
cd $cwd/data/raw
unzip crawl-300d-2M.vec.zip

Train neural network models for bigram representations

cd src
python tune_bigram_models.py

Generate Figure 1 and Table 1

Run notebooks/generate_figure_1_and_table_1.ipynb to generate Figure 1 and Table 1

Generate Table 2

Run notebooks/generate_table_2.ipynb to generate Table 2

About

Capturing Word Order in Averaging Based Sentence Embeddings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 78.5%
  • Jupyter Notebook 21.5%