Dependencies

Python 3.5+ Python packages:
Pytorch 0.3.X (The code may not work with PyTorch 0.4.X as some APIs are different.)
Requests (to send entity linking requests)
NLTK (tokenization, POS tagging)

Files

pipeline_baseline.py: pipeline for the baseline model without MFD and background knowledge features.
pipeline_mfd.py: pipeline for the model with MFD features.
pipeline_mfd_bk.py: pipeline for the model with MFD and background knowledge features.
tagme.py: run entity linking on the data set using TagMe.
data_processing.py: contains some functions to clean and enrich the entity linking results (optional).

File Format

The tweet data set should be encoded in 3-column TSV files as follows:

tag:search.twitter.com,2005:592839674642718722    God bless the clergy #BaltimoreRiots    CH
tag:search.twitter.com,2005:592840383434006529    Shame on these racist white women for trying to steal this poor black man's purse : #BaltimoreRiots     LB
tag:search.twitter.com,2005:592840285736050688    Nothing screams justice like destroying your own town and stealing innocent people's property #BaltimoreRiots   AS,LB
...

How to Run

Entity Linking

Command:

python tagme.py -i <INPUT_FILE> -o <OUTPUT_FILE> -t <TOKEN>

-i, --input: Path to the TSV format tweet data set
-o, --output: Path to the output file (TagMe result)
-t, --token: TagMe application token. To use the TagMe RESTful API, you need to register an account and get a free token (see: https://services.d4science.org/web/tagme/tagme-help)

Model Training

Command:

python pipeline_baseline/baseline_mfd/baseline_mfd_bk.py --train <TRAIN_SET_FILE>
 --dev <DEV_SET_FILE>
 --test <TEST_SET_FILE>
 --mode train
 --el <TAGME_RESULT_FILE>
 --mfd <MORAL_FOUNDATION_DICT_FILE>
 --model <MODEL_OUTPUT_DIR>
 --output <TEST_SET_RESULT_OUTPUT_FILE>
 --gpu 1
 --embedding <PATH_TO_WORD_EMBEDDING_FILE_FOR_TWEETS>
 --el_embedding <PATH_TO_WORD_EMBEEDING_FILE_FOR_BACKGROUND_KNOWLEDGE>

--train,--dev,--test: Training/dev/test set files
--el: Path to the TagMe result file.
--mfd: Path to the Moral Foundation Dictionary file.
--model: Path to the model output directory.
--output: Path to the output file. At the end of the training, the model will be applied to the test set, and the results will be written to this file.
-m, --mode: Mode, currently only 'train' is implement. If you want to predict moral values on an unlabeled data set, label all instances as 'NM' (non-moral), the prediction results will be written to the output file (see --output).
--labels: Label set (default: CH,FC,AS,LB,PD).
--learning_rate: Learning rate (default=0.005).
--batch_size: Batch size (default=20).
--max_epoch: Max training epoch number (default=30).
--max_seq_len: Max sequence length (tweets longer than this length will be truncated)
--embedding: Path to the embedding file for tweets.
--el_embedding: Path to the embedding file for background knowledge. Note that we use different pre-trained word embedding for tweets and background knowledge.
--embedding_dim: Word embedding dimension (default=100). Should be consistent with embeddings in the --embedding file.
--el_embedding_dim: Word embedding dimension (default=100). Should be consistent with embeddings in the --el_embedding file.
--hidden_size: LSTM hidden state size (default=100).
--linear_size: Sizes of hidden layers that process the LSTM output (default=50).
--el_linear_sizes: Sizes of hidden layers that process the background knowledge feature vector (default=5).
--mfd_linear_sizes: Sizes of hidden layers that process the MFD feature vector. (default=5).
--gpu: Use GPU or not (default=1).
--device: Select GPU (if you have multiple GPUs on your machine).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
data.py		data.py
data_processing.py		data_processing.py
model.py		model.py
pipeline_baseline.py		pipeline_baseline.py
pipeline_mfd.py		pipeline_mfd.py
pipeline_mfd_bk.py		pipeline_mfd_bk.py
tagme.py		tagme.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

data.py

data.py

data_processing.py

data_processing.py

model.py

model.py

pipeline_baseline.py

pipeline_baseline.py

pipeline_mfd.py

pipeline_mfd.py

pipeline_mfd_bk.py

pipeline_mfd_bk.py

tagme.py

tagme.py

Repository files navigation

Dependencies

Files

File Format

How to Run

Entity Linking

Model Training

About

Releases

Packages

Languages

limteng-rpi/mvp

Folders and files

Latest commit

History

Repository files navigation

Dependencies

Files

File Format

How to Run

Entity Linking

Model Training

About

Topics

Resources

Stars

Watchers

Forks

Languages