Tweet Generation and Hashtag Prediction

Data used:

Papers referenced:

A Hybrid Convolutional Variational Autoencoder for Text Generation

Generating new tweets

A hybrid convolutional variational autoencoder, inspired by Semeniuta, Severyn, and Barth's 2017 paper, is used to encode tweet character vectors into a low-dimensional hidden representation. This hidden representation is then used as the basis for generation of new tweets.

Predicting hashtags from tweets

Twitter JSON is processed to extract text, hashtags, and data structures relating the two.

For each hashtag, GLoVE embeddings are averaged together to form tweet embeddings and then hashtag embeddings from averaging tweets together. These tweet embeddings are then used as inputs for a fully-connected neural network. The network response is fed through a softmax classifier and euclidean distance (l2) is used as the loss function. For hashtag prediction, sentence embeddings are calculated in the same way by averaging word vectors together, then they are fed through the neural network and the predicted vectors are searched in our calculated hashtag embeddings using nearest neighbor.

Usage

This project targets python 3.6.0, tensorflow 1.0.1

Executing python main.py --help shows the available arguments. By default, 100,000 tweets are extracted from the JSON file, zero hashtags are printed out, and 50 are embedded. The default test file is assumed to be cache-0.json from the Twitter 2012 election dataset located at ../data/twitter-nlp/json/cache-0.json. Other parameters, such as the GloVe embedding dimension for the baseline hashtag prediction model and the embedding file are specified within main.py. Currently, tweet and hashtag embeddings are created with the Embedder and then trained with the Baseline Model. Afterwards, tweets are loaded into the hybrid VAE model.

Included Utilities

utils.twitter_download contains a client for Tweepy. This allows custom Twitter datasets to be downloaded given a username and a count. There is an API limit of 200 tweets per request, but this tool provides a recursive solution around this that maxes out the assigned Twitter API account's download limit.

If being used outside of this project, the authentication keys should be replaced with new ones from Twitter.

Example

from utils.twitter_download import Tweepy_Client
#default args (do not use outside of this project!)
client = Tweepy_Client()
#custom authentication keys
client = Tweepy_Client(con_key=" ",
                      con_sec=" ",
                      acc_tok=" ",
                      acc_sec=" ")
client.retrieve("sundarpichai", 500)
#raw json is contained in client.tweets
print(client.tweets)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
models		models
paper		paper
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models

models

paper

paper

utils

utils

.gitignore

.gitignore

README.md

README.md

main.py

main.py

Repository files navigation

Tweet Generation and Hashtag Prediction

Data used:

Papers referenced:

Generating new tweets

Predicting hashtags from tweets

Usage

Included Utilities

Example

About

Releases

Packages

Languages

tandalesc/twitter-nlp

Folders and files

Latest commit

History

Repository files navigation

Tweet Generation and Hashtag Prediction

Data used:

Papers referenced:

Generating new tweets

Predicting hashtags from tweets

Usage

Included Utilities

Example

About

Resources

Stars

Watchers

Forks

Languages