Problem statement : Given a sentence, decide if it is toxic (several variants)
-W2V GloVe https://nlp.stanford.edu/projects/glove/
- fastText
See https://arxiv.org/pdf/1508.06615.pdf
Tweettokenizer
Explore build our own w2v/character level model
http://karpathy.github.io/assets/rnn/diags.jpeg
Sequence classification with LSTM:
Idea - read each word one at a time into a neural network, put this out at the end
In Keras Code :
from keras.models import Sequential
from keras.layers import LSTM, Dense
import numpy as np
data_dim = 16
timesteps = 8
num_classes = 10
# expected input data shape: (batch_size, timesteps, data_dim)
model = Sequential()
model.add(LSTM(32, return_sequences=True,
input_shape=(timesteps, data_dim))) # returns a sequence of vectors of dimension 32
model.add(LSTM(32, return_sequences=True)) # returns a sequence of vectors of dimension 32
model.add(LSTM(32)) # return a single vector of dimension 32
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
# Generate dummy training data
x_train = np.random.random((1000, timesteps, data_dim))
y_train = np.random.random((1000, num_classes))
# Generate dummy validation data
x_val = np.random.random((100, timesteps, data_dim))
y_val = np.random.random((100, num_classes))
model.fit(x_train, y_train,
batch_size=64, epochs=5,
validation_data=(x_val, y_val))
http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
LDA(Latent Dirichlet Allocation) https://radimrehurek.com/gensim/models/ldamodel.html
LSI(Latent Sematic Indexing) https://radimrehurek.com/gensim/models/lsimodel.html
more info below
-
GloVe
-
fastText
With some words that can be easily spotted as vulgar or offensive, root out these from the get go.
(However can be used sarcastically or ironically as such a binary/ zero-sum stance should not be taken)
raw data at
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge#evaluation
reading material http://karpathy.github.io/2015/05/21/rnn-effectiveness/#recurrent-neural-networks
Sequence classification with LSTM: