NewsGenerator

About

This project aims to generate fake news using machine learning model (RNN). This project aims to compare the power of the short term rnn (Vanilla RNN) and Long term rnn (GRU). Both the models are implmented from scratch (only using numpy).

Data Set

The data set contains news title as tweeted by the Times of India on twitter.The data was collected for the purpose of this awesome project, for any query regarding data collection contact Breejesh Rathod.

Data Preprocessing

The data set contains around 3 Lakh news titles as tweeted by The Times of India, some of them is enlisted below

Tax officials claim 'evidence' against Kolkata Knight Riders http://toi.in/aRGFUZ
Delhi's Lajpat Nagar blast: Three accused sentenced to death http://toi.in/9vkch5
Abandoned car triggers panic at airport

The number of data points that the model was trained on was kept limited (around 40-50k)

The data preprocessing included

Removing all links and replacing it with 1 common link (as the vocabulary has to be built).
Start and End token were appended to the sentences and the sentences were tokenized.
Vocabulary is built and the words in the sentences are replaced by the respective index in the vocabulary list.
One hot vectors were built, and the preprocessed data was pickled.
The the length of each sentence was equlized by adding the end token as the filler word.

** Params **

vocab size: 2000
data points: 30000

This is the code of the preprocessing phase.

Models Info

The models are trained on the google cloud platform 50 gb ram, 8 cores.
Different versions of the models were trained, I'll be talking about the version that I found was best.
The task at hand is to build many to many RNN.
In the code both the versions of truncated backpropogation through time is implmented (tf style bptt, and the trut tbptt, links in the resources)
The models were trained in a parallel fashion using the joblib module of python.
Mini batch gradient descent was used for training.

Vanilla RNN

** Parameters **

This is the stucture of params file

Hidden Nodes: 128
Loss Function: Softmax loss
Activation Function: tanh
Optimization: Adam
Layers: 1
Sequence length: 30
Batch Size: 128

The below graph is Loss vs Iteration
This is the loss at every 3rd iterations

The models was trained for 45 epochs, using the tensorflow style bptt.Below are some of the generated headlines during training

Epoch 0

Watch law shows leave century leave sign leave leave game Of This
firm Supreme Kyi attend changes activist spent ministry train ministry CRPF No

Epoch 5

US may assembly : US likely wins Mary : child Barack Obama
Just in : political rise : could Pawar https://examplearticle/exres/abcd.com

Epoch 20

MS Dhoni the China says finance help Headley World Cup : alert
grow of Taliban https//examplearticle/exres/abcd.com

Epoch 25

Shah Rukh Khan shot dead
India Sri Lanka conspiracy , https//examplearticle/exres/abcd.com

The most of the sentences doesn't make sense but, the model atleast learnt to complete the names (eg barrack obama, sharukh khan)

Finally after training below are some selected news title that was generated by the model

IPL 4 : India relief Deccan Chargers
Shah Rukh Khan 's used to be KKR's owner
Hazare in Pakistan https//examplearticle/exres/abcd.com
Nadal in jail

The major observation that was made that vanilla rnn fails to generate longer sentences that actually makes sense, as it fails to learn long term dependencies.

This is the code file for vanilla rnn cell.
This folder contains generated titles.

GRU

The architechture diagram is given below

The Gru cell used

The parameters used were same as that of the vanilla rnn, in this case truncated backprop was not used, full forward and the backward pass was made The model was trained for 100 iterations

Below are some of the generated headlines during training

Epoch 20

Poll 16 for shut ! The Best 28 near
sends cause tough paise France feel https//examplearticle/exres/abcd.com Chidambaram old situation

Epoch 40

Odisha killer Arun thrash legal points reported by a local
4 men die on may 8 https//examplearticle/exres/abcd.com areas

Epoch 60

unveils Who quits Abdullah world The legal rates eye Food On meeting https//examplearticle/exres/abcd.com
9.20pm sends security Italian Newshour clear feel do T20

Epoch 80

Netherlands telecom hints novelty |impacts Hyderabad
visa hikes in noida https//examplearticle/exres/abcd.com

The model somewhat learnt long term dependencies, but the results were below the expectations

Finally after training below are some selected news title that was generated by the model

Kalmadi fishermen rapes an European
Hazare says https//examplearticle/exres/abcd.com
PM daughter killed https//examplearticle/exres/abcd.com
IITs track Laxman's plane https//examplearticle/exres/abcd.com

This is the gru code file
This is the cell of the gru
This folder contains the generated sentences

Conclusion

The data set contained maximum of 40 words in a news title.
vanilla rnn performed better than gru as there were no longer sequences that the gru could model.
Both the models learnt to put the names correctly in the news.
The models learnt the punctuations.
Gru is not useful to model shorter sequences

This repo also contains the code for the lstm cell

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
generated		generated
images		images
.gitignore		.gitignore
Gru.py		Gru.py
GruCell.py		GruCell.py
Lstm.py		Lstm.py
LstmCell.py		LstmCell.py
README.md		README.md
RNN.py		RNN.py
analyze.py		analyze.py
console.ipynb		console.ipynb
controlTraining.txt		controlTraining.txt
data.zip		data.zip
params.json		params.json
preprocessor.py		preprocessor.py

psambit9791/NewsGenerator

Folders and files

Latest commit

History

Repository files navigation

NewsGenerator

Contents

About

Data Set

Data Preprocessing

Models Info

Vanilla RNN

GRU

Conclusion

References

About

Resources

Stars

Watchers

Forks

Languages