Data Mining Project Assignment 1

By Shashank Agarwal (agarwal.202@osu.edu) Anurag Kalra (kalra.25@osu.edu)

All code is in the code folder. Report is in report folder (pdf & docx formats are provided)

##Instructions

First we extracted all files from the sgm files and saved it in a csv called "data.csv" input= files/*.sgm files file=cleanXML.jar output=files/pre_processing.csv
We then used this to remove all the stop words and saved the file to "out_file.csv" input=data.csv file = read.py output=out_file.csv
Now we do stemming and create two files a) Frequency of each word across all documents : word_out_2.csv b) Number of documents where the word is present: word_out.csv input=out_file.csv file=read_py.py output=word_out.csv & word_out_2.csv
We then calculated the tf-idf of each word and saved it to the file 'tdidf.csv'

input= word_out.csv & word_out_2.csv file=tdidf.py output = tdidf.csv

We use only the words with tf-idf of greate than 0.01, which results in 2823 words input=tdidf.csv file=feature.py output=final_tdidf.csv
We then create the feature vector using the list of words as one axis and the document body as the other. Final results are stored in 'final_tdidf.csv' input = final_tdidf.csv, data.csv file=create_feature.py output=feature_matrix.pytext

Using the Make File:

To execute all steps just run "make All"

To execute a specific step from the above list follow these commands:

for step1 (create a csv file from sgm files) make step1

for step2 (remove stop words) make step2

for step3 (stemming and counting words) make step3a make step3b

for step4 (calculate tf-idf) make step4

for step5 (get top keywords with tf-idf > 0.1) make step5

for step6 (create feature vector) make step6

to clean all csv files use (please use this with caution as data generation takes a lot of time) make clean

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Datamining-report2		Datamining-report2
code		code
code2		code2
report		report
report2		report2
README.md		README.md
apriori.py		apriori.py
jaccard.py		jaccard.py
minhash.py		minhash.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datamining-report2

Datamining-report2

code

code

code2

code2

report

report

report2

report2

README.md

README.md

apriori.py

apriori.py

jaccard.py

jaccard.py

minhash.py

minhash.py

Repository files navigation

Data Mining Project Assignment 1

About

Releases

Packages

Contributors 2

Languages

imshashank/data-mining

Folders and files

Latest commit

History

Repository files navigation

Data Mining Project Assignment 1

About

Resources

Stars

Watchers

Forks

Languages