2IMM15-Web-information-retrieval-and-data-mining

Instructions for the GUI controlled components

Components: preprocessor, indexer, query system, clustering and classification

The database files can be downloaded through the following link: https://melroy.pro/uploads/web_database_files.zip

The components above have been merged into the master branch. Please take the following steps to get this component working:

Clone the master branch of the git repository into a directory of choice.
Import the directory in PyCharm. Once imported, select the “src” folder, right click and choose the option “Mark Directory as > Source Root”
Before proceeding, please insert the crawler_sqlite.db and database.sqlite files into the data folder. Here the database.sqlite is the database provided by the NIPS collection, and crawler_sqlite.db is a file provided by us.
Check all python files for missing packages.
Run the setup.py script to initialize the WordNet stemmer database.
Run main.py in the gui folder to start the GUI.
Start by clicking on the “Index papers” button. This will automatically call the preprocessing.
Once indexing is done, the components can be accessed and interacted with through their corresponding tabs.

Instructions for the web crawler component

Python 3 and the Scrapy package are required to run the crawler.

Python 3 can be downloaded from https://www.python.org/downloads/.

Install scrapy in a global or virtual environment using the following command via the command line: pip install scrapy. This will also download and install all of scrapy's dependencies.

Get the crawler source code from the Jur branch of this repository.

Then run the crawler via the command line as follows:

cd to the top-level DblpCrawler directory.
Use the "scrapy crawl dblpcrawler" command to start the crawler.

The crawler will now run until it is done. When the crawler finished running it will output some statistics and notify the user.

After running the crawler the crawler_sqlite.db with the scraped data can be found in the data directory.

Instructions for Topic Modeling and Co Author Graph components.

Topic Modeling is placed in a new gui folder named 'gui_topic_modeling'.

For example purpose a data from pregenrated model is already placed in the working directory (src/gui_topic_modeling folder)
Run GUI. Enter 9 in the number of topics text field.
You can also choose other number but then you have to do create model first before for further analysis.
The model already loads files dic.p,corpus.p and author2docModel.p. These can be generated from raw data by running utilities file in topic modeling. Howvever, these are common for any topic model. So they are generated them before hand and placed them in working directory.
GUI is not made for co author graph. But you can explore the functionalities from coauthorgraph.py in TopicModeling folder.

Important Note: If you are cloning from github repository, due to file limitations the sqlite database is not included. Please download it from https://www.kaggle.com/benhamner/nips-papers/data and place the database in gui_topic_modeling folder (or from where the main.py is running) to access the features of topic modeling.

A sample demo can be found at https://www.youtube.com/watch?v=mdK02BrTQV4&feature=youtu.be

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

src

src

.gitignore

.gitignore

README.md

README.md

Repository files navigation

2IMM15-Web-information-retrieval-and-data-mining

Instructions for the GUI controlled components

Instructions for the web crawler component

Instructions for Topic Modeling and Co Author Graph components.

About

Releases

Packages

Contributors 5

Languages

melroy999/2IMM15-Web-information-retrieval-and-data-mining

Folders and files

Latest commit

History

Repository files navigation

2IMM15-Web-information-retrieval-and-data-mining

Instructions for the GUI controlled components

Instructions for the web crawler component

Instructions for Topic Modeling and Co Author Graph components.

About

Resources

Stars

Watchers

Forks

Languages