A Kid Friendly Search engine which will display results for enhancement of kids' knowledge. The search engine eliminates all kinds of harmful content inappropriate for kids. We are using Neural Network and will rank the results using TF IDF,by tweaking it with our own formula.
The project consists of the following main steps:
- General instructions to run the project
- Scraping data from the web
- Assigning labels to the training data using pattern.en
- Filtering objectionable content
- Identifying topics
- Running ElasticSearch
Clone the repository into your local machine by typing the command
git clone repository_url
To run the project, you need to have a running version of Python 3.6(not 3.7) and pip.
To install the dependencies execute,
pip install requirements.txt
This command will install all the required dependencies.
Scraping the data requires installed version of Selenium and BeautifulSoup. The libraries are present in the requirements.txt file.
For Data Scraping -
- Run Medium_Scrapper_using_selenium.py
- Run WebScraper.py
- Run Medium_Search_URL_Scrapper.py
- Run WebScraper.py
- Combine the datasets and name them - final_data.csv
Or you can download the data from this link: https://drive.google.com/file/d/1BrAguUjU6yU4In8iWx4-i37MBcK_gmqi/view
- Create a new Virtual Environment using the command
virtualenv -p python3 venv
- A new folder called venv gets created.
- To source into the Virtual Environment, type the command
source venv/bin/activate
- A (venv) will get prepended to the command line.
- Navigate to the Project folder in the path - /A-Search-Engine-for-Kids/helper_scripts/class_labelling_using_pattern.en
- Run the command
python data_content_labelling.py
- This script was created initially to classify data as Positive, Strongly Positive, Negative, Strongly Negative. The input CSV file taken here is a basic data set with limited records of 1280 rows.
- The output of this script is the same input data set with another column for sentiment score appended.
- Once the final_data.csv file is retrieved, the file should be saved in the same directory as the file named, web_content_classification.ipynb file. The file should be executed by entering the command,
jupyter notebook
- This will open the notebook and all the cells can be executed by using Shift+Enter. Or via Cells> Run All.
Note: Since the data set is huge (149mb), it will take a long period of time to see the results.
- To execute this file, load the classification3.ipynb and topic modelling.ipynb as ipynb files in the jupyter notebook and execute it by using Shift+Enter or Cells> Run All.
- This file takes as input the output of the Filtering objectionable content step. The input file is "whole_data.csv" which is found in the same directory as the classification.ipynb file.
- Create virtual environment:
virtualenv -p python3 venv
- Download elastic search (anywhere other than project folder):
brew install elasticsearch
- Set up virtual environment inside the app/ folder
virtualenv -p python3 venv source venv/bin/activate
On execution of the last command you will see “venv” in the terminal line
- Open a second terminal window and start elastic search process in background
brew services start elasticsearch
- Go to this directory, " ./usr/local/bin your elastic search directory and run
./elasticsearch` or `.\elasticsearch
- Once elasticsearch is up and running, go to app/index/ and run,
python elastic_search_helper.py
This will start the flask app, which can be viewed in the browser using this url:
http://localhost:5000
If any error while starting elastic search
- "failed to obtain node locks, tried [[/usr/local/var/lib/elasticsearch .."
Enter
ps aux | grep 'java' kill -9 <PID>
- Unable to locate python 3.7 on Pycharm
locate anaconda if installed
which anaconda
Copy the path for the folder into pycharm and locate python 3.7 or similar version
- Girish Tiwale
- Richa Nahar
- Sabiha Barlaskar
- Supritha Amudhu