Get Google Keywords

Preparation

Install python packages lxml, html5lib, nltk (Linux users need to apt-get python-dev, python-lxml as well.)
```
  $ pip install lxml html5lib nltk
```

From nltk.download() select and install corpus/stopwords

  $ python
  >>> import nltk
  >>> nltk.download()

Configure settings

vi settings.py

TARGET: Either 'news' or 'blog'
QUERYLIST: List of queries
HTMLPATH: Path to save html files. (Paths should end with a slash)
KEYWORDPATH = 'data/keywords/'
NCRAWLPAGES: Number of search pages to crawl from Google
DELIMS: Delimiters for parsing words in HTML page
TODAY: Date for analysis

Run

In order to get search results for data mining, run

python main.py data mining

or set QUERYLIST=['data', 'mining'] in settings.py, and run

python main.py

Results

If HTMLPATH='data/html/' and KEYWORDPATH='data/keywords/ in settings.py, the search results and keywords are stored in the 'data' folder as below.

data/
    ├── html/
    │   ├── data_mining/
    │   └── data_mining-20120907.json
    └── keywords/
        └── keywords-data_mining.json

data/html/data_mining/: This folder contains the raw HTML files. File names are marked with a timestamp.

data/html/data_mining-20120907.json: This file contains th url, desc(description), crawled_time, title extracted from the raw HTML files. Below is an example.

  [
    {
      "url": "http://smartdatacollective.com/timoelliott/101486/analytics-world-news-big-data-cool-3d-analytics", 
      "desc": "Themos Kalafatis has worked as a consultant for , Text Mining, Information Extraction and Data Quality for over a decade. More \u00bb ", 
      "crawled_time": "20120907_192648",
      "page_no": 1,
      "title": "Scary Big Data, Cool 3D Analytics and More"
    },
    ...
  ]

data/keywords/keywords-data_mining.json: This file contains the most frequent keywords. An example is shown below.

  ["data", 23],
  ["mining", 19],
  ["analytics", 3],
  ["app", 3],
  ["big", 2],
  ["mayo", 2],
  ["companies", 2],
  ["3d", 2],
  ["ehr", 2],
  ["datamining", 2],
  ["partner", 2],
  ["nlp", 1],
  ["desktops", 1],
  ["office", 1],
  ["advisory", 1]
  ...

Authors

2012 LG-SNU Smart TV Project Team (Created Sep. 2012)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
crawlers		crawlers
data		data
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawlers

crawlers

data

data

utils

utils

.gitignore

.gitignore

README.md

README.md

main.py

main.py

settings.py

settings.py

Repository files navigation

Get Google Keywords

Preparation

Configure settings

Run

Results

Authors

About

Releases

Packages

Languages

ithanks/google-keywords

Folders and files

Latest commit

History

Repository files navigation

Get Google Keywords

Preparation

Configure settings

Run

Results

Authors

About

Resources

Stars

Watchers

Forks

Languages