Skip to content

ithanks/google-keywords

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Get Google Keywords

Preparation

  • Install python packages lxml, html5lib, nltk (Linux users need to apt-get python-dev, python-lxml as well.)

      $ pip install lxml html5lib nltk
    
  • From nltk.download() select and install corpus/stopwords

      $ python
      >>> import nltk
      >>> nltk.download()
    

Configure settings

vi settings.py
  • TARGET: Either 'news' or 'blog'
  • QUERYLIST: List of queries
  • HTMLPATH: Path to save html files. (Paths should end with a slash)
  • KEYWORDPATH = 'data/keywords/'
  • NCRAWLPAGES: Number of search pages to crawl from Google
  • DELIMS: Delimiters for parsing words in HTML page
  • TODAY: Date for analysis

Run

In order to get search results for data mining, run

python main.py data mining

or set QUERYLIST=['data', 'mining'] in settings.py, and run

python main.py

Results

If HTMLPATH='data/html/' and KEYWORDPATH='data/keywords/ in settings.py, the search results and keywords are stored in the 'data' folder as below.

data/
    ├── html/
    │   ├── data_mining/
    │   └── data_mining-20120907.json
    └── keywords/
        └── keywords-data_mining.json
  • data/html/data_mining/: This folder contains the raw HTML files. File names are marked with a timestamp.

  • data/html/data_mining-20120907.json: This file contains th url, desc(description), crawled_time, title extracted from the raw HTML files. Below is an example.

      [
        {
          "url": "http://smartdatacollective.com/timoelliott/101486/analytics-world-news-big-data-cool-3d-analytics", 
          "desc": "Themos Kalafatis has worked as a consultant for , Text Mining, Information Extraction and Data Quality for over a decade. More \u00bb ", 
          "crawled_time": "20120907_192648",
          "page_no": 1,
          "title": "Scary Big Data, Cool 3D Analytics and More"
        },
        ...
      ]
    
  • data/keywords/keywords-data_mining.json: This file contains the most frequent keywords. An example is shown below.

      ["data", 23],
      ["mining", 19],
      ["analytics", 3],
      ["app", 3],
      ["big", 2],
      ["mayo", 2],
      ["companies", 2],
      ["3d", 2],
      ["ehr", 2],
      ["datamining", 2],
      ["partner", 2],
      ["nlp", 1],
      ["desktops", 1],
      ["office", 1],
      ["advisory", 1]
      ...
    

Authors

2012 LG-SNU Smart TV Project Team (Created Sep. 2012)

About

Get keywords from Google blog, news search.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%