Patents Text Analysis

Flexible, configurable code to perform web-scraping and text-analysis on patents from Google Patents.

Introduction

The code has been developed for a project to study IBM's patents (see below), but has been made perfectly general-purpose. Cloning this repository, you can easily build and analyze your own dataset, choosing the assignee and the years of publication.

The IBM Patents dataset [link]

The code has been used to create the following dataset: IBM Patents. The dataset contains the metadata and text of ~60K patents assigned to IBM, through the years 2000-2019. The aim of this project was to find and study topics and trends in IBM technologies by performing text analysis. The plots shown below have been generated using just a small sample from this dataset, and serve as examples.

Installation

git clone https://github.com/FeLusiani/PatentsTextAnalysis.git
cd ./patents-nlp
pip install -r requirements.txt

The web-scraping is performed using Selenium. You will need to download the Chrome driver here.

Finally, make sure to configure the paths in conf.py.

Building the dataset

Configure the application paths and the scraping parameters in conf.py. In particular, the scraping parameters are:

Assignee's name
Years range
Language
Number of result pages to scrape per month
Number of patents per page

Then run build_dataset.py to build the dataset.

~/patents-nlp$ python build_dataset.py
usage: build_dataset.py [-h] [--delete] [--all] [--scrape] [--download] [--convert] [--clean] [--overwrite] [--multiproc]

scrape, download, convert and clean the data

optional arguments:
  -h, --help   show this help message and exit
  --delete     delete all data
  --all        execute whole pipeline
  --scrape     scrape metadata
  --merge      merge the scraped metadata
  --download   download pdfs using metadata
  --convert    convert pdfs to txts
  --clean      clean txts
  --overwrite  overwrite existing files
  --multiproc  use multiprocessing
  --verbose    show INFO logging

The pipeline for building the dataset is the following:

Using the search criteria listed above, metadata of the patents is scraped from Google Patents, saving each year in its own .csv file.
The metadata files are merged in a single .csv file.
The patents listed in the unified .csv file are downloaded (as .pdf files).
The .pdf files are converted to .txt by performing OCR with the tika python library.
The .txt files are cleaned using the nltk python library: stop-words and punctuation removal, lemmification of tokens.

Latent Semantic Analysis (LSA)

The code in lsa_example.ipynb shows how to perform a latent semantic analysis of the text corpus using the functions in the module LSA.

The code performs TFIDF on the text corpus, and then factorizes the words_documents matrix (using either SVD or NMF method), yielding the words- and documents- embedding of the latent topics.

The plot shows the top 15 topics found in the text corpus (in this case, a sample from IBM's patents published in the year 2019). For each topic, the top 4 words are shown.

Analyzing the trends of chosen topic

In the topics_trends.ipynb notebook, we perform an analysis of the trends of hand-made topics (sets of keywords) to study how their frequency changes through the years.

Medicine
 - Keywords: ['medical', 'patient', 'health', 'treatment']
ML
 - Keywords: ['neural', 'train', 'recognition', 'learn']
Aut. driving
 - Keywords: ['vehicle', 'autonomous', 'park']
Quantum comput.
 - Keywords: ['quantum']

Authors

Alessandro Marincioni - Maranc98
Federico Lusiani - FeLusiani

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
LSA		LSA
data_pipeline		data_pipeline
fasttext		fasttext
images		images
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_dataset.py		build_dataset.py
conf.py		conf.py
lsa_example.ipynb		lsa_example.ipynb
requirements.txt		requirements.txt
topics_trends.ipynb		topics_trends.ipynb

License

FeLusiani/PatentsTextAnalysis

Folders and files

Latest commit

History

Repository files navigation

Patents Text Analysis

Introduction

The IBM Patents dataset [link]

Installation

Building the dataset

Latent Semantic Analysis (LSA)

Analyzing the trends of chosen topic

Authors

About

Resources

License

Stars

Watchers

Forks

Languages