NOSQL 4 USPTO

uspto patent data-pipeline for nosql system

프로젝트에서 중점사항

매년 150,000건 이상의 데이터를 issue하는 USPTO 빅데이터 활용
병렬구조로 수집하여 수집 속도를 향상시킴
USPTO에서 자주 활용이되는 QUERY 위주의 test 및 비교

1. 데이터 소개 및 수집

1.1. 데이터 소개

USPTO 데이터 2018년 1월 ~ 6월 18일까지의 데이터
총 162,238건의 데이터 (약 17GB)

1.2. 데이터 수집

파일 수집 URL 생성

USPTO에 2002년 이후에 생성된 데이터들은 xml 파일형식으로 가져올 수 있음
따라서, USPTO의 url에서 년도만(2002이후) 바꾸어 xml파일을 수집하는 형식으로 파이썬 문법 작성
본 프로젝트에서는 실험적으로 2018년 1-6월까지의 약 6개월 데이터만 가져옴

zip을 풀어 xml 형태로 최종 수집

USPTO에 올려져 있는 파일들은 zip 형태로 되어 있음
따라서, zip을 풀어서 xml 형태로 최종 수집

2. 데이터 변환 및 결과

2.1. 데이터 변환 : XML -> JSON

MongoDB에서 JSON과 같은 형식(BSON)이 사용 가능하기 때문에 XML을 JSON파일 형식으로 변환해야 함

2.2. 데이터 결과

test query

title, assignee(=patent number), dates(priority, publication), legal status(patent application, granted patent), number of claims

Step Running.

0. Environments

OS: Ubuntu 16.04.4 LTS
Script language: Python 3.6.4 :: Anaconda custom (64-bit)
Database: MongoDB shell version v3.6.5

1. collect xml file from USPTO.

$ python collect_weekly_xml.py {$YEAR}

2. insert json file to MongoDB.

$ python run_insert_mongo.py {$YEAR}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
md_img		md_img
metadata_zip_file		metadata_zip_file
sample_file		sample_file
.gitignore		.gitignore
README.md		README.md
collect_weekly_xml.py		collect_weekly_xml.py
lib_insert_mongo.py		lib_insert_mongo.py
run_insert_mongo.py		run_insert_mongo.py
script_meta_zip.py		script_meta_zip.py
script_summary_file.py		script_summary_file.py
script_weekly_xml.py		script_weekly_xml.py
setting_utility.md		setting_utility.md
test_insert_mongo.py		test_insert_mongo.py

5eo1ab/nosql4uspto

Folders and files

Latest commit

History

Repository files navigation

NOSQL 4 USPTO

uspto patent data-pipeline for nosql system

프로젝트에서 중점사항

1. 데이터 소개 및 수집

1.1. 데이터 소개

1.2. 데이터 수집

2. 데이터 변환 및 결과

2.1. 데이터 변환 : XML -> JSON

2.2. 데이터 결과

test query

Step Running.

0. Environments

1. collect xml file from USPTO.

2. insert json file to MongoDB.

About

Resources

Stars

Watchers

Forks

Languages