Skip to content

mkln/remerge

Repository files navigation

REMERGE

Linking PATSTAT to Company databases

Michele Peruzzi, Georg Zachmann, Reinhilde Veugelers

Introduction

PATSTAT is a database published by the European Patent Office and includes info on millions of patents and patentees. Its usage is sometimes limited by the little information on the patentees. Linking it to company databases has historically been a manual task. This is due to:

  • its focus on the patent applications, not on the patent applicants or inventors,
  • missing classification of patentees into categories such as individuals, companies, other organizations,
  • missing basic information on patentees such as their address, or their name. In addition, large company databases include a large majority of non-patenting companies. Ultimately, just a few patentees should be matched to a relatively small number of companies. For these reasons, advanced matching algorithm have not been used, as they make comparisons using the shared fields.

Remerge is a set of python scripts that allows to match PATSTAT to Company databases (in this case, Amadeus from Bureau van Dijk). It is not limited to comparisons between shared fields, and uses as much information as possible. A Lasso-regression model is estimated on the training set and applied to the data to get the estimated probabilities of matching.

Procedure

Starting from cleaned and geocoded data:

  1. filter_companies.py For every PATSTAT name, computes JW and Lev string distances, then for every PATSTAT name outputs Union(top10lw, top10lev), includes computation of geo-location includes separation of names and legal identifiers adds Amadeus variables for hand labeling later. this is the most resource-intensive part of the algorithm

  2. extract_sample.py loads RAW PATSTAT and Amadeus, loads candidate matches, takes previous dataset and asks user to find the true matches

  3. remerge_sector_matrix.py (can be run before 2.) calculates IPC-NAICS "similarity" by looking into the unique exact matches. A unique exact match is, of all pairings between a PATSTAT name and a company, the only one in which the two names are the same. Most PATSTAT names have no exact match. Unique ones are even less.

  4. generate_vars.py and prepare_modelfit.py Generate some of the variables that are used by the Lasso-regression.

  5. remerge_fitmodel_training.r Fits the Lasso-regression model. (Calls some python code) Loads R source code from regression_functions-modelmatrix.r

  6. remerge_fitmodel_wholedata.py Fits the generated model to the whole dataset. Saves the results.

  7. remerge_persontable.py (optional) Takes the matching results and returns a table of patstat_id : phat : company_id where patstat_id is the same as person_id in patstat and phat is the estimated probability of match. The resulting table can then be loaded into an SQL server.

Releases

No releases published

Packages

No packages published