iod-freebase-indexer

Indexes text datasets or custom entity extraction datasets from freebase to IDOL OnDemand

For more information and for a list of all IDOL OnDemand APIs create an account on idolondemand.com.

IDOL OnDemand offers many functionalities including text indexing and analytics capabilities which are the focus of this script.

###Context

This tool helps with the creation of Custom Entity Datasets from freebase as discussed in the following blog post.

It can also create custom search datasets with wikipedia descriptions, which can be very useful to say: quickly recommend TV shows similar based on description.

###Prerequisites

You'll need an IDOL OnDemand APIkey that you can get on idolondemand.com

You'll also need a Freebase API Key which you can get through a google developer account, here

###Install

Just run a pip install to install the dependecies ( currently only one )

pip install -r requirements.txt

###Configuration

You'll need a config.json ( or any other name) file that will hold your keys.

{
  "iodkey":"myapikey",
  "freebasekey":"yourfreebasekey"
}

###Usage

python commandline.py --help

The configs directory in the repo contains some input configurations to get you started. For example animalsindex.json will index animals data + their wikipedia descriptions

python commandline.py --config configs/config.json --input configs/animalsindex.json

###Input Formats

{
  "iodindex":"animals",
  "freebasequery":[{
    "name": null,
    "mid":null,
    "/common/topic/alias": [],
    "type": "/biology/animal",
    "key": {
      "namespace": "/wikipedia/en_id",
      "value": null
    },
    "/biology/organism_classification/scientific_name": []
  }],
  "aliasfields":["alias","scientific_name"],
  "parametricfields":[],
  "type":"categories"
}

iodindex: The IDOL OnDemand index that you want to index into
freebasequery: A freebase MQL query. Create and test your own on http://www.freebase.com/query
type: categories or index , categories will create a "categorization" flavor index and create boolean rules for matchin each element from freebase. index will create a "standard" flavor index for search.
description: false by default. set to true and each document will be indexed along with its wikipedia description. IMPORTANT with type set to index.
aliasfields: when doing a categories type custom entity extraction, aliasfields is a list of extra fields it will use to match against. Remember those fields should be part of the freebasequery!.
parametricfields: these fields will be indexed as parametric type fields meaning that the idol field_text query operator can be used to filter against those values.

Note: "mid":null, is required in all freebasequeries currently as the mid is used for the unique id.

###Example inputs

Categories/Entity Extraction datasets

animals.json : extract animal names based on their common aliases or scientific names
cars.json : extract car models, filter by brands etc.
legalcases.json : extract legal cases
drugs.json : extract drugs based on various denominations
celestialobjects.json : planets, asteroids, stars can be extracted

Many of these have a bunch of fields being indexed as well allowing for even more refined extraction as well as the return of useful information on match.

Search Datasets

animalstext.json : index the descriptions for every animal on freebase for search and analytics

###Freebase field conversion

Full path fields will only keep the final name:
"/biology/organism_classification/scientific_name": [] in the freebasequery will result in a field called "scientific_name" to be stored in each idolondemand document.

quickname:field will store its value as "quickname" only

    "defendant:parties": [{
      "optional": true,
      "role": "Defendant",
      "parties": []
    }],
    "plaintiff:parties": [{
      "optional": true,
      "role": "Plaintiff",
      "parties": []
    }],

The above will store the parties field for each role as defendant_parties and plaintiff_parties

###Other Functionalities

####Resuming

Some freebase queries may have a lot of files. If the script can't run to completion, the --resume flag will resume the last run script to its last indexing point ( assuming --config and --input are set to be the same )

Ensuring index freshness

The --delete flag will DELETE the target index and create a new one before data is indexed into it.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
commandline.py		commandline.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

commandline.py

commandline.py

requirements.txt

requirements.txt

utils.py

utils.py

Repository files navigation

iod-freebase-indexer

Ensuring index freshness

About

Releases

Packages

Languages

License

lemoogle/iod-freebase-indexer

Folders and files

Latest commit

History

Repository files navigation

iod-freebase-indexer

Ensuring index freshness

About

Resources

License

Stars

Watchers

Forks

Languages