GitHub - jamesdunham/cso-classifier: Script that classifies content from scientific papers with the topics of the Computer Science Ontology (CSO).

CSET demo

This demo fetches a set of CS paper titles, abstracts, and keywords from Web of Science BigQuery tables, and runs the CSO Classifier over them.

For documentation of the CSO Classifier by its authors, see the original repo.

Getting started

Clone this repo
pip install -r requirements.txt (Python >= 3.6)
python -m spacy download en_core_web_sm (for tokenization and stop words)
Download a keyfile for a service account with BigQuery access into the project root
Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of that keyfile, which will look like "GCP-CSET Projects-b314d1aa5d86.json"

Use

Run 0-extract-inputs-from-bigquery.sh to copy Web of Science records in BigQuery to a Cloud Storage bucket
Run 1-copy-inputs-from-storage.sh to copy the bucket's contents to the disk (see "Input" section below)
Run python main.py (e.g., GOOGLE_APPLICATION_CREDENTIALS='gcp-cset-projects-b314d1aa5d86.json' python main.py) to write predictions to the disk (see "Output" section below)
Run 2-reshape-predictions.py to reshape the predictions for loading back into BigQuery (see "Reshaped output for BigQuery" below)
Run 3-copy-predictions-to-storage.sh to copy the reshaped predictions to the bucket
Run 4-create-bigquery-table.sh to create a BigQuery table for the predictions
Run 5-load-predictions-into-bigquery.sh to insert the predictions into the new table

Input

The classifier will operate on inputs .data/*.jsonl by default, and for each write .data/cset-predictions-*.jsonl.

Lines in the input file look like this, where each field is a string:

{"id": "{unique id}", "title": "{title}", "keywords": "{keywords}", "abstract_text": "{abstract}" }

Output

A single line in the output file (reformatted for exposition) looks like this:

{
  "syntactic": [
    "flight test",
    "computer hardware"
  ],
  "semantic": [
    "flight test",
    "computer hardware",
    "hardware components",
    "hardware design",
    "hardware resources"
  ],
  "enhanced": {
    "field programmable gate arrays (fpga)": [
      {
        "matched": 2,
        "broader of": [
          "hardware design",
          "hardware resources"
        ]
      }
    ],
    "flight control systems": [
      {
        "matched": 1,
        "broader of": [
          "flight test"
        ]
      }
    ],
    "computer science": [
      {
        "matched": 4,
        "broader of": [
          "computer hardware",
          "hardware",
          "computer system",
          "computer systems"
        ]
      }
    ],
    "computer hardware": [
      {
        "matched": 8,
        "broader of": [
          "hardware components",
          "fpga",
          "field programmable gate arrays",
          "field-programmable gate arrays",
          "field-programmable gate array (fpga)",
          "field programmable gate array (fpga)",
          "field programmable gate arrays (fpga)",
          "field programmable gate array"
        ]
      }
    ],
    "embedded systems": [
      {
        "matched": 1,
        "broader of": [
          "hardware components"
        ]
      }
    ],
    "control systems": [
      {
        "matched": 2,
        "broader of": [
          "flight control systems",
          "flight control system"
        ]
      }
    ],
    "computer systems": [
      {
        "matched": 4,
        "broader of": [
          "embedded system",
          "embedded systems",
          "control systems",
          "control system"
        ]
      }
    ]
  },
  "id": "WOS:000272834900082"
}

Reshaped output for BigQuery

We can't use arbitrary keys in BigQuery records, so before loading the predictions, we flatten enhanced so that each array element is an object with a term key. See prediction-schema.json for the BigQuery schema. The same example as above, reshaped for BigQuery (and intended for readability):

{
  "syntactic": [
    "flight test",
    "computer hardware"
  ],
  "semantic": [
    "flight test",
    "computer hardware",
    "hardware components",
    "hardware design",
    "hardware resources"
  ],
  "enhanced": [
    {
      "term": "field programmable gate arrays (fpga)",
      "matched": 2,
      "broader": [
        "hardware design",
        "hardware resources"
      ]
    },
    {
      "term": "flight control systems",
      "matched": 1,
      "broader": [
        "flight test"
      ]
    },
    {
      "term": "computer science",
      "matched": 4,
      "broader": [
        "computer hardware",
        "hardware",
        "computer system",
        "computer systems"
      ]
    },
    {
      "term": "computer hardware",
      "matched": 8,
      "broader": [
        "hardware components",
        "fpga",
        "field programmable gate arrays",
        "field-programmable gate arrays",
        "field-programmable gate array (fpga)",
        "field programmable gate array (fpga)",
        "field programmable gate arrays (fpga)",
        "field programmable gate array"
      ]
    },
    {
      "term": "embedded systems",
      "matched": 1,
      "broader": [
        "hardware components"
      ]
    },
    {
      "term": "control systems",
      "matched": 2,
      "broader": [
        "flight control systems",
        "flight control system"
      ]
    },
    {
      "term": "computer systems",
      "matched": 4,
      "broader": [
        "embedded system",
        "embedded systems",
        "control systems",
        "control system"
      ]
    }
  ],
  "id": "WOS:000272834900082"
}

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
classifier		classifier
cset		cset
data		data
examples		examples
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classifier

classifier

cset

cset

data

data

examples

examples

images

images

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

MANIFEST.in

MANIFEST.in

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

CSET demo

Getting started

Use

Input

Output

Reshaped output for BigQuery

About

Releases

Packages

Languages

License

jamesdunham/cso-classifier

Folders and files

Latest commit

History

Repository files navigation

CSET demo

Getting started

Use

Input

Output

Reshaped output for BigQuery

About

Resources

License

Stars

Watchers

Forks

Languages