Skip to content

rwaitman/naaccr-tumor-data

 
 

Repository files navigation

NAACCR Tumor Registry v18 Data in i2b2 etc.: Beta Testing

While various issues remain, KUMC put a version of this code in production in the Sep 2019 HERON Great Salt Lake release.


NAACCR File ETL: Getting Started

We suppose you have access to your NAACCR v18 file; for testing, you can use naaccr-xml-sample-v180-incidence-100.txt.

For integration testing, we use the H2 Database Engine, but when you are ready, you should configure access to Postgres, SQL Server, or Oracle as in section 3.4.2 Set Database Properties of the i2b2 installation guide.

db.username=SA
db.password=
db.driver=org.h2.Driver
db.url=jdbc:h2:file:/tmp/DB1;create=true

naaccr.flat-file=naaccr_xml_samples/naaccr-xml-sample-v180-incidence-100.txt
naaccr.tumor-table: TUMOR

i2b2.patient-mapping-query: select distinct MRN, patient_num from ...
# i2b2.schema: i2b2demodata
# i2b2.template-fact-table: i2b2demodata.observation_fact

Then, to create the TUMOR table (less PATID column) following draft PCORnet specifications:

$ java -jar build/libs/naaccr-tumor-data.jar tumor-table
INFO gpc.DBConfig - getting config from db.properties
...
INFO all_naaccr.DAT: layout NAACCR 18 Incidence
...
INFO inserted 100 records into TUMOR

Patient Mapping using custom SQL

The i2b2 patient_mapping table typically includes a crosswalk from MRN to patient_num, but the details seem to be somewhat idiosyncratic. For example, to use NAACCR item 2300 (medical record number) from v15, which is 11 characters starting at column 3606:

update naacr.tumor tr
set tr.patient_num = (
    select pm.patient_num
    from i2b2data.patient_mapping pm
    where pm.patient_ide_source = 'MRN'
    and pm.patient_ide = trim(leading ' ' from substr(observation_blob, 3606, 11))
    )
;
commit;

Given a script such as the above, use java -jar build/libs/naaccr-tumor-data.jar run patient_mapping.sql to update the patient_num column of the tumor table.

TODO: the tumor table should also have a PATID VARCHAR column (issue 48).

When building i2b2 facts, set --mrn-item=medicalRecordNumber correspondingly (using NAACCR XML ids) and specify in db.properties:

i2b2.patient-mapping-query=select distinct tr.MRN, pm.patient_num
  from i2b2data.patient_mapping pm
  join (select trim(leading ' ' from substr(observation_blob, 3606, 11)) as MRN from tumor) tr
  on tr.MRN = pm.patient_ide

naaccr-tumor-data.jar Usage Reference

The naaccr-tumor-data command is short for java -jar naaccr-tumor-data.jar.

Usage:
  naaccr-tumor-data tumor-table [--db=PF] [--task-id=ID]
  naaccr-tumor-data tumor-files [--db=PF] NAACCR_FILE...
  naaccr-tumor-data load-layouts [--db=PF] [--layout-table=T]
  naaccr-tumor-data facts  [--db=F] --upload-id=NNN [--obs-src=S] [--mrn-item=N] [--encounter-start=N]
  naaccr-tumor-data summary  [--db=F] [--task-id=ID]
  naaccr-tumor-data ontology [--db=F] [--table-name=N] [--version=V] [--task-hash=H] [--update-date=D] [--who-cache=D]
  naaccr-tumor-data import [--db=F] TABLE DATA META
  naaccr-tumor-data run SCRIPT [--db=F]
  naaccr-tumor-data query SQL [--db=F]

Options:
  tumor-table        load TUMOR table from flat file
  --task-id=ID       version / completion marker [default: task123]
  tumor-files        load NAACCR records into a (CLOB) column of a DB table
  load-layouts       load NAACCR layout data
  --layout-table=T   where to load layout data [default: LAYOUT]
  facts              build OBSERVATION_FACT_NNN table
  --upload-id=NNN    to fill in observation_fact.upload_id [default: 1]
  --obs-src=S        sourcesystem_cd to give to facts [default: tumor_registry@kumed.com]
  --mrn-item=N       NAACCR item to use for patient mapping [default: patientIdNumber]
  --encounter-start=N  encounter_num start [default: 2000000]
  summary            build NAACCR_EXTRACT_STATS table
  --db=PROPS         database properties file [default: db.properties]
  ontology           build NAACCR_ONTOLOGY table
  --table-name=T     ontology table name [default: NAACCR_ONTOLOGY]
  --version=NNN      ontology version [default: 180]
  --task-hash=H      ontology completion marker
  --update-date=D    ontology update_date in YYYY-MM-DD format
  --who-cache=DIR    where to find WHO oncology metadata
  import             import CSV
  TABLE              target table name
  DATA               CSV file
  META               W3C tabular data metadata (JSON)
  run                run SQL script
  query              run SQL query and write results to stdout in JSON

NAACCR Record Layout Version 18

The NAACCR_ETL process used at KUMC and other GPC sites to load tumor registry data into i2b2 is outdated by version 18 of the NAACCR standard.

ref:


Previous work: KUMC HERON i2b2 NAACCR ETL

please cite:


Previous work: GPC Breast Cancer Survey

cite:


Building on NAACCR XML from SEER

  • naaccr-xml - NAACCR XML reader in Java by F. Depry of IMS for SEER
    • first release: v0.5 (beta) Apr 20, 2015; v1.0 Feb 7, 2016
    • frequent release:
      • v6.6 Feb 6, 2020
      • ...
      • v5.4 Jun 13, 2019
      • v5.3 May 21, 2019
  • XML replaces flat file in 2020 (IOU citation)
  • also supports flat files
  • imsweb/layout has sections, codes, etc.
  • NAACCR XML WG meets alternate Fridays 11amET (e.g. Aug 2)

Coded concepts

Metadata for coded values is also work in progress.

  • HERON ETL was based on a NAACCR v12 MS Access DB that no longer seems to be maintained / published.
  • currently using a mix of:
    • LOINC answer lists (from v11 and v12)
    • well curated code-labels: naaccr NAACCR reader in R by N. Werth of PA Dept. of Health
  • hope to incorporate codes from imsweb/layout

Primary sites, morphologies from WHO

Maintained by the World Health Organization (WHO)

  • primary sites: e.g. C50 for Breast
    • i2b2 ontology support ported from HERON ETL
  • morphology: 9800/3 for Leukemia
    • morphology i2b2 ontology support TODO

SEER Site Recode

  • combines primary site and histology
  • e.g. 20010 for Lip
  • i2b2 ontology support ported from HERON ETL

site-specific factors

Obsolete in 2018, but to capture data from older cases...


Platform for v18: JVM, JDBC, H2 DB, tablesaw, groovy, (and luigi)

We are taking this opportunity to reconsider our platform and approach:

  • Portable to database engines other than Oracle
  • Explicit tracking of data flow to facilitate parallelism
  • Rich test data to facilitate development without access to private data
  • Separate repository from HERON EMR ETL to avoid information blocking friction

See CONTRIBUTING for details.

  • JDBC
    • lets us leverage the working knowledge of SQL in our community
      • HERON ETL: 30KLOC of SQL
    • portable: same JVM platform as i2b2
    • JDBC connectivity to datamarts
    • H2 for in-memory DB
  • groovy to fill in gaps where SQL is awkward, such as
    • iterating over columns or tables
    • tablesaw Dataframe library a la python pandas, Spark
    • difference from Java worthwhile? see CONTRIBUTING
  • luigi (optional)
    • luigi tasks preserve partial results

NAACCR Ontology for i2b2: Luigi Usage

The NAACCR_Ontology1 task creates a NAACCR_ONTOLOGY table:

$ luigi --module tumor_reg_tasks NAACCR_Ontology1
DEBUG: Checking if NAACCR_Ontology1(design_id=upper, naaccr_version=18, naaccr_ch10_bytes=3078052) is complete
15:48:09 INFO: ...status   PENDING
15:48:09 INFO: Running Worker with 1 processes
...
15:48:20 ===== Luigi Execution Summary =====
15:48:20
15:48:20 Scheduled 1 tasks of which:
15:48:20 * 1 ran successfully:
15:48:20     - 1 NAACCR_Ontology1(...)

If you're interested in luigi usage, see client.cfg for details. If not, see:

  • tumor_reg_ont.py
  • heron_load/tumor_item_value.csv, and
  • heron_load/naaccr_concepts_load.sql

Related work: OMOP / OHDSI Ongology Working Group


Wish list: missing data sentinels


NAACCR_PATIENTS, NAACCR_TUMORS, and NAACCR_OBSERVATIONS: Usage

Using tumor_reg_data.py, the NAACCR_Load task turns a NAACCR v18 flat file into tables for patients, tumors, and observations:

$ luigi --module tumor_reg_tasks NAACCR_Load
...
15:02:04 5890 INFO: Informed scheduler that task   NAACCR_Load_2019_08_20_tumor_registry_k_1306872023_225d26f0cd   has status   DONE

The tables are:

  • naaccr_patients, naaccr_tumors, naaccr_observations
  • observation_fact_NNNN, observation_fact_deid_NNNN
    • where NNNN is an upload_id

naaccr_patients and observation_fact_NNNN depend on an existing patient_mapping table. naaccr_tumors uses a reserved range of encounter_num.

TODO: publish generated notebook; smooth out the level of detail here vs. there.


Use case: GPC Breast Cancer survey

  • GENERATED_SQL for two i2b2 queries from GPC Breast cancer work

See test_data/bcNNN_generated.sql.


Toward Synthetic NAACCR Test Data

In test_data:

  • capture statistics of NAACCR data
  • synthesize test data with similar distributions

See test_data/data_char_sim.sql, test_data/tr_summary.py


Goal: data characterization, checks, charts

jupyter notebook of checks, charts

  • PCORNet CDM Emperical Data Characterization report
  • GPC Breast Cancer QA reports

Toward PCORNet CDM Integration

  • exploring FIELDS, VALUESETS a la PCORNET CDM for a TUMOR table
  • Another approach: i2b2 OBSERVATION_FACT -> PCORNet OBS_GEN.
    • crosswalk to LOINC (as of NAACCR v12): loinc-naaccr/loinc_naaccr.csv
      • on loinc-csvdb branch

See pcornet_cdm/ directory.


Goal: "data lake" ETL, multiple i2b2 fact tables

  • HERON ETL 2011: copy NAACCR flat file into DB with Oracle sqlldr
  • Spark approach:
    1. specify transformation from the NAACCR flat file to an i2b2 fact table
      • use PySpark to un-pivot / melt
    2. facts.write.jdbc(...oracle db...)
      • Spark it schedules work of transforming the flat file.
  • aim to use multiple fact tables in i2b2 1.7.09.
    • crc.properties set queryprocessor.multifacttable=true
    • in ontology table, set c_facttablecolumn=NAACCR_FACT.concept_cd

Related work: FHIR / HL7


About

using NAACCR tumor registry data in i2b2, PCORNet CDM

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 60.4%
  • Groovy 27.7%
  • PLSQL 7.6%
  • TSQL 1.9%
  • Makefile 1.3%
  • Shell 0.7%
  • Other 0.4%