Extraction of Clinical Trials Biomarkers
haukka helps curators search clinical trials at (http://clinicaltrials.gov) for cancer biomarkers (genes and alterations), and curate them for future reference. haukka started as GSoC 2015 project for OncoBlocks organization.
pyhaukka depends on:
In this project Vagrant is used to manage a VirtualBox instance for development. So just start by installing these tools, then fire up the ubuntu/trusty64 instance, which will also install required packages: PostgreSQL 9.4, redis, Python 2.7, pip and virtualenv and create required setup on a virtual box image. t
$ vagrant up
$ vagrant ssh
...
Host: 127.0.0.1
Port: 2222
Username: vagrant
...
After the instance is up and running, SSH to the mentioned port and then:
$ cd /vagrant/
$ py.test tests
Note that the directory /vagrant/
inside the virtual box is synced with the project folder at the host (this repository).
- Provide training corpus in
nltk_data/corpora/clinical_trials
directory. It should look something like this
Crizotinib in Pretreated Metastatic Non-small-cell Lung Cancer With [MET] Amplification.
within the same liver segment as long as the dose constraints to normal tissue can be met.
This is a Phase II, open-label, non-randomized, multi-center study of oral Dabrafenib in
combination with oral Trametinib in subjects with rare cancers including anaplastic thyroid
cancer, biliary tract cancer, gastrointestinal stromal tumor, non-seminomatous germ cell
tumor/non-geminomatous germ cell tumor, hairy cell leukemia, World Health Organization (WHO)
Grade 1 or 2 glioma, WHO Grade 3 or 4 (high-grade) glioma, multiple myeloma, and
adenocarcinoma of the small intestine, with [BRAF V600E] positive-mutations. This study is
designed to determine the overall response rate (ORR) of oral Dabrafenib in combination with
oral Trametinib in subjects with rare [BRAF V600E] mutated cancers. Subjects will need to have
a fresh or frozen tumor tissue sample provided to confirm the [BRAF V600E] mutation status.
Only subjects with histologically confirmed advanced disease and no available standard
treatment options will be eligible for enrollment. Subjects will undergo screening
assessments within 14 days (up to 35 days for ophthalmology exam, echocardiogram or disease
assessments) prior to the start of treatment to determine their eligibility for enrollment
in the study.
Note the brackets around the words that should be tagged as biomarkers.
- Train a classifier and store it
python manage.py train
- After this step a classifier binary should be stored in
nltk_data/classifiers/biomarker_classifier.pickle
and will be picked up by default by the RESTful webservice.
- Start Celery worker
Before the webservice can dispatch requests, background worker need to be started.
celery worker -A pyhaukka.worker
Note: on the provided Vagrant instance a redis-server is installed and started using the default configuration.
- Running a development WSGI server
Then either to run a WSGI server for debugging
$ python wsgi.py
Or, using uWSGI:
$ uwsgi uwsgi.ini
Resource URL | HTTP Verb | Functionality |
---|---|---|
/trials |
GET | Get list of stored trials |
/trials/nct_id |
GET | Retrieve a single trial detail |
/tasks |
POST | Post a new task to load and process a trial |
/tasks/task_id |
GET | Retrieve task result and status |
- Requests to process single trials can be sent to background worker, by sending a POST request to /tasks endpoint with the URL of the clinical trial. e.g:
POST /tasks
{"url":"https://clinicaltrials.gov/show/NCT02034110?displayxml=true"}
This will enqueue the clinical trial to be fetched, and then run the classifier on it. To see the results, use trials
endpoint
- To query the result and status of a task use
/tasks/<task_id>
endpoint:
GET /tasks/9ea876aa-17c6-493d-8178-461bfd330a80
{
"result": "NCT02034110",
"state": "SUCCESS",
"status": "Trial processed!"
}
-
GET requests to
/trials
list all the currently stored and processed trials -
GET request to specific
/trials/<NCT_ID>
, fetches the stored trial data as well as the results of running the classifier
GET /trials/NCT02034110
{
"nct_id":"NCT02034110",
"ner_result":[["BRAF","BIO"],["V600E","BIO"],["BRAF","BIO"],["V600E","BIO"],["ORR","BIO"],["BRAF","BIO"],["V600E","BIO"],["BRAF","BIO"],["V600E","BIO"],["ECOG","BIO"],["BRAF","BIO"],["V600E","BIO"],["BRAF","BIO"],["CLIA","BIO"],["BRAF","BIO"],["BRAF","BIO"],["CLIA","BIO"],["BRAF","BIO"],["GSK","BIO"],["FFPE","BIO"],["GSK","BIO"],["BRAF","BIO"],["MEK","BIO"],["ASCT","BIO"],["ABMT","BIO"],["PBSCT","BIO"],["GSK","BIO"],["MRI","BIO"],["GSK","BIO"],["CNS","BIO"],["RVO","BIO"],["CSR","BIO"],["RVO","BIO"],["CSR","BIO"],["RVO","BIO"],["CSR","BIO"],["RVO","BIO"],["CSR","BIO"],["NYHA","BIO"],["LVEF","BIO"],["LLN","BIO"],["LLN","BIO"],["LVEF","BIO"],["INR","BIO"],["HBV","BIO"],["HCV","BIO"],["RNA","BIO"],["HIV","BIO"],["BRAF","BIO"],["V600E","BIO"]],
"processed_on":null,
"trial":
{"brief_summary":"\n This is a Phase II, open-label, non-randomized, multi-center study of oral Dabrafenib in\n combination with oral Trametinib in subjects with rare cancers including anaplastic thyroid\n cancer, biliary tract cancer, gastrointestinal stromal tumor, non-seminomatous germ cell\n tumor/non-geminomatous germ cell tumor, hairy cell leukemia, World Health Organization (WHO)\n Grade 1 or 2 glioma, WHO Grade 3 or 4 (high-grade) glioma, multiple myeloma, and\n adenocarcinoma of the small intestine, with BRAF V600E positive-mutations. This study is\n designed to determine the overall response rate (ORR) of oral Dabrafenib in combination with\n oral Trametinib in subjects with rare BRAF V600E mutated cancers. Subjects will need to have\n a fresh or frozen tumor tissue sample provided to confirm the BRAF V600E mutation status.\n Only subjects with histologically confirmed advanced disease and no available standard\n treatment options will be eligible for enrollment. Subjects will undergo screening\n assessments within 14 days (up to 35 days for ophthalmology exam, echocardiogram or disease\n assessments) prior to the start of treatment to determine their eligibility for enrollment\n in the study.\n ","condition":["Cancer"],"criteria":"\n Inclusion Criteria:\n\n - Signed, written informed consent.\n\n - Sex: male or female.\n\n - Age: >=18 years of age at the time of providing informed consent.\n\n - Eastern Cooperative Oncology Group (ECOG) performance status: 0, 1 or 2.\n\n - BRAF V600E mutation-positive tumor: Local testing - Local BRAF mutation test results\n obtained by a Clinical Laboratory Improvement Amendments (CLIA) approved local\n laboratory may be used to permit enrollment of subjects with positive results. Local\n BRAF mutation test results will be subject to central verification; Central testing -\n Local BRAF mutation test results will be confirmed by central testing in a CLIA\n approved, designated central reference laboratory by the THxID BRAF assay or an\n alternate GSK designated assay. NOTE: For central testing, Formalin-fixed\n paraffin-embedded (FFPE) core bone marrow (BM) biopsies are not acceptable from\n subjects in the Multiple myeloma (MM) cohort.\n\n - Able to swallow and retain orally administered medication. NOTE: Subject should not\n have any clinically significant gastrointestinal (GI) abnormalities that may alter\n absorption such as malabsorption syndrome or major resection of the stomach or\n bowels. For example, subjects should have no more than 50% of the large intestine\n removed and no sign of malabsorption (i.e., diarrhea).NOTE: If clarification is\n needed as to whether a condition will significantly affect the absorption of study\n treatments, contact the GSK Medical Monitor.\n\n - Female Subjects of Childbearing Potential: Subjects must have a negative serum\n pregnancy test within 7 days prior to the first dose of study treatment and agrees to\n use effective contraception, throughout the treatment period and for 4 months after\n the last dose of study treatment.\n\n - French subjects: In France, a subject will be eligible for inclusion in this study\n only if either affiliated to or a beneficiary of a social security category.\n\n Exclusion Criteria:\n\n - Prior treatment with: BRAF and/or MEK inhibitor(s); anti-cancer therapy (e.g.,\n chemotherapy with delayed toxicity, immunotherapy, biologic therapy or\n chemoradiation) within 21 days (or within 42 days if prior nitrosourea or mitomycin C\n containing therapy) prior to enrollment and/or daily or weekly chemotherapy without\n the potential for delayed toxicity within 14 days prior to enrolment; Investigational\n drug(s) within 30 days or 5 half-lives, whichever is longer, prior to enrollment\n\n - Previous major surgery within 21 days prior to enrollment.\n\n - Prior extensive radiotherapy treatment within 21 days prior to enrolment. NOTE:\n Limited radiotherapy for palliative care is permitted within 14 days prior to\n enrollment as long as any radiation-related toxicity has resolved prior to\n enrollment.\n\n - Prior solid organ transplantation or allogenic stem cell transplantation (ASCT).\n NOTE: Previous autologous bone marrow transplant (ABMT) or autologous peripheral\n blood stem cell transplant (PBSCT) is permitted.\n\n - History of: Interstitial lung disease or pneumonitis; Another malignancy. NOTE:\n Subjects with another malignancy are eligible if: (a) disease-free for 3 years, (b)\n had a history of completely resected non-melanoma skin cancer, and/or (c) have a\n indolent second malignancy(ies) defined as a slow growing second/concurrent\n malignancy which is characterized by slow growth, a high initial response rate and a\n relapsing , progressive disease course. For example, a previously untreated low grade\n and select intermediate-grade lymphoid malignancy would be allowed as per the\n available body of evidence. There are no available clinical alternatives to the\n proposed population. Consult a GSK Medical Monitor if unsure whether second\n malignancies meet requirements specified above.\n\n - Presence of: cerebral metastases (except for subjects in the WHO Grade 1 or 2 Glioma\n or WHO Grade 3 or 4 Glioma histology cohorts). NOTE: Subjects with brain metastases\n may be included if: All known lesions have been previously treated with surgery or\n stereotactic radiosurgery, and Any remaining cerebral lesion(s) are asymptomatic and\n confirmed stable disease (i.e., no increase in lesion size) for >=90 days prior to\n enrollment as documented by two consecutive magnetic resonance imaging (MRI) or\n computed tomography (CT) scans with contrast, and No treatment with corticosteroids\n or enzyme-inducing anticonvulsants required for >=30 days prior to enrolment.\n Approval received from GSK Medical Monitor.\n\n - Presence of symptomatic or untreated leptomeningeal or spinal cord compression. NOTE:\n Subjects who have been previously treated for these conditions and have stable\n central nervous system (CNS) disease (documented by consecutive imaging studies) for\n >60 days, are asymptomatic and currently not taking corticosteroids, or have been on\n a stable dose of corticosteroids for at least 30 days prior to enrollment, are\n permitted.\n\n - Presence of pre-existing >= Grade 2 peripheral neuropathy.\n\n - Presence of unresolved treatment-related toxicity of >= Grade 2 (except alopecia) or\n toxicities listed in the general and histology-specific adequate organ function\n tables at the time of enrolment.\n\n - Presence of any serious and/or unstable pre-existing medical disorder (aside from\n malignancy exception above), psychiatric disorder, or other conditions that could\n interfere with subject's safety, obtaining informed consent or compliance to the\n study procedures.\n\n - History or current evidence/risk of retinal vein occlusion (RVO) or central serous\n retinopathy (CSR): History of RVO or CSR, or predisposing factors to RVO or CSR\n (e.g., uncontrolled glaucoma or ocular hypertension, uncontrolled systemic disease\n such as hypertension or diabetes mellitus, or history of hyperviscosity or\n hypercoagulability syndromes); Visible retinal pathology as assessed by ophthalmic\n examination that is considered a risk factor for RVO or CSR such as evidence of new\n optic disc cupping, evidence of new visual field defects and intraocular pressure >21\n mmHg.\n\n - History or evidence of cardiovascular risk including any of the following: Acute\n coronary syndromes (including myocardial infarction and unstable angina), coronary\n angioplasty, or stenting within 6 months prior to enrolment; Clinically significant\n uncontrolled arrhythmias NOTE: Subjects with controlled atrial fibrillation for >30\n days prior to enrollment are eligible; Class II or higher congestive heart failure as\n defined by the New York Heart Association (NYHA) criteria; Left ventricular ejection\n fraction (LVEF) below the institutional lower limit of normal (LLN). NOTE: If a LLN\n does not exist at an institution, then use LVEF <50%.; Corrected QT (QTc) interval\n for heart rate using Bazett-corrected QT interval (QTcB) >=480 millisecond (msec);\n Intracardiac defibrillator and/or permanent pacemaker; Treatment-refractory\n hypertension defined as a blood pressure (BP) >140/90 millimeters of mercury (mmHg)\n which may not be controlled by anti-hypertensive medication(s) and/or lifestyle\n modifications; Known cardiac metastases.\n\n - Current use of prohibited medication(s) or requirement of prohibited medications\n during study. NOTE: Use of anticoagulants such as warfarin is permitted; however,\n international normalization ratio (INR) must be monitored according with local\n institutional practice.\n\n - Positive for: Hepatitis B surface antigen or Hepatitis C antibody. NOTE: Subjects\n with laboratory evidence of cleared hepatitis B virus (HBV) and hepatitis C virus\n (HCV) infection will be permitted. NOTE: False positive subjects may be cleared for\n enrollment based on RNA-based assays; Human immunodeficiency virus (HIV); testing not\n required.\n\n - Known immediate or delayed hypersensitivity reaction or idiosyncrasy to drugs\n chemically related to study treatment, or excipients, or to dimethyl sulfoxide and/or\n sulfonamides (structural component of dabrafenib).\n\n - Female subjects: Pregnant, lactating or actively breastfeeding.\n\n - Subjects enrolled in France: The French subject has participated in any study using\n an investigational product (IP) within 30 days prior to enrollment in this study.\n ",
"detailed_description":null,
"keywords":["trametinib","Dabrafenib","solid tumors","BRAF V600E mutation","efficacy","safety"],
"lastchanged_date":"July 23, 2015",
"location":["United States","Austria","Belgium","Canada","Denmark","France","Germany","Italy","Korea, Republic of","Netherlands","Norway","Sweden"],
"nct_id":"NCT02034110",
"overall_status":"Recruiting",
"title":"A Phase II, Open-label, Study in Subjects With BRAF V600E-Mutated Rare Cancers With Several Histologies to Investigate the Clinical Efficacy and Safety of the Combination Therapy of Dabrafenib and Trametinib"}
}
Trial data is stored in PostgreSQL database for further retrieval by search queries.
Attribute | Description |
---|---|
nctid |
e.g NCT02034110 |
url |
e.g. https://clinicaltrials.gov/show/NCT02034110?displayxml=true |
trial_data |
JSON dictionary of clinical trial data 1 |
overall_status |
Overall status as read from clinical trial XML |
ner_result |
JSON dictionary of extracted biomarkers 1 |
lastchanged_date |
Lastchanged date of the clinical trial found in the XML |
loaded_on |
When the trial is loaded |
1 JSON for trial_data and ner_result is what is similar to those mentioned earlier
The following features are calculated for each word to classify a word as being part of a biomarker or not.
cbio
: Binary feature, True if the word exists incbio_cancer_genes
list.all_caps
: Binary feature, True if the word is all CAPS.en_word
: Binary feature, True if the word is an English word.stop_word
: Binary feature, True if the word is among NLTK list of stop words.has_digits
: Binary feature, True if the word contains some digits.all_digits
: Binary feature, True if the word is completely a digital number.word_len
: Length of the word to be tagged.tag
: The word is tagged eitherBIO
if it is part of a biomarker orO
if not.
Examples:
Word | cbio | all_caps | en_word | stop_word | has_digits | all_digits | word_len | tag (training) |
---|---|---|---|---|---|---|---|---|
written | No | No | Yes | No | No | No | 7 | O |
documentation | No | No | Yes | No | No | No | 13 | O |
of | No | No | Yes | Yes | No | No | 2 | O |
BRAF | Yes | Yes | No | No | No | No | 4 | BIO |
V600 | No | Yes | No | No | Yes | No | 4 | BIO |
mutation | No | No | Yes | No | No | No | 8 | O |
This | No | No | Yes | Yes | No | No | 4 | O |
is | No | No | Yes | Yes | No | No | 2 | O |
a | No | No | Yes | Yes | No | No | 1 | O |
Phase | No | No | Yes | No | No | No | 5 | O |
II | No | No | No | No | No | No | 2 | O |
open | No | No | Yes | No | No | No | 4 | O |
label | No | No | Yes | No | No | No | 5 | O |
non | No | No | Yes | Yes | No | No | 3 | O |
randomized | No | No | Yes | Yes | No | No | 10 | O |
multi | No | No | Yes | No | No | No | 5 | O |
center | No | No | Yes | No | No | No | 6 | O |
study | No | No | Yes | No | No | No | 5 | O |