The OpenDeID corpus is the first Australian based gold-standard corpus for patient de-identification. This corpus can used for development of automated patient de-identification systems using rule based or machine learning approaches. The corpus comprises of 2,100 pathology reports consisting of approximately 717 tokens per report from 1,833 cancer patients with 38,414 PHI entities annotated. The overall inter-annotator agreement and deviation scores for all three settings were 0.9464 and 0.9503 respectively. The corpus is manually annotated with surrogate information and measures have been taken to make sure there is no indetifiable informaiton. For more information please refer to https://www.sredhconsortium.org/sredh-datasets/opendeid-corpus-dataset
The OpenDeID corpus is used to design and develop OpenDeID pipeline. https://github.com/TCRNBioinformatics/OpenDeID-Pipeline
Please refer to https://www.sredhconsortium.org/sredh-datasets/opendeid-corpus-dataset
contact: z3339253 (at) unsw (dot) edu (dot) au
Please refer to https://www.sredhconsortium.org/sredh-datasets/opendeid-corpus-dataset
https://www.sredhconsortium.org/sredh-datasets/opendeid-corpus-dataset/faqs
https://www.sredhconsortium.org/sredh-datasets/opendeid-corpus-dataset/faqs
https://github.com/SREDH-Consortium/OpenDeID-Corpus
https://github.com/SREDH-Consortium/OpenDeID-Pipeline
https://github.com/TCRNBioinformatics/OpenDeID-Corpus
https://github.com/TCRNBioinformatics/OpenDeID-Pipeline
https://www.sredhconsortium.org/sredh-datasets/opendeid-corpus-dataset