rc-ml-corpus

Repository to generate agency/organization specific corpus to build ML models

This repository initially contains code to download PDF files of the documents hosted at the NOAA Institutional Repository. It reuses scripts from other RC repositories that are added as Git submodules.

After cloning this repo, initialize the local Git configuration for the required submodules:

git submodule init

For more info about how to use Git submodules, see:

Once you confirm the submodules are initialized correctly, you can run the script by:

python create_noaa_ir_corpus.py

The current script workflow is:

Pull all documents metadata from NOAA Institutional Repository
Reuse scripts form several Rich Context repositories to crete a corpus.jsonld (same structure as https://github.com/Coleridge-Initiative/rclc)
Download all PDFs files

You can skip any of those steps. To see the how to do it you can run:

python create_noaa_ir_corpus.py -h

Example: python create_noaa_ir_corpus.py --skip_download

NOAA API has 17 collections, each one with its own endpoint. See NOAA API documentation for details.

In step 1, the script will pull the documents metadata from all the collections and then deduplicate them, since one document could be present in more than one collection/endpoint. It will generate a csv file with the documents metadata (eg NOAA id, title, doi, etc) and a programmatically generated URL pointing to the PDF files in the NOAA IR.

In step 2, the script will reuse code from RCGraph to generate a corpus.jsonld file. The file might have additional entities when compared with the one published originally in RCLC. It will initially contain a placeholder for the dataset links.

In step 3, the script will try to download all the PDF files present in corpus.jsonld file. It will save them into the subdirectory corpus/pub/pdf. The download script will make 3 attemps by default. A few NOAA documents will fail to be downloaded. If the download is interrupted, the script will pick up where it left, preventing downloading again files already present in corpus/pub/pdf.

Corpus available in S3 bucket

Since the download and pdf-to-text conversion will take a significant time for bigger corpora we are making both PDF and TXT files available in a S3 bucket named richcontext.

First, you need to have your .aws directory configured with valid keys, etc., for S3 access before the following script will work. The bucket is readable by the public, even so the boto3 library for requires keys for a valid AWS user account.

Then adapt the rclc/bin/download_s3.py script as example code to download the PDF files (open access publications), and TXT files (raw extracted text) from the public S3 bucket. You will need to modify or adapt that code. For now we are not providing structured JSON files for these corpora.

Please, note that the S3 bucket will contain more files than just the current corpus. You need to use corpus.jsonld file as an index. You will see that the entries in that file of type ResearchPublication has an id that matches those files in the S3 bucket.

Example: for this entry in corpus.jsonld file, `"@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication-f23cafe87a6977199527"`, the corresponding filenames will be `f23cafe87a6977199527.pdf` and `f23cafe87a6977199527.pdf.txt`

On the other hand, not all documents present in corpus.jsonld file will be present in the S3 bucket. There are two reasons why a publication present in corpus.jsonld file might not have the corresponding .txt file uploaded in the S3 bucket:

The PDF file of that publication failed to be retrieved from the NOAA Institutional Repository -a few files are missing there
The pdf-to-text tool could not process the PDF file.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
adrf-onto @ a695487		adrf-onto @ a695487
corpus		corpus
datasets @ a9c51d3		datasets @ a9c51d3
publications @ 419f950		publications @ 419f950
rc_metrics		rc_metrics
rcgraph @ e6f32b0		rcgraph @ e6f32b0
rclc @ 4921522		rclc @ 4921522
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
create_noaa_ir_corpus.py		create_noaa_ir_corpus.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adrf-onto @ a695487

adrf-onto @ a695487

corpus

corpus

datasets @ a9c51d3

datasets @ a9c51d3

publications @ 419f950

publications @ 419f950

rc_metrics

rc_metrics

rcgraph @ e6f32b0

rcgraph @ e6f32b0

rclc @ 4921522

rclc @ 4921522

.gitignore

.gitignore

.gitmodules

.gitmodules

LICENSE

LICENSE

README.md

README.md

create_noaa_ir_corpus.py

create_noaa_ir_corpus.py

requirements.txt

requirements.txt

Repository files navigation

rc-ml-corpus

Repository to generate agency/organization specific corpus to build ML models

Example: python create_noaa_ir_corpus.py --skip_download

Corpus available in S3 bucket

Example: for this entry in corpus.jsonld file, `"@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication-f23cafe87a6977199527"`, the corresponding filenames will be `f23cafe87a6977199527.pdf` and `f23cafe87a6977199527.pdf.txt`

About

Releases

Packages

Languages

License

Coleridge-Initiative/rc-ml-corpus

Folders and files

Latest commit

History

Repository files navigation

rc-ml-corpus

Repository to generate agency/organization specific corpus to build ML models

Example: python create_noaa_ir_corpus.py --skip_download

Corpus available in S3 bucket

Example: for this entry in corpus.jsonld file, "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication-f23cafe87a6977199527", the corresponding filenames will be f23cafe87a6977199527.pdf and f23cafe87a6977199527.pdf.txt

About

Resources

License

Stars

Watchers

Forks

Languages

Example: for this entry in corpus.jsonld file, `"@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication-f23cafe87a6977199527"`, the corresponding filenames will be `f23cafe87a6977199527.pdf` and `f23cafe87a6977199527.pdf.txt`