Self-contained ready-to-use Python scripts to help Data Citizens who work with Google Cloud Data Catalog.
-
Quickstart: sample code for Data Catalog's API core features.
-
Load Tag Templates from CSV files: loads a set of fields from CSV files and creates Tag Templates using them.
-
Load Tag Templates from Google Sheets: loads a set of fields from Google Sheets and creates Tag Templates using them.
-
Data Catalog hands-on guide: a mental model @ Google Cloud Community / Medium
-
Data Catalog hands-on guide: search, get & lookup with Python @ Google Cloud Community / Medium
-
Data Catalog hands-on guide: templates & tags with Python @ Google Cloud Community / Medium
git clone https://github.com/ricardolsmendes/gcp-datacatalog-python.git
cd gcp-datacatalog-python
- BigQuery Admin
- Data Catalog Admin
./credentials/datacatalog-samples.json
Using virtualenv is optional, but strongly recommended unless you use Docker.
pip install --upgrade virtualenv
python3 -m virtualenv --python python3 env
source ./env/bin/activate
pip install --upgrade -r requirements.txt
export GOOGLE_APPLICATION_CREDENTIALS=./credentials/datacatalog-samples.json
Docker may be used as an alternative to run all the scripts. In this case please disregard the Virtualenv install instructions.
Ent-to-end tests help to make sure Google Cloud APIs and Service Accounts IAM Roles have been properly set up before running a script. They actually communicate with the APIs and create temporary resources that are deleted just after being used.
- pytest
export GOOGLE_CLOUD_TEST_ORGANIZATION_ID=ORGANIZATION_ID
export GOOGLE_CLOUD_TEST_PROJECT_ID=PROJECT_ID
pytest tests_e2e/quickstart_test.py
- docker
docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty \
--env GOOGLE_CLOUD_TEST_ORGANIZATION_ID=ORGANIZATION_ID \
--env GOOGLE_CLOUD_TEST_PROJECT_ID=PROJECT_ID \
gcp-datacatalog-python \
pytest tests_e2e/quickstart_test.py
- python
python quickstart.py --organization-id ORGANIZATION_ID --project-id PROJECT_ID
- docker
docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty gcp-datacatalog-python \
python quickstart.py --organization-id ORGANIZATION_ID --project-id PROJECT_ID
- A master file named with the Template ID — i.e.,
template-abc.csv
if your Template ID is template_abc. This file may contain as many lines as needed to represent the template. The first line is always discarded as it's supposed to contain headers. Each field line must have 3 values: the first is the Field ID; second is its Display Name; third is the Type. Currently, typesBOOL
,DOUBLE
,ENUM
,STRING
,TIMESTAMP
, andMULTI
are supported.MULTI
is not a Data Catalog native type, but a flag that instructs the script to create a specific template to represent field's predefined values (more on this below...). - If the template has ENUM fields, the script looks for a "display names file" for each of them. The files shall be named with the fields' names — i.e.,
enum-field-xyz.csv
if an ENUM Field ID is enum_field_xyz. Each file must have just one value per line, representing a display name. - If the template has multivalued fields, the script looks for a "values file" for each of them. The files shall be named with the fields' names — i.e.,
multivalued-field-xyz.csv
if a multivalued Field ID is multivalued_field_xyz. Each file must have just one value per line, representing a short description for the value. The script will generate Field's ID and Display Name based on it. - All Fields' IDs generated by the script will be formatted to snake case (e.g., foo_bar_baz), but it will do the formatting job for you. So, just provide the IDs as strings.
TIP: keep all template-related files in the same folder (sample-input/load-template-csv
for reference).
- pytest
export GOOGLE_CLOUD_TEST_PROJECT_ID=PROJECT_ID
pytest tests_e2e/load_template_csv_test.py
- docker
docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty \
--env GOOGLE_CLOUD_TEST_PROJECT_ID=PROJECT_ID \
gcp-datacatalog-python \
pytest tests_e2e/load_template_csv_test.py
- python
python load_template_csv.py \
--template-id TEMPLATE_ID --display-name DISPLAY_NAME \
--project-id PROJECT_ID --files-folder FILES_FOLDER \
[--delete-existing]
- docker
docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty gcp-datacatalog-python \
python load_template_csv.py \
--template-id TEMPLATE_ID --display-name DISPLAY_NAME \
--project-id PROJECT_ID --files-folder FILES_FOLDER \
[--delete-existing]
https://console.developers.google.com/apis/library/sheets.googleapis.com?project=<PROJECT_ID>
- A master sheet named with the Template ID — i.e.,
template-abc
if your Template ID is template_abc. This sheet may contain as many lines as needed to represent the template. The first line is always discarded as it's supposed to contain headers. Each field line must have 3 values: column A is the Field ID; column B is its Display Name; column C is the Type. Currently, typesBOOL
,DOUBLE
,ENUM
,STRING
,TIMESTAMP
, andMULTI
are supported.MULTI
is not a Data Catalog native type, but a flag that instructs the script to create a specific template to represent field's predefined values (more on this below...). - If the template has ENUM fields, the script looks for a "display names sheet" for each of them. The sheets shall be named with the fields' names — i.e.,
enum-field-xyz
if an ENUM Field ID is enum_field_xyz. Each sheet must have just one value per line (column A), representing a display name. - If the template has multivalued fields, the script looks for a "values sheet" for each of them. The sheets shall be named with the fields' names — i.e.,
multivalued-field-xyz
if a multivalued Field ID is multivalued_field_xyz. Each sheet must have just one value per line (column A), representing a short description for the value. The script will generate Field's ID and Display Name based on it. - All Fields' IDs generated by the script will be formatted to snake case (e.g., foo_bar_baz), but it will do the formatting job for you. So, just provide the IDs as strings.
TIP: keep all template-related sheets in the same document (Data Catalog Sample Tag Template for reference).
- pytest
export GOOGLE_CLOUD_TEST_PROJECT_ID=PROJECT_ID
pytest tests_e2e/load_template_google_sheets_test.py
- docker
docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty \
--env GOOGLE_CLOUD_TEST_PROJECT_ID=PROJECT_ID \
gcp-datacatalog-python \
pytest tests_e2e/load_template_google_sheets_test.py
- python
python load_template_google_sheets.py \
--template-id TEMPLATE_ID --display-name DISPLAY_NAME \
--project-id PROJECT_ID --spreadsheet-id SPREADSHEET_ID \
[--delete-existing]
- docker
docker build --rm --tag gcp-datacatalog-python .
docker run --rm --tty gcp-datacatalog-python \
python load_template_google_sheets.py \
--template-id TEMPLATE_ID --display-name DISPLAY_NAME \
--project-id PROJECT_ID --spreadsheet-id SPREADSHEET_ID \
[--delete-existing]