GitHub - uktrade/data-workspace: PostgreSQL-based open source data analysis platform

Data Workspace - a PostgreSQL-based open source data analysis platform

This is the entry-point repository for Data Workspace, a PostgreSQL-based open source data analysis platform with features for users with a range of technical skills. It contains a brief catalogue of all Data Workspace repositories (below), the source for the Data Workspace developer documentation, and the Terraform code to deploy Data Workspace into AWS.

Tip

Looking for the Data Workspace Django application? It's now in the data-workspace-frontend repo.

Catalogue of Data Workspace repositories

The components of Data Workspace are stored across several Git repositories.

Core

data-workspace (this repository)

Contains the Terraform code to deploy Data Workspace in AWS, and the public facing developer documentation for Data Workspace. See Contents of this repository for details of what goes where.
data-workspace-frontend

Contains the core Django application the defines the most user-facing components of Data Workspace. Also contains "the proxy" that sits in front of the Django application that integrates with SSO and routes requests, for example to tools.

Also contains the Dockerfiles for other components. However, it's planned to move these out to separate repositories.

Tools

data-workspace-tools

Contains the definitions of the on-demand tools that users can launch in Data Workspace.
data-workspace-mlflow

Contains the definitions of MLFlow, an MLOps tool.
data-workspace-superset

Contains the definitions of Superset, a dashboarding tool.
data-workspace-gitlab

Contains the definitions of GitLab, which stores code and run CI pipelines.

Low level

Some of the components of Data Workspace are lower level, and less Data Workspace-specific - they can at least theorically be re-used outside of Data Workspace

pg-sync-roles

Used to synchronise permissions between the data-workspace-frontend metadata database and users in the main PostgreSQL database.
mobius3

Used in on-demand tools to sync user's files with S3
dns-rewrite-proxy

Used in tools in order to filter and re-write DNS requests
theia-postgres

Used in Theia to give reasonably straightforward access to a PostgreSQL database
mirror-git-to-s3
git-lfs-http-mirror

Used to mirror git repositories that use Large File Storage (LFS) to S3 and to then access them from inside tools.
ecs-pipeline

Used to deploy Data Workspace from Jenkins
quicksight-bulk-update-datasets

A CLI script to make bulk updates to Amazon Quicksight datasets

Ingesting data

These components are usually used to ingest data into the PostgreSQL database that's the core of Data Workspace

pg-bulk-ingest
pg-force-execute

Used to ingest large amounts of data in the PostgreSQL database
to-file-like-obj

Used in serveral ways to convery from iterables of bytes to a file-like object for memory-efficient data ingestion. For example when parsing CSVs.
iterable-subprocess

Used to extract data from archives in a format that requires running an external program.
stream-read-ods

Used to extract data from Open Document Spreadsheet (ODS) files in a memory-efficient and disk-efficient way.
stream-unzip

Used to extract data from ZIP files in a memory-efficient and disk-efficient way.
stream-read-xbrl

Used to ingest data from Companies House.
sqlite-s3vfs

Used to generate large and complex SQLite files that are then ingested into the Data Workspace PostgreSQL database.
s3-dropbox

Used to power a simple API to accept incoming data files in any format and drop it in S3, subsequently ingested into Data Workspace.

Publishing data

These components are used when publishing data from Data Workspace.

public-data-api

Makes data available to the public.
stream-zip

Creates ZIP files in a memory-efficient and disk-efficient way.
stream-write-ods

Creates Open Document Spreadsheet (ODS) files in a memory-efficient and disk-efficient way.
postgresql-proxy

Part of the system that makes data available to other internal applications.

Contents of this repository

.github/workflows/

The GitHub actions workflows for this repository.
- deploy-docs-to-github-pages.yml
  
  On change of the main branch (such as a merge of a PR) it builds the developer documentation in docs/, pushes it to GitHub pages, and surfaces it at https://data-workspace.docs.trade.gov.uk/
- lint-terraform.yml
  
  On any PR against the main branch, or change of the main branch, it runs linting checks against the Terraform code to make sure it is consistently formatted.
.gitignore

A list of file patterns that are not committed to this repository by default during local development. For example it contains the patterns that match temporary files created by Terraform when run locally, or the built documentation when building the documentation locally.
docs/

The source of the Data Workspace developer documentation. The documentation is built using the node-based Eleventy static site generator and the X-GOVUK govuk-eleventy-plugin in order to use the GOV.UK design system.

The built documentation is hosted on GitHub pages.
package-lock.json
package.json
eleventy.config.js

Supporting files for building the Data Workspace developer documentation. The package.json file has the list of direct dependencies, package-lock.json has specific versions of all the direct and transitive node dependencies, and eleventy.config.js contains the configuration.
infra/

The Terraform source to build the infrastructure of Data Workspace in Amazon Web Services (AWS).
README.md

The source of the file you're currently reading.
CODEOWNERS

The list of code owners that can approve pull requests in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

docs

docs

infra

infra

.gitignore

.gitignore

CODEOWNERS

CODEOWNERS

LICENSE

LICENSE

README.md

README.md

eleventy.config.js

eleventy.config.js

package-lock.json

package-lock.json

package.json

package.json

Repository files navigation

Catalogue of Data Workspace repositories

Core

Tools

Low level

Ingesting data

Publishing data

Contents of this repository

About

Contributors 6

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.github/workflows		.github/workflows
docs		docs
infra		infra
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
README.md		README.md
eleventy.config.js		eleventy.config.js
package-lock.json		package-lock.json
package.json		package.json

License

uktrade/data-workspace

Folders and files

Latest commit

History

Repository files navigation

Catalogue of Data Workspace repositories

Core

Tools

Low level

Ingesting data

Publishing data

Contents of this repository

About

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Languages