Skip to content

MMMdata/python_gcs_bigq

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

python_gcs_bigq

ETL Data Set Shards using Python with Google Cloud Storage and Google BigQuery

Due to very large volumes of data captured within their business domain, Data Analysts and Data Scientists that are using an RDBMS to support their business intelligence (BI) and machine learning (ML) activities are now seriously constrained by the inherent limitations of the b-tree data structure which the RDBMS uses to capture and manage their data.

Though ideally suited for inserts, updates, and deletes, i.e., ideally suited for OLTP transactions, b-tree data structures are not well suited for data searches. Data searches over b-tree data structures are limited to log base 2 performance - that is, for each individual record you want to read from a table containing 1 million records, you must (at minimum) read 19 other records before you ever get to read that 1 record you wanted). To compensate for this inherent limitation of the b-tree data structure, RDBMS vendors have vertically scaled their commodity data services. However, that approach has proven to be inadequate when having to search very large volumes of data, whether structured or unstructured. Hence, the reason for the creation of commercial column-oriented data services.

In this business case, the Data Analysts and Data Scientists have hit that 'SQL query' wall, and are initiating a Proof-of-Concept (POC) to determine if a column-oriented data service, which supports (some flavor of) SQL, can provide timely access to the 100s of TBs of structured data. They have chosen to evaluate Google BigQuery to determine if it can support their use cases. And since Google Cloud Storage (GCS) must serve as the staging area for BigQuery, GCS is therefore part of the POC as well. Alternatively, Amazon Redshift could be chosen for this use case, which would lead to the use of Amazon S3 buckets as its staging area.

At present, the data sets which are input to their BI and ML activities are sourced from a single on-premise commodity RDBMS OLTP cluster. Each table in the OLTP RDBMS contains records created and changed over a 15 year period. According to this business use case, to provide the teams of Data Scientists and Data Analysts with 'timely' data, on an hourly basis the old, the new, and the changed records, found in that data source are required to be present in their corresponding BigQuery table/entity and view(s).

Since this problem space does not involve the blending of data sets from a collection of data sources (whether sourced from within the same business domain, or from a variety of business domains from within the same enterprise, or from a variety of enterprises), this POC does not need to adopt complex enterprise integration patterns. Of the four main approaches to enterprise integration (file transfer; shared databases; remote procedure invocation; messaging), the file transfer pattern is best suited to this business use case POC.

A critical remaining design question is: what programming language will the solution be implemented in? When creating, as in this POC, a solution for a single department, for a single business domain, the demands placed upon the language by the file transfer pattern are considerably simpler than in the case of any of the other 3 enterprise integration approaches.

Whenever adopting the shared databases, or the remote procedure invocation, or the messaging, enterprise integration pattern the programming language must be able to scale in response to business events. As such, the language must be multi-threaded, and preferably that threading can accomplished at a high level of abstraction, such as in the case of a JVM running a Scala program which uses the Akka framework. Moreover, since large volumes of data are ubiquitous, and are also distributed, these multi-thread programs must be orchestrated across numerous virtual machines (VMs) (each of which manages some portion of the big data).

Whenever using the file transfer pattern within a single department, and when, at any single step along the data flow, you are merely transferring a file from point A to point B, you do not need a multi-threaded programming language. This is the type of programming task that Python is very well suited to handle. And, given that data is distributed across VMs, and that the ETL Python scripts must therefore be run con-currently (without interfering with each other), it is a relatively straightforward matter to operate such distributed Python scripts.

When adopting the file transfer pattern you have to chose the appropriate type of file system, one which is a distributed file management system. But, since extracting, or transforming, or loading, large data files/sets are all disk I/O intensive tasks, and moreover that for this business use case all ETL scripts must finish in 60 minutes, each VM must use local solid-state devices (SSD) as their persistent storage mechanism.

For each source table, a Python + SQL script will copy the table's records, and output them to CSV files, which are persisted to on-premise disk storage. However, since a number of source tables are very larges (~50TBs), and since all changes must be replicated into BigQuery within a 60 minute window, these source tables which contain very large numbers of records, must be sharded. Unfortunately, having to shard source data sets really complicates the ETL tasks that implement the file transfer pattern. Fortunately, adopting and enforcing naming conventions provides a high-level abstraction which greatly reduces the complexity of the ETL programs.

In addition to the added complexity of sharding, there are 2 additional concerns: the maximum file size supported by Google Cloud Storage (GCS) is 4GB, and, the file transfer pattern is always network intensive. To avoid having to shard the source data, it is important to discover (for each source table) if all of the source table records can be extracted as a single CSV file, but whose file size (which must not exceed 4GB) is not too large to complete all ETL tasks within the allotted 60 minutes. As always, wherever possible we want to obey the KISS (keep-it-simple-stupid) design principle. Since GCS and Google BigQuery are cloud based services, it will be necessary, post the POC, to build a virtual private network (VPN) which links the on-premise site with the Google Cloud Platform (GCP), assuming, of course, that the value proposition of the POC has been proven.

In light of the (on-premise + Internet) network bandwidth, and the file size limitations of GCS, it has been determined that the shards can be modeled as a time series composed of one month intervals. One adopted convention related to sharding is that each shard extract CSV file will have this suffix '_CCYYMMDDHH.csv'. By convention the CSV file prefix is the name of the source table as defined in the RDBMS. Another related convention is the rule that each source table will have a corresponding bucket in GCS, which is to be named after that source table. However, in that all GCS bucket names must be globally unique (across all GCP consumers), the convention of a cryptic bucket name prefix standard is enforced. The last conventions to be adopted is to assign the name of the source table to its corresponding BigQuery table. But since the individual shards are to be present in BigQuery, the sharded table name has this suffix '_CCYYMM'.

Implicit in this last convention is the understanding that the source schema is not being transformed in this POC. When records from a relational data model are stored within a column-oriented data model, it is surprising to observe just how much faster the search queries execute without having to transform the source data model. Since the Data Analysts and Data Scientist want to avoid having to re-write their existing SQL statements (so that they work with a transformed data model), the database schema is being recreated within BigQuery. It is understood that should an existing query over the existing schema prove not to be viable in BigQuery, then a transformation program to 'write the data the way you want to read the data' will have to be implemented.

In that the source database is not sharded, but the BigQuery database will be sharded, views must be created in BigQuery. Such a view will wrap all of the shards from the source table and create the illusion in BigQuery of a single table. By implementing views, the existing SQL statements will not have to be re-written to handle shards.

The data flow that handles the hourly time-series shards is significantly more complex, and rightly so. Outside the context of a time-boxed POC, the change data capture (CDC) feature of an RDBMS would be used to record the metadata about new and changed records. With the CDC operating, only these 'delta' records would find their way into the extract CSV files (unlike the POC approach of taking all records, changed or unchanged, since the last extract event). And when depending on CDC metadata to keep the BigQuery database in sync, it is also required that baseline extracts (containing all records) be created for each source table. CDC is therefore outside on consideration for this POC, as the core of the POC is the evaluation of BigQuery by the Data Analysts and Data Scientists. Instead, when sharding is required, the extract SQL statement will exploit the time-stamp property of each source table and use it to retrieve just the new and changed records (since the prior extract event) from that source table, i.e., the poor man's CDC technique.

Given the complexity of the sharded data flow, its discussion will be spread over 2 blog postings. In this posting the extract task will not be discussed further, however the load of the CSV file into GCS task, and its subsequent load into BigQuery, will be covered. In the second blog in this series, 'Use Python to Update BigQuery tables', the tasks of merging new and changed records with its shard CSV file, as well as the creation and maintenance of BigQuery views (over sharded tables), will be covered.

The interesting engineering bits of this POC are found around the challenges involved in sharding large volumes of changing data sets so that they can quickly be refreshed within Google BigQuery. Sharding time series data sets is the most challenging part of this use case, particularly given the fact that BigQuery does not support the UPDATE statement. Unlike the truly enterprise grade data services which support the UPDATE statement, such as Netezza, Teradata, Aster, Oracle Exadata 2, etc., updating BigQuery tables must be accomplish outside of that data service. This is not an easy problem to solve, let alone within a 60 minute window in which all ETL tasks have to finish, while running concurrently across hundreds of structurally different source tables.

The data flow that handles a single CSV extract file that is generated for a given source table is relatively simple. The CSV extract file is generated and written to a local SSD. From there it is copied into its corresponding GCS bucket. Next, the corresponding BigQuery table is programmatically truncated and the CSV file (in GCS) is bulk loaded into that same BigQuery table. To accomplish this last load task, a JSON representation of each BigQuery table has been created.

When reviewing the Python code, notice how the naming conventions are used to greatly simplify the Python code. The goal is to have only 1 Python script which can handle any and all source tables which have small volumes of data, and, likewise, to have only 1 Python script which can handle any and all sharded source tables which have very large data volumes. Attaining these 2 design goals is important if you want to avoid creating a big-ball-of-mud (bbom).

Since it is imperative that the steps the Python script progress through are logged as they occur, you will see that logging code in the Python code samples provided. In addition, since the Python script uses the credentials of a Google Service account, you will also see in the Python code samples how to use those credentials.

There are 3 Python files which contain the code for handling the single CSV file load from an on-premise folder into a GCS bucket and on into a BigQuery table: config.py, functions.py and loader.py. You can find these Python code files at www.thesolusgroupllc.com/gcp.html. The Python and SQL code that creates the CSV extract files is not being shared - there are many examples of this type of programming already available on the web.

The loader.py invokes a method named 'ComplexLoad' which handles shards and the updating of BigQuery tables. That method, and all other methods it calls are contained in another Python file named 'complex_functions.py'. That algorithm will be covered in the second blog posting in this series.

If you have any questions about the code, just send an email message to charles@thesolusgroupllc.com

Written by Charles Clifford, M.S.

About

ETL Data Set Shards using Python with Google Cloud Storage and Google BigQuery

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%