Google Cloud Storage

Tree Schema integrates with Google Cloud Storage to automatically extract metadata from files that exist within your Cloud Storage buckets.

Connecting to Google Cloud Storage

Tree Schema requires a service account to connect to your GCS account, to set up this connection there are two required fields:

  • Project ID: The unique ID of the GCP project where your GCS buckets reside

  • JSON Key File: The full content of an service account credentials

../../_images/gcs_connection.png

For details on creating a service account visit the GCP Documentation.


IAM Permissions

The following permissions are required for Tree Schema to integrate with Google Cloud Storage.

  • storage.buckets.get: allows access to get metadata and content within a bucket

  • storage.buckets.list: allows Tree Schema to validate access to a given bucket

  • storage.objects.get: allows Tree Schema to pull content of a file and infer the schema

  • storage.objects.list: allows Tree Schema to find one or more files within a given directory

We suggest creating a role specific for Tree Schema, as seen here:

../../_images/gcs_tree_schema_role.png

Access for the service account you created for Tree Schema will need to be added to each bucket that it requires acess to. To do this, navigate to Google Cloud Storage, select the bucket to give pemissions then go to the permissions tab. At the bottom select the +Add button.

../../_images/gcs_add_permissions.png

Now add the service account for Tree Schema that exists within your account and assign it to the role created above. If you created a new role you can find it under custom:

../../_images/gcs_add_role.png

Automatically Extracting Schemas

Within Google Cloud Storage you may have several directory structures that each contain a unique schema. For example, you might have a directory that contains all of your user information and you might have a separate directory that contains all of your account information in the same bucket:

- analytics_bucket
| -- users
|  | -- user_file1.csv
|  | -- user_file2.csv
| -- accounts
|  |-- customers
|     | -- year=2020
|        | -- customer_1.csv
|        | -- customer_2.csv
|     | -- year=2021
|        | -- customer_1.csv
|        | -- customer_2.csv

When you want Tree Schema to automatically infer the format of your data from Azure Blob Storage you must specify the bucket and the directory structure of each unique schema, as seen below:

../../_images/gcs_schemas.png

In this example above, the customers schema is partitioned by year but the users schema is not partitioned. When partitions exist in your directory structure, Tree Schema will automatically add the values of your partitions to the schema but only if you specify the directory location above where the partitions begin.

In this example, we have defined two schemas:

  1. bucket: analytics_bucket, directory: users

  2. bucket: analytics_bucket, directory: accounts/customers


Google Cloud Storage Limimtations

Given that the structure of the directories and containers within GCS may be unique for each bucket, Tree Schema does not read through all the files in your GCS buckets to try and identify where unique schemas exist. You must provide the location of the schema.