Google Cloud Storage¶
Tree Schema integrates with Google Cloud Storage to automatically extract metadata from files that exist within your Cloud Storage buckets.
Connecting to Google Cloud Storage¶
Tree Schema requires a service account to connect to your GCS account, to set up this connection there are two required fields:
Project ID: The unique ID of the GCP project where your GCS buckets reside
JSON Key File: The full content of an service account credentials
For details on creating a service account visit the GCP Documentation.
The following permissions are required for Tree Schema to integrate with Google Cloud Storage.
storage.buckets.get: allows access to get metadata and content within a bucket
storage.buckets.list: allows Tree Schema to validate access to a given bucket
storage.objects.get: allows Tree Schema to pull content of a file and infer the schema
storage.objects.list: allows Tree Schema to find one or more files within a given directory
We suggest creating a role specific for Tree Schema, as seen here:
Access for the service account you created for Tree Schema will need to be added to each bucket that it requires acess to. To do this, navigate to Google Cloud Storage, select the bucket to give pemissions then go to the permissions tab. At the bottom select the +Add button.
Now add the service account for Tree Schema that exists within your account and assign it to the role created above. If you created a new role you can find it under custom:
Automatically Extracting Schemas¶
Within Google Cloud Storage you may have several directory structures that each contain a unique schema. For example, you might have a directory that contains all of your user information and you might have a separate directory that contains all of your account information in the same bucket:
- analytics_bucket | -- users | | -- user_file1.csv | | -- user_file2.csv | -- accounts | |-- customers | | -- year=2020 | | -- customer_1.csv | | -- customer_2.csv | | -- year=2021 | | -- customer_1.csv | | -- customer_2.csv
When you want Tree Schema to automatically infer the format of your data from Azure Blob Storage you must specify the bucket and the directory structure of each unique schema, as seen below:
In this example above, the customers schema is partitioned by year but the users schema is not partitioned. When partitions exist in your directory structure, Tree Schema will automatically add the values of your partitions to the schema but only if you specify the directory location above where the partitions begin.
In this example, we have defined two schemas:
bucket: analytics_bucket, directory: users
bucket: analytics_bucket, directory: accounts/customers
Google Cloud Storage Limimtations¶
Given that the structure of the directories and containers within GCS may be unique for each bucket, Tree Schema does not read through all the files in your GCS buckets to try and identify where unique schemas exist. You must provide the location of the schema.