Azure Blob Storage

Tree Schema integrates with Azure Blob Storage to automatically extract metadata from files that exist within your Blob Storage.

Connecting to Azure Blob Storage

There is one required field when connecting to Azure Blob Storage:

  • Shared Access Connection String: This is the full connection string that is generated for a Shared Access Signature (SAS). SAS is used to connect Tree Schema to your account because it provides fine-grained access control and time-constrained access.

../../_images/azure_blob_connection.png

A shared access connection is unique to each Azure storage account. For details on Storage Accounts visit the Azure Documentation.

Note

When you connect Tree Schema to a storage account you can access any container within that account as long as the authentication provided has access to the container and the blobs within the container.


Creating a Shared Access Signature and Permissions

The Azure documentation for Shared Access Signatures can be found (here).

As a quick overview, you can follow these steps to create one:

  1. Navigate to your storage accounts page in the Azure portal

../../_images/navigate_to_storage.png
  1. Select Shared Access Signature on the left.

../../_images/shared_access_signature.png
  1. Select the permissions. Tree Schema needs the following access:

  • Allowed services: Blob

  • Allowed resource type: Service, Container & Object

  • Allowed permissions: Read & List

  • Blob versioing permissions: None

  • Add an end date as far enough in the future as you feel comfortable. You can update the Shared Access Signature that Tree Schema uses at any time.

  • Optionally, you can limit access to the IP address that Tree Schema will use to access your data. View the Jump Server documentation to learn how to find the Tree Schema IP assigned to your account.

../../_images/shared_access_signature_permissions.png
  1. The Shared Access Signature connection string will be generated at the bottom:

../../_images/shared_access_connection_string.png

Automatically Extracting Schemas

Within Azure Blob Storage you may have several directory structures that each contain a unique schema. For example, you might have a directory that contains all of your user information and you might have a separate directory that contains all of your account information in the same container:

- analytics_container
| -- users
|  | -- user_file1.csv
|  | -- user_file2.csv
| -- accounts
|  |-- customers
|     | -- year=2020
|        | -- customer_1.csv
|        | -- customer_2.csv
|     | -- year=2021
|        | -- customer_1.csv
|        | -- customer_2.csv

When you want Tree Schema to automatically infer the format of your data from Azure Blob Storage you must specify the container and the directory structure of each unique schema, as seen below:

../../_images/azure_blob_schemas.png

In this example above, the customers schema is partitioned by year but the users schema is not partitioned. When partitions exist in your directory structure, Tree Schema will automatically add the values of your partitions to the schema but only if you specify the directory location above where the partitions begin.

In this example, we have defined two schemas:

  1. container: analytics_container, directory: users

  2. container: analytics_container, directory: accounts/customers


Azure Blob Limimtations

Given that the structure of the directories and containers within Azure Blob Storage may be unique for each Storage Account, Tree Schema does not read through all the files in your Blob Storage account to try and identify where unique schemas exist. You must provide the location of the schema.