Azure Blob Storage¶
Tree Schema integrates with Azure Blob Storage to automatically extract metadata from files that exist within your Blob Storage.
Connecting to Azure Blob Storage¶
There is one required field when connecting to Azure Blob Storage:
Shared Access Connection String: This is the full connection string that is generated for a Shared Access Signature (SAS). SAS is used to connect Tree Schema to your account because it provides fine-grained access control and time-constrained access.
A shared access connection is unique to each Azure storage account. For details on Storage Accounts visit the Azure Documentation.
When you connect Tree Schema to a storage account you can access any container within that account as long as the authentication provided has access to the container and the blobs within the container.
Automatically Extracting Schemas¶
Within Azure Blob Storage you may have several directory structures that each contain a unique schema. For example, you might have a directory that contains all of your user information and you might have a separate directory that contains all of your account information in the same container:
- analytics_container | -- users | | -- user_file1.csv | | -- user_file2.csv | -- accounts | |-- customers | | -- year=2020 | | -- customer_1.csv | | -- customer_2.csv | | -- year=2021 | | -- customer_1.csv | | -- customer_2.csv
When you want Tree Schema to automatically infer the format of your data from Azure Blob Storage you must specify the container and the directory structure of each unique schema, as seen below:
In this example above, the customers schema is partitioned by year but the users schema is not partitioned. When partitions exist in your directory structure, Tree Schema will automatically add the values of your partitions to the schema but only if you specify the directory location above where the partitions begin.
In this example, we have defined two schemas:
container: analytics_container, directory: users
container: analytics_container, directory: accounts/customers
Azure Blob Limimtations¶
Given that the structure of the directories and containers within Azure Blob Storage may be unique for each Storage Account, Tree Schema does not read through all the files in your Blob Storage account to try and identify where unique schemas exist. You must provide the location of the schema.