Understanding the Schema Definition

The schema definition determines the shape of your schema and the structure of your data. The image below shows an empty schema definition:

../../_images/schema_definition.png

The first thing to note is that you can always create the schema manually by typing each field or you can upload a sample file and allow us to identify your schema on your behalf. Later on we will show you how to also generate this schema directly from the Data Store.

Schema Layout

To define a schema layout simply add in appropriate values, each row here represents a field. As an example, for tables this would be a column and for JSON data this would be a key.

  • Field Name: This field is required

  • Data Type: This field is required - it is used to validate data type consistency across all of your schemas and fields and to ensure that lineage between fields maintains valid data type transformations (e.g. you do not convert a boolean to an integer). We have chosen to use a common serialization standard, JSON, as our source for data types. More information can be found at the official JSON specification website.

  • Data Format: This field is required - JSON is a great way to standardize data types across your entire data ecosystem but most Data Stores have more specific data types and formats, those values are captured here (e.g. varchar(128))

  • Null: Whether or not the field can be null

  • Sample Values: A comma separated list of sample values for this given field. Later on you will be able to assign definitions to each sample value.

../../_images/example_schema_definition.png

Embedded Object Fields

Data is not always flat and your Data Catalog should be able to handle any shape or format of data. In the example above we defined four fields:

  • boolean_field

  • object_holder

  • object_holder.int_value

  • object_holder.str_value

Notice how object_holder is given the data type object and also that any values that we want to go inside of the object we use dot notation to specify the hierarchy of fields. Once your schema is created it will have this format:

{
    "boolean_field": "boolean",
    "object_holder": {
        "int_value": "integer",
        "str_value": "string"
    }
}

The same principle can be applied with lists, consider that we create the following:

../../_images/example_schema_definition_list.png

Which has the fields:

  • array_holder

  • array_holder.int_value

This will give us the following schema:

{
    "array_holder": [
        {
            "int_value": "number"
        }
    ]
}

We can even create objects inside of lists! Given that we create the following schema:

../../_images/ex_schema_def_inner_list.png

Which has the fields:

  • array_holder

  • array_holder.object_holder

  • array_holder.object_holder.int_val

  • array_holder.object_holder.str_val

  • array_holder.object_holder.inner_array

  • array_holder.object_holder.inner_array.second_str_val

We end up with this schema:

{
    "array_holder": [
        {
            "object_holder": {
                "inner_array": [
                    {
                        "second_str_val": "string"
                    }
                ],
                "int_val": "number",
                "str_val": "string"
            }
        }
    ]
}

Schema Inference from a File

To automatically infer a schema from a file select the button at the top to Upload File to Infer Schema.

../../_images/infer_schema_btn.png

This will present a pop up which will present the types of files we can infer schemas from. Once you select an eligible file type you will be able to select your file to upload.

../../_images/infer_schema_options.png

Finally, hit Submit and your schema should be filled in on your behalf:

../../_images/auto_inferred_schema.png

Partitioned Fields

If you have partitioned fields that are in the file path for your schema but not in the content of the file itself Tree Schema will automatically pick them up and add them to the schema that is defined as long:

  1. The schema location is above the directories where the partitioned directories are location, and

  2. The partitions are in the the standard field_name=value structure

For example, if you are using S3 as a Data Store you may have the following location for your schema:

s3://your-bucket/path/to/schema

Withithin this directory you may have other sub-directories that look like this:

s3://your-bucket/path/to/schema/year=2019/month=02/day=01
s3://your-bucket/path/to/schema/year=2019/month=02/day=02
...
s3://your-bucket/path/to/schema/year=2020/month=02/day=01

Since all of your partitioned directories are below /path/to/schema, when Tree Schema generates the schema for your file it will automatically add year, month and day to your schema.

That’s it! Now submit the schema to create it.