Data Types and Formats

When it comes to capturing metadata about fields we understand that there are a lot of different ways to describe a field. We have boiled this down to three fields that will be rigid enough to allow us to execute data validation on your behalf but also flexible enough to allow you to be able to fully capture the specifics of your data store.

  1. Field type: This is primarily used internally within Tree Schema, there are only three valid values

  • Object: this is any form of embedded set of values, such as a nested JSON, a dictionary or a map

  • List (or array): this is any list of items, the items may be objects, other lists or scalars

  • Scalar: these are everything else such as strings, integers, floats, decimals, other numeric values, booleans, etc.

  1. Data Type: This field is a value that is compatible with JSON serialization standards, as defined in the JSON Schema Definiton.

  • We have chosen to do this because a large part of the value that comes with a data catalog is understanding relationships between data and JSON is one of the most common, simplistic and interoperable ways to serialize data.

  • We map every possible value from each data store to one of these common data types and we require that a value be selected for this field on each schema you manually create.

  • This enables us to do data validation checks for you such as “did an integer field change to a boolean?” and to also track your data type movement across data stores to ensure they are valid.

  1. Data Format: This is a free-form field which should be used to give your users more context about the field.

  • When we automatically generate schemas on your behalf we fill in this value directly from the Data Store; for example a data type of string may have a data format of varchar(64).

  • This is also a good place to capture formatting for date / timestamp objects if needed.