Data Lineage

Quite possibly one of the most important capabilities for a Data Catalog is to allow users to explore how data flows from one system to the next and to be able to both visualize and interact with the lineage of the data.

../../_images/lineage_growth_hd.gif

We have deeply integrated data lineage into all aspects of Tree Schema in order to provide a compreshensive view of your data at all levels - all the way from the highest level by viewing connections between Data Stores down to the most granular field level transformations. Here we show you how to use the lineage tools within Tree Schema in order to get the most value out of your data.


Data Lineage & Transformations

Data Lineage and Transformations are very similar within Tree Schema, the main distinction between the two is that a transformation has a limited scope to only include the fields, schemas and data stores explicity defined within the given transformation where as data lineage is a full and comprehensive view of all of your data.

When you define a single transformation you may have, as an example, field “A” in schema “1” that correspondes to field “B” in schema “2”. This would be the entire scope of the transformation, however, when you view this transformation from the data lineage perspective you will be able to see all connections from schema “1” or field “B”, and not only those defined within a single transformation.

Consider this basic example, where a transformation only maps a single field from one schema to another, the transformation looks like this:

../../_images/lineage_transform_simple.png

The same transformation can be seen in the Data Lineage view, highlighted with the red line. However, there is much more context here about other schemas and fields as well.

../../_images/lineage_simple.png

Understanding the Lineage Interface

All data lineage follows the same basic principles for data relationships as the rest of Tree Schema. Fields sit within schemas and schemas reside within Data Stores. All of your data lineage will be shown with objects similar to this:

../../_images/lineage_object.png

The grey, outermost box, is the Data Store. The inner green boxes are the schemas. Both the data stores and the schemas will be prefaced with Data store or Schema respectively. The inner most boxes are the fields. For the example above we can see that the Data Store Device & Session Data has a schema device.session with the field device_id which has a relationship to another field. If we look at another example we can see a Data Store which contains two schemas and several fields within each schema:

../../_images/lineage_object2.png

Note

Not all fields within a schema are displayed within the data lineage view, only fields which have at least 1 connection to another field are shown


Lineage Relationships

Relationships between fields can be seen by the lines that connect the fields. By default connections are not shown as “active” and therefore they are light grey when you see them on the lineage view. To “activate” and highlight a connection you can do one of three things:

  1. Hover over the field: this will highlight all connections into and from that given field

  2. Hover over the schema: this will highlight all connections into and from the given schema

  3. Hover over the data store: this will highlight all connections into and from the given data store

Examples of this are:

Hovering over a field

../../_images/lineage_hover_field.png

Hovering over a schema

../../_images/lineage_hover_schema.png

Hovering over a data store

../../_images/lineage_hover_data_store.png

Exploring Data Lineage

When you hover over a field, schema or a data store you will see several icons appear. Two of them are the selectors for upstream and downstream lineage. The up and downstream selectors look like this:

Upstream Selector

../../_images/lineage_selector_upstream.png

Downstream Selector

../../_images/lineage_selector_downstream.png

Similar to the lineage relationships above, these also display relative to where you hover:

  1. Hover over the field: this will allow you to add new fields, schemas & data stores that connect upstream or downstream to this specific field

  2. Hover over the schema: this will allow you to add new fields, schemas & data stores that connect upstream or downstream to any field within this schema

  3. Hover over the data store: this will allow you to add new fields, schemas & data stores that connect upstream or downstream to any field within any schema within this data store

Examples of this are:

Hovering over a field

../../_images/lineage_hover_field_up_downstream.png

Hovering over a schema

../../_images/lineage_hover_schema_up_downstream.png

Hovering over a data store

../../_images/lineage_hover_data_store_up_downstream.png

Note

Once you click a selector and Tree Schema attempts to retrieve the up or downstream relationships for that specific entity the same selector will disappear and not be available again


Collapsing & Expanding Lineage Nodes

Sometimes lineage exploration is just noisy, there are too many schemas, too many data stores and too much information to be able to digest anything useful. To help with this you can collapse and expand nodes, in the top left corner of your Data Store or Schema you will see either the collapse icon.

You can see a single collapsed schema on the right:

../../_images/lineage_collapse_single.png

Notice that there are still three connections to the schema? That is because there are 3 unique fields connecting to this schema, to really get the benefit of collapsing your data stores and schemas you can collapse more than on to also collapse the connections, as seen here:

../../_images/lineage_collapse_multiple.png