A data lake is a central storage facility that houses an organisation’s structured, unstructured and semi-structured data. In most cases, the data that is ingested will be strewn all over. As a data lake accumulates data over the years, this could lead to ‘data swamping’; where users will no longer know where their data is stored or what transformations took place to the data that was ingested. Such a situation will lead to data lying in isolation, thus losing the whole point of storing data.

This is where data lake governance comes into place. Data governance is a pre-defined data management process that an organisation implements to ensure that high-quality data is available throughout the whole data life cycle. However, there is a void in semantic consistency and governance of metadata in the current implementations of data lake solutions (Gartner, 2017).

There are a number of benefits for implementing data governance within a data lake.

  • Traceability – helps understand the entire life cycle of the data residing in the data lake (this also includes metadata and lineage visibility)
  • Ownership – helps organisations to identify data owners should there be questions about the validity of data
  • Visibility –  helps data scientists swiftly and easily recognise and access the data they are looking for, amidst large volumes of structured, semi structured and unstructured data
  • Monitored health – helps ensure that data in the data lake adheres to pre-defined governance standards
  • Intuitive data search – helps users to find and ‘shop’ for data in one central location, using familiar business terms and filters, that narrow results to isolate the right data.

Praedictio Data Lake

Praedictio, an Amazon Web Services powered data lake solution developed by Mitra Innovation, offers all of the business benefits discussed as above. One of the key attractions of the Praedictio Data Lake lies in its visualisation component which features a powerful three-fold visualisation of the data lake, as follows:

data lineage visualisation

source and destination visualisation

graph visualisation of data in the data lake

Furthermore, Praedictio Data Lake is equipped with a dashboard component which delivers visualisation of the health of the data lake to users, along with an alerting mechanism when pre-defined thresholds are met.

Another key feature of a data lake is the ability to catalogue data; which is based on  meta data that relates to the data residing in the Data Lake. This helps users easily search for the necessary data and also helps users determine which data is fit to use—and which data needs to be discarded because it is incomplete or irrelevant to the analysis at hand. Moreover, it also shows the schema changes of the underlying data over time too.

Take away

Data Lakes store data in their native formats. The data structure and requirements are not defined until the data is needed. Such data in its native format is gibberish and cannot be used to derive  business insights to gain a competitive edge. This makes it important that an organisation adds policy driven processes, thus adding context to the underlying data, making it more efficiently and effectively used by the stakeholders.

Hence, it is evident that data governance policies and data cataloguing is of great importance for higher value-generation, making actionable insights and informed decisions, as well as to eliminate the current drawbacks of data silos in data lakes.

Follow us as we explore the newest frontiers in ICT innovation, and we apply such technologies to solving real world problems faced by enterprises, organisations and individuals. Thank you so much for reading! ?

Kalani Samarawickrama

Senior Software Engineer | Mitra Innovation

Leave a Reply