10. 05. 2021

Reading time: 4 Min

Traceability and Compliance in the Data Lake

Through Digital Transformation, organizations are able to set themselves up for future innovation. The digital transformation is also not stopping at the banking and insurance industry. The use of AI opens up numerous new possibilities, for example, for analyzing customer behavior. Thus, it is crucial that AI and Data Analytics are properly integrated in the company in order to avoid risks and to benefit to the maximum from the digital transformation.

However, not only the available amount of data that companies can make use of is growing more and more - the associated Compliance regulations and principles are also increasing. The GDPR, at the latest, has made traceability indispensable for meeting Compliance requirements. In this context, traceability means understanding where one's own data comes from and what has happened to it along the way. This should be done at every single point of the pipeline.

For Data Lake platforms, this results in specific requirements for traceability, which can be met through the systematic setup of metadata. Establishing an automated Data Lineage as well as a Data Catalog plays a special role here.

What makes Data Lineage "provable"? If the lineage is automatically extracted from the code, it accurately represents the actual state, and is just a concise representation of the implementation. This is not the case with manually maintained metadata.

Markus Salomon, Business Development Manager Data Engineering & AI

Platform frameworks also make a valuable contribution. They are mainly introduced to extend the platform rapidly and in a standardized way and to be able to operate it well. With metadata being generated inside the platforms, they also provide a good basis, if neither a dedicated system for Data Lineage nor a Data Catalog are available. Furthermore, automated provisioning of infrastructure and configuration through CI/CD pipelines help to ensure traceability at a deeper level, making it easier to meet Compliance requirements.

On an organizational level, the overarching strengthening of DevOps and Data Governance is a key step, not only through tools and technology, but also in the corporate culture. Modern methods such as Machine Learning bring their own challenges and opportunities when it comes to traceability and Compliance. Through Machine Learning, new information can be extracted from data that can benefit traceability. However, some of these results cannot be traced back to the source. The field of Explainable AI is subject to active research, but in practice many algorithms remain a black box, causing a gap in traceability. Furthermore, Machine Learning and AI can correlate data, which may allow for improper analysis and thus affect Compliance.

Markus Salomon (Business Development Manager Data Engineering & AI) and Felix Unverzagt (Data Scientist) addressed the topic Traceability & Compliance in the Data Lake on the second day of a digital conference about AI & Data Analytics in Banking and Insurance. In their presentation, they embedded concepts into development and operational processes. Also, metadata as well as provable Data Lineage were in focus of their presentation, and the role of Machine Learning in this context was addressed. Here you can download the presentation slides to learn more.

Download presentation slides (1.3mb)

Download PDF

What about your Compliance?

Together we can get to the bottom of your Data Lineage. Many years of experience and a strong expertise in the banking environment make Adastra a world-class partner for your projects. Contact us for a free initial consultation.

Learn more today, get started tomorrow.

Thank you

We will contact you as soon as possible.