DataHub

Deze inhoud is nog niet vertaald.

While data can offer valuable insights for your organization, metadata is the key to understanding and managing that data effectively.

By publishing a data catalog, a hospital makes its data accessible (i.e., facilitates data access requests) and discoverable. Data requesters can identify the hospital’s data assets that support their specific use cases. Metrics on data quality make existing datasets more meaningful to stakeholders.

By aligning from the outset with the new European metadata standard (DCAT-AP for Health), information on health-related data sources in a hospital can be exchanged and become visible at a regional, national, or European level.

For who?

For data producers / software engineers

who have no idea how their data is being used downstream
technical documentation

For data consumers / data engineers (as metadata users):

technical documentation
also valuable for insights about pipeline executions, schema change history, and data quality
broader picture features: lineage, data products & data contracts

For data consumers / data engineers (as metadata providers)

data classification
- https://blog.datahubproject.io/pii-classification-just-got-easier-with-datahub-6bab2b63abcb

For data product users:

finding data
understanding data
insights into how data is used

EHDS

Irrespective of whether a data holder receives any actual data request, data holders must provide a description of the data sets they hold and that are covered by the EHDS to their respective Health Data Access Body (HDAB).

DataHub

DataHub is a 3rd generation data catalog that enables Data Discovery, Collaboration, Governance, and end-to-end Observability that is built for the Modern Data Stack.¹

DataHub employs a model-first philosophy, with a focus on unlocking interoperability between disparate tools & systems.²

Data Products and Data Contracts

DataHub provides robust support for both Data Products and Data Contracts, which are integral components of its data management and governance capabilities.

Data Products

Definition and Purpose: Data Products in DataHub are a way to organize and manage data assets such as tables, topics, views, pipelines, charts, and dashboards. They belong to a specific domain and can be accessed by various teams or stakeholders within an organization. This concept is a key part of data mesh architecture, where Data Products are independent units managed by a specific domain team.
Benefits: Data Products help in curating a coherent set of logical entities, simplifying data discovery and governance. They allow stakeholders to easily discover and understand available data, supporting data governance efforts by managing and controlling access to Data Products.
Creation and Management: Data Products can be created using the DataHub UI or via a YAML file managed using GitOps practices. Users need specific privileges to create and manage Data Products within a domain.

https://datahubproject.io/docs/dataproducts/

Data Contracts

Data contracts set data quality standards for data products. Use data contracts to enforce data quality standards as early and often as possible in a data pipeline so as to prevent negative downstream impact. A data contract (e.g., in a YAML file) stipulates the quality standards to which any newly ingested or transformed data must adhere, such as schema and column data type, freshness, and missing or validity standards. Each time the pipeline accepts or produces new data, the checks in the contract are executed; where a check fails, it indicates that new data does not meet the contract’s data quality standards and warrants investigation or quarantining. If you consider a data pipeline as a set of components – data transformations, and ingestions, etc. – you can apply a data contract to verify the data interface between these components and measure data quality standards. Doing so frequently and consistently enables you to effectively break apart a dense data pipeline into manageable parts wherein data quality is verified before data moves from one component to the next. Use the same strategy of frequent verification in a CI/CD workflow to make sure that newly-committed code adheres to your stipulated data quality standards.

Definition and Purpose: A Data Contract is an agreement between a data asset’s producer and consumer, serving as a promise about the quality of the data. It includes assertions about the data’s schema, freshness, and data quality. Data Contracts are verifiable and based on the actual physical data asset, not its metadata.
Characteristics: Data Contracts are producer-oriented, meaning one contract per physical data asset, owned by the producer. They consist of a set of assertions that determine the contract’s status.
Creation and Management: Data Contracts can be created via the DataHub CLI, API, or UI. They allow you to promote a selected group of assertions as a public promise, and if these assertions are not met, the Data Contract is considered failing.

Sources:

https://datahubproject.io/docs/managed-datahub/observe/data-contract/ (DataHub)
https://dataproducts.substack.com/p/the-rise-of-data-contracts
https://dataproducts.substack.com/p/an-engineers-guide-to-data-contracts
https://www.datamesh-manager.com/learn/what-is-a-data-contract
https://hatchworks.com/talking-ai/data-contracts-best-practices/ (Talking AI podcast)
Soda
- Soda Core ³
- Soda Data Contracts ⁴
  - Experimentally supported in Soda Core 3.3.3 or greater for PostgreSQL, Snowflake, and Spark. Data contracts are only available for use in programmatic scans using Soda Core. Soda Core CLI does not support data contracts. Best practice dictates that you install data contracts in a (Python) virtual environment.⁵
- Soda Library ⁶

Data Quality

Standards like the OMOP Common Data Model (CDM) provide a standardized structure to harmonize disparate data sources for large-scale analytics. Yet, ensuring that data conform to such standards and remain consistent over time requires a systematic approach to quality control, spanning the entire data value chain. EHRs often contain missing values, erroneous measurements, and non-standard data types, impeding effective research and clinical decision-making. We need a continuous data quality monitoring framework, which evaluates datasets on key metrics — completeness, plausibility, conformance, and cross-hospital benchmarking. For example, Lynxcare builds on Kahn’s data quality framework ⁷⁸. OHDSI has built a data quality dashboard based on this framework ⁹.

https://www.element61.be/en/competence/data-profiling-data-quality

Data Products - https://dataproducts.substack.com/ (must read!)

The Data Quality resolution process:

https://www.reddit.com/r/dataengineering/comments/1ain8i3/preferred_data_quality_tools/

https://dataroots.io/blog/orchestrating-data-quality

dbt
Great Expectations
Soda
- https://www.soda.io/integrations/duckdb
- https://medium.com/@moritzkoerber/add-data-quality-checks-to-your-duckdb-pipeline-with-soda-core-d99f07788639
YData-Profiling
- https://docs.profiling.ydata.ai/latest/getting-started/concepts/
- https://github.com/ydataai/ydata-profiling

Ingestion layer:

https://dlthub.com/

Data Quality and Data Lakes (S3)

While ensuring data quality in a relational database is challenging, achieving it in a data lake becomes a herculean effort for two main reasons: volume and variability.

Poor data quality is a people and process problem masquerading as a technical problem. The challenges of volume and variability underscore this point.

Variability - The strength of S3 is also its Achilles’ heel: you can store virtually anything in it. Structured data, unstructured data, incomplete data, image data, audio data, Avro, Parquet, JSON—you name it, and you can likely store it in S3. This flexibility is incredibly powerful for data workflows involving machine learning or other downstream workflows requiring diverse use cases. However, it also creates a long tail of potential data quality problems, with silent edge cases often causing the most significant issues.¹⁰

Volume - Another strength—but also an Achilles’ heel—of S3 is its volume capabilities. S3’s ease of loading data and its infinite scalability can turn it into a data dumping ground, especially when the data’s use case is unclear. This lack of constraints within S3 fosters an environment ripe for data quality disasters, as teams can load any data into the lake without limits. As the data volume grows, teams must navigate the long tails of variability while facing a “needle in a haystack” challenge to identify and resolve issues.¹⁰

The lack of technical limitations in S3 means that it’s up to the people and processes of technical teams to protect the data lake.

Data Quality and Trust

Data can become untrustworthy.

In any data pipeline or system, a Sev1 (Severity 1) incident is a high-impact problem so urgent that it receives immediate attention—often because it causes clear and measurable damage to the business (e.g., an outage that stops key processes or makes entire datasets unavailable). Since these issues are obvious and urgent, the organization quickly mobilizes resources to fix them.

However, many lower-level errors or data quality issues remain less visible—perhaps a subtle mismatch in fields, an unexpected data type, or rows that go missing in certain edge cases. Because these errors don’t trigger an obvious, business-halting failure, they often slip under the radar. Over time, these “invisible errors” can erode trust in the data or lead to incorrect assumptions downstream (e.g., inaccurate analyses, misguided strategic decisions). They can cause long-term damage without the immediate red flags that a Sev1 incident brings.

There is often an imbalance between the attention given to urgent, highly visible issues (Sev1’s) and the quieter problems that still have real consequences but remain undetected until they’ve already caused harm.¹¹

Data Quality and Existing Data Pipelines

How do we differentiate between data we need to operationalize and data we are still evaluating for value?

https://www.linkedin.com/blog/engineering/data-management/datahub-popular-metadata-architectures-explained ↩
https://datahubproject.io/docs/architecture/architecture ↩
https://github.com/sodadata/soda-core ↩
https://medium.com/@tombaeyens/introducing-soda-data-contracts-ac752f38d406 ↩
https://docs.soda.io/soda/data-contracts.html ↩
https://docs.soda.io/soda-library/install.html ↩
Kahn MG, Raebel MA, Glanz JM, Raghavan R, Jackson KL, et al. (2016). A Pragmatic Framework for Single-site and Multisite Data Quality Assessment in Electronic Health Record-based Clinical Research. eGEMs (Generating Evidence & Methods to improve patient outcomes), 4(1): 1277. ↩
https://www.lynx.care/knowledge-center/sentinel-video-data-quality-dashboard ↩
https://ohdsi.github.io/DataQualityDashboard/index.html ↩
https://dataproducts.substack.com/p/why-data-quality-is-so-hard-in-s3 ↩ ↩²
https://dataproducts.substack.com/p/the-rise-of-data-contracts#:~:text=the%20GIGO%20Cycle%3A-,The%20GIGO%20Cycle ↩

DataHub

For who?

EHDS

DataHub

Data Products and Data Contracts

Data Products

Data Contracts

Data Quality

Data Quality and Data Lakes (S3)

Data Quality and Trust

Data Quality and Existing Data Pipelines

Footnotes