Blackbar

For technical documentation, please visit the Blackbar website.

Data-driven technologies can lead to new applications for medicine and healthcare practice, fueled by the massive stores of unstructured data in hospital silos.¹ However, for those outside the field, it’s easy to overlook some of the hurdles involved in moving these innovations from theory into clinical use.

Chief among these hurdles is patient privacy. Free-form text in Electronic Health Records (EHR) provides invaluable insights into patient backgrounds, disease progression, and treatments. Yet, this same unstructured nature — which frequently intertwines clinical details with personally identifiable information (PII) — restricts researcher’s and innovator’s ability to access and leverage the data at scale.

Safeguards like pseudonymization are key to mitigating privacy concerns and strengthening trust in healthcare data practices among stakeholders, ensuring that these technologies can be used responsibly and effectively.

At the same time, hospitals must be able to leverage pseudonymized clinical data across multiple scenarios — whether in a research context, for proof-of-value initiatives, or through APIs that integrate AI-driven functionalities into clinical workflows and software.

Blackbar delivers a comprehensive, end-to-end solution that meets these requirements.

Pseudonymization

🛡️ Pseudonymisation can reduce the risks to the data subjects by preventing the attribution of personal data to natural persons in the course of the processing of the data, and in the event of unauthorised access or use.

🔔 Pseudonymised data, which could be attributed to a natural person by the use of additional information, is still to be considered personal data, even if the pseudonymous data and additional information are not in the hands of the same person.

📌 Controllers need to
🔹 modify or transform the data
🔹 keep additional information for attributing the personal data to a specific data subject separately
🔹 apply technical and organisational measures to ensure personal data are not being attributed

Pseudonymisation can help reduce risks of confidentiality, function creep or accuracy, facilitate data analysis, support data minimalisation and transfers to third parties/third countries.²

Function creep occurs when data that was initially gathered for a specific, limited purpose begins to be used for additional, unintended purposes—often without the informed consent of those whom the data concerns. In the context of healthcare, this might happen if patient information collected for treatment or research is later repurposed in ways patients or regulators did not anticipate (for instance, for marketing or broader analytics unconnected to the original care objectives). Pseudonymization helps mitigate this risk because it reduces the possibility of re-identifying individuals or misusing their data for functions beyond the originally stated intent.

Function creep and accuracy refer to two separate but related risks when reusing personal data:

Function Creep
- Data gathered for one purpose might be quietly repurposed for another, unintended use (e.g., marketing or profiling) without new consent. This “creep” violates privacy expectations and can undermine trust.
Accuracy
- Repeatedly repurposing or combining data from multiple sources can introduce errors, incorrect linkages, or mismatched records. Over time, these inaccuracies can accumulate, reducing the overall reliability of the dataset.

By pseudonymizing data, organizations can mitigate both risks. They limit the ability to repurpose data for extraneous functions (function creep) and maintain more careful controls over how the data is combined or shared—thereby preserving accuracy.

Rationale for Narratives

Structured and coded data are indispensable for standardizing information and supporting interoperability in healthcare.

alt text

Nevertheless, clinicians still rely on narrative text across various sources because structured formats alone cannot capture the full nuance of patient care. The context, depth, and flexibility provided by free-form notes are critical for truly patient-centered medicine.³

alt text

As a result, even as healthcare players increasingly emphasizes capturing healthcare data in structured or coded formats, free-form clinical narratives are unlikely to vanish. Instead, documentation will continue to benefit most from a complementary, hybrid approach.

According to widely cited estimates, between 60% and 80% of clinically relevant hospital data exists in narrative, unstructured text rather than in standardized, coded fields.⁴ Because most healthcare information systems cannot easily process this data, its potential for secondary use remains largely untapped — despite the promise of advanced analytics and AI-driven text processing to improve patient care and research. For example, physician notes could support readmission prediction, prediction of disease onset or disease progress to assist in timely and personalized intervention, and more.

Many other use cases are in scope of recent developments in Natural Language Processing (NLP) and Large Language Models (LLM) ⁵ ⁶ ⁷, like automated translation of free-text clinical notes into structured, coded information. However, some significant hurdles remain for widely trusted and scalable reuse:

safeguarding patient privacy;
data management for reuse - by definition, unstructured data does not fit into standardized models like OMOP-CDM
translating these innovations in text processing into clinician-friendly applications that also offer seamless integration with existing EHR systems.

As a result, at a time when healthcare organizations face increasing economic pressures, they still use manual processes to extract needed information from unstructured data in the EHR, primarily for purposes such as statistical reporting (in ICD-10), registries, quality reporting, chronic disease management, and to support research applications. Additional reporting requirements result in a growing administrative burden for frontline caregivers, who need to spend even more time on going through clinical notes.

Project Blackbar was created to help hospitals anonymize and/or pseudonymize their free-form textual sources of clinical information, and maximize the secure sharing of clinically relevant information with researchers, external collaborators, or third parties — while ensuring compliance with GDPR regulations and not revealing personally sensitive data.

Approach

Blackbar provides an automated solution to redact any PII of patients and caregivers in free-form text. By employing advanced pseudonymization techniques, Blackbar effectively reduces the risk of exposing information that could directly or indirectly identify individuals, enabling secure and compliant data reuse.

Project blackbar uses a hybrid approach combining deep learning techniques with more advanced lookup based techniques to locate PII in clinical notes. Blackbar allows the models to be deployed offline on your own infrastructure

Furthermore, all hospitals deploy integrated EHR systems, where a lot of information is already digitalized. In order to redact the PII in health records, blackbar can make use of the existing information which is available in the databases of the hospital to optimally remove the PII.

The setup allows to:

Annotate clinical notes
- names and addresses of patients and health personnel
- general dates, birth dates, ages
- ID’s, social security numbers, email addresses, professions
- names of locations and organisations
Detect personally identifiable information
- using BiLSTM/CNN/Transformer-based(BERT) named entity recogntion deep learning models
- by looking up variants of names / addresses of the patient linked to the health records using the local alignment technique Smith-Waterman
Perform the pseudonymization
- replacing all detected PII with fake names / addresses / … by patient
- replacing the dates and timestamps with time-shifted dates
Store results and logs of the process
- stores information about the exact locations in the text of the personally identifiable information
- create a new text which looks exactly the same as the original text where the personally identifiable information information is replaced by pseudo names and the mapping between the 2 is kept for traceability

Volume and Variety of Unstructured Clinical Information

Different organizations and studies provide varying percentages, but the general consensus is that the majority of clinical information is composed of notes, reports, and other unstructured documentation. Sources frequently cite or discuss the estimate that 60–80% of healthcare data is unstructured.⁸

Variety of Sources

Patient admission or triage notes
- providing context around the reason for hospital entry, initial symptoms, and triage priority
Narratives written by healthcare providers (doctors, nurses, specialists)
- documenting daily observations, changes in the patient’s condition, and any interventions or responses to treatment
Detailed, free-form summaries
- from specialist consultations or interdisciplinary discussions about complex cases
Discharge summaries
- including diagnoses, treatments provided, outcomes, and follow-up care instructions
- often tailored for the patient and their general practitioners
Care team communications
- informal messages or notes between care team members discussing patient updates, alerts, or nuanced care instructions
Patient history
- gathered through patient interviews, often including lifestyle, family history, or social factors relevant to their care.
Reports from surgeries or procedures
- including findings, techniques used, and any complications encountered
Nursing assessments
- containing narratives from nurses documenting observations of the patient’s physical, emotional, and mental status
- may include unstructured assessments of pain, mood, and daily functioning
Interpretive reports
- from radiologists on scans, x-rays, or MRIs, often with detailed observations or considerations beyond standardized metrics
- from physicians or specialists about lab or diagnostic test results, highlighting abnormal findings or possible implications
Pharmacists’ notes
- describing patient medication plans, interactions, and specific considerations
- often tailored to individual patient needs
Follow-up and referral letters
- sent to other healthcare providers for continued patient care after discharge
- often with tailored observations and recommendations

The Rationale of Using Free-Form Clinical Text

There are several compelling reasons why physicians, researchers, and patients alike benefit from retaining a narrative element in healthcare documentation:

Determinants of Health

Well-established models suggest that clinical care accounts for 20% of health outcomes for patients.⁹ Other determinants of health include:
- Social and economic variables,
- Physical environment (which, along with socio-economic variables, already makes up 50%).⁴
- Behavioral factors
- Genetics
Except for genetics data, which tends to be structured, data that contribute most significantly to health outcomes is uncollected or unstructured, and infrequently used in healthcare today.
These wider health determinants are often described in caregivers’ narratives and will not be found in structured data. The rise of advanced analytics and large language models presents a new opportunity to build solutions to act on ALL determinants to improve people’s health and the quality of their lives.

Nuanced Context

Medical encounters are highly individual, and a purely structured format cannot fully capture the complexity and uniqueness of each patient’s situation.
Free-form notes allow clinicians to describe subtle symptoms, social factors, and psychosocial elements that may not align with predetermined codes or dropdown menus.

Clinical Reasoning and Decision-Making

The process by which clinicians arrive at diagnoses or treatment decisions often involves a chain of reasoning that is best explained in narrative form.
Free-form notes allow clinicians to document thought processes and rationale—vital for clarity, continuity of care, and medico-legal reasons.

Flexibility and Adaptability

Medicine evolves rapidly, with new conditions, procedures, and findings constantly emerging. Strictly coded data sets can lag behind these developments.
Free-text options offer flexibility to adapt and record new information in real-time, without waiting for codes or structures to catch up.

Enhanced Communication

Healthcare is a team effort that frequently involves multiple specialties.
A descriptive narrative can facilitate better communication among providers who need a holistic understanding of the patient’s story, rather than just a list of codes or structured fields.

Patient-Centered Care

Many clinicians believe that capturing a patient’s story in free-form text is essential to truly “see” the person behind the condition.
This patient-centered approach helps ensure care plans are individualized and culturally sensitive.

Research Value

While structured data is invaluable for large-scale analytics, free-text data provides qualitative insights.
Researchers often turn to natural language processing (NLP) to uncover patterns and context that may be lost in purely coded datasets.

Use Case: Hospital Readmission Risk

When it comes to predicting a patient’s risk of returning to the hospital, unstructured data — specifically physician notes — can provide insights that may surpass those gleaned from standardized, coded information like ICD-10 diagnoses or medication prescriptions. Here’s why:

Nuanced Clinical Context
- Physician notes often capture subtle clues about a patient’s social situation, mental health, or other contributing factors that are not reflected in coded fields.
- For example, a clinician might note concerns about a patient’s mobility at home, the presence (or lack) of a supportive caregiver, or the patient’s mental state—factors that significantly affect the likelihood of readmission.
Emerging or Ambiguous Conditions
- Structured data (e.g., ICD-10 codes) is fundamentally limited by existing coding schemas, which may lag behind real-world clinical practice. Physicians frequently observe new or evolving conditions in their narrative notes—well before codes are assigned or even created.
- These “unofficial” observations could reveal emerging comorbidities or rare complications that heighten readmission risk.
Patient Behavior and Adherence
- Non-adherence to medication or lifestyle recommendations is a major contributor to hospital readmissions. While prescription data shows which medications were issued, it can’t confirm if patients are actually taking them as directed.
- Clinicians often record comments about suspected noncompliance, difficulties with follow-up appointments, or misunderstandings about medication—insights that are typically absent from standardized fields.
Physician Reasoning and Clinical Judgment
- ICD-10 codes or medication lists tell you what diagnoses or treatments exist, but they don’t convey the “why” behind a clinician’s decision. Narrative notes capture thought processes, differential diagnoses, and concerns about possible complications.
- This extra layer of reasoning can signal potential risks. For instance, if a note mentions that a doctor is “concerned about worsening renal function,” it may be a more sensitive indicator of an impending readmission than a one-time abnormal lab result coded in the chart.
Contextual Clues and Social Determinants of Health (SDoH)
- Social factors—like housing instability, financial hardship, or difficulty accessing transportation—are often buried in free-text notes but rarely systematically coded.
- These SDoH factors can strongly correlate with readmission rates, making them extremely valuable in risk prediction models.
Rich Source for Natural Language Processing (NLP)
- Modern NLP and Large Language Model (LLM) techniques can mine physician notes for patterns, sentiments, or risk indicators that are challenging to capture in structured data.
- These AI-powered methods can detect negation (e.g., “patient denies chest pain”), uncertainty (“possible pneumonia”), or temporal details (“symptoms worsening over the last two weeks”), further refining risk assessments.
Holistic Patient Perspective
- Ultimately, unstructured notes often read like an evolving narrative of the patient’s journey. They provide a holistic view that goes beyond single-point codes or clinical measurements, shedding light on patient behaviors, caregiver involvement, and psychosocial challenges that directly affect readmission risk.

In conclusion, while structured data like diagnoses and medications undoubtedly play an important role in clinical analytics, the depth and richness of unstructured physician notes can offer an even stronger signal when it comes to anticipating hospital readmissions. By harnessing advanced NLP and LLM tools to extract and interpret these valuable insights, healthcare organizations can improve risk stratification, refine care plans, and ultimately reduce unnecessary readmissions—all by tapping into the power of the clinician’s narrative.

Other Clinical Notes Use Cases

Beyond predicting readmissions, there are several other use cases where the narrative detail in clinical notes offers valuable insights that structured data alone often misses. Here are a few examples:

Adverse Drug Event Detection
- While structured data can show which medications are prescribed, unstructured notes often contain critical details about side effects or patient-reported symptoms (e.g., headaches, rash, fatigue).
- These free-text mentions can serve as early warning signs of adverse reactions or drug interactions that might not yet be reflected in coded diagnosis fields.
Patient Safety and Quality Improvement
- Near-misses or safety concerns are often informally documented in progress notes (e.g., “Patient almost received the wrong dose” or “Nurse caught a potential medication error before administration”).
- Structured data alone typically doesn’t capture these near-misses, but free-text documentation can reveal patterns requiring process improvements.
Social Determinants of Health (SDoH)
- Clinicians frequently note important social or lifestyle factors (e.g., housing instability, food insecurity, caregiver availability).
- These contextual details have major implications for a patient’s treatment adherence and follow-up care but are usually absent from typical EHR fields.
Complex Diagnoses and Rare Conditions
- Rare or newly emerging conditions may not have a well-established ICD-10 code or might be entered incorrectly due to limited code options.
- Physicians often document these nuances in free text, flagging conditions for future refinement in structured coding systems.
Patient-Centered Care Planning
- Unstructured notes can capture detailed patient preferences, emotional states, or psychosocial barriers (e.g., “Patient is anxious about the upcoming procedure,” “Wants to consult family before making a decision”).
- This information helps tailor more personalized care plans and fosters better shared decision-making.
Longitudinal Care for Chronic Conditions
- Chronic conditions like diabetes, hypertension, or COPD often require ongoing, iterative documentation of symptom changes, lifestyle adjustments, and mental health considerations.
- Narrative notes provide a continuous thread of how symptoms evolve over time and how patients respond to different treatments.
Discharge Planning and Follow-Up
- A discharge summary might include specifics about patient education, instructions given, and potential red flags to watch out for (e.g., “Needs wound check within 48 hours,” “Advised to monitor blood sugar closely”).
- These free-text details can be critical for care coordination, ensuring a smoother transition from hospital to home or another facility.
Research into Healthcare Delivery
- Studies on patient-provider communication, clinical decision-making, and care pathways often rely on the richness of narrative text.
- Unstructured notes can reveal bottlenecks, best practices, or unexplored phenomena that are invisible in purely coded datasets.

In all of these scenarios, unstructured clinical notes offer a more holistic understanding of the patient and the healthcare process. Combining these qualitative insights with quantitative, structured data can lead to better care quality, improved patient safety, and more meaningful research outcomes.

Anonymization and Pseudonymization

Small Cell Risk Assessment (SCRA)

https://www.google.com/search?q=scra+small+cell+risk+assessment (maybe not needed for producing small samples for the Data Catalog, but for data release)

https://www.datenschutzbehorde.be/publications/aanbeveling-nr.-11-03-2011.pdf (KCE)

EU Guidelines on Pseudonymization

https://www.edpb.europa.eu/news/news/2025/edpb-adopts-pseudonymisation-guidelines-and-paves-way-improve-cooperation_en

The guidelines provide two important legal clarifications:

Pseudonymised data, which could be attributed to an individual by the use of additional information, remains information related to an identifiable natural person and is therefore still personal data. Indeed, if the data can be linked back to an individual by the data controller or someone else, it remains personal data.

Pseudonymisation can reduce risks and make it easier to use legitimate interests as a legal basis (Art. 6(1)(f) GDPR), as long as all other GDPR requirements are met. Likewise, pseudonymisation can aid in securing compatibility with the original purpose (Art. 6(4) GDPR). The guidelines also explain how pseudonymisation can help organisations meet their obligations relating to the implementation of data protection principles (Art. 5 GDPR), data protection by design and default (Art. 25 GDPR) and security (Art. 32 GDPR).

eHealth

https://www.ehealth.fgov.be/ehealthplatform/nl/service-pseudonimisering-anonimisering

Links

https://www.linkedin.com/posts/activity-7280130692548591616-5S_c ↩
https://www.linkedin.com/posts/dvanroijen_edpb-guidelines-012025-on-pseudonymisation-activity-7285985711848108032-fM5m?utm_source=share&utm_medium=member_desktop ↩
Ford, Elizabeth, John A Carroll, Helen E Smith, Donia Scott, and Jackie A Cassell. 2016. “Extracting Information from the Text of Electronic Medical Records to Improve Case Detection: A Systematic Review.” Journal of the American Medical Informatics Association 23 (5): 1007–15. https://doi.org/10.1093/jamia/ocv180. ↩
Pak HS. 2018. “Healthcare Tech Outlook | Unstructured Data in Healthcare.” Healthcare Tech Outlook. Healthcare Tech. November 30, 2018. https://artificial-intelligence.healthcaretechoutlook.com/cxoinsights/unstructured-data-in-healthcare-nid-506.html. ↩ ↩²
Martin, Keith. 2024. “Are LLMs Coming for Coding? Yes, and Medical Coders Should Prepare.” Journal of AHIMA. June 28, 2024. https://journal.ahima.org/page/are-llms-coming-for-coding-yes-and-medical-coders-should-prepare. ↩
Briganti, Giovanni. 2023. “A Clinician’s Guide to Large Language Models.” Future Medicine AI, August. https://doi.org/10.2217/fmai-2023-0003. ↩
Park, Ye-Jean, Abhinav Pillai, Jiawen Deng, Eddie Guo, Mehul Gupta, Mike Paget, and Christopher Naugler. 2024. “Assessing the Research Landscape and Clinical Utility of Large Language Models: A Scoping Review.” BMC Medical Informatics and Decision Making 24 (1). https://doi.org/10.1186/s12911-024-02459-6. ‌ ↩
Why the variability in estimates?
- Different Clinical Workflows: some specialty areas rely more heavily on free-text notes (e.g., psychiatry, complex chronic conditions) than others (e.g., highly protocol-driven services).
- Variations in EHR Design: institutions with more advanced, structured documentation templates may have a lower percentage, whereas less standardized systems may push the figure higher.
- Changing Technology: as newer EHR functionalities encourage structured input, the ratio may fluctuate—though free-text remains prevalent due to its flexibility and expressiveness.
↩
Moriarty, Clare. 2023. “Acting on the Wider Determinants of Health Will Be Key to Reduced Demand.” England.nhs.uk. July 24, 2023. https://www.england.nhs.uk/blog/acting-on-the-wider-determinants-of-health-will-be-key-to-reduced-demand/. ↩