CSI

Ce contenu n’est pas encore disponible dans votre langue.

Introduction

A pseudonymized corpus of natural language clinical documents is a powerful catalyst for text-processing innovations, such as Large Language Models (LLMs), creating new pathways for a smoother collaboration with stakeholders to bring cutting-edge research and innovation to patient care and healthcare systems.

With pseudonymization as a safeguard for protecting patient privacy, how can hospitals effectively take first steps towards creating more value from clinical narratives?

A prime example is UZ Brussel’s NLP-driven web application that empowers clinicians to identify patient cohorts by searching for natural language concepts. Not only does this solution complement — or even replace — the traditional, structured data queries performed by IT, it allows to go beyond ad hoc searches by grouping textual concepts under medical codes. This enables automated, reusable searches that enhance efficiency and precision in patient care and research.

Rationale

With the abundance of healthcare data (and high costs associated with gathering this data) it must surely be possible to achieve a higher quality of care. We can develop predictive models to know whether people are at risk before they develop a disease or condition. We can gain more knowledge of managing and/or slowing the pace of chronic conditions. To provide the right care for patients, caregivers need to stay up to date with new research in their medical field. Not all of these challenges can be solved by putting only structured data to work. For example, for the identification of drug abuse in the emergency department, structured data would only partially solve the problem, and can turn up a high number of false positives and misses for capturing early indications.

Converting unstructured healthcare data into a structured format is a challenge. Their complexity and heterogeneity make it difficult to fit them neatly into tables (converting to “tabular data”): traditional databases and data tables like Excel or CSV files. The presence of inconsistent and cryptic medical terminologies further complicates the conversion process. Moreover, clinical jargon, acronyms, misspellings, and abbreviations add to the challenges faced during the conversion.

Dirk Van Hyfte:

Even with templates used in EHR, most of the physicians continue to dictate notes for their reports. In typical History and Physical Examination (H&P) notes, the History of Present Illness or HPI is typically a narrative. For a discharge summary the hospital course and follow up plan sections are also mostly a narrative. For radiology reports such as an echocardiogram, important values such as ejection fraction are usually buried in the free text.

Reviewing these notes individually may be feasible if there are no time constraints. However, usually, the provider simply doesn’t have the time. And, clinical decision support rules won’t be able to be accessed either because the coded data is not available.

All these negatively affects the usefulness of the information. So clearly you can highlight that a we indeed need a break through technology here.

How can NLP help?

Natural Language Processing (NLP) plays a critical role in complementing traditional queries that rely solely on structured and coded data.

Capturing Nuanced Information
- Traditional queries on structured data rely on specific codes and fields, which may not fully reflect the intricacies of a patient’s condition.
- Free-text notes often contain details such as social history, lifestyle factors, or subtle symptoms not captured in discrete fields.
- NLP tools can extract this rich information, providing a more holistic view of the patient.
Uncovering Hidden Insights
- Structured data can tell you how many patients have a particular diagnosis or which medications they are taking, but it may not reveal deeper context—such as how those medications are tolerated, or the severity of side effects as described by patients.
- NLP can analyze free-text comments, physician notes, and other unstructured sources to uncover trends and patterns that might otherwise remain hidden.
Enhancing Data Completeness
- A significant portion of clinical documentation remains in free-text form, whether it’s progress notes, discharge summaries, or patient messages.
- By applying NLP, organizations can integrate and analyze these additional data streams, enhancing the completeness and usefulness of their data repositories.
Supporting Advanced Analytics
- When combined with structured data, insights extracted through NLP enable more advanced analytics—such as predictive modeling or real-time alerts.
- For instance, NLP can flag language in clinical notes that indicates a patient is at higher risk of complications, thereby triggering targeted interventions.
Facilitating Research and Quality Improvement
- Researchers benefit from accessing a richer dataset that goes beyond standard coded fields.
- Free-text data often includes detailed clinical observations, rationale for treatment decisions, and patient-reported outcomes.
- By unlocking this information using NLP, healthcare organizations can drive quality improvement initiatives and generate evidence-based insights more effectively.

In sum, NLP is indispensable for translating the wealth of unstructured text data in healthcare into actionable insights, thereby augmenting and enhancing the value of traditional, code-based queries.

Structuring and coding of patient information

Free-form clinical notes contain a wealth of rich detail and nuance that can help improve the overall quality and completeness of structured healthcare data. With the right tools and processes, these unstructured narratives can be converted into coded or semi-structured information, ultimately enhancing data analytics, interoperability, and patient care. Here are some ways free-form clinical notes can support the structuring and coding of information:

Named Entity Recognition (NER)
- Identifying Key Terms: NLP algorithms can detect relevant clinical concepts such as diseases, medications, procedures, and laboratory values mentioned in free-text.
- Standardization: Once identified, these terms can be mapped to standardized vocabularies (e.g., SNOMED CT, ICD-10, LOINC) or internal hospital codes, improving data interoperability.
Sentence-Level and Document-Level Classification
- Categorizing Text Segments: Machine learning models can categorize segments of text (e.g., diagnosis, medication, family history) based on linguistic patterns and context.
- Enhanced Structured Fields: These labeled segments can be stored in structured fields, making it easier to run queries and generate reports on specific data points.
Contextual Extraction of Clinical Findings
- Qualifiers and Negations: NLP systems can detect not only the presence of symptoms or conditions but also qualifiers such as severity or negation (e.g., “no evidence of disease”), ensuring more accurate coding.
- Time Stamps and Timelines: Identifying when an event or condition occurred (e.g., “two weeks ago”) can help construct patient timelines, which are particularly useful for chronic disease management and longitudinal studies.
Auto-Population of EHR Fields
- Reducing Clinician Burden: Automated or semi-automated extraction of structured elements from free-text reduces manual data entry and the likelihood of transcription errors.
- Real-Time Feedback: As clinicians document patient encounters in real-time, NLP-enabled systems can suggest structured codes or categories, encouraging more consistent and complete documentation.
Pattern Recognition for Clinical Decision Support
- Predictive Alerts: By extracting relevant clinical factors from notes (e.g., risk factors, medication changes), systems can generate alerts or reminders.
- Quality Improvement: Aggregated data from multiple free-text records can help detect trends or gaps in care, guiding interventions.
Discovery and Evolution of Coding Schemas
- Identifying New Codes: Free-text often captures emerging diagnoses or procedures not yet fully recognized or coded. NLP can surface these “new” concepts for potential inclusion in coding systems.
- Keeping Vocabularies Current: Ongoing analysis of notes allows institutions to maintain and update internal vocabularies and align them with evolving medical terminology.

By intelligently applying NLP and other AI-driven techniques to free-form clinical notes, healthcare organizations can integrate detailed, context-rich information into their structured data environments. This ultimately leads to better clinical decision-making, more efficient workflows, and improved patient outcomes.

Leveraging LLMs

Large Language Models (LLMs)—such as GPT variants and other transformer-based architectures—are a subset of NLP technologies that have brought about significant advancements in extracting and organizing information from free-form clinical notes. Below are several ways LLMs can aid in structuring and coding free-text healthcare data, expanding on the points already made:

Deep Context Understanding
- Contextual Accuracy: Unlike rule-based or traditional NLP approaches, LLMs excel at understanding context, including clinical nuance, abbreviations, and domain-specific language. They can interpret a phrase in the broader context of the paragraph or the entire note, leading to more accurate concept identification.
- Polysemy Resolution: Medical terms can have multiple meanings depending on context (e.g., “stroke” can indicate a cerebrovascular event or a physical gesture). LLMs leverage vast amounts of training data to disambiguate these terms accurately.
Advanced Summarization and Abstraction
- Summarizing Complex Narratives: LLMs can generate concise, structured summaries from lengthy clinical free-text notes. This helps surface key diagnoses, treatments, and outcomes without losing critical details.
- Extracting Key Variables: By summarizing or paraphrasing the narrative, LLMs can automatically populate predefined templates or forms with essential structured fields (e.g., diagnosis codes, medication lists).
Enhanced Named Entity Recognition (NER) and Entity Linking
- Sophisticated Pattern Recognition: LLMs are highly effective at identifying complex entities (e.g., multi-word condition names, drug dosage instructions, or composite risk factors).
- Mapping to Standard Ontologies: After recognizing entities, LLMs can link them to medical terminologies like SNOMED CT, ICD-10, or RxNorm. This “mapping” step significantly improves interoperability and data quality.
Relationship Extraction and Clinical Reasoning
- Identifying Interactions: Beyond extracting individual entities, LLMs can recognize relationships (e.g., which medication is used to treat which condition, or how certain social factors relate to a patient’s compliance).
- Inference and Reasoning: Some cutting-edge LLMs can infer causal or temporal relationships (e.g., “Symptoms started after the medication was introduced”), enabling more accurate and meaningful coding.
Adaptive and Real-Time Assistance
- Dynamic Prompting: LLMs can be integrated into clinical documentation workflows, prompting clinicians in real time with suggestions or clarifications (e.g., “Did you mean Type 2 Diabetes? Would you like to add ICD-10 code E11.9?”).
- Continuous Learning: Because LLMs can be fine-tuned and updated more readily than static rule-based systems, they adapt faster to new medical findings and coding standards.
De-identification and Data Privacy
- Automatic Removal of PHI: One critical application of LLMs is detecting and removing protected health information (PHI) in free-text notes. They can identify personal identifiers (e.g., names, addresses, phone numbers) more accurately by leveraging contextual clues.
- Secure Sharing of Structured Data: Once de-identified, relevant clinical details can be structured and coded for broader sharing (e.g., research datasets), all while preserving patient confidentiality.
Improved User Experience and Workflow Efficiency
- Reducing Clinician Burden: LLM-based systems can transform lengthy narratives into partially structured templates, drastically cutting the time clinicians spend on manual data entry or coding.
- Fewer Errors and Omissions: Automated checks and suggestions from an LLM reduce the likelihood of omission or inaccurate coding, increasing the reliability of EHR data.
Support for Research and Population Health
- Scalable Data Insights: By converting large volumes of unstructured notes into analyzable, structured outputs, LLMs enable robust population-level studies, predictive analytics, and AI-driven interventions.
- Discovery of Emerging Trends: LLMs can flag new conditions or unanticipated side effects mentioned in clinical text, contributing to updates in coding systems and clinical best practices.
Continuous Refinement and Personalization
- Domain-Specific Fine-Tuning: Healthcare organizations can fine-tune general LLMs on local data (e.g., institution-specific documentation styles, local languages, or specialty-area jargon) for better performance.
- Feedback Loops: Ongoing user feedback (e.g., clinicians accepting or rejecting a system’s suggestions) can further refine the model’s accuracy and relevance over time.

Free-form clinical notes remain a critical part of healthcare documentation because they capture the richness and complexity of patient care. LLMs — and NLP tools more broadly — serve as powerful bridges between the unstructured narrative and structured datasets. By automatically interpreting, summarizing, coding, and de-identifying information, LLMs help unlock the full potential of clinical notes for analytics, research, decision support, and improved patient outcomes. Rather than supplanting free-text documentation, LLMs augment it, making it more actionable, interoperable, and valuable across the healthcare ecosystem.

Mindmap of LLMs applications

└── 1. Named Entity Recognition (NER) & Entity Linking
    ├── Identify Medical Concepts (diseases, medications, procedures)
    ├── Map Concepts to Standard Ontologies (e.g., SNOMED CT, ICD-10, RxNorm)
    └── Handle Synonyms & Ambiguities (e.g., “stroke” vs “stroke the cat”)
└── 2. Summarization & Abstraction
    ├── Generate Concise Summaries of Long Clinical Notes
    ├── Extract Key Variables (e.g., diagnoses, vitals, social factors)
    └── Populate Structured Fields (auto-fill EHR templates)
└── 3. Relationship Extraction & Clinical Reasoning
    ├── Identify Medication-to-Condition Links
    ├── Detect Risk Factors & Causal Connections (e.g., medication side effects)
    └── Establish Timeline & Sequence of Events (e.g., symptom onset)
└── 4. Real-Time Assistance & Workflow Integration
    ├── Auto-Suggestions for Coded Fields While Typing
    ├── Intelligent Prompts & Alerts (e.g., missing documentation)
    └── Continuous Learning from Clinician Feedback
└── 5. De-identification & Data Privacy
    ├── Detect PHI (names, addresses, phone numbers)
    ├── Replace or Remove Identifiers (anonymization/pseudonymization)
    └── Facilitate Secure Data Sharing (research, collaboration)
└── 6. Quality Improvement & Decision Support
    ├── Populate Dashboards with Structured Insights
    ├── Trigger Predictive Alerts (e.g., risk of complications)
    └── Monitor Compliance with Protocols & Guidelines
└── 7. Advanced Analytics & Research
    ├── Support Large-Scale Epidemiological Studies
    ├── Enable Retrospective & Prospective Data Mining
    └── Discover Emerging Conditions or Treatment Patterns
└── 8. Domain-Specific Fine-Tuning & Customization
    ├── Adapt LLMs to Specific Specialties (oncology, cardiology, etc.)
    ├── Incorporate Local Documentation Style & Terminology
    └── Improve Accuracy with Ongoing Feedback Loops

Barriers to developing LLM-based applications for clinical practice

Even though we can expect the walls between open and closed AI to crumble in the future, there are still important considerations to make:
- Security and Data Privacy
  - Open: Running an open model locally offers more control over data handling. Sensitive data never leaves your environment if you can deploy the model in-house.
  - Closed: Typically requires sending data to a remote server. Data handling policies and security are managed by the provider, which can raise privacy or compliance concerns.
- Cost and Scalability
  - Open: May be “free” to obtain, but compute and storage costs shift to the user if they want to deploy at scale.
  - Closed: Often based on a usage or subscription fee. Users save on infrastructure costs but are beholden to the provider’s pricing and capacity.
if we can build a corpus of pseudonymized and verified clinical documents, we have a broader range of options
- for development and deployment

CSI

UZ Brussel has been at the forefront of leveraging Natural Language Processing (NLP) to analyze free-form clinical documents.

Through an intuitive self-service web application called CSI, hospital users can efficiently access and explore the wealth of information contained within these documents. This tool has become an invaluable complement to traditional queries on structured data, significantly enhancing the depth and breadth of analytical capabilities. By exploring textual concepts, users can easily identify patient cohorts for clinical studies or trials. By selecting and grouping - and optionally, coding - concepts, new patients can be automatically discovered and reported. Clinicans or data managers can easily screen for false positive results the web application. Integration with other use cases is possible with the extensive REST API provided by CSI.

Manual and Automated Labeling

In SPECTRE-HD we use machine learning approaches (NLP, NER) to expedite large-scale labeling and then validate with smaller manual review sets. Clinicians and domain experts are engaged intially to create labels for or annotate text in clinical narratives, progress notes, and discharge summaries. With CSI there is no need to directly apply labels to individual documents. Instead, clinicians and domain experts identify, group and label key concepts (diagnoses, symptoms, social determinants of health) that have already been derived from these documents with NLP. These key concepts are indicative of a condition or disease (markers). After grouping key concepts, they can be tied to coded ontologies

Contextual Tagging Beyond simple categorical labels, capture contextual details. For instance, labeling a radiology report of an X-Ray not just by “pneumonia” but also by severity, location, or associated risk factors.

Apply Named Entity Recognition (NER), topic modeling, sentiment analysis, or summarization to extract structured information from clinical narratives, progress notes, and discharge summaries.