Blog: AI for Electronic Health Records

Target audience: Beginner

Estimated reading time: 8'

This post reviews the various techniques to identify and remove sensitive patient information from medical documents and electronic health records.

Artificial intelligence for de-identification of medical documents

Table of contents

Background
De-identification

Techniques

Regular expressions

Terms frequency

Natural language processing with transformer

Evaluation

References

Note: This blog is a collection of non-technical topics related to application of artificial intelligence, machine and deep learning to health care. Please refer to Practical data science and engineering for data, software engineering specific issues such as design and coding.

Background

The Health Insurance Portability and Accountability Act (HIPAA), enacted in 1996, is a national statute that safeguards confidential patient health data from being released without the patient's permission or awareness [ref 1]. The Privacy Rule within HIPAA sets forth regulations concerning the handling and sharing of an individual's health information. In practice, HIPAA encompasses 18 distinct regulations governing the management of Protected Health Information (PHI).

De-identification

The patient information deemed sensitive includes name, address, phone number, social security id, health insurance data, primary care provider, place and date of service as illustrated with the following snippet of a clinical note.

DOB: XX/XX/XXXX

Patient: XXXX XXXX

History Number: XXXXXXX

Exam: Guided right knee arthrogram. Procedure performed by XXXX XXXX MD with assistance by resident XXX XXX, M.D.

INDICATION: XX years old Female with ACUTE RT KNEE SURG XX.XX.XXXX

DAP: 22.1 uGy*m2

The patient was interviewed prior to exam ...

...

There are several methods to deal with patient data

Removal: The words related to the patient personal information is simply deleted
Substitution by indexing: Each word associated with private patient are replaced by an index or key so the personal information can be potentially retrieved later. Hashing is the most common technique to implement this technique
Perturbation: The patient data is obfuscated by altering or encrypting the original words so the entire structure of the note is preserved

This study is limited to the removal solution.

Techniques

There are many techniques available to cleanse medical document from private personal information. I restricted my analysis to three very different methods

Regular Expression using simple programming technique
Terms Frequency from information retrieval
Large Language Model from machine (deep) learning

Regular expression

The most straightforward strategy is to pinpoint the format of each confidential term, for instance, a social security number like 123-45-6789, or a date of birth formatted as 01.01.2020. While this purely coding-based solution is easy to put into action, it's susceptible to mistakes.

Identifiers like health insurance numbers or addresses can differ significantly in their presentation across medical records.

Terms Frequency

Term frequency is a metric utilized in information retrieval (IR) and machine learning to assess the significance or pertinence of textual elements (such as words, phrases, or lemmas) within a single document, as part of a broader document corpus.
It is essentially the tally of how often a particular word occurs in a document.
Consider this instance: a term frequency ranking derived from a set of 12,600 radiology reports.
Given that identifiers pertaining to a patient are distinct to an individual clinical note or Electronic Health Record (EHR) within a vast compilation or database of documents, their occurrence rate is typically quite low.

Employing this method marks a substantial advancement beyond the rudimentary use of regular expressions. Nevertheless, its reliability may diminish when the same terms related to a patient, a healthcare provider, or an insurance company are present across multiple documents.

with,60712
exam,50181
breast,37359
date,36043
doctor,36000
are,29177
findings,28063
screening,28003
imaging,27309
cancer,27155

....

Francis,3

Jessica,2

97812,1

zahir,1

....

Natural Language Processing with transformer

Named Entity Recognition (NER) involves detecting and classifying crucial elements in text, such as individual words or clusters of words, into predefined categories like Person, Location, Date. In the context we're discussing, each element of Protected Health Information (PHI)—such as names, addresses, dates, etc.—is assigned a specific tag. These tags can appear in disparate sections of medical documents and may vary in format and length.

The central concept is to deploy NER techniques to extract the different components of PHI from each document.

The following PHI tags are used in our evaluation:

Date of service
Place of service
Medical record id or EHR id
Patient name
Patient id
Patient address
Patient city
Patient state
Patient ZIP
Patient health plan
Patient health insurance group
Patient health insurance member
Age
Date of birth
Provider id
Provider name
Provider specialty

A transformer is an encoder-decoder architecture model which uses self-attention mechanisms to extract an entire sequence of terms to the decoder at once. Bi-directional Encoder Representations from Transformer (BERT) is a commonly used transformer encoder that relies on predicting the order of sentences or segments in a document and predicting masked tokens to generate an embedding or representation of this document [2].

Let's consider the age PHI tag, the following sentence "Robert is 63 years with mild diabetes ..." is encoded into a vector floating point values, then decoded to "Robert is AA years with mild diabetes ..."[ref 3]

The BERT encoder can be effectively used in conjunction with the term frequency filter. As mentioned previously, PHI data has a lower relative frequency of occurrence than medical terms with a large corpus of clinical notes.

PHI data and medical terms can be pre-trained concurrently by assigning all PHI data into a sentence and breaking the other components of the document into other sentences.

Evaluation

False positive (fp): The algorithm removed data it should not have (low risk)

False negative (fn): The algorithm failed to remove some of the personal data

True positive (tp): The algorithm succeeds to remove the appropriate patient specific data

We need to define the metrics to compare the various techniques for de-identifying patient information for any given medical document.

We use the following quality variable commonly used in data science:

The cumbersome task of annotating/labeling the medical documents to evaluate limits the analysis to 1,640 clinical notes.

Notes:

The TF values have been normalized with the range [0, 1]
The BERT encoder used to extract PHI tags generated an embedding vector of size 768 with a maximum number of 512 tokens per sentences

References

[1] Health Insurance Portability and Accountability Act of 1996

[2] Bi-directional Encoder Representation from Transformers

[3] Named-entity recognition with BERT for anonymization of medical records

[4] Simple Transformers for PHI De-identification

Search This Blog

AI for EHR

EHR Anonymization using AI

Background

De-identification

Techniques

Regular expression

Terms Frequency

Natural Language Processing with transformer

Evaluation

References

Comments

Post a Comment

Popular posts from this blog

Autonomous vs Computer-Aided Medical Coding