EHR Anonymization using AI

Blog: AI for Electronic Health Records                                             
Target audience: Beginner
Estimated reading time: 8'


This post reviews the various techniques to identify and remove sensitive patient information from medical documents and electronic health records.

Artificial intelligence for de-identification of medical documents

Table of contents
Follow me on LinkedIn

Note: This blog is a collection of non-technical topics related to application of artificial intelligence, machine and deep learning to health care. Please refer to  Practical data science and engineering  for data, software engineering specific issues such as design and coding


Background

The Health Insurance Portability and Accountability Act (HIPAA), enacted in 1996, is a national statute that safeguards confidential patient health data from being released without the patient's permission or awareness [ref 1]. The Privacy Rule within HIPAA sets forth regulations concerning the handling and sharing of an individual's health information. In practice, HIPAA encompasses 18 distinct regulations governing the management of Protected Health Information (PHI).


De-identification

The patient information deemed sensitive includes name, address, phone number, social security id, health insurance data, primary care provider, place and date of service as illustrated with the following snippet of a clinical note.
 
DOB: XX/XX/XXXX
Patient: XXXX XXXX
History Number: XXXXXXX
Exam: Guided right knee arthrogram. Procedure performed by XXXX XXXX MD with assistance by resident XXX XXX, M.D.
INDICATION: XX years old Female with  ACUTE RT KNEE SURG XX.XX.XXXX 
DAP: 22.1 uGy*m2
The patient was interviewed prior to exam ...
...

There are several methods to deal with patient data

  • Removal: The words related to the patient personal information is simply deleted
  • Substitution by indexing: Each word associated with private patient are replaced by an index or key so the personal information can be potentially retrieved later. Hashing is the most common technique to implement this technique
  • Perturbation: The patient data is obfuscated by altering or encrypting the original words so the entire structure of the note is preserved

This study is limited to the removal solution.

 

Techniques

There are many techniques available to cleanse medical document from private personal information. I restricted my analysis to three very different methods
  • Regular Expression using simple programming technique
  • Terms Frequency from information retrieval
  • Large Language Model from machine (deep) learning

Regular expression

The most straightforward strategy is to pinpoint the format of each confidential term, for instance, a social security number like 123-45-6789, or a date of birth formatted as 01.01.2020. While this purely coding-based solution is easy to put into action, it's susceptible to mistakes.
Identifiers like health insurance numbers or addresses can differ significantly in their presentation across medical records.
 

Terms Frequency

Term frequency is a metric utilized in information retrieval (IR) and machine learning to assess the significance or pertinence of textual elements (such as words, phrases, or lemmas) within a single document, as part of a broader document corpus.
It is essentially the tally of how often a particular word occurs in a document.
Consider this instance: a term frequency ranking derived from a set of 12,600 radiology reports.
Given that identifiers pertaining to a patient are distinct to an individual clinical note or
Electronic Health Record (EHR) within a vast compilation or database of documents, their occurrence rate is typically quite low.

Employing this method marks a substantial advancement beyond the rudimentary use of regular expressions. Nevertheless, its reliability may diminish when the same terms related to a patient, a healthcare provider, or an insurance company are present across multiple documents.

with,60712
exam,50181
breast,37359
date,36043
doctor,36000
are,29177
findings,28063
screening,28003
imaging,27309
cancer,27155
....
Francis,3
Jessica,2
97812,1
zahir,1
....

Natural Language Processing with transformer

Named Entity Recognition (NER) involves detecting and classifying crucial elements in text, such as individual words or clusters of words, into predefined categories like Person, Location, Date. In the context we're discussing, each element of Protected Health Information (PHI)—such as names, addresses, dates, etc.—is assigned a specific tag. These tags can appear in disparate sections of medical documents and may vary in format and length.

The central concept is to deploy NER techniques to extract the different components of PHI from each document.
 
The following PHI tags are used in our evaluation: 
  • Date of service
  • Place of service
  • Medical record id or EHR id
  • Patient name
  • Patient id
  • Patient address
  • Patient city
  • Patient state
  • Patient ZIP 
  • Patient health plan
  • Patient health insurance group
  • Patient health insurance member
  • Age
  • Date of birth
  • Provider id
  • Provider name
  • Provider specialty
 
A transformer is an encoder-decoder architecture model which uses self-attention mechanisms to extract an entire sequence of terms to the decoder at once. Bi-directional Encoder Representations from Transformer (BERT) is a commonly used transformer encoder that relies on predicting the order of sentences or segments in a document and predicting masked tokens to generate an embedding or representation of this document [2].
Let's consider the age PHI tag, the following sentence "Robert is 63 years with mild diabetes ..." is encoded into a vector floating point values, then decoded to "Robert is AA years with mild diabetes ..."[ref 3]

The BERT encoder can be effectively used in conjunction with the term frequency filter. As mentioned previously, PHI data has a lower relative frequency of occurrence than medical terms with a large corpus of clinical notes.

PHI data and medical terms can be pre-trained concurrently by assigning all PHI data into a sentence and breaking the other components of the document into other sentences.


Evaluation

  • False positive (fp): The algorithm removed data it should not have (low risk)
  • False negative (fn): The algorithm failed to remove some of the personal data
  • True positive (tp): The algorithm succeeds to remove the appropriate patient specific data

We need to define the metrics to compare the various techniques for de-identifying patient information for any given medical document.

We use the following quality variable commonly used in data science:

 
The cumbersome task of annotating/labeling the medical documents to evaluate limits the analysis to 1,640 clinical notes.

 Notes
  • The TF values have been normalized with the range [0, 1]
  • The BERT encoder used to extract PHI tags generated an embedding vector of size 768 with a maximum number of 512 tokens per sentences
 

References

[1]   Health Insurance Portability and Accountability Act of 1996  

Comments

Popular posts from this blog

Autonomous vs Computer-Aided Medical Coding