EHR Anonymization using AI
Note: This blog is a collection of non-technical topics related to application of artificial intelligence, machine and deep learning to health care. Please refer to Practical data science and engineering for data, software engineering specific issues such as design and coding.
Background
De-identification
The patient information deemed sensitive includes name, address, phone number, social security id, health insurance data, primary care provider, place and date of service as illustrated with the following snippet of a clinical note.There are several methods to deal with patient data
- Removal: The words related to the patient personal information is simply deleted
- Substitution by indexing: Each word associated with private patient are replaced by an index or key so the personal information can be potentially retrieved later. Hashing is the most common technique to implement this technique
- Perturbation: The patient data is obfuscated by altering or encrypting the original words so the entire structure of the note is preserved
This study is limited to the removal solution.
Techniques
There are many techniques available to cleanse medical document from private personal information. I restricted my analysis to three very different methods- Regular Expression using simple programming technique
- Terms Frequency from information retrieval
- Large Language Model from machine (deep) learning
Regular expression
Terms Frequency
It is essentially the tally of how often a particular word occurs in a document.
Consider this instance: a term frequency ranking derived from a set of 12,600 radiology reports.
Given that identifiers pertaining to a patient are distinct to an individual clinical note or Electronic Health Record (EHR) within a vast compilation or database of documents, their occurrence rate is typically quite low.
Employing this method marks a substantial advancement beyond the rudimentary use of regular expressions. Nevertheless, its reliability may diminish when the same terms related to a patient, a healthcare provider, or an insurance company are present across multiple documents.
exam,50181
breast,37359
date,36043
doctor,36000
are,29177
findings,28063
screening,28003
imaging,27309
cancer,27155
Natural Language Processing with transformer
- Date of service
- Place of service
- Medical record id or EHR id
- Patient name
- Patient id
- Patient address
- Patient city
- Patient state
- Patient ZIP
- Patient health plan
- Patient health insurance group
- Patient health insurance member
- Age
- Date of birth
- Provider id
- Provider name
- Provider specialty
The BERT encoder can be effectively used in conjunction with the term frequency filter. As mentioned previously, PHI data has a lower relative frequency of occurrence than medical terms with a large corpus of clinical notes.
PHI data and medical terms can be pre-trained concurrently by assigning all PHI data into a sentence and breaking the other components of the document into other sentences.
Evaluation
- False positive (fp): The algorithm removed data it should not have (low risk)
- False negative (fn): The algorithm failed to remove some of the personal data
- True positive (tp): The algorithm succeeds to remove the appropriate patient specific data
We
need to define the metrics to compare the various techniques for
de-identifying patient information for any given medical document.
We use the following quality variable commonly used in data science:
- The TF values have been normalized with the range [0, 1]
- The BERT encoder used to extract PHI tags generated an embedding vector of size 768 with a maximum number of 512 tokens per sentences
Comments
Post a Comment