Skip to main content

Collection and Analysis of Medical Text Data

At the moment of writing, PubMed has a wealth of information (over 26 million citations) for BioMedical literature.

As discussed on the previous post, initial research has been directed towards the Post-Finasteride Syndrome. So a starting point was required.

There are several theories as to what causes Post-Finasteride Syndrome but i decided to start looking at the Theories most often being discussed in Forums : That some sort of hormonal dysregulation has taken place which has not been corrected even after the cessation of the Drug. 

Regarding Chronic Fatigue : Again many theories exist, ranging from HPA Axis dysregulation and Virus Infection to Psychiatric conditions.

So several Medical Topics had to be taken into consideration by collecting all relevant Research from PubMed. To do this i have used a Python package called BioPython. Since i had no access to the full text, only the abstracts were collected. 

The following snapshot shows the results from collecting PubMed entries regarding the Cytochrome P450 :

However this information should be further analyzed so we understand more about the associated context. Simply identifying that a specific PubMed entry contains mentions of P450 or specific CYPs (e.g CYP1B1) or of specific diseases such as Diabetes  does not really help us.

With the help of Information Extraction we can have access to a much finer detail on all relevant knowledge that exists within PubMed text.

Let's look at an example. We wish to identify PubMed texts regarding Endoplasmic Reticulum Stress also known as ER Stress. However, simply identifying this piece of text is not very helpful as opposed to finding what induces ER Stress. Consider the following  :

Here we see how an Information Extraction tool (in this example this tool is GATE) is able to identify the parts of text that mention  Induction of ER Stress. In this way -and a bit more work- we can later use this information to automatically identify common Inducers of ER Stress or compounds that ameliorate ER Stress.

Note that in our example shown above, ER Stress induction is found (among other Topics) with :


Note also that there exists an entry that discusses about amelioration of ER Stress after an Induction process.

Having knowledge of this kind of finer detail, we may now identify associations between Topics of Interest.

For example :


Which means that after analyzing thousands of PubMed entries, an association has been found between mentions of Induction of ER Stress and Type 2 Diabetes. The same process can take place for any kind of other Medical entities such as genes. So we may also find that :


Meaning that xbp1 gene was found to be associated with mentions of Alzheimer's Disease risk in PubMed texts. Note again here the context : It is far more informative to identify qualitative information (in this case it is the risk of Alzheimer's) than simply knowing that there is an association between xbp1 Gene and Alzheimer's .

Apart from Information Extraction, several other methods were used including the use of word2vec as this is implemented in the Gensim package which was used to automatically identify -as an example- common inducers and inhibitors of P450.

At the moment of writing, 592 Topics that include Genes, Diseases, Syndromes and qualitative information (such as the topic ER_STRESS_INDUCTION discussed above) have been extracted from  8,052,820 PubMed citations.

The next step was to analyze this data and represent it in such as a way that it may be used as Input to several Machine Learning algorithms with the goal of forming a Hypothesis as to what lies behind Chronic Fatigue, Post-Finasteride and several other Syndromes of unknown origin.


Popular posts from this blog

New findings : Myosin, D3, Actins, Autophagy/Phagocytosis

It is time to look at some new findings as these were identified by Machine Learning and Network Analysis.
Before continuing please note that in previous posts we discussed the importance of Endoplasmic Reticulum Stress, the Unfolded Protein Response and Genes AXL, GRB2, MGP, TYRO3, MERTK, GGCX, GAS6, SH2B3.
Recall also that Sulfation has been also selected as important.

The latest findings suggest the following Topics as being relevant to the Research presented in this Blog :
CYP27A1 and VDBP LXR (Liver X Receptor ) Actins (G-Actin, F-Actin) Myosin Phagocytosis / Autophagy

On the following algorithmic run, Machine Learning identifies relevant Topics to this Research :

Machine Learning, NLP and Network Analysis-Guided Medical Research : A Case Study

Can Machine Learning help us in identifying the origin of several Medical Syndromes?

In previous posts we have seen how approximately 8 Million PubMed abstracts were collected and analyzed using Natural Language Processing (NLP) techniques. This NLP Processing is the basis for generating Data that may then be used as Input to several Machine Learning algorithms.

In this Case Study our Goal is to identify relevant Medical Topics (Topics include Genes, Biological Pathways, etc) that are most likely to direct Medical Researchers towards the origin(s) of the following Syndromes :

-Post-Finasteride Syndrome
-Post-Accutane Syndrome
-Chronic Fatigue Syndrome
-Gulf-War Syndrome
-Post-Treatment Lyme disease Syndrome

Before continuing, please read the following post for important disclaimershere

Note that the results shown below originate strictly from output of Machine Learning Algorithms / Network Analysis. No Human intervention has been made apart from the fact that Candidate T…

Results from Classification Analysis

As discussed in the previous post, Vitamin K appears to be one of the important topics according to the Network Analysis that has been performed. I wanted to see whether this would be the case using Classification Algorithms.

Below is a sample run with several algorithms being applied using Scikit-Learn (some topics found are not disclosed).

We see that TYRO3, GAS6, GRB2 are among Topics that are considered highly relevant to this Research. Recall that these are Vitamin K related Genes as discussed in the previous post. However we see more new Topics coming up and these are :
1) Liver Disease
2) Mitochondria
3) Norepinephrine
4) ROS (Reactive Oxygen Species)

See below Dr Ron Davis, discussing latest findings of  Research done at Stanford regarding Chronic Fatigue Syndrome. He mentions TCA Cycle :

From wikipedia, regarding TCA Cycle we read :

The citric acid cycle is a key metabolic pathway that connects carbohydr…