Skip to main content

Collection and Analysis of Medical Text Data

At the moment of writing, PubMed has a wealth of information (over 26 million citations) for BioMedical literature.

As discussed on the previous post, initial research has been directed towards the Post-Finasteride Syndrome. So a starting point was required.

There are several theories as to what causes Post-Finasteride Syndrome but i decided to start looking at the Theories most often being discussed in Forums : That some sort of hormonal dysregulation has taken place which has not been corrected even after the cessation of the Drug. 

Regarding Chronic Fatigue : Again many theories exist, ranging from HPA Axis dysregulation and Virus Infection to Psychiatric conditions.

So several Medical Topics had to be taken into consideration by collecting all relevant Research from PubMed. To do this i have used a Python package called BioPython. Since i had no access to the full text, only the abstracts were collected. 

The following snapshot shows the results from collecting PubMed entries regarding the Cytochrome P450 :




However this information should be further analyzed so we understand more about the associated context. Simply identifying that a specific PubMed entry contains mentions of P450 or specific CYPs (e.g CYP1B1) or of specific diseases such as Diabetes  does not really help us.

With the help of Information Extraction we can have access to a much finer detail on all relevant knowledge that exists within PubMed text.

Let's look at an example. We wish to identify PubMed texts regarding Endoplasmic Reticulum Stress also known as ER Stress. However, simply identifying this piece of text is not very helpful as opposed to finding what induces ER Stress. Consider the following  :



Here we see how an Information Extraction tool (in this example this tool is GATE) is able to identify the parts of text that mention  Induction of ER Stress. In this way -and a bit more work- we can later use this information to automatically identify common Inducers of ER Stress or compounds that ameliorate ER Stress.

Note that in our example shown above, ER Stress induction is found (among other Topics) with :

-Hyperlipidemia
-Hyperhomocysteinemia
-Hyperglycemia
-Alcohol

Note also that there exists an entry that discusses about amelioration of ER Stress after an Induction process.


Having knowledge of this kind of finer detail, we may now identify associations between Topics of Interest.

For example :


ER_STRESS_INDUCTION <=> TYPE2_DIABETES


Which means that after analyzing thousands of PubMed entries, an association has been found between mentions of Induction of ER Stress and Type 2 Diabetes. The same process can take place for any kind of other Medical entities such as genes. So we may also find that :

XBP1_GENE <=> ALZHEIMERS_DISEASE_RISK


Meaning that xbp1 gene was found to be associated with mentions of Alzheimer's Disease risk in PubMed texts. Note again here the context : It is far more informative to identify qualitative information (in this case it is the risk of Alzheimer's) than simply knowing that there is an association between xbp1 Gene and Alzheimer's .

Apart from Information Extraction, several other methods were used including the use of word2vec as this is implemented in the Gensim package which was used to automatically identify -as an example- common inducers and inhibitors of P450.

At the moment of writing, 592 Topics that include Genes, Diseases, Syndromes and qualitative information (such as the topic ER_STRESS_INDUCTION discussed above) have been extracted from  8,052,820 PubMed citations.

The next step was to analyze this data and represent it in such as a way that it may be used as Input to several Machine Learning algorithms with the goal of forming a Hypothesis as to what lies behind Chronic Fatigue, Post-Finasteride and several other Syndromes of unknown origin.

Comments

Popular posts from this blog

New findings : Myosin, D3, Actins, Autophagy/Phagocytosis

It is time to look at some new findings as these were identified by Machine Learning and Network Analysis.
Before continuing please note that in previous posts we discussed the importance of Endoplasmic Reticulum Stress, the Unfolded Protein Response and Genes AXL, GRB2, MGP, TYRO3, MERTK, GGCX, GAS6, SH2B3.
Recall also that Sulfation has been also selected as important.

The latest findings suggest the following Topics as being relevant to the Research presented in this Blog :
CYP27A1 and VDBP LXR (Liver X Receptor ) Actins (G-Actin, F-Actin) Myosin Phagocytosis / Autophagy

On the following algorithmic run, Machine Learning identifies relevant Topics to this Research :



Machine Learning, NLP and Network Analysis-Guided Medical Research : A Case Study

Can Machine Learning help us in identifying the origin of several Medical Syndromes?

In previous posts we have seen how approximately 8 Million PubMed abstracts were collected and analyzed using Natural Language Processing (NLP) techniques. This NLP Processing is the basis for generating Data that may then be used as Input to several Machine Learning algorithms.

In this Case Study our Goal is to identify relevant Medical Topics (Topics include Genes, Biological Pathways, etc) that are most likely to direct Medical Researchers towards the origin(s) of the following Syndromes :

-Post-Finasteride Syndrome
-Post-Accutane Syndrome
-Chronic Fatigue Syndrome
-Fibromyalgia
-Gulf-War Syndrome
-Post-Treatment Lyme disease Syndrome

Before continuing, please read the following post for important disclaimershere


Note that the results shown below originate strictly from output of Machine Learning Algorithms / Network Analysis. No Human intervention has been made apart from the fact that Candidate T…

Welcome to Algo-genomics

I decided to start this Blog because i wanted to document my 4-year effort on identifying what is behind several syndromes that have no known treatment such as the Post-Finasteride Syndrome (known as PFS) and also Chronic Fatigue Syndrome (also known as CFS).
I would also like to draw attention from Researchers that could potentially use/validate the hypotheses that will be discussed in this Blog.
My first effort has focused to  "Post-Finasteride Syndrome", a syndrome with a debilitating set of symptoms that persist for a small percentage of people that have taken the drug Finasteride. The problems that are associated with Post-Finasteride Syndrome  can be found  on the Post-Finasteride Foundation Website :

http://pfsfoundation.org

As Research progressed, i began realizing that there were several syndromes that had very similar/overlapping symptoms. According to the hypothesis being discussed here, these potentially associated Syndromes of unknown origin are the following :