Exploratory Data Analysis(EDA) on Clinical Trials related to COVID-19

Using Python to visualize coronavirus related trials on Clinicatrials.gov

Jun 28 ·5min read

Photo by Luke Chesser on Unsplash

ClinicalTrials.gov is a database of privately and publicly funded clinical studies conducted around the world. It is maintained by the National Institute of Health. All data is publicly available and the site provides a direct download feature which makes it super easy to use relevant data for analysis. This article is an attempt to demonstrate a step by step exploratory data analysis on clinical trials related to COVID 19 studies presented on the site.

Clinical Terminology

Before jumping into the dataset, lets look at some basic definitions of the common clinical trial terminologies:

For more information on Clinical Trials related glossary, refer to this and this

Data Download

ClinicalTrials.gov provides a very simple interface to search and download data. On the home page itself, it provides the search functionality as below:

Source: ClinicalTrials.gov

By entering the keyword “COVID 19” in “Other terms” field, it will list all studies where the term “COVID 19” or its related synonyms appear. The search is smart enough to include all data where terms related to COVID 19 are present. Our search included the below terms:

Source: ClinicalTrials.gov

Press on “Download” on upper right corner and choose “XML Download” on next screen

Source: ClinicalTrials.gov

A zip file containing XML files for search results will be downloaded. Each XML file corresponds to 1 study. The filename is the NCT number which a unique identifier of a study in ClinicalTrials repository.

Loading Data into a Panda’s Dataframe

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import os
from xml.etree import ElementTreeimport os
list_of_files=os.listdir('./COVID_19_Studies')df_covid = pd.DataFrame()
df = pd.DataFrame()
list_keywords = []# Read in datafor file in list_of_files:
    tree = ElementTree.parse(file_path)
    root = tree.getroot()trial = {}trial['id'] = root.find('id_info').find('nct_id').text
    trial['overall_status'] = root.find('overall_status').text
    trial['study_type'] = root.find('study_type').text
    if root.find('start_date') != None:
        trial['start_date'] = root.find('start_date').text
         trial['start_date'] = ''
    if root.find('enrollment') != None:
        trial['enrollment'] = root.find('enrollment').text
         trial['enrollment'] = ''trial['condition'] = root.find('condition').text
    if root.find('location_countries') != None:
        trial['location_countries'] = root.find('location_countries').find('country').text
        trial['location_countries'] = ''
    if root.find('intervention') != None:
        trial['intervention'] = root.find('intervention').find('intervention_name').text
        trial['intervention'] = ''
    for entry in root.findall('keyword'):
        list_keywords.append(entry.text)if root.find('official_title') == None:
        trial['title'] = root.find('brief_title').text
        trial['title'] = root.find('official_title').textdate_string = root.find('required_header').find('download_date').text
    trial['date_processed'] = date_string.replace('ClinicalTrials.gov processed this data on ', '')
    trial['sponsors'] = root.find('sponsors').find('lead_sponsor').find('agency').text
    df  = pd.DataFrame(trial,index=[i])
    df_covid = pd.concat([df_covid, df])

View Total Studies and Attribute Names for each Study

There are total 2439 studies which are downloaded.

Visualize the Status of Studies

ax = sns.countplot(y=”overall_status”, data=df_covid, orient=’h’)

Visualize the types of Studies

ax = sns.countplot(y=’study_type’, data=df_covid, orient=’h’)
ax.set_title(‘Study Types’)

As shown above, the data set consists predominantly of interventional trials where recruitment of patients is active (usually, the recruitment of patients in a clinical trial is challenging but due to the prevalence of pandemic the patients are available in large numbers).

Visualize the Interventions given

Intervention refers to the medicinal product (e.g drug, device, vaccine, placebo etc) given to the patients in a study. Let’s see the top five interventions given in these studies.

# get top five interventionsinterventional_studies = df_covid[df_covid[‘study_type’]==’Interventional’]top_interventions = interventional_studies[‘intervention’].value_counts().sort_values(ascending=True)[-5:]top_interventions.plot(kind=’barh’, title=’Interventions’)

Which Countries are conducting Interventional Trials?

# Top 10 Countriescountries = interventional_studies[interventional_studies[‘location_countries’]!=’’]country = countries[‘location_countries’].value_counts().sort_values(ascending=True)[-10:]country.plot(kind=’barh’, title=’Country’)

What Conditions are reported most frequently for International Trials?

# Top 10 Conditionscondition = interventional_studies[‘condition’].value_counts().sort_values(ascending=True)[-10:]condition.plot(kind=’barh’, title=’Condition’)

How large are Interventional Trials?

The size of a trial depend upon the number of patients enrolled.

# Convert to numeric
interventional_studies[‘enrollment’] = interventional_studies[‘enrollment’].astype(int)
# Remove the trials with recruitment status withdrawn and terminated
enrollment = interventional_studies.loc[
(interventional_studies[‘overall_status’] != ‘Withdrawn’) & (interventional_studies[‘overall_status’] != ‘Terminated’)]
bins = [-1, 20, 40, 60, 100, 200, 400, 600, 1000]
group_names = ['< 20', '21-40', '41-60', '61-100', '101-200', '201-400', '401-600', '>600']
categories = pd.cut(enrollment['enrollment'], bins, labels=group_names)
# Add categories as column in dataframe
enrollment[‘Category’] = categories
# View value counts
enrollment_counts = enrollment[‘Category’].value_counts().sort_index(ascending=True)
enrollment_counts.plot(kind=’bar’, title=’Size of Interventional Trials’, alpha=0.6, colormap=’Accent’, rot=20)

Visualizing keywords

The data for each clinical trial also contains a set of key words for that trial. It will be insightful to see which keywords have prominence in the overall data set. A quick and easy way to check it is to draw a word cloud.

from wordcloud import WordCloud, STOPWORDS 
import matplotlib.pyplot as plt
words = ‘’
stopwords = set(STOPWORDS)
for word in list_keywords:
value = str(word)
tokens = value.split()
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
words += “ “.join(tokens)+” “
wordcloud = WordCloud(width = 900, height = 900,
background_color =’black’,
stopwords = stopwords,
min_font_size = 10).generate(comment_words)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.tight_layout(pad = 0)


As evident here, terms related to respiration play a major role in COVID-19

The complete code is available at Github