Exploratory Data Analysis on Mobile App Behavior Data

Deep Dive into EDA with Large Dirty Raw Data Using Visualization and Correlation Analysis to Improve your Hands-on Skills

Jun 22 ·5min read

Img from Unsplash via link

In the previousarticle, we introduced how to perform EDA on a small app behavior dataset. Hopefully, you learned a lot there. This post aims to improve your EDA skills with a more complicated dataset and introduce new tricks. It is split into 6 parts.

1. Data review

2. Data cleaning

3. Numerical variable Distribution

4. Binary variable Distribution

5. Correlation analysis

6. Summary

Now, let’s begin the journey.

  1. Data review

Quickly looking at the data shown in the below video, you can find there are 31 columns with 27,000 rows. With so many features, it’s better to create a view of each column’s explanation, as shown in Figure 1, to enhance our understanding.

Video 1 A brief view of raw data

Fig.1 Variable explanation in table view

2. Data cleaning

Raw data usually contains missing values. So we need to know if there are any NaN in each column. Specifically,


We got columns [‘age’, ‘credit_score’, ‘rewards_earned’]. But how many records are NaN in these columns? So,


As shown in Figure 2, there are over 8,000 NaN in the ‘ credit_score’ column, and 3,227 NaN in the ‘ rewards_earned’ column. We will drop these two columns, and remove the 4 records where age is NaN .

Fig.2 NaN records in the column


dataset = dataset[pd.notnull(dataset.age)]
dataset = dataset.drop(columns = [‘credit_score’, ‘rewards_earned’])

3. Numerical variable Distribution

To better understanding data distribution, visualization is the best method. Let’s try histogram . Specifically,

dataset2 = dataset.drop(columns = [‘user’, ‘churn’])
for i in range(1, dataset2.shape[1] + 1):
    plt.subplot(6, 5, i)
    f = plt.gca()
    vals = np.size(dataset2.iloc[:, i — 1].unique()
    plt.hist(dataset2.iloc[:, i — 1], bins=vals, color=’#3F5D7D’)

As shown in Figure 3, many variables are positively skewed. Some binary variables are evenly distributed while others are highly concentrated on one side. For highly concentrated variables, it is very important to review if the dependent variable is imbalanced or not. For instance, for column ‘ waiting_4_loan ’, less than 10% is 1. If most people who are waiting for loans unsubscribed the product, the model probably will be over-fitted on this feature.

Fig.3 Histogram of numerical variables

4. Binary variable distribution

As above said, let’s focus on the distribution of binary variables using a Pie chart . Specifically,

dataset2 = dataset[[‘housing’, ‘is_referred’, ‘app_downloaded’, ‘web_user’, ‘app_web_user’, ‘ios_user’, ‘android_user’, ‘registered_phones’, ‘payment_type’, ‘waiting_4_loan’, ‘cancelled_loan’, ‘received_loan’, ‘rejected_loan’, ‘left_for_two_month_plus’, ‘left_for_one_month’, ‘is_referred’]]for i in range(1, dataset2.shape[1] + 1):
    f = plt.gca()
    values = dataset2.iloc[:, i — 1].value_counts(normalize = True).values
    index = dataset2.iloc[:, i — 1].value_counts(normalize = True).index
    plt.pie(values, labels = index, autopct=’%1.1f%%’)

As shown in Figure 4, there are 5 columns that need to explore further, because their distribution is highly concentrated: ‘waiting_4_loan’, ‘cancelled_loan’, ‘received_loan’, ‘rejected_loan’, ‘left_for_one_month’ .

Fig.4 Pie chart of binary variables

For these 5 columns, let’s review dependent variable distribution in the minority category. Specifically,

Fig.5 Dependent variable distribution in concentrated variables

Figure 5 tells us the dependent variable is not strongly imbalanced in the minority category. Great. Nothing to worry about.

In summary, the whole purpose of visualization is to understand how even the distribution of each variable is, and how even the dependent variable distribution is in each binary variable. So we can identify variables that require over-sampling or down-sampling.

5. Correlation analysis

5.1 between independent and dependent variables

This step is to understand which feature or variables may have a strong impact on the dependent variable. Here we only analyze numerical variables.


dataset.drop(columns = [‘user’, ‘churn’, ‘housing’, ‘payment_type’, ‘registered_phones’,‘zodiac_sign’]).corrwith(dataset.churn).plot.bar(figsize=(20,10), title = ‘Correlation with Response variable’,fontsize = 15, rot = 45,grid = True)

Figure 6 shows some interesting findings. For instance, for variable ‘ cc_taken’ , the more credits customers take, the more likely they churn. This may indicate the customers are not happy with credit cards.

Fig.6 Correlation between independent and dependent variables

5.2 between independent variables

Ideally, we only use ‘independent’ variables as input. The correlation matrix tells if variables are independent with each other. Specifically,

corr = dataset.drop(columns = [‘user’, ‘churn’]).corr()

As shown in Figure 7, there is a strong negative correlation between ‘ android_user ’ and ‘ ios_user ’. Another column is ‘ app_web_user ’ which represents users who use both app and web. It can only be 1 when ‘ app_downloaded ’ is 1 and ‘ web_user ’ is 1. So ‘ app_web_user ’ is not an independent variable that needs to be removed. Specifically,

dataset = dataset.drop(columns = [‘app_web_user’])
dataset.to_csv(‘new_churn_data.csv’, index = False)

Fig.7 Correlation matrix between independent variables

6. Summary

In many cases, you probably will be tasked with more dirty data than we processed here. So you need to clean the data first, review the data distribution, and understand if any imbalance occurs. In addition, use correlation analysis to remove any dependence on your features. Fortunately, the thought process will be more or less the same, regardless of the amount of data you have.

Great! Huge congratulations for making it to the end. Hopefully, this gives a better sense of how to perform EDA on raw data. If you need the source code, feel free to visit my Github page. The next article will walk through data processing, model building, and optimization.