In the previousarticle, we introduced how to perform EDA on a small app behavior dataset. Hopefully, you learned a lot there. This post aims to improve your EDA skills with a more complicated dataset and introduce new tricks. It is split into 6 parts.
1. Data review
2. Data cleaning
3. Numerical variable Distribution
4. Binary variable Distribution
5. Correlation analysis
Now, let’s begin the journey.
Quickly looking at the data shown in the below video, you can find there are 31 columns with 27,000 rows. With so many features, it’s better to create a view of each column’s explanation, as shown in Figure 1, to enhance our understanding.
Raw data usually contains missing values. So we need to know if there are any NaN in each column. Specifically,
We got columns [‘age’, ‘credit_score’, ‘rewards_earned’]. But how many records are NaN in these columns? So,
As shown in Figure 2, there are over 8,000 NaN in the ‘ credit_score’ column, and 3,227 NaN in the ‘ rewards_earned’ column. We will drop these two columns, and remove the 4 records where age is NaN .
dataset = dataset[pd.notnull(dataset.age)] dataset = dataset.drop(columns = [‘credit_score’, ‘rewards_earned’])
To better understanding data distribution, visualization is the best method. Let’s try histogram . Specifically,
dataset2 = dataset.drop(columns = [‘user’, ‘churn’]) for i in range(1, dataset2.shape + 1): plt.subplot(6, 5, i) f = plt.gca() vals = np.size(dataset2.iloc[:, i — 1].unique() plt.hist(dataset2.iloc[:, i — 1], bins=vals, color=’#3F5D7D’)
As shown in Figure 3, many variables are positively skewed. Some binary variables are evenly distributed while others are highly concentrated on one side. For highly concentrated variables, it is very important to review if the dependent variable is imbalanced or not. For instance, for column ‘ waiting_4_loan ’, less than 10% is 1. If most people who are waiting for loans unsubscribed the product, the model probably will be over-fitted on this feature.
As above said, let’s focus on the distribution of binary variables using a Pie chart . Specifically,
dataset2 = dataset[[‘housing’, ‘is_referred’, ‘app_downloaded’, ‘web_user’, ‘app_web_user’, ‘ios_user’, ‘android_user’, ‘registered_phones’, ‘payment_type’, ‘waiting_4_loan’, ‘cancelled_loan’, ‘received_loan’, ‘rejected_loan’, ‘left_for_two_month_plus’, ‘left_for_one_month’, ‘is_referred’]]for i in range(1, dataset2.shape + 1): f = plt.gca() values = dataset2.iloc[:, i — 1].value_counts(normalize = True).values index = dataset2.iloc[:, i — 1].value_counts(normalize = True).index plt.pie(values, labels = index, autopct=’%1.1f%%’)
As shown in Figure 4, there are 5 columns that need to explore further, because their distribution is highly concentrated: ‘waiting_4_loan’, ‘cancelled_loan’, ‘received_loan’, ‘rejected_loan’, ‘left_for_one_month’ .
For these 5 columns, let’s review dependent variable distribution in the minority category. Specifically,
Figure 5 tells us the dependent variable is not strongly imbalanced in the minority category. Great. Nothing to worry about.
5.1 between independent and dependent variables
This step is to understand which feature or variables may have a strong impact on the dependent variable. Here we only analyze numerical variables.
dataset.drop(columns = [‘user’, ‘churn’, ‘housing’, ‘payment_type’, ‘registered_phones’,‘zodiac_sign’]).corrwith(dataset.churn).plot.bar(figsize=(20,10), title = ‘Correlation with Response variable’,fontsize = 15, rot = 45,grid = True)
Figure 6 shows some interesting findings. For instance, for variable ‘ cc_taken’ , the more credits customers take, the more likely they churn. This may indicate the customers are not happy with credit cards.
5.2 between independent variables
Ideally, we only use ‘independent’ variables as input. The correlation matrix tells if variables are independent with each other. Specifically,
corr = dataset.drop(columns = [‘user’, ‘churn’]).corr()
As shown in Figure 7, there is a strong negative correlation between ‘ android_user ’ and ‘ ios_user ’. Another column is ‘ app_web_user ’ which represents users who use both app and web. It can only be 1 when ‘ app_downloaded ’ is 1 and ‘ web_user ’ is 1. So ‘ app_web_user ’ is not an independent variable that needs to be removed. Specifically,
dataset = dataset.drop(columns = [‘app_web_user’]) dataset.to_csv(‘new_churn_data.csv’, index = False)
In many cases, you probably will be tasked with more dirty data than we processed here. So you need to clean the data first, review the data distribution, and understand if any imbalance occurs. In addition, use correlation analysis to remove any dependence on your features. Fortunately, the thought process will be more or less the same, regardless of the amount of data you have.
Great! Huge congratulations for making it to the end. Hopefully, this gives a better sense of how to perform EDA on raw data. If you need the source code, feel free to visit my Github page. The next article will walk through data processing, model building, and optimization.