As a result of the warming climate, wildfires in the Amazon rain forest have been of increasing concern. Here we will explore and analyze the Fires in Brazil data set provided by the Brazilian Government. The data is available here .

Exploring data is commonly the first step in building predictive models in data science. Exploratory data analysis involves summarizing the characteristics of the data set. This includes statistical summaries about features in the data such as mean, standard deviation, distribution, and number of records.

The first step is to import that pandas package in python and read the data into a pandas data frame. You can think of a pandas dataframe like a data table or an excel spreadsheet.

import pandas as pddf = pd.read_csv(“amazon.csv”, encoding = “ISO-8859–1”)print(df.head())

The output of the first five rows is:

As you can see the data includes the year number, Brazilian State, Month, Number of wildfires and the date. We can start our analysis by calculating the mean number of fires for each state. We do this by using the “groupby” method in pandas:

df = df.groupby(‘state’)[‘number’].mean().reset_index()print(df)

We can also visualize this using the ‘seaborn’ package in python:

import seaborn as snsbar = sns.barplot(df['state'], df['number'], color = "red") bar.set_xticklabels(df['state'], rotation=45)

We can also look at the standard deviation in number of fires for each state:

df = df.groupby(‘state’)[‘number’].mean().reset_index() bar = sns.barplot(df['state'], df['number'], color = "black") bar.set_xticklabels(df['state'], rotation=45)

We can also look at the distribution in number of fire across all states:

We can see that the distribution in number of forest fires is long tailed. This turns out to be the case for each Brazilian state invidually as well.

If we want to look at the set of individual states we can do as follows:

print(set(df['state'].values))

which when executed gives:

This is useful especially when you have a large data set and would like to analyze specific segments or groups of data. In this case we only have 22 states, for now lets call them categories. There are situations where you may have millions of rows and are interested in a set of categories which may only have, for example, 5 values. It would be formidable to manually look through the excel file to count the number of categories. Further, you can look at how much data is available for each category:

from collections import Counter print(Counter(df['state'].values))

Small caveat: for data sets larger than 1 GB in size pandas is very slow and it is best to use other tools for analysis. Other tools include Spark clusters, Hadoop, and Hive. Since these tools take a significant amount of time to learn and set up other alternatives include breaking the data up into batches and using pandas, and working with numpy arrays.

A good approach to building accurate models for prediction is to find ways to break the total data set into clusters with similar attributes. If we wish to look at data for an individual state, let’s say ‘Roraima’, we can filter the data frame:

df = df[df['state'] == 'Roraima'] df = df.head().reset_index() df.drop('index') print(df.head())

We can also look for any regular patterns in number of wildfires for any given month and state by plotting the number of wildfires in Roraima in the month of January vs the year number:

import matplotlib.pyplot as plt df = df[df['month'] == 'Janeiro'] plt.plot(df['year'], df['number']) plt.xlabel('Year') plt.ylabel("Number of Wildfires") plt.title("Wildfires in Roraima in the month of January")

You can also perform this analysis for different states and months:

This serves to inform your model building process. One interesting path forward would be to build a model per state and month. This isn’t the only path forward but the point of explorarotory data analysis is to understand your data and subsequently be inspired for feature engineering and model building.

In the next post we will build a predictive model using the random forest algorithm in the python machine learning package called ‘sklearn’.

Thank you for reading!

推荐文章

- 1. Create Data Science pipelines with Luigi & PySpark and CI/CD
- 2. When your data doesn’t fit in memory: the basic techniques
- 3. Enter Analytics: From Boot Camp to working in Data Science
- 4. Scraping Machinery Parts
- 5. 5 ways to become a data scientist
- 6. An Open Source Deep Learning Inference Engine Based on FPGA

## 我来评几句

登录后评论已发表评论数()