How to use SQL to perform data analysis (House Property Sales Time Series)

Welcome to my new article. Data analysis is always a necessary skill if you want to stand out among other candidates during your job hunting. Maybe you know how to use Python or R or even SAS/Matlab to perform. But knowing SQL is often a necessary yet important skill that you should have. Many companies still use SQL a lot for data extraction and analysis. Simple but able to handing a sufficiently large amount of data help SQL still important after many years (I have coded SQL for over seven years).

Therefore, I am here to demonstrate a data analysis using SQL. The platform I used for SQL is MySQL Workbench. You can download for free from the below link:

And the dataset for this analysis is “House Property Sales Time Series” from Kaggle.

About how to use MySQL Workbench is out of this article’s scope. You can have more information from the documentation. One thing to mention is that I code SQL with MySQL. There may be some differences if you code with Oracle or others. No need to panic if you encounter errors. Just change back to what it should be in your language.

The schema is called “dataset ”and the data is stored as a table called “raw_sales”

The first step is always printing some records from the dataset.

SELECT * FROM dataset.raw_sales
LIMIT 10

So there are five columns in the dataset, datesold, postcode, price, propertyType, and bedrooms.

select min(datesold) as min_datesold, max(datesold) as max_datesold 
from dataset.raw_sales;select count(1) from dataset.raw_sales;

The dataset covers from 7th Feb 2007 to 27 Jul 2019 with 29580 records.

So which date has the most frequent sales?

select datesold, count(1) as sales_count, sum(price) as price_sum
from dataset.raw_sales
group by datesold
order by sales_count desc 
limit 10;

So the maximum number of sales for one day is 50 with a total price of 42M.

Which postcode has the highest average price per sales?

select postcode,avg(price) as price_avg , count(1) as sales_count
from dataset.raw_sales
group by postcode
order by price_avg desc
limit 10;

There is a total of 9 sales in postcode 2618 with an average price of 1.08M.

And which year has the lowest number of sales?

select year(datesold) as year , count(1) as sales_count, sum(price) as price_sum 
from dataset.raw_sales
group by year 
order by sales_count
limit 10;

Is it possible to know which top five postcodes by price in each year? Of course you can, but you need to know what window function is.

First I didn’t know what a window function was. But then I was asked again and again and again during interviews. So I looked it up and realized I should have known this earlier (at least helped me answer questions in an interview).

Basically, window function can perform an aggregation based on the partition and return the result back to a row. If you want to know more about the syntax and usage of a window function, I recommend you read the doc from Oracle as below:

So I first aggregate price by year and postcode

select year(datesold) as year ,postcode, sum(price) as price_sum 
from dataset.raw_sales
group by year, postcode

Then I use the window function to get the ranking of the total price by each year.

select year,postcode, price_sum, row_number() over (partition by year order by price_sum desc) as ranking
from (
select year(datesold) as year ,postcode, sum(price) as price_sum 
from dataset.raw_sales
group by year, postcode
)  a

Finally, select all records with ranking smaller than or equal to 5

select * from
(
select year,postcode, price_sum, row_number() over (partition by year order by price_sum desc) as ranking
from (
select year(datesold) as year ,postcode, sum(price) as price_sum 
from dataset.raw_sales
group by year, postcode
)  a
) b
where ranking <=5 
order by year,ranking

Next, I will move on to propertyType. There are two types, house, and unit.

So how many sales of houses and units are there for each year?

The easy way is to do a group by and then count how many records. But here I present another method by using case when. Case when function is another useful function to do aggregation with criteria. Here when I calculate the numbers of records, instead of using count function, I use sum function with a case when function inside.

select year(datesold) as year, 
sum(case when propertyType = "house" then 1 else 0 end) as house_sales_count,
sum(case when propertyType = "unit" then 1 else 0 end) as unit_sales_count 
from dataset.raw_sales
group by year;

So for house_sales_count, if the propertyType equals house then return 1. Otherwise, it returns 0. And then sum all the records. This will show how many sales with propertyType equal to house.

Another advantage is that the result looks like the Excel pivot table. You can do a comparison directly.

As you can see clearly, there are way more house sales than unit sales in each year. Then the follow-up question must be about the average price difference between house and unit. Sure we can use the case when function also. But one thing to be cautious is that the else part.

select year(datesold) as year,
avg(case when propertyType = "house" then price else null end) as house_price_avg,
avg(case when propertyType = "unit" then price else null end) as unit_price_avg
from dataset.raw_sales
group by year;

Instead of returning 0, it is necessary to return a null value so that that particular record will not be included when calculating the average.

Again, the average price for the house is more expansive than the unit.

How about further considering the number of bedrooms?

select year(datesold) as year,
avg(case when propertyType = "house" then price/bedrooms else null end) as house_price_bedroom_avg,
avg(case when propertyType = "unit" then price/bedrooms else null end) as unit_price_bedroom_avg
from dataset.raw_sales
group by year;

The picture is different this time. The average price for the unit is higher than the house in terms of the number of bedrooms.

我来评几句
登录后评论

已发表评论数()

相关站点

热门文章