Welcome to my new article. Data analysis is always a necessary skill if you want to stand out among other candidates during your job hunting. Maybe you know how to use Python or R or even SAS/Matlab to perform. But knowing SQL is often a necessary yet important skill that you should have. Many companies still use SQL a lot for data extraction and analysis. Simple but able to handing a sufficiently large amount of data help SQL still important after many years (I have coded SQL for over seven years).
Therefore, I am here to demonstrate a data analysis using SQL. The platform I used for SQL is MySQL Workbench. You can download for free from the below link:
MySQL :: Download MySQL Workbench
MySQL Workbench provides DBAs and developers an integrated tools environment for: Database Design & Modeling SQL…
And the dataset for this analysis is “House Property Sales Time Series” from Kaggle.
House Property Sales Time Series
Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government…
About how to use MySQL Workbench is out of this article’s scope. You can have more information from the documentation. One thing to mention is that I code SQL with MySQL. There may be some differences if you code with Oracle or others. No need to panic if you encounter errors. Just change back to what it should be in your language.
This is the MySQL™ Workbench Reference Manual. It documents the MySQL Workbench Community and MySQL Workbench…
The schema is called “dataset ”and the data is stored as a table called “raw_sales”
The first step is always printing some records from the dataset.
SELECT * FROM dataset.raw_sales LIMIT 10
So there are five columns in the dataset, datesold, postcode, price, propertyType, and bedrooms.
select min(datesold) as min_datesold, max(datesold) as max_datesold from dataset.raw_sales;select count(1) from dataset.raw_sales;
The dataset covers from 7th Feb 2007 to 27 Jul 2019 with 29580 records.
So which date has the most frequent sales?
select datesold, count(1) as sales_count, sum(price) as price_sum from dataset.raw_sales group by datesold order by sales_count desc limit 10;
So the maximum number of sales for one day is 50 with a total price of 42M.
Which postcode has the highest average price per sales?
select postcode,avg(price) as price_avg , count(1) as sales_count from dataset.raw_sales group by postcode order by price_avg desc limit 10;
There is a total of 9 sales in postcode 2618 with an average price of 1.08M.
And which year has the lowest number of sales?
select year(datesold) as year , count(1) as sales_count, sum(price) as price_sum from dataset.raw_sales group by year order by sales_count limit 10;
Is it possible to know which top five postcodes by price in each year? Of course you can, but you need to know what window function is.
First I didn’t know what a window function was. But then I was asked again and again and again during interviews. So I looked it up and realized I should have known this earlier (at least helped me answer questions in an interview).
Basically, window function can perform an aggregation based on the partition and return the result back to a row. If you want to know more about the syntax and usage of a window function, I recommend you read the doc from Oracle as below:
MySQL :: MySQL 8.0 Reference Manual :: 12.21.1 Window Function Descriptions
In the following function descriptions, over_clause represents the OVER clause, described in Section 12.21.2, "Window…
So I first aggregate price by year and postcode
select year(datesold) as year ,postcode, sum(price) as price_sum from dataset.raw_sales group by year, postcode
Then I use the window function to get the ranking of the total price by each year.
select year,postcode, price_sum, row_number() over (partition by year order by price_sum desc) as ranking from ( select year(datesold) as year ,postcode, sum(price) as price_sum from dataset.raw_sales group by year, postcode ) a
Finally, select all records with ranking smaller than or equal to 5
select * from ( select year,postcode, price_sum, row_number() over (partition by year order by price_sum desc) as ranking from ( select year(datesold) as year ,postcode, sum(price) as price_sum from dataset.raw_sales group by year, postcode ) a ) b where ranking <=5 order by year,ranking
Next, I will move on to propertyType. There are two types, house, and unit.
So how many sales of houses and units are there for each year?
The easy way is to do a group by and then count how many records. But here I present another method by using case when. Case when function is another useful function to do aggregation with criteria. Here when I calculate the numbers of records, instead of using count function, I use sum function with a case when function inside.
select year(datesold) as year, sum(case when propertyType = "house" then 1 else 0 end) as house_sales_count, sum(case when propertyType = "unit" then 1 else 0 end) as unit_sales_count from dataset.raw_sales group by year;
So for house_sales_count, if the propertyType equals house then return 1. Otherwise, it returns 0. And then sum all the records. This will show how many sales with propertyType equal to house.
Another advantage is that the result looks like the Excel pivot table. You can do a comparison directly.
As you can see clearly, there are way more house sales than unit sales in each year. Then the follow-up question must be about the average price difference between house and unit. Sure we can use the case when function also. But one thing to be cautious is that the else part.
select year(datesold) as year, avg(case when propertyType = "house" then price else null end) as house_price_avg, avg(case when propertyType = "unit" then price else null end) as unit_price_avg from dataset.raw_sales group by year;
Instead of returning 0, it is necessary to return a null value so that that particular record will not be included when calculating the average.
Again, the average price for the house is more expansive than the unit.
How about further considering the number of bedrooms?
select year(datesold) as year, avg(case when propertyType = "house" then price/bedrooms else null end) as house_price_bedroom_avg, avg(case when propertyType = "unit" then price/bedrooms else null end) as unit_price_bedroom_avg from dataset.raw_sales group by year;
The picture is different this time. The average price for the unit is higher than the house in terms of the number of bedrooms.