Sorting data frames in pandas

How to sort data frames quickly and efficiently

Many beginner data scientists try to sort their data frames by writing complicated functions. This is not the most efficient or easiest way to do it. Do not reinvent the wheel and use sort_values() function provided by pandas package. Let’s have a look at the real-life example and how to use sort_values() function in your code.

Load data set

We will use a python dictionary to create some fake client data and we will load this data to pandas data frame. We will keep it simple so we will have just four columns: name, country, age and latest date active. The data set is simple enough but we will give us a good overview of how we can sort a data frame in several different ways.

import pandas as pd
import numpy as np
client_dictionary = {'name': ['Michael', 'Ana', 'Sean'], 
                     'country': ['UK', 'UK', 'USA'], 
                     'age': [10, 51, 13],
                     'latest date active': ['07-05-2019', '23-12-2019', '03-04-2016']}
df = pd.DataFrame(client_dictionary)
df.head()

Just above, we have our client data frame. It has only three clients but it will be enough to showcase all different sorting possibilities.

Sort by alphabetical order

Let’ start with sorting a data frame by names in alphabetical order. We will use panads sort_values() function and specify by which column name we want to sort by using a parameter called ‘by’ :

df.sort_values(by='name')

We can see that a data frame is now sorted according to name column in alphabetical order.

We can reverse the ordering by using ascending=False as our function parameter:

df.sort_values(by='name', ascending=False)

Sort by number

Let’s try to do the same but now sort by age. The code looks exactly the same except we change the column name that we will use for sorting:

df.sort_values(by='age')

Sort by dates

Again the same line of code will work for dates! The only thing we need to ensure is that our date is recognized as a date type and not as a string. We will use astype() method to do that and then apply the sorting function:

df['latest date active'] = df['latest date active'].astype('datetime64[ns]')
df.sort_values(by='latest date active')

Sort by multiple columns

Sort values function can work with multiple columns. It will first sort the data frame according to the first element in the column list. If there are values that are not sortable using the first column it will proceed to the next column in the list. Let’s look at the example when we first sort by country and then sort by name:

df.sort_values(by=['country','name'])

So here we can see that our entries are sorted by country. All UK entries are above USA entries and then they are even further sorted according to its name column.

Using inplace parameter

Last but not least if you want your current data frame to save the result of the sorting you remember to use inplace parameter and set it to True:

df.sort_values(by=['country','name'], inplace=True)

Conclusion

You can efficiently sort your data frames using single lines of code. The magic is done using sort_values function from pandas package. If you get to know how to use parameters as outlined in this overview you will be able to sort any data frame according to your needs. I hope you will find it useful and happy sorting!

我来评几句
登录后评论

已发表评论数()

相关站点

热门文章