Interactive maps with Python, pandas and Plotly: following bloggers through Sydney

Interactive maps with Python, Pandas and Plotly

Following bloggers’ paths through Sydney

Jan 23 ·7min read

In this article and another few, I will explore Python and Plotly to put together a few different awesome looking charts. Plotly.js is a Javascript-based visualisation library, built by a company (also) called Plotly, and they also provide wrappers for various languages, including a python wrapper called… plotly.

Despite their unfortunate naming schema, they’ve really put together a very powerful, yet still very customisable, library, and I’m excited to explore what it can do.

You can follow along with the source code that I use, and the data, from this GitLab repository .

Location, location, location

I often read articles, like food blogs, or travel blogs, and think: where exactly are these places ?

Over the last Christmas break, I had the opportunity to travel back to Sydney with my partner, who had never been to Australia, let alone Sydney. I am from Sydney, and wanted to help her see the “best” of Sydney, and appreciate it as many of us do.

So I read a few travel blogs to see what first-time visitors loved about Sydney. It struck me in doing so that a visitor would be easily confused about where everything was and which sights are worth seeing. Planning each days’ schedule would be a pain as it was never clear as to how far one place was in comparison to another.

My solution to all this was to plot a map, with three goals in mind:

  • One, build an interactive map including each of these locations marked.
  • Two, the map would have different markers to indicate what type of location it was. Whether it was a landmark, a location (like a suburb), or a transport hub.
  • Three, to indicate how many of the blogs mentioned this location, so that we can filter the more ‘important’ ones.

Plotly has my go-to visualisation library for anything custom. I had seen that it includes an amazing MapBox integration, which I had not tried before. So I thought I would kill the proverbial two birds with one stone.

Before we get started

I assume you’re familiar with python. Even if you’re relatively new, this tutorial shouldn’t be too tricky, though.

You’ll need pandas and plotly Install them (in your virtual environment) with a simple pip install [PACKAGE_NAME] .

If you do not have a Mapbox token, set one up with them — we are going to want to need it. They offer a very free account with very reasonable access limits.

I save my key in a file, and load it in with:

with open('mapbox_tkn.txt', 'r') as f: 
    mapbox_key = f.read().strip()

Maps are fun

Introduction

It was slightly painful to collate this information, despite using NLP tools due to people insisting on misspelling names on their blog, or simply calling things by different names. (‘Harbour Bridge’ or ‘Sydney Harbour Bridge’? ‘Queen Victoria Building’ or QVB’?)

This tutorial is focussed on Plotly and not scraping, so I will provide loc_data.csv (all data and scripts are available on my repo here ). This file includes data for all unique locations that we're going to look at. And the data from each blog is included in the data_csvs subdirectory, named blog_file[N].csv .

Load the csv file into a dataframe, and take a look at its contents:

loc_df = pd.read_csv('mapping_blogs/loc_data.csv', index_col=0) print(loc_df.head())

You’ll see that there are five columns: index, location, lat, lon and type.

They are the index number, location name string, latitude and longitude in decimals and the type of location.

Data cleaning

The ‘type’ column looks categorical upon first inspection. All unique values in a column can be collated by loc_df.type.unique() . And it shows ['Area', nan, 'Food/Drinks', 'Transport', 'Lodging'] . Indeed they are! I remember that the NaN values are those which I couldn't come up with a category for. Let's give them a name, Misc for miscellaneous.

The easiest way to do this is to use pandas’ .fillna method with the inplace parameter. loc_df.type.fillna('Misc', inplace=True) will do the trick, and fill in any NaN values.

Our first map

By this stage, we actually already have enough information to plot something! Through the magic of plotly, we just need these lines of code for our first map:

import plotly.express as px 
fig = px.scatter_mapbox(loc_df, lat="lat", lon="lon", color="type")
fig.update_layout(mapbox_style="open-street-map") 
fig.show()

Our first Plotly map!

Something like this should have opened on your browser (or in your Jupyter notebook). Isn’t that cool? It took just three lines of code to plot this. Plotly Express makes it much faster to create plots. The map is interactive, so have a look zooming, panning, looking at the markers and isolating each plot by clicking on the legends.

What we’re doing here is passing the entire dataframe to the .scatter_mapbox function, and specifying the columns where the data resides. Then we specify the open-street-map style (available styles are listed in this guide ) using update_layout .

If you’re wondering how we passed the mapbox_key variable that we loaded earlier to Plotly, the answer is that we haven't. Using the open-street-map style means that a mapbox key is not needed, Open Street Map being a free, collaborative project.

Now that we know how to make a basic map, let’s get into the weeds, really looking at the data closely, and adding some bells and whistles.

Serious mapping

Mouseovers

While playing with the map, you probably noticed the mouseover tooltips. They’re great but not particularly informative. So let’s fix that. We want to see the name of the place, and don’t care much for the exact coordinates. Let’s specify parameters hover_name='location' , and hover_data=['type'] instead.

I also thought the colourful map was a little distracting from the overlays, so I change the mapbox style to light , and now we need to provide the mapbox key. Lastly, I thought the map was initially too high up, when I am mostly interested in the Sydney metro area. So let's change the default mapped area by specifying the zoom parameter. The code and the resulting map are below:

fig = px.scatter_mapbox(loc_df, lat="lat", lon="lon", color="type", hover_name='location', hover_data=['type'], zoom=12) 
# Now using Mapbox 
fig.update_layout(mapbox_style="light", mapbox_accesstoken=mapbox_key) 
fig.show()

The mouseover popups are great, but not (yet) very informative

Okay, that’s a huge improvement. And we’ve now met two of our three objectives. For the last one, we’re going to have to compile counts of locations. So, back to the dataframe we go.

Finding the most popular destinations

I wanted to see which destinations, or locations, were the most popular by bloggers. So in this section, we will count up how many times each location has been included.

The first step is to count up which of the locations in the master list are in each blog, and sort the resulting dataframe:

data_dir = 'data_csvs' 
data_files = [i for i in os.listdir(data_dir) if i.endswith('.csv')] for csv_file in data_files: 
    with open(os.path.join(data_dir, csv_file), 'r') as f: 
        locs_txt = f.read() 
    temp_locs = locs_txt.split('\n') 
    locs_bool = [loc_in_list(i, temp_locs) for i in list(loc_df['location'])] 
    loc_df = loc_df.assign(**{csv_file: locs_bool}) loc_df = loc_df.assign(counts=loc_df[data_files].sum(axis=1))
loc_df.sort_values(by='counts', inplace=True, ascending=False)

The loc_in_list function is something I wrote to compare location names to a list, taking into account various combinations of including/omitting the word 'the', and various apostrophe/quote symbol.

def loc_in_list(loc, loc_list): loc_list = list(set([i.strip().lower() for i in loc_list if len(i.strip().lower()) > 0])) 
loc_list += ['the ' + i for i in loc_list if i[:4] != 'the '] 
loc_list += [i[4:] for i in loc_list if i[:4] == 'the '] for t_char in ["'", "-"]: 
    loc_list += [i.replace(t_char, "") for i in loc_list if t_char in i] 
    loc_list += [i.replace(t_char, " ") for i in loc_list if t_char in i] loc = loc.replace("'", "'") 
loc = loc.strip().lower() loc_in_list_bool = (loc in loc_list) or (loc.replace("'", "") in loc_list) return loc_in_list_boolprint(loc_df.head())

Looking at the dataframe again, looks like we’ve got a dataframe sorted by counts of occurrences. ‘Harbour Bridge’ makes an appearance on every list with 6!

We can add a sizing parameter (and max_size to control symbol sizes), and plot the map again:

fig = px.scatter_mapbox( loc_df, lat="lat", lon="lon", color="type", size="counts", hover_name='location', hover_data=['type'], zoom=12, size_max=15) 
fig.update_layout(mapbox_style="light", mapbox_accesstoken=mapbox_key) 
fig.show()

Overlapping locations

The eagle-eyed among you might have noticed these, overlapping locations.

For places like these, I’m going to look at overlapping locations and just go with the location of the names with the higher counts.

I simply loop over every row, and look for rows with distances less than a threshold:

loc_df = loc_df.assign(dup_row=0) 
loc_thresh = 0.0001 for i in range(len(loc_df)): 
    src_ind = loc_df.iloc[i].name 
    for j in range(i+1, len(loc_df)): 
        tgt_ind = loc_df.iloc[j].name 
        lat_dist = loc_df.loc[src_ind]['lat'] - loc_df.loc[tgt_ind]['lat'] 
        lon_dist = loc_df.loc[src_ind]['lon'] - loc_df.loc[tgt_ind]['lon'] 
        tot_dist = (lat_dist ** 2 + lon_dist ** 2) ** 0.5 
        if tot_dist < loc_thresh: 
            print(f'Found duplicate item "{loc_df.loc[tgt_ind]["location"]}", index {tgt_ind}') 
            for csv_file in data_files: 
                if loc_df.loc[tgt_ind, csv_file]: 
                    loc_df.loc[src_ind, csv_file] = True 
                if loc_df.loc[tgt_ind, 'location'] not in loc_df.loc[src_ind, 'location']: 
                    loc_df.loc[src_ind, 'location'] = loc_df.loc[src_ind, 'location'] + ' | ' + loc_df.loc[tgt_ind, 'location'] 
                    loc_df.loc[tgt_ind, 'dup_row'] = 1 loc_df = loc_df[loc_df.dup_row == 0] 
loc_df = loc_df.assign(counts=loc_df[data_files].sum(axis=1))
loc_df.sort_values(by='counts', inplace=True, ascending=False)

The data is now ready to be plotted! I also turn off the displaybar & disable edits, which I put in on all my plots.

fig = px.scatter_mapbox(loc_df, lat="lat", lon="lon", color="type", size="counts", hover_name='location', hover_data=['type'], zoom=12, size_max=15) 
fig.update_layout(mapbox_style="light", mapbox_accesstoken=mapbox_key) 
fig.show( config={ 'displayModeBar': False, 'editable': False, }, )

Look at that — the points on the bridge have been joined, the names added to each other, and count incremented!

And so we have an interactive, filterable by category, with tooltips!

That finishes this writeup on scatter plotting on a map. Hopefully that was interesting to you.

If you liked this, say :wave: / follow on twitter , or follow for updates.

我来评几句
登录后评论

已发表评论数()

相关站点

热门文章