Scheduled Web Scraping with Django and Heroku

Build a django app that scrapes a job board daily

Jan 11 ·7min read

We often need a lot of training data for machine learning, and web scraping can be a way to acquire it.

But in the past, there was a company I really wanted to work for. They didn’t currently have a data science posting but as soon as they did, I wanted to apply.

The solution? Scrape their job board daily so I was notified anytime a new job was posted.

Let’s build a simple django app, deploy to Heroku, and scrape a job board daily.

Setup the app

Create the directory for the app and cd into it.

mkdir jobs && cd jobs

Open this in whatever code editor your prefer. I’m using Sublime.

Create and start our virtual environment. Then install the packages we’ll need.

python -m venv env
source env/bin/activate
pip3 install django psycopg2 django-heroku bs4 gunicorn

Create the project (django’s version of a web application).

django-admin startproject jobs

cd into the project and create an app for scraping.

cd jobs
django-admin startapp scraping

Create a Job model

We’ll only need to define 1 model in this app, a Job model. This represents jobs that we’ll collect.

Overwrite /scraping/models.py with the following.

from django.db import models
from django.utils import timezoneclass Job(models.Model):
    url = models.CharField(max_length=250, unique=True)
    title = models.CharField(max_length=250)
    location = models.CharField(max_length=250)
    created_date = models.DateTimeField(default=timezone.now)    def __str__(self):
        return self.title    class Meta:
        ordering = ['title']    class Admin:
        pass

Register your model in /scraping/admin.py . This allows us to view records in Django’s default admin panel (we’ll get to this shortly).

from scraping.models import Job
admin.site.register(Job)

Add scraping to installed apps in /jobs/settings.py .

INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'scraping'
]

Setup database

Setup the database. I love postgres so we’ll use that.

At this point, ensure you’ve previously installed postgres on your mac with brew and started it (this is beyond the scope of this article).

Create a database for this project on the command line. You can open the postgres console with the below command.

psql -d template1

Create a user and a database, then exit.

create user django_user;
create database django_jobs owner django_user;
\q

In /jobs/settings.py , update DATABASES . In other frameworks you’d want to scope these specifically for the development environment, but here we won’t worry about it. It will work on Heroku anyway (we’ll get there shortly).

Note these are the user and database names we created above.

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql_psycopg2',
        'NAME': 'django_jobs',
        'USER': 'django_user',
        'HOST': '',
        'PORT': ''
    }
}

Create migrations, and migrate the database from the command line.

python manage.py makemigrations
python manage.py migrate

This will create a table called scraping_job . This is a django namespace convention because it belongs to the scraping app.

Now create a superuser and give it a password in your command line.

python manage.py createsuperuser --email admin@example.com --username admin

Test current app

We’ve done some work so far but we have no idea if anything works. Let’s test it out before going further.

On the command line.

python manage.py runserver

Then navigate to http://127.0.0.1:8000/admin in your browser. Login with the superuser you just created.

After logging in, click “jobs” under “scraping”, then “add job” in the top right. Now fill in some made up information and click “save”. If you can see the job you created, everything works up until this point!

Custom django-admin commands

We’re going to setup a custom django-admin command that scrapes a job board. This is what we’ll automatically schedule at the infrastructure level in order to automate scraping.

Inside the /scraping module, create a directory called /management , and a directory inside /management called /commands . Then create 2 python files, _private.py and scrape.py in /commands .

Not showing everything in /scraping here

Drop this code into scrape.py .

from django.core.management.base import BaseCommandfrom urllib.request import urlopen
from bs4 import BeautifulSoup
import json
from scraping.models import Jobclass Command(BaseCommand):
help = "collect jobs"
# define logic of command
def handle(self, *args, **options):
# collect html
html = urlopen('https://jobs.lever.co/opencare')
# convert to soup
soup = BeautifulSoup(html, 'html.parser')
# grab all postings
postings = soup.find_all("div", class_="posting")
for p in postings:
url = p.find('a', class_='posting-btn-submit')['href']
title = p.find('h5').text
location = p.find('span', class_='sort-by-location').text
# check if url in db
try:
# save in db
Job.objects.create(
url=url,
title=title,
location=location
)
print('%s added' % (title,))
except:
print('%s already exists' % (title,))
self.stdout.write( 'job complete' )

Putting the code in a directory structure like this, then defining a Command class with a handle function tells django this is a custom django-admin command.

Here we’re using beautiful soup to scrape a Lever job board, then save it in our database. I have no affiliation but the job board I’ve chosen is for a fantastic company called Opencare - you should apply if they post something!

You can now run this from your command line like below.

python manage.py scrape

And you’ll see this output.

Run it again and you’ll see this.

That’s because we’ve prevented adding duplicate job records in the scrape.py code above.

If you have a database administration program setup (like dbeaver ), you can also inspect rows in your database. We won’t get into that here but it should look like below.

Deploying to production

Now let’s get this onto Heroku.

Freeze your requirements so Heroku knows what to install on deploy.

pip3 freeze > requirements.txt

On the command line, run nano .gitignore and add below.

.DS_Store
jobs/__pycache__
scraping/__pycache__

Then ctrl+x , y , enter to save and close (on mac). This prevents deploying unnecessary files.

Create a file named Procfile in your root, and paste below inside. This tells heroku to boot up a web dyno, and the 2nd command migrates the db.

web: gunicorn jobs.wsgi
release: python manage.py migrate

The file tree will look like.

Make sure you have a Heroku account then login with heroku login on the command line.

Create an app with any name you want. But it will needs to be unique across all the apps on Heroku. I ran heroku create django-scraper-879 , where the app name is django-scraper-879 . But you’ll need to pick your own.

Now add these lines to the very bottom of settings.py. heroku_django takes care of some settings configuration like (Static files / WhiteNoise) for you.

import django_heroku
django_heroku.settings(locals())

Update DEBUG in settings. Don’t want to deploy to prod in debug mode.

DEBUG = False

Adds files to git with below.

git init
git add . -A
git commit -m 'first commit'

Now push our app to Heroku with this.

git push heroku master

Scheduling the job

You can manually run the job from your local command line with heroku run python manage.py scrape but it would be annoying to have to manually run it every day.

Let’s automate this.

Log into your Heroku console and click on “Resources,” then “find more add-ons”.

Now find and click on this add-on. Try a “ctrl+f” for “schedule” to help locate it. Heroku has a ton of potential add-ons. It looks like below.

Now add it to your app, so you now have.

Click on it and create a job, everyday at… 12am UTC . It’s not nice to ping websites more than necessary!

Input your command and save it.

Save and we’re done!

Now just wait until 12am UTC (or whatever time you picked) and you’re database will populate.

Conclusion

We touched on a lot of things here. Django, Heroku, Scheduling, Web Scraping, Postgres.

While I used the example of wanting to know when a company posted a new job, there are lots of reasons you might want to scrape a website.

  • E-commerce companies want to monitor competitor prices.
  • Executive recruiters may want to see when a company posts a job opening.
  • You or I may want to see when a new competition is added on Kaggle.

This tutorial was just to illustrate what’s possible. Let me know if you’ve built anything cool with web scraping in the comments.

我来评几句
登录后评论

已发表评论数()

相关站点

热门文章