Testing, profiling, and optimizing NLP models with Pytest, Cython, and spaCy

Unit test your machine learning models, profile your code, and take full advantage of c’s natural language processing speed.

Jan 23 ·8min read

I read a blog post that claimed to have profiled spaCy and NLTK for natural-language data preprocessing, and to have found NLTK far faster.

uhh (via giphy , @ TheLateShow)


NLTK is the Jeep Grand Cherokee of natural language processing toolkits: it’s huge, it’s been around for a long time, it’s got a lot of power, but it’s slow and takes a lot of gas.

SpaCy is the Lamborghini Aventador. It may not be a tank with all the bells and whistles, but it’s sleek, stripped down to bare metal. What it does, it does fast . And with its high horsepower comes the risk (and fun) of taking too hard a turn while speeding down a winding course and fishtailing right off the road.

this, but with vectors (via giphy , @ thegrandtour)

So, this claim struck me as outlandish:

Spacy is way, way slower than NLTK. NLTK took barely 2 seconds to tokenize the 7 MB text sample while Spacy took whooping 3 minutes!

If you write your pipeline correctly, and your hardware has decent memory, there is no way this can happen.

Why should I care?

I primarily work in research engineering — my job is to make the models, not build software architecture. So why care about your code performance if you just want to build chatbots and search engines?

If you’re interested in building the best model for the data, you can’t consider the machine learning model apart from the code quality.

  • The large size of NLP corpora means you’ll be limited in the algorithms and the tools you can use unless you try to optimize your code for space and time.
  • Not writing tests means you won’t know what changes caused which errors — or worse, there might be problems in your substantive model that you’re not even aware of, and won’t ever find out. (Have you ever written an ML function or class only to realize it was never being called?)
  • Understanding the libraries you’re using is critical not only to code performance, but also to a deeper understanding of the substance of the models you’re building. You’ll accelerate your knowledge from merely importing scikit-learn and fitting an estimator to actually grokking the algorithms that underpin NLP.
  • Having good tests and thoughtful code is critical to reproducible research.
  • If you’re writing code for production, not solely academic purposes, you absolutely need to think about how it will run in real-world usage.

Why’s spaCy so fast?

SpaCy’s paint job is Python, but the engine is Cython. Cython is like a creole language — part Python, part c. It is a superset of Python (all valid Python is valid Cython), but it includes faster, more complex features from c.

For certain use cases, Cython is a great speed-enhancing tool — for instance, computing lots of numerical results whose outcomes are contingent on a series of if-statements. Usually you’d use numpy to speed up computation, but numpy is good at vectorizing code — doing the same thing to every element. When we want to do a different thing to each element, Cython lets us use that same c optimization, but with the added logical complexity.

SpaCy makes extensive use of cythonization out of the box, making it a very fast NLP framework in comparison to many other Python libraries.

The pitfalls of not understanding your algorithms

Let’s go back to the blog that thought spaCy was slower than NLTK.

When I ran my versions of NLTK and spaCy text cleaners on the same documents using line_profiler, employing an apples-to-apples comparison, NLTK took 5.5 times as long as spaCy to do substantially similar work.

TL;DR: NLTK is 5.5 times slower than spaCy when used in a comparable NLP pipeline on the same data.

What was different in my code? To begin with, I studied the architecture of both these systems inside and out, including:

  • What tasks they are doing, and when
  • What data types they expect and return
  • The control flow of the functions

To get a meaningful comparison, you need to know that spaCy is doing a lot more than NLTK is in the referenced code. By default, the spaCy pipeline includes tokenization, lemmatization, part of speech tagging, dependency parsing, and named entity recognition. It does all of that (and more) every time you call nlp(doc) .

example of spaCy’s capabilities for part of speech tagging and dependency parsing

By contrast, NLTK (in the referenced code) is only tokenizing and lowercasing the corpus.

Comparing those two pipelines against each other is absurd.

We need a better framework:

  • I temporarily disabled the dependency parse and named entity recognition in spaCy’s pipeline, since NLTK isn’t doing those things.
  • I made NLTK do part of speech tagging and lemmatization, to bring it closer to spaCy’s performance.

This still isn’t a perfect one-to-one comparison, but it’s a lot closer.

A tale of two spaCys

I wrote two versions of the spaCy pipeline, to reflect how I see people using nlp.pipe in the wild. The first is list(nlp.pipe(docs)) . The second, more efficient way, is to use nlp.pipe(docs) as the generator object that it is.

This distinction matters, because unlike Python lists, generator objects do not hold the entire corpus in memory at the same time. That means you can iteratively modify, extract content from, or write chunks of the corpus to a database. In NLP, this comes up a lot, because corpora are often very large — too large to hold in memory all at once.

What’s the verdict?

cProfile (dramatization) (via giphy , @ f1)

Testing my three pipelines (with NLTK, spaCy-as-list, and spaCy-as-generator) on the same Reddit corpus (n=15000), here are the results according to line_profiler:

  • spaCy-as-list: total time = 64.260 s
  • spaCy-as-generator: total time = 60.356 s
  • NLTK: total time = 334.677 s

Using spaCy, without even writing your own Cython, speeds up your code more than 5x.

This makes sense — NLTK isn’t optimizing with Cython, and when it tags and lemmatizes tokens, it uses more time-expensive operations than spaCy’s.

What’s the best way to write a text preprocessing pipeline?

As with any code, spaCy will be slow if you don’t understand its data structures and how to wield them responsibly.

Read the source code. SpaCy relies on two main things to run fast: Cython’s c internals and Python generators. Because of this, it uses complex classes and data pipelines that mean the types and methods of its objects aren’t always immediately apparent.

For example, when you call nlp on a text to get a Doc object , spaCy generates c code to process your data. While you can use ordinary commands like bracket notation for indexing to interact with the resulting Doc, they don’t work the same as base Python. Rather, the Doc class overloads Python operators to let you interact with a struct as though it were a Python object.

a c function in the Doc class

The only way to really know this is to read the source code. The documentation won’t spell it out for you. Apart from working through the Advanced NLP with Spacy tutorial, there’s no other way to really learn the ropes.

Getting comfortable reading source code will vastly improve your programming literacy and your data pipelines.

Have a plan. Make sure you have outlined the steps of your pipeline in doodles and pseudocode before you start writing it. Having a thought-out plan will help you think about how best to chain together different functions. You’ll see efficiencies in the data, ways to group functions and variables together to speed things up, and how to cast your data for ease of flow.

A workflow for NLP optimization

  1. Solve the problem without regard to optimality. Just get it working.
  2. Have working tests in place before you start changing your code. Having tests means that you can change significant portions of your architecture without worrying about introducing new mystery bugs.
    Tests aren’t something a lot of data scientists think about up front, and it’s not always easy to think of how to write them for NLP machine-learning models. But at least in data preprocessing, they can make a world of difference in helping you avoid the “garbage in, garbage out” phenomenon of poorly-preprocessed text. Some examples include:
  • Write unit tests for your regular expressions (e.g., the ones you use to tokenize sentences) to make sure they match only what you want them to
  • Check whether your functions return the correct data type. Should that return value be a numpy array, a dict, a string, a list, a list of lists? Test it.
  • Check that all your classes and functions are actually being called the way you intend. It’s never fun to find out that the vital embedding model or feature engineering component is getting the wrong input (or worse, isn’t being called at all).

Before you start cythonizing everything, though, make sure you’ve profiled your code and cleaned up your Python. Even pure Python libraries like gensim can be extremely fast and memory-efficient, because they’ve scrupulously tightened up their data structures and pipelines.

  1. Measure the speed and performance of your code consistently and repeatably. You need a baseline to know whether you’re improving things (or making them worse). Line profiler , cProfile , and py-spy will all produce files documenting your script’s performance that you can read, reference, and compare.
  2. Make sure you’re using the right data structures and function calls. Don’t bother with Cython until you’ve made sure your Python is as fast as it can be. Sometimes Cython is not the right tool for the job, and minor changes to your Python code will produce more dramatic improvements than cythonization. As we saw, swapping out lists for generators is an easy bottleneck fix.
  3. Check the control flow of your functions. A line-by-line profiler can really help here. In writing my comparison, I noticed that line_profiler said my spaCy function was spending time on the NLTK part of the text-cleaning function. I had written the code so that the inner function pertaining only to NLTK was defined before I checked whether the spacy=True argument was True or False. Moving that code to inside the NLTK part of the if-else statement stopped my spacy model from running through that (irrelevant) block of code on every pass.
  4. Write good tests. Thinking about what output your functions need to produce to pass your tests will almost force you to do 1–3 automatically.

Better code, better science

Thinking this way takes a little more architecting of your model, but it’s worth it to avoid waiting half an hour for your docs to process. Take the time to think through the logical structure of your model, write tests, and reflect on the output of the profiler to avoid doing sloppy science in the name of instant gratification. Anyone can make a bag of words; the finesse and nuance of NLP lie in the detail work.