Unit test your machine learning models, profile your code, and take full advantage of c’s natural language processing speed.
I read a blog post that claimed to have profiled spaCy and NLTK for natural-language data preprocessing, and to have found NLTK far faster.
NLTK is the Jeep Grand Cherokee of natural language processing toolkits: it’s huge, it’s been around for a long time, it’s got a lot of power, but it’s slow and takes a lot of gas.
SpaCy is the Lamborghini Aventador. It may not be a tank with all the bells and whistles, but it’s sleek, stripped down to bare metal. What it does, it does fast . And with its high horsepower comes the risk (and fun) of taking too hard a turn while speeding down a winding course and fishtailing right off the road.
So, this claim struck me as outlandish:
Spacy is way, way slower than NLTK. NLTK took barely 2 seconds to tokenize the 7 MB text sample while Spacy took whooping 3 minutes!
If you write your pipeline correctly, and your hardware has decent memory, there is no way this can happen.
I primarily work in research engineering — my job is to make the models, not build software architecture. So why care about your code performance if you just want to build chatbots and search engines?
If you’re interested in building the best model for the data, you can’t consider the machine learning model apart from the code quality.
SpaCy’s paint job is Python, but the engine is Cython. Cython is like a creole language — part Python, part c. It is a superset of Python (all valid Python is valid Cython), but it includes faster, more complex features from c.
For certain use cases, Cython is a great speed-enhancing tool — for instance, computing lots of numerical results whose outcomes are contingent on a series of if-statements. Usually you’d use numpy to speed up computation, but numpy is good at vectorizing code — doing the same thing to every element. When we want to do a different thing to each element, Cython lets us use that same c optimization, but with the added logical complexity.
SpaCy makes extensive use of cythonization out of the box, making it a very fast NLP framework in comparison to many other Python libraries.
Let’s go back to the blog that thought spaCy was slower than NLTK.
When I ran my versions of NLTK and spaCy text cleaners on the same documents using line_profiler, employing an apples-to-apples comparison, NLTK took 5.5 times as long as spaCy to do substantially similar work.
TL;DR: NLTK is 5.5 times slower than spaCy when used in a comparable NLP pipeline on the same data.
What was different in my code? To begin with, I studied the architecture of both these systems inside and out, including:
To get a meaningful comparison, you need to know that spaCy is doing a lot more than NLTK is in the referenced code. By default, the spaCy pipeline includes tokenization, lemmatization, part of speech tagging, dependency parsing, and named entity recognition. It does all of that (and more) every time you call
By contrast, NLTK (in the referenced code) is only tokenizing and lowercasing the corpus.
Comparing those two pipelines against each other is absurd.
We need a better framework:
This still isn’t a perfect one-to-one comparison, but it’s a lot closer.
I wrote two versions of the spaCy pipeline, to reflect how I see people using
nlp.pipe in the wild. The first is
list(nlp.pipe(docs)) . The second, more efficient way, is to use
nlp.pipe(docs) as the generator object that it is.
This distinction matters, because unlike Python lists, generator objects do not hold the entire corpus in memory at the same time. That means you can iteratively modify, extract content from, or write chunks of the corpus to a database. In NLP, this comes up a lot, because corpora are often very large — too large to hold in memory all at once.
Testing my three pipelines (with NLTK, spaCy-as-list, and spaCy-as-generator) on the same Reddit corpus (n=15000), here are the results according to line_profiler:
Using spaCy, without even writing your own Cython, speeds up your code more than 5x.
This makes sense — NLTK isn’t optimizing with Cython, and when it tags and lemmatizes tokens, it uses more time-expensive operations than spaCy’s.
As with any code, spaCy will be slow if you don’t understand its data structures and how to wield them responsibly.
Read the source code. SpaCy relies on two main things to run fast: Cython’s c internals and Python generators. Because of this, it uses complex classes and data pipelines that mean the types and methods of its objects aren’t always immediately apparent.
For example, when you call
nlp on a text to get a Doc object , spaCy generates c code to process your data. While you can use ordinary commands like bracket notation for indexing to interact with the resulting Doc, they don’t work the same as base Python. Rather, the Doc class overloads Python operators to let you interact with a struct as though it were a Python object.
The only way to really know this is to read the source code. The documentation won’t spell it out for you. Apart from working through the Advanced NLP with Spacy tutorial, there’s no other way to really learn the ropes.
Getting comfortable reading source code will vastly improve your programming literacy and your data pipelines.
Have a plan. Make sure you have outlined the steps of your pipeline in doodles and pseudocode before you start writing it. Having a thought-out plan will help you think about how best to chain together different functions. You’ll see efficiencies in the data, ways to group functions and variables together to speed things up, and how to cast your data for ease of flow.
Before you start cythonizing everything, though, make sure you’ve profiled your code and cleaned up your Python. Even pure Python libraries like gensim can be extremely fast and memory-efficient, because they’ve scrupulously tightened up their data structures and pipelines.
spacy=Trueargument was True or False. Moving that code to inside the NLTK part of the if-else statement stopped my spacy model from running through that (irrelevant) block of code on every pass.
Thinking this way takes a little more architecting of your model, but it’s worth it to avoid waiting half an hour for your docs to process. Take the time to think through the logical structure of your model, write tests, and reflect on the output of the profiler to avoid doing sloppy science in the name of instant gratification. Anyone can make a bag of words; the finesse and nuance of NLP lie in the detail work.