You might think it is very common to remove stop words from text during preprocessing it. Yes, I agree with you but you should be careful about what kind of stopwords you are removing.
The most common method to remove stop words is using NLTK’s stopwords.
Let’s look at the list of stop words from nltk.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Now, look at all the bold words.
So, the question is what is wrong with them?
Let's imagine you are asked to create a model which does sentiment analysis of product reviews. The dataset is fairly small that you label it your self. Consider few reviews from the dataset.
1. The product is really very good. — POSITIVE 2. The products seems to be good. — POSITIVE 3. Good product. I really liked it. — POSITIVE 4. I didn’t like the product. — NEGATIVE 5. The product is not good. — NEGATIVE
You performed preprocessing on data and removed all stopwords.
Now, let us look what happens to the sample we selected above.
1. product really good. — POSITIVE 2. products seems good. — POSITIVE 3. Good product. really liked. — POSITIVE 4. like product. — NEGATIVE 5. product good. — NEGATIVE
Look at negative feedbacks.
Positive feedback doesn’t seem to be affected, but look at negative feedback. Their whole meaning has changed. If we train our model on this data, then it is surely going to underperform.
This happens very often, after removing stopwords the whole meaning of sentence changes.
If you are working with basic NLP techniques like BOW, Count Vectorizer or TF-IDF(Term Frequency and Inverse Document Frequency) then removing stopwords is a good idea because stopwords act like noise for these methods. If you working with LSTM’s or other models which capture the semantic meaning and the meaning of a word depends on the context of the previous text, then it becomes important not to remove stopwords.
Now, coming to my original question — Does removing stopwords really improve model performance?
Like I said earlier it depends on what kind of stopwords are you removing. The problem here is that if you do not remove stop words, noise will increase in dataset because of words like I, my, me, etc.
So, what's the solution? Creating a new list of correct stop words but the problem is to reuse it in different projects.
This is why I’ve created a Python package nlppreprocess which removes stops words which are not necessary. It also has some additional functionalities that can make cleaning of text fast.
The best way to utilize its functionality is by connecting it with pandas as below:
You can check its complete documentation on the page itself.
Now, if we utilize this package to preprocess the above samples we’ll get something like this
1. product really very good. — POSITIVE 2. products seems good. — POSITIVE 3. Good product. really liked. — POSITIVE 4. not like product. — NEGATIVE 5. product not good. — NEGATIVE
Now, it seems reasonable to use this package for the removal of stopwords and other preprocessing.
Let me know what is your opinion on this in the comment section.