Analyzing Wine Descriptions using the Natural Language Toolkit in Python

Describing Wine for the Layman

A couple months ago, I created a web app that allows users to input a query and return wine recommendations based on semantic similarity. It was built using the Tensorflow lab Universal Sentence Encoder . When I put the tool into production, I added code that writes the user’s input to my database so I can analyze the words people are using to find wine. Based on my analysis of what has been recorded so far, it seems like most people are like me: I have little-to-no experience reviewing wine, and I don’t know which words to use when searching for it. Most of the queries I’ve recorded are two or three words and simple, like “easy to drink.” To help myself and my users, I am diving into the wine descriptions to see what I can learn about the language used to describe wine. Scroll to the bottom of the article to see the completed code.

Data and Dependencies

The original data can be found on Kaggle ; however, the examples within this article are using my engineered data. I discuss some of the data engineering in myoriginal article, for those who are interested. To analyze the text, I’m using the WordCloud and nltk (Natural Language Toolkit) python packages. I start by loading dependencies and checking the data:

import pandas as pd
import sqlite3
from sqlite3 import Error
import refrom wordcloud import WordCloud
import matplotlib.pyplot as pltimport nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer'wordnet')
from nltk.stem.wordnet import'stopwords')
from nltk.corpus import stopwordsfrom sklearn.feature_extraction.text import CountVectorizer#force output to display the full description
pd.set_option('display.max_colwidth', -1)#create connection to database
conn = sqlite3.connect('db\wine_data.sqlite')
c = conn.cursor()#create the pandas data frame
wine_df = pd.read_sql('Select title, description, rating, price, color from wine_data', conn)#display the top 3 records from the data frame