A couple months ago, I created a web app that allows users to input a query and return wine recommendations based on semantic similarity. It was built using the Tensorflow lab Universal Sentence Encoder . When I put the tool into production, I added code that writes the user’s input to my database so I can analyze the words people are using to find wine. Based on my analysis of what has been recorded so far, it seems like most people are like me: I have little-to-no experience reviewing wine, and I don’t know which words to use when searching for it. Most of the queries I’ve recorded are two or three words and simple, like “easy to drink.” To help myself and my users, I am diving into the wine descriptions to see what I can learn about the language used to describe wine. Scroll to the bottom of the article to see the completed code.
The original data can be found on Kaggle ; however, the examples within this article are using my engineered data. I discuss some of the data engineering in myoriginal article, for those who are interested. To analyze the text, I’m using the WordCloud and nltk (Natural Language Toolkit) python packages. I start by loading dependencies and checking the data:
#dependencies import pandas as pd import sqlite3 from sqlite3 import Error import refrom wordcloud import WordCloud import matplotlib.pyplot as pltimport nltk from nltk.tokenize import RegexpTokenizer from nltk.stem.snowball import SnowballStemmer #nltk.download('wordnet') from nltk.stem.wordnet import WordNetLemmatizer#nltk.download('stopwords') from nltk.corpus import stopwordsfrom sklearn.feature_extraction.text import CountVectorizer#force output to display the full description pd.set_option('display.max_colwidth', -1)#create connection to database conn = sqlite3.connect('db\wine_data.sqlite') c = conn.cursor()#create the pandas data frame wine_df = pd.read_sql('Select title, description, rating, price, color from wine_data', conn)#display the top 3 records from the data frame wine_df.head(3)