Automate a Data Science Workflow — Movie Reviewer Sentiment Analysis

Automate a Data Science Workflow — Movie Reviewer Sentiment Analysis

Organize jobs that can be executed at the click of a button.

Dec 5 ·4min read

Photo Cred: Unsplash

Note: I was not sponsored by Knime or anyone else to write this article.

I’m very resistant to point and click solutions. And I think my resistance is in good faith and for good reasons. We’ve waited a long time for drag and drop solutions for web apps, mobile apps, and a whole host of other things.

But fundamentally, I think a solution like Knime is perfect for letting the user introduce the perfect amount of flexibility and simplification as necessary. For me, boxing up all my steps in a Data Science workflow has given me a whole new level of management to my projects.

Here’s a really great article expanding on this point: To Code or not to Code .

Knime brings excellent top-level explainability to your workflow.

The Knime community has an excellent collection of custom workflows and nodes known as Knime Hub .

Drag and drop entire workflows or nodes from Knime Hub into your workspace…

For the purpose of this article I’ve taken the pre-built workflow from above. Let’s break it down step by step…

1.) File Reader

Use the File Reader node to start the flow from the CSV. Straightforward.

2.) Preprocessing

Simply put, these next nodes are the preprocessing necessary to make the data model-ready. Basically, we’re just producing word counts for ever text section, with a label in the last column for “Positive” or “Negative” sentiment.

We take our strings to docs, extract the word counts from each doc and create the bit vectors, and then finally we encode our labels.

3.) Model Initialization and Training.

The first step here is the partition. This is our typical training/testing split. We can right click our node and define our train/test split %.

As you can we choose a 70% split and use random sampling.

If you pay attention, you can see that there are two lines coming out of the partition node. The top line represents the training data, flowing from the partition to the learner. The bottom line represents the test data that will flow the predictor, which will perform inference after the learner node has training our model.

We can also view some information about our model, in this case the highly interpretable Decision Tree…

4.) Test the model and review the results.

We run the model and branch off to both an ROC Curve and a generic scorer.

And from our confusion matrix you can see the model performs pretty well!

5.) And what might we do with this outcome?

Obviously we can’t just stop there. Maybe if we’re in a lab we stop there. But as people looking to bring value to a business, there’s a deliverable we have to meet here. Maybe daily we run a workflow and insert the results of new IMDB Movie Reviews to a Database? Maybe we hit some endpoint with a POST request? Maybe we visualize the results and send them in an email? If you can imagine it, Knime probably has it…