Note: I was not sponsored by Knime or anyone else to write this article.
I’m very resistant to point and click solutions. And I think my resistance is in good faith and for good reasons. We’ve waited a long time for drag and drop solutions for web apps, mobile apps, and a whole host of other things.
But fundamentally, I think a solution like Knime is perfect for letting the user introduce the perfect amount of flexibility and simplification as necessary. For me, boxing up all my steps in a Data Science workflow has given me a whole new level of management to my projects.
Here’s a really great article expanding on this point: To Code or not to Code .
The Knime community has an excellent collection of custom workflows and nodes known as Knime Hub .
Drag and drop entire workflows or nodes from Knime Hub into your workspace…
For the purpose of this article I’ve taken the pre-built workflow from above. Let’s break it down step by step…
Use the File Reader node to start the flow from the CSV. Straightforward.
Simply put, these next nodes are the preprocessing necessary to make the data model-ready. Basically, we’re just producing word counts for ever text section, with a label in the last column for “Positive” or “Negative” sentiment.
We take our strings to docs, extract the word counts from each doc and create the bit vectors, and then finally we encode our labels.
The first step here is the partition. This is our typical training/testing split. We can right click our node and define our train/test split %.
As you can we choose a 70% split and use random sampling.
If you pay attention, you can see that there are two lines coming out of the partition node. The top line represents the training data, flowing from the partition to the learner. The bottom line represents the test data that will flow the predictor, which will perform inference after the learner node has training our model.
We can also view some information about our model, in this case the highly interpretable Decision Tree…
We run the model and branch off to both an ROC Curve and a generic scorer.
And from our confusion matrix you can see the model performs pretty well!
Obviously we can’t just stop there. Maybe if we’re in a lab we stop there. But as people looking to bring value to a business, there’s a deliverable we have to meet here. Maybe daily we run a workflow and insert the results of new IMDB Movie Reviews to a Database? Maybe we hit some endpoint with a POST request? Maybe we visualize the results and send them in an email? If you can imagine it, Knime probably has it…