Custom Machine Learning with ML.NET

In this post, we look broadly at the capabilities of ML.NET, Microsoft's open source machine learning framework, compared to Azure Cognitives Services.

ML.NET is Microsoft’s recently released open-source, cross-platform, code-first framework for Machine Learning. Although new to us, the framework has its roots in Microsoft Research, and has been used by many internal teams over the last decade, including those working on products you have almost certainly heard of — Microsoft Windows, Office and Bing, to name a few.

ML.NET makes it possible for .NET developers to easily integrate machine learning into their applications, whether console, desktop or web. It covers the full lifecycle of ML activity, from training and evaluation of models, to use and deployment. Many typical supervised and unsupervised machine learning tasks are supported, including Classification , Regression , Recommenders and Clustering . The framework also integrates with TensorFlow , giving .NET developers the ability to invoke deep learning models (suited for tasks like object detection or speech analysis) from a familiar environment.

Why ML.NET?

These days, we are spoiled for choice when it comes to options for adding machine learning or AI capabilities to our applications. With a NuGet package and just a few lines of code, we can harness the power of Azure Cognitive Services to perform complex tasks like sentiment analysis, object detection and OCR with high levels of accuracy and performance. Microsoft really has done an incredible job at making these tools accessible to developers of all levels of experience.

How then does ML.NET fit in? You can use ML.NET to perform many of the same kinds of machine learning tasks as you can on Azure. However, as a highly configurable and code-based framework, it will certainly take more than a few lines of code. In terms of differentiation, some of the key reasons you might consider ML.NET are:

  • Training a domain-specific model:

    Many Cognitive Service models are trained on broad datasets in order to provide a good experience for a wide range of use cases. This is great for pick-up-and-play use, as well as many real-world needs. However, if you are working on a specialized problem, a general-purpose model may not be as well suited. For example, Cognitive Services will have no trouble telling you whether an image contains a hat or an animal. If you want to detect and distinguish between different kinds of hats (for example, your own hat collection) and don’t care about recognizing animals or other objects, you might benefit from training your own domain-specific model, which ML.NET allows you to do easily.

  • Keeping data within your network or on a user’s machine:

    Many Cognitive Services do allow you to train custom models, or augment the built-in ones, by providing them with your own examples. In some cases your models can also be exported and downloaded, enabling offline usage. However, for regulation or privacy reasons you may not want, or be permitted, to upload training data or send predication inputs to a cloud provider. ML.NET can be used end to end — both for training and for prediction — in an offline manner. If you need training and/or prediction data to remain internal, ML.NET is an attractive option.

  • Dynamic generation of ML models:

    As a code-first framework, ML.NET makes it is quite easy to perform dynamic generation of machine learning models, based on information not known at compile time. If your application supports dynamic content (for example, user defined schemas) and you want to integrate ML capabilities, ML.NET is an option.

  • Modifying or extending the framework:

    As an open-source project, the full source code for ML.NET is available on GitHub, allowing you to quickly investigate implementation details, fix bugs or even add functionality, as needed.

  • Avoiding consumption-based pricing:

    ML.NET is free to use, regardless of the number of operations you perform with it. Of course, running your own systems has a cost too!

Probably the biggest barrier to accessing these differentiating features is the higher requirement of Machine Learning knowledge that ML.NET has when compared Azure Cognitive Services. Using ML.NET requires you to be thinking more about things like data pre-processing, data pipelines, algorithm selection, model validation and performance metrics. While understanding these concepts will give you a solid machine learning grounding, tackling them all at once can be a bit daunting. Fortunately, the ML.NET team has put something together that can help newcomers to get started.

Bridging the Gap — AutoML and Model Builder

If you want to use ML.NET but the idea of building pipelines, selecting trainers and evaluating models has you thinking twice, there is an option for you in the form of AutoML , a companion library for ML.NET. AutoML lowers the barrier to entry for new machine learning developers by automating parts of the lifecycle and attempting to produce an optimal machine learning model for your data. Specifically, it automatically:

  • Loads training data from an SQL or text-based source

  • Performs basic pre-processing of input data, including detection of categorical fields and removal of fields that are not useful for prediction

  • Explores potential algorithms and parameters, iteratively training models and evaluating the effectiveness of each against your input data

  • (When used via the CLI or Model Builder) Generates code to load the trained optimal model, ready to provide new predictions

AutoML can be invoked from code ( Install-Package Microsoft.ML.AutoML ), a command line interface ( dotnet tool install -g mlnet ) or via a GUI tool in the form of a Visual Studio Extension, Model Builder .

For the remainder of this post, we’ll run through an example of using Model Builder to automatically train a machine learning model and generate the code to use it.

Walkthrough — Using Model Builder to Automatically Train a Taxi Fare Prediction Model

In this walkthrough, we’ll build a model that predicts a New York taxi fare based on inputs such as time, distance, number of passengers and payment method. We’ll use data from the ML.NET samples repository as our input.

Prerequisites:

If you don’t have Visual Studio 2017 or 2019, install one of those before attempting to install the Model Builder extension.

Step 1: Create a New Project in Visual Studio

ML.NET runs in any x86 or x64 environment that .NET Core runs in, so we could start with many of the built-in templates. In this case, we’ll create a new .NET Core console app.

Once you’ve created your project, wait till you see the familiar empty console app project on screen.

Step 2: Add ‘Machine Learning’ to Your Project

With the extension installed, we can invoke Model Builder by right-clicking our project in the Solution Explorer, and selecting Add -> Machine Learning. After doing this, you’ll be greeted by the ML.NET Model Builder scenario screen.

Step 3: Configure Model Builder for Your Dataset

Select Scenario

Our interaction with Model Builder starts by picking from one of a few predefined scenarios. Essentially, these are templates tailored for specific machine learning tasks. In our case, we want to predict taxi fares, so the ‘Price Prediction’ is a good choice.

Load Training Data

The next task is to specify the data we want to use for training. Price prediction is an example of a supervised learning task, in which a machine learning model is trained to make predictions by being shown examples of historical data. Examples include both the model inputs (in our case, things like time, distance and number of passengers) as well as the output value (the actual fare for a trip). Later, when we want to predict a fare, our model will take the details of our new trip and use them, in conjunction with the relationships it derived from the training data, to predict a fare.

To assess the quality of a machine learning model, we typically exclude part of our historical data from training. This ensures we have some known good input/output combinations (that our model hasn’t seen) against which we can compare our model’s outputs. AutoML witholds a portion of our data automatically for this purpose, so we can provide it with our full dataset. If you completed the optional prerequisite, you should choose your concatenated dataset in the Select a file dialog. Otherwise, you can paste in the URL for the training data . The benefit of using the concatenated dataset is that you will provide a larger body of training data to AutoML.

After loading the file, Model Builder will automatically detect columns and provide a preview of the data. We need to tell Model Builder which field we want to predict; in our case this is the ‘fare_amount’ field.

Step 4: Use Model Builder to Generate an Optimal Model

Train an Optimized Model

Model Builder uses AutoML to iteratively explore options and determine the optimal prediction algorithm and parameters for a given dataset. The upper bound on iteration time is up to us, and should primarily be influenced by the size of the training dataset.

The ML.NET team has some guidelines on iteration durations for various dataset sizes; for our dataset (between 2.5mb and 5mb, depending on whether you concatenated the test and train data), just ten seconds should be adequate. After clicking ‘Train’, Model Builder will begin to iterate on models and display a few details about its progress. Model Builder evaluates each model it trains and uses the model’s R-Square score as the mechanism for comparing them.

Review Model Performance

After performing the optimization, Model Builder provides an overview of the process, including the evaluation metrics of the best five configurations it was able to produce within the iteration time.

Although Model Builder automatically selects the model with the best result, it is worth taking a moment to review the final metrics. If the metrics of the selected model are not good, it is unlikely to perform well on new inputs. In a situation like this, you may need to iterate on the model training process. Options might include:

  • Increasing the exploration time for AutoML (allow it to find a better algorithm or parameters)
  • Increasing the amount of training data (provide more examples that better represent the variability of your domain)
  • Preprocessing training data (expose new features that could increase predictability, or remove those that might not)

In our case above, the best model was produced using the LightGbmRegression trainer and yielded an R-squared score of 0.94, which should perform well.

Step 5: Use the Model

After evaluation, Model Builder will automatically add two new projects to your solution. The first is a library containing the model and input classes that can be referenced by your existing project. The second is a sample console application with code that demonstrates how to load and use the model.

With these two projects generated, we’re ready to see the model in action. The sample application uses a hard-coded single input from your training dataset to demonstrate model usage. To make it more interactive, you can replace the contents of Program.cs with the below, which will allow you to interactively enter trip details and receive a predicted fare:

using System;
using System.IO;
using System.Linq;
using Microsoft.ML;
using PredictTaxiFareML.Model.DataModels;
using static System.Console;
using static System.Environment;

namespace PredictTaxiFareML.ConsoleApp
{
    class Program
    {
        private const string Quit = "quit";
        private const string ModelPath = @"MLModel.zip";
        
        static void Main(string[] args)
        {
            var context = new MLContext().Model;
            var model = context.Load(GetPath(ModelPath), out _);
            var engine = context.CreatePredictionEngine<ModelInput, ModelOutput>(model);

            WriteLine("== AutoML Interactive Taxi Fare Predictor == ");
            while (GetInput(out var input))
                WriteLine($"{NewLine}Predicted fare: " +
                 $"{engine.Predict(input).Score:C}{NewLine}");
        }

        private static bool GetInput(out ModelInput input)
        {
            WriteLine($"{NewLine}Enter trip details:{NewLine}");

            input = new ModelInput
            {
                Passenger_count = ReadF("Passenger count", 1),
                Trip_time_in_secs = ReadF("Trip time (mins)", 1) * 60,
                Trip_distance = ReadF("Distance (mi)", 0),
                Vendor_id = ReadCat("Vendor", "VTS", "CMD"),
                Rate_code = ReadF("Rate code (0 - 6)", 0, 6),
                Payment_type = ReadCat("Payment type", "CRD", "CSH"),
            };

            return true;
        }

        private static float ReadF(string title, 
            float min = float.MinValue, float max = float.MaxValue)
        {
            while (true)
            {
                try { return Clamp(float.Parse(Prompt(title)), min, max); } 
                catch (Exception ex) { WriteLine(ex.Message); }
            }
        }

        private static string ReadCat(string title, params string[] values)
        {
            title = $"{title} [{String.Join(", ", values)}]";

            var ret = "";
            while (!values.Contains(ret))
                ret = Prompt(title);

            return ret;
        }

        private static string Prompt(string title)
        {
            Write($"  - {title}: ");
            return ReadLine().Trim().ToUpper();
        }

        private static float Clamp(float input, float min, float max)
        {
            var ret = Math.Max(Math.Min(input, max), min);

            if (Math.Abs(ret - input) > 0.1)
                WriteLine($"Clamping to {ret}");

            return ret;
        }

        private static string GetPath(string relativePath)
        {
            var root = new FileInfo(typeof(Program).Assembly.Location);
            var asmPath = root.Directory.FullName;

            return Path.Combine(asmPath, relativePath);
        }
    }
}

That code in action looks like this:

Wrapping Up

And that’s it! We’ve successfully used Model Builder to automatically generate an optimized model for prediction from our taxi fare dataset. AutoML handled some of the thornier steps for us automatically, letting us benefit from some of the unique features of ML.NET without needing to be a machine learning expert. Hopefully this walkthrough helps to demystify ML.NET a little, and gives you the inspiration to try creating custom models on some of your own data too.

我来评几句
登录后评论

已发表评论数()

相关站点

热门文章