Regressions with Azure Machine Learning

image

 

In today’s post, I’ll explore how you can use Azure Machine Learning to perform regression analysis on a dataset.

Regression analysis is a data mining process that identifies a model that depicts the correlation between an outcome and one or more predictors.  To explore this idea, the first thing we need is data!

 

The Dataset

 

One compelling source of data is the NYC Open Data portal. The portal features datasets containing a wide range of topics and categories related to New York City.  Information about schools, restaurants ratings, pot holes and many more topics, is available via the portal.

While browsing the data I came across the 311 Service Requests dataset. This dataset contains all the 311 service requests from 2010 to present.  If you are not familiar with 311, think of it as a customer service line for all the non-emergency requests to New York City’s government.  So I decided to upload the dataset to Azure ML Studio and use the visualization tools to find what is the most frequent complaint of the vocal New Yorkers.

 

Note:  To upload a dataset in a CSV format you will need to create a ML workspace via the Azure Portal, sign-in into ML Studio and create a new Experiment. More information here

              For more information about how to visualize your data here.

 

To my surprise, I learned that the most frequent complaints in 2013 were heating related.

 

image

 

The Predictor

I thought it’d be interesting to create a model to predict the number of heating complaints.  I think you would agree that there must be a correlation between the temperature and the number of these complaints.  So next, I needed to find the average temperature of when the complaint was created.  And fortunately, there’s a dataset for that! NOAA National Climatic Data Center.

After creating a database on Azure, importing the data and a bit of T-SQL’s aerobics –i.e. joining, data cleaning etc. I ended up with a dataset containing two columns: the number of heating complaints and the average temperature of when they occurred. I added a Reader module (to read data from my Azure SQL table) to my experiment and removed the original dataset. The figure below shows the visualization of the output of the Reader module.

 

image

 

Linear Regression

Let’s start with the simplest the regression: lineal regression.  In short, a linear model (identified from the process of regression) assumes that there’s a proportional relationship between to predictor (AvgDailyTemp)  and the outcome (HeatingComplaints).  So let’s train a Linear Model using Azure ML and our dataset.

The modeling process in Azure ML, at a high level consists of finding the parameters of your model from your dataset (Train Model), confirm the results of your model against a dataset (Score Model)  and then obtain key quality factors of the results or evaluate against another model (Evaluate Model).

To do this, you need to drop a Train Model, Score Model, Evaluate Model and Linear Regression modules into the experiment and connect them as depicted below.

 

Note: You can fine tune the linear regression process by changing the Solution Method (Ordinary Least Squares or Online Gradient Descent), whether to include an intercept and other parameters. For more information see here.

 

image

 

Next you need to configure the Train Model node to select the HeatComplaints column –this is how you tell Azure ML what should be the outcome of your model.

 

image

 

 

So let’s see how the model did.  The following image shows Azure ML’s visualization of the predicted values next to the scatter plot of the actual values.

 

image

 

Possion Regression and Log Scale

 

From the scatter plot of the actual values, you can tell that the plot resembles an exponential function.  However, having an exponential model means that we would loose the ease of regression of the linear model. Fortunately, we can have both. We do this by transforming a component of our model so that instead of the number of heating complains the model will generalize the logarithm of the number of heating complaints.  Let’s plot that scenario.

 

image

 

As you can see from the plot, once we apply the transformation, the trend is linear. So we can either go back to our data and apply the log function to the HeatingComplaints column or we can use the Possion regression. The outcome of the Possion regression is the natural log of the outcome variable.  Your experiment should look similar to the picture below.

 

 image

Now if we plot the scored model vs the actual data the get the following.

 

image

 

The new graph seems to be a better reflection of the actual shape and scale of values.

 

Closing notes

 

In this blog post I showed you how to visualize data, create linear and Possion regressions using Azure ML.

Next, I’ll show you  how can use decision forests to get a model that fits the data better.


2 Comments

  •       Commented

    Thanks for the post. Great detailed breakdown. However, an important point to clarify: You are seeing a straight line fit in your model (the diagram of scored vs avg daily temp) because you haven't split your data into training and scoring subsets. One set e.g. 70% will be used to train the model, and the performance of the model gets scored against the 30% untrained. If you do not do this, will be unable to tell if your model has overfit, and is not generalised enough to work on different data.

Post Reply