Build a Recommendation System for your Blog or Web Site using Azure Machine Learning and Azure Mobile Services

 

Maybe you have noticed the panel with a post’s recommendations next to the blog post itself and wondered how it works.

In today’s post I’ll describe the implementation of this feature, which leverages some really cool Azure technology including Azure Machine Learning (Azure ML), Azure Mobile Services and Azure Storage.


But first, let me give you a quick overview of the approach to developing this recommendations mechanism and the rationale behind it. The goal is simply to provide a set of recommendations of similar content directly from a blog post. One way to achieve this goal is to treat this as a categorization problem. Once all of the blog posts have a category assigned to them, the solution provides recommendations from the categories that the blog posts share.


If we go down this path, first we would define these categories and create a set of data (a training dataset) categorized according to our expectations. From this data, we would train a machine learning model for future categorization. In my case, this means that I’d need to categorize some of my data manually and select and train a model. Once the model is trained, I could use it to categorize the rest of the data automatically. If we were dealing with tens of thousands of articles or blog posts, the effort would be justifiable. However, for my blog that has significantly less content, I might just spend time tagging my content and doing the categorization myself.
 

Note: In the context of machine learning, the above-described approach is known as supervised learning 


Now, say we could categorize the data without having to provide a training data set or the categories up front. The only decision we would make would be the number of groups of data. The similarity of the text would define what would go into each group.

This approach would be analogous to giving a person several blog posts and a number of bins and tasking the person with placing blog posts in bins corresponding to their degree of similarity. As such, each bin would contain a group of the most similar blog posts.

 


 

One algorithm that performs this type of analysis is K-Means clustering. K represents the number of centroids (centers of the clusters) that the algorithm will try to find in the data. In my scenario, you can think of the centroids as the bins or groups of data. This is the approach I used, so let’s get to it!

 

Note: One of the trade-offs with K-Means clustering is that there is not a deterministic way to know the optimal value of K. This is what is known as a NP-hard problem. Clustering is an example of unsupervised learning.

 

Data In

To get my blog’s data ready for K-Means using Azure ML, there are a few preparatory steps, including removing HTML, CSS, and the stop words. 

 

Note: “Stop words” are repeating words that are often removed from machine learning analysis due to their high frequency and limited semantic value. For example “the”.

 

Once the data is ready, I have to present it in a format that Azure ML can read. To implement all of these tasks, I leverage the job-scheduling capabilities of Azure Mobile Services.

The code shows, the relevant part of the implementation of the scheduled job that reads the blog’s data from the RSS feed, cleans it, and persist it in an Azure table (you can get the full code here).


public class RssMLPrep : ScheduledJob
{

public async override Task ExecuteAsync()
{

using (var httpClient = new HttpClient())
{

var response = await httpClient.GetAsync(new Uri(CloudConfigurationManager.GetSetting("RssSource")));
var data = await response.Content.ReadAsStringAsync();
var xnodes = new XPathDocument(new MemoryStream(UTF8Encoding.UTF8.GetBytes(data)))
.CreateNavigator()
.Select("//item");

var table = GetCloudTable();


foreach (XPathNavigator node in xnodes)
{
var title = node.SelectSingleNode("title").ToString();

//Include the title in the text and prepare text...
var text = PrepareText(string.Format("{0} {1}", title, node.SelectSingleNode("description").ToString()));

//hash the title to remove invalid chars and use it as the key
var rowKey = ComputeSH1Hash(title);

var tableOp = TableOperation.InsertOrMerge(new BlogEntry()
{
Text = text,
PubDate = DateTime.Now,
Link = node.SelectSingleNode("link").ToString(),
RowKey = rowKey,
PartitionKey = "BlogEntries",
Title = title
});

await table.ExecuteAsync(tableOp);

}

}
}
...
}

 
Keep in mind that before you can run the experiment, you will need to run the scheduled job at least once to populate the data. For more information about deployment and testing of Jobs in Azure Mobile Services, see here.
 

The Experiment

 

Next, I will describe the implementation of the clustering process in Azure ML.


1.- Connecting the Modules and Configuring the Reader

 

First, we sign into Azure ML Studio and create a new experiment. From the list of modules on the left pane, locate and drop into the experiment the following modules: Reader, K-Means Clustering, Train Clustering Model, Feature Hashing, Execute R Script, Metadata Editor and Writer modules. Connect the modules according to the screenshot below.

 

image


Click on the Reader module and configure the properties so that it can read from the Azure table.

 

image

 

2.- Feature Hashing and Principal Component Analysis (PCA)

 

Next, we need to convert the blog’s text data into a format that K-Means can use. K-Means assumes a vector of n-dimensions as the input. Furthermore, since K-Means utilizes a Euclidian distance calculation to determine how close a vector is from the centroid, it is important that these dimensions contain numeric values and that these values are normalized.

 

Although it sounds a bit complex, fortunately a technique called feature hashing simplifies the implementation of these tasks. Additionally, Azure ML provides a Feature Hashing module that leverages the Vowpal Wabbit library to perform a 32-bit murmurhash v3 hashing.

To use this module, we must indicate the column with the text we want to hash, the hashing bit size, and the length of the N-grams. The figure below shows you the properties of the module.

 

Note: An N-grams value of 2 indicates that we will hash terms consisting of two words each.

 

image

 

The result of this module will be 4,096 new numeric columns for each row! This implies that K-Means will try to find K number of clusters in a space of 4,096 dimensions. This is a problem that is often referred to as the “curse of dimensionality” and specifically for K-Means is that the concept of distance becomes less precise as the number of dimensions grows. 

 

Note: Determining the number of columns is the hashing bit size parameter. Because we entered 12, this translates into 2^12 columns, which is 4,096.


Another technique commonly used to reduce the number of dimensions is Principal Component Analysis (PCA). To do this in Azure ML, we will need to implement the following R code using the Execute R Script module.

 

Note: The script selects the top 10 principal components. Depending on your data, however, you might find it beneficial to use this filter as a tuning parameter and return greater or fewer components

dataset1 <- maml.mapInputPort(1) # class: data.frame

#rowkey, title, link, pubdate, partitionkey, timestamp
reference_data <- dataset1[,1:7]

#pca on the hashed cols
pca_results <- prcomp(dataset1[,8:4103])

#take only the top 10 principal components
top_components <- data.frame(pca_results$x[,1:10])

#bind the results to the reference data
data.set <- cbind(reference_data,top_components)

maml.mapOutputPort("data.set");

 
3.- K-Means and Model Training
 

At this point, we have the data ready for our model! As I mentioned earlier, there is no exact way to know the optimal value of K. However, one way to approach this is to think about the number of high-level topics or themes that your content covers. For my case, I thought that 5 would be a good starting point. The following image shows the configuration of the K-Means Model and Training Clustering Model modules.

 

image

 

Note: The initialization in K-Means could be a key factor in the quality of the outcome. By default, the algorithm picks the top dimensions as the starting points for the centroids. However, a better initialization that is geared toward avoiding some of the shortcomings of the default approach is K-Means++, and Azure ML provides this initialization optimization!

image

Important: When configuring the selected columns in Train Clustering Model module, make sure that you select the option Allow duplicates and preserve column order in selection.

 

4.- Writing Our Results and Running the Experiment

Before writing our results we need to convert the Assignments column to String. For this we will use the Metadata Editor module with the following properties.

image

 

Finally, we write our results in an Azure table using the Writer module. So let’s configure the Writer according to the figure below.


image

image


Ready, set, run the experiment by clicking on the Run button!

 

Data Out


We are almost there! What is left is to expose the results table via Azure Mobile services and the client code. The following code shows the relevant part of the implementation via a custom API. (You can get the full code here.)

public class BlogRecommendationsController : ApiController
{
public ApiServices Services { get; set; }


// GET api/BlogRecommendations
[AuthorizeLevel(Microsoft.WindowsAzure.Mobile.Service.Security.AuthorizationLevel.Application)]
public IEnumerable<BlogItem> Get(string title)
{
var resultsTable = GetCloudTable("BlogDataTableResults");

var cluster =resultsTable.CreateQuery<BlogEntryResult>()
.Where<BlogEntryResult>(b => b.Title == title)
.Select<BlogEntryResult, string>
(r => r.Assignments)
.FirstOrDefault<string>();

if (cluster != null)
{
return resultsTable.CreateQuery<BlogEntryResult>()
.Where<BlogEntryResult>(b => b.Assignments == cluster && b.Title != title)
.Select<BlogEntryResult, BlogItem>
(r => new BlogItem() { Title = r.Title, Link = r.Link });
}

return null;
}
...
}

For the client, I am using the Azure Mobile Services HTML SDK. The following code shows the implementation of the client code:

 

$("#recommendation_panel").hide();

var client = new WindowsAzure.MobileServiceClient(
"https://<YOUR MOBILE SERVICE>.azure-mobile.net/",
"<YOUR_APP_KEY>"
);

//We are assuming the value in the HTML title tag is equal to the title of the RSS feed
var rawTitle = $(document).attr('title');

client.invokeApi("BlogRecommendations", {
body: null,
parameters: {title:rawTitle},
method: "get"
}).done(function (results) {
showRecommendations(results.result);
}, function(error) {
//error handling...
});

function showRecommendations(items){
if(items == null){
return;
}

if (items.length <= 0){
return;
}


$("#recommendation_panel").show();
var lst = $("#recommendation_body").append('<div class="list-group"/>');

items.forEach(function(r,i,a) {
lst.append('<a class="list-group-item" href="' + r.link + '">'+r.title+'</a>');
})

$("#recommendation_header").append("Recommended Posts");
$("#recommendation_footer").append("Powered by Azure Machine Learning");
}

 

Note: You will need to configure CORS for the Azure Mobile Service. CORS is supported in Azure Mobile Services with a .NET backend starting version 1.0.348. For more information, see this blog.

 

Final Thoughts!


In today’s post, I showed you how to implement a recommendations feature for your blog or web site quickly by utilizing the K-Means clustering algorithm in Azure ML.

In future blog posts, I will show you some of the optimizations I did, given the results I got from the approach outlined here. 

Stay tuned!


10 Comments

Post Reply