I delivered a session on Azure ML at the recent NYC Microsoft BigData hackathon. I had a great time exploring exciting machine learning scenarios and ideas with the attendees. Later in the day, while the teams were heads down working on their projects, I had open time to do some Azure ML hacking using the New York City restaurant inspection results dataset.
The gist of the idea was to create a model that can predict the grade of a restaurant from the description of the violation. The results were remarkably accurate, so I am sharing the approach, and this is the walkthrough.
Preparing the data
In Azure ML studio, uploading the file is straightforward. Click New -> Dataset -> From local file, select your local copy, and, from the dropdown box, choose the type—CSV with header.
In a new experiment, drop the restaurant ratings dataset, remove all columns except Grade and Violation Description, set the Grade column as categorical, and replace missing values.
Next, include the feature hashing and principal component analysis modules. We will use the feature hashing module to convert combinations of two words (N-grams = 2) to numeric features that can be used as inputs to train a model. A hashing bit size of 10 results to 1024 columns (2^10), which is too many, so we need to identify the most relevant ones. Reducing dimensionality is the purpose of the principal component analysis (PCA) module. Let’s apply PCA for all columns except Grade and reduce the number of dimensions to five.
Training the jungle
Decision jungles are an interesting algorithm. In contrast to decision trees, decision jungles can create multiple paths to the same leaf. The structure resembles a graph rather than a hierarchical structure, which grows exponentially when more accuracy is needed. Therefore, decision jungles are more compact and perform better in comparison to decision trees—geeky yet interesting details here.
Needless to say, then we will train a decision jungle model. We will use the multiclass decision jungle module and a training set of 70% of the data. With the remaining 30%, we will score the trained model and evaluate the results.
And from the evaluation module, the results are . . .
In short, great accuracy!
Let’s step back and think about the results and the nature of the data. One important characteristic is that the text is consistent with a limited variation: 108 distinct values from over 400,000 records. This result indicates that the text is standardized, which narrows the number of paths generated during training and improves generalization. In other words, the structure of the trained model is not too complex, and the input, assuming that the same level of standardization remains, is a reliable predictor of the grade.