AWS Machine Learning - what ML algorithms can you easily and elastically run on AWS?

May 21, 2017

I'm curious to find out more about which ML algorithms are supported by AWS, or which ones can be easily and elastically deployed?

First of all there is Amazon Machine Learning.

This is a fully managed service for simple machine learning using a couple of canned algorithms.

Further docs here.

It supports the complete ML workflow, but only supports basic regression algorithms as follows:

Amazon ML uses the following learning algorithms:

For binary classification, Amazon ML uses logistic regression (logistic loss function + SGD).
For multiclass classification, Amazon ML uses multinomial logistic regression (multinomial logistic loss + SGD).
For regression, Amazon ML uses linear regression (squared loss function + SGD).

They should probably rename this service to "simple regression machine learning"?
So I've actually learned something today (hurray). I've used linear and non-linear regression for years, and other methods (e.g. FOIL, decision trees etc), but hadn't realised there was a regression method for classification! (actually I did but I re-invented it from scratch bother). Logistic regression comes from statistics which probably excuses my ignorance.

Some discussions of pros/cons of logistic regression.

Also not sure how well AWS ML scales? Here are the limits.

The other well known Amazon ML services are the ones I heard about recently at AWS Summit Sydney: Rekognition (images), Polly (speech) and Lex (chatbots) (Polly want a cracker?)

AI Services: At the highest level, for developers who want access to AI technologies without having to train or develop their own ML models, AWS provides a collection of highly scalable pre-trained and pre-tuned managed AI Services that do not require any previous artificial intelligence or deep learning knowledge in order to get started. Amazon Rekognition for image and facial analysis, Amazon Polly for text-to-speech, and Amazon Lex for building conversational chatbots with automatic speech recognition and natural language understanding (NLU) capabilities.

These are impressive but limited in focus.

What if I want to run some more traditional ML algorithms at scale? Say on Apache Spark? ILP, first order induction, decision tree algorithms like FOIL, PROGOL, GOLEM, ID3, C4.5 etc come to mind from 1980s/1990s. How about clustering of 1st order relational data? Bayes-networks? etc Well known libraries of ML algorithms like Weka? (From NZ?)

Weka is supported on AWS.

Deep learning? Yes, this is supported:

AI Infrastructure: Deep learning frameworks, like Apache MXNet, use neural nets, which involve the process of multiplying a lot of matrices. Amazon EC2 P2 instances provide powerful Nvidia GPUs to substantially accelerate the time to complete these computations, so you can train your models in a fraction of the time required by traditional CPUs. After training, Amazon EC2 C4 compute-optimized and M4 general purpose instances are ideally suited for running inferences with the trained model. In addition, AWS Lambda lets you simplify your operations with serverless machine learning predictions, while AWS Greengrass lets you run AI IoT applications seamlessly across the AWS Cloud and local devices.

The AWS Deep Learning AMIs support a whole bunch of AI frameworks:

Apache MXNet, TensorFlow, the Microsoft Cognitive Toolkit (CNTK), Caffe, Caffe2, Theano, Torch and Keras

What's in them all? Not sure.

Also a CloudFormation option for Deep Learning.

Oh, the overall AI architecture helps to work out where everthing fits (obviously)

i think Platforms is where I should have been looking:

AI Platforms: For customers with existing data who want to focus on building custom inference models, we provide a set of AI platforms which remove the undifferentiated heavy lifting associated with deploying and managing AI training and model hosting. The Amazon Machine Learning service allows you to train custom machine learning models using your own data, without requiring deep machine learning skills or expertise. In addition, Apache Spark on Amazon EMR includes MLlib for scalable machine learning algorithms.

This supports Spark MLib.

What's that? Machine Learning library for Spark includes:

MLlib contains many algorithms and utilities.

ML algorithms include:

Classification: logistic regression, naive Bayes,...
Regression: generalized linear regression, survival regression,...
Decision trees, random forests, and gradient-boosted trees
Recommendation: alternating least squares (ALS)
Clustering: K-means, Gaussian mixtures (GMMs),...
Topic modeling: latent Dirichlet allocation (LDA)
Frequent itemsets, association rules, and sequential pattern mining

ML workflow utilities include:

Feature transformations: standardization, normalization, hashing,...
ML Pipeline construction
Model evaluation and hyper-parameter tuning
ML persistence: saving and loading models and Pipelines

Other utilities include:

Distributed linear algebra: SVD, PCA,...
Statistics: summary statistics, hypothesis testing,...

These look more like the ML algorithms I was expecting.

The full list:

Classification and regression

This page covers algorithms for Classification and Regression. It also includes sections discussing specific classes of algorithms, such as linear methods, trees, and ensembles.

Table of Contents

Inputs and Outputs

Clustering

This page describes clustering algorithms in MLlib. The guide for clustering in the RDD-based API also has relevant information about these algorithms.

Table of Contents

Best practices for AWS EMR and Spark (has MLib been replaced by DataFrames?)

DataFrames? Replacement for RDD?

Comparison of RDD, DataFrames and DataSets

This tutorial looks like fun TODO Building a Recommendation Engine with Spark ML on Amazon EMR using Zeppelin

And another tutorial (with Redshift data). TODO

It's a few years old now, I wonder if there's a more recent example?

Kinesis supports 1 (?) ML algorithm for streaming data (maybe ore if you write them youself?)_

And you can extent Spark MLlib with your own ML algorithms to by the looks of it?

And something interesting from Cloudurable:

In this part of Spark’s tutorial (part 3), we will introduce two important components of Spark’s Ecosystem: Spark Streaming and MLlib.

They say: What if you want to go beyond these algorithms, then you will need what is called a ML Workflows.

ML Workflows from this book (TODO Read)

PS
Over the last 20+ years I've had a lot of experience with "big data" problems and models, mainly in scientific organisations (e.g. CSIRO, UCL eScience UK). I project managed and architected a scientific software application which enabled non-expert end-users to experiment with different models/solutions to produce desired results from a variety of data sources. This was "clever" in that it would work forwards and backwards at once. Given a set of possibly partial input and output data (results), candidate models would be automatically highlighted and could be run to the results, or more input data could be supplied if it was missing and critical. Models could be run in a pipeline to produce missing input data for other models in some cases. This was in the last 1990s. Surely by now this sort of framework for data/models/results is common practice?

So for ML problems there are a couple of obvious ways of automating the pipeline.

1 Have a ML algorithm to "rule them all". i.e. given input data, the problem, and a set of ML algorithms, train a ML algorithm (or > 1) to pick the best algorithm for the job and run it automatically, and iterate (i.e. learn) and or change algorithms. I.e. treat the pipeline like a ML problem itself.

2 Use the cloud. Run all the ML algorithms (that can in theory be run for given data/problem) all the time. Continuously compare results and use multiple models for the "production" environment. Gamify all the models to keep on getting the best results?

And you need to do both, 1 learns from 2!

This seems somewhat obvious, someone must be doing this already?

(2) is suggested in this book (not specifically about ML but similar problem). where there are sufficient parallel resources (cloud anyone?)

A cheat sheet (decision tree?) for choosing ML on Azure.

On the other hand, the type of ML I was doing in the 80's (autonomous, unsupervised on relational/1st order data producing 1st order logic/horn clauses) seems both medieval and modern. Essentially it was learning from "graph data" which has made the news again, both in terms of databases and ML for graph data. E.g.

https://research.googleblog.com/2016/10/graph-powered-machine-learning-at-google.html

http://www.idgconnect.com/abstract/18124/the-wave-disruption-graph-machine-learning

This article suggests graph ML is a type of clustering? Sort of, but clustering that uses relationships. This isn't new, I invented an algorithm in the 1980's that did this.

https://blog.insightdatascience.com/graph-based-machine-learning-6e2bd8926a0

Building a graph db on AWS (from ML data but not doing any ML on it?)

P2S

I decided to try out the AWS Machine Learning tutorial as it looked as if I could do it in a few minutes.

Sure, setting it up and starting the process only takes 15 minutes, however once it gets to the learning phase it seems to get stuck with no feedback. It's been running (maybe?) now for 20 minutes and says:

Creation time	May 25, 2017 11:32:42 AM
Completion time	Not available
Compute Time (Approximate)	Not available
Status	In progress
Message	Current Step: TRAINING (1/1) Current Iteration: (10/10) 100%

This is NOT very friendly. I would have assumed that the tutorial example is designed to be quick and easy. It should only take 10 minutes at the most. The lack of feedback suggests something has gone wrong? Nothing much online about what the Current Steps are (how many and where am I am to).

This article says it can take hours (lot) to process:

http://www.kdnuggets.com/2016/02/amazon-machine-learning-nice-easy-simple.html

Finally finished, wall clock time was 21 minutes:

Type	Binary classification
Creation time	May 25, 2017 11:32:42 AM
Completion time	3 mins.
Compute Time (Approximate)	2 mins.
Status	Completed

Ok seemed to work eventually. I then tried batch predictions and it wasn't able to tell me the cost estimate: Amazon ML is unable to estimate the cost of generating the predictions you requested.

It's also unable to tell me how long it will take:

Creation time	May 25, 2017 12:17:31 PM
Completion time	Not available
Compute Time (Approximate)	Not available
Status	Pending

Finished in a few minutes, but no results in S3. Looks like you have a to actually give the destination S3 bucket a named object which I didn't before (but there was no error). i.e.

Choose Continue.
For S3 destination, type the name of the Amazon S3 location where you uploaded the files in Step 1: Prepare Your Data. Amazon ML uploads the prediction results there.

ACTUALLY NEED TO ENTER A VALID BUCKET OBJECT NAME FOR THE DESTINATION

For Batch prediction name, accept the default, Batch prediction: ML model: Banking Data 1. Amazon ML chooses the default name based on the model it will use to create predictions. In this tutorial, the model and the predictions are named after the training datasource, Banking Data 1.
Choose Review.
In the S3 permissions dialog box, choose Yes.

Estimate this time is 50 cents. The web page doesn't seem to update when the job is finished, need to click refresh manually.

Conclusions? Well I guess it works. It's pretty basic and the run-time feedback is minimal and unreliable. For data of this size and complexity there's really no need to use ML as a Service either. This is a pretty basic machine learning algorithm (regression only), you can't actually output the model for reuse elsewhere, and the data is limited is column/value data. This is just propositional (zero order) data. What happened to the ability to learn from 1st order relational data that was developed in the 1980s? I guess this is Graph Data learning (or learning from Tuples) now? Also not sure how well it copes with noisy and missing data?

There is Statistical Relational Learning.

https://s3-ap-southeast-1.amazonaws.com/erbuc/files/2b087385-4424-4923-afc2-22b760287efe.pdf

Looks like some open source algorithms available:

https://alchemy.cs.washington.edu/

http://netkit-srl.sourceforge.net/

A whole bunch of related Probabilistic programming systems

Some graph data analytics languages.including Gremlin (runs on JVM).

Also Pixy! First order logic on graph databases.

Looks cool!

Enter Pixy. Pixy is a bridge from first-order logic to Gremlin. The first-order logic of Pixy operates on vertices and edges. We can ask questions like "Find vertices and edges that match some predicate" where the predicate is formed by

various comparisons on vertex and edge properties,
logical operations "and" (∧), "or" (∨) and "not" (¬), and
the universal "for every" (∀) and existential "there exists" quantifiers (∃) that operate on vertices and edges.

Pixy queries are expressed using Prolog rules, not SQL. Rules in Prolog are expressed as Horn clauses. Prolog like SQL has the full expressive power of first-order logic.

A Pixy Tutorial

Looks like Cypher is another graph database query language, supported by Neo4j so possibly better supported?

https://neo4j.com/developer/cypher-query-language/

And someone has already come up with the idea of running all possible algorithms at once: DataRobot.com