Scaling Machine Learning from 0 to millions of users — part 2 Training: EC2, EMR, ECS, EKS or SageMaker?

In Part 1, we broke out of the laptop, and decided to deploy our prediction service on a virtual machine. By doing so, we discussed a few simple techniques that helped with initial scalability… and hopefully with reducing manual ops. Since then, despite a few production hiccups due the lack of high availability, life has been pretty good.

However, traffic soon starts to increase, data piles up, more models need to be trained, etc. Technical and business stakes are getting higher, and let’s face it, the current architecture will go underwater soon. Time’s up: in this post, we’ll focus on scaling training to a large number of machines.

Two hundred… and fifty… six… GPUs!!! Ja!

This is an opinionated series, remember? It’s also based on what I hear when talking to real-life customers, not ideal ones. Reality is often ugly, and only tech hipsters (and top management) think that it looks exactly like that fancy article your read on <insert_name_here> 😉

Special thanks to rockin’ Evangelists Abby Fuller, Jerry Hargrove, Adrian Hornsby, Brent Langston and Ian Massingham for their tips and ideas.

Scaling up will fix it… right?

Yes and no. Yes, it can be a short-term solution to use a large server for training and prediction. Amazon EC2 has a ton of instance types to pick from, and all it takes is stopping your instance, changing the instance type, and starting it again.

You should only be using APIs from now on. When it comes to scaling tech & ops, manual work in any GUI is the Antichrist.

No, because it’s only a temporary solution:

It’s fine to use bigger instances, until there is no bigger instance. Then what?
It’s also quite possible that your workload won’t scale nicely, and won’t make full use of the additional hardware (RAM, CPU cores, I/O, etc.). A marginal performance gain isn’t worth the extra spend.
Most of all, scaling up will simply delay the inevitable. Keep doing it, and the only thing you’ll end up with is a bigger problem to solve.

In the spirit of avoiding over-engineering, it’s OK to scale up a couple of times, but if monitoring keeps pointing at the Impassable Wall of Scalability Doom, I’d advise you to act a little too early rather than a little too late: things scale linearly until they don’t, and you don’t want to find out what happens when the exponential starts rising!

Scaling out

When it comes to training Machine Learning (ML) models, the top requirements are actually pretty simple:

Reliable, scalable storage for your data sets.
Elastic compute clusters, that can be started on-demand in lots of different configurations (hardware, frameworks, etc.).
As little ops as possible: ML is what you should focus on, because ML is what turns your raw data into revenue / profit / improver customer experience.

“That’s not the full list! We want total control, no lock-in, low bills, top performance… and everything else too, whatever it is”. Yes, yes, we’ll get there 🙂

Storage

Let’s get that one out of the way: your data goes to Amazon S3. Any other choice would need a bullet-proof justification (try me!). Throughput? Well, you now get up to 25 Gigabit per second between EC2 and S3. That should enough for now. Scalability? High availability? Security? No ops? Cost? Check.

At this point, anyone in your team coming up with a “high performance storage cluster based on this super cool open source project” should be hit on the head with a heavy object until they stop moving. No mercy for wasting time, putting projects at risk, and over-engineering.

Because “the doc wasn’t clear” and “bugs happen to everyone”.

OK, now what about compute options? Amazon EC2? Amazon EMR? Container services? Amazon SageMaker?

Begun the Scaling War has.

Amazon EC2

Our journey started on EC2, so it can be quite tempting to continue there. Is that a good enough reason?

As discussed in Part 1, the Deep Learning AMI makes your life much simpler. It’s packed with open source tools and libraries optimized by AWS (11x speedup on TensorFlow 1.11, anyone? Or maybe linear scaling up to 256 GPUs?). Horovod, a popular library for distributed training, is also included. All of this will make it much easier to setup efficient, distributed training clusters.

Some constraints may prevent you from using that AWS-maintained AMI: need for a specific OS vendor, licensing restrictions, having to run identical builds across different providers, etc. Unfortunately, you’ll have to set everything up yourself. Don’t underestimate that task: you’ll have to do it again and again as new versions are released. Even with automation, that’ll never be a lights-out operation.

Whichever AMI you use, setting up a training cluster means:

Firing up a bunch of instances,
Picking one as the leader, and setting up distributed training. That usually involves listing hostnames / IP addresses of other machines in the cluster.
Start training,
Once training is complete, grab the trained model and save it in Amazon S3.
Shut the training cluster down.

Quite a bit of work, then, which is probably why most of you go through steps 1 and 2 once, run the cluster 24/7, and never make it to step 5… And this, ladies and gentlemen, is my main concern for using EC2 here. Unless you automate all of this (with AWS CloudFormation, Terraform, CLI scripts, etc.), you will waste a ton of money. Someone up above will quickly put a cap on your budget, meaning that you’ll probably end up with a fixed-size cluster that needs to be time-shared by multiple developers / teams… and of course, someone will develop a nice intranet page to book time slots on the cluster. Congratulations, you’re reinvented the mainframe!

Resnet-50 training in 3… 2… 1…

Don’t laugh. I’ve met a very large — and otherwise brilliant — AI company managing hundreds of physical GPU servers just like this. Unfortunately, I’m sure some customers do the same on EC2… Get in touch if you’re stuck there, we can help!

Of course, maybe your DevOps team was kind enough to provide an all-singing, all-dancing cluster provisioning script that each developer can run to get their own training cluster (buy them lots of beer: that rarely ever happens). Would that be a good enough reason to stick to EC2 for training?

Maybe. Techniques for cost optimization on EC2 are well-known: reserved instances, spot instances, etc. Doing that right may offset the extra DevOps costs. Only you can find out, and you definitely should.

If you don’t have time or skills to get automation and cost optimization right, I’d think twice about running training jobs at scale on EC2.

Amazon EMR

Amazon EMR in a ML discussion? Well, yes: TensorFlow and Apache MXNetare part of the EMR distribution, and of course, Spark MLlib is also included. EMR supports compute-optimized instances (c5 and p3), so it looks like we have everything we need.

Here are a few good reasons to run training jobs on EMR:

You already use EMR for other tasks, with solid automation (on-demand clusters, steps, etc.) and cost optimization (spot instances!).
Your data requires a lot of ETL, and Hive / Spark would work great for that. Why not run everything in one place?
Spark MLlib has the algos you need.
You read somewhere that there a SageMaker SDK for Spark, so that could be another option for the future.

And equally good reasons NOT to do it:

You don’t have time or skills to automate and optimize costs. GPU-based EMR clusters running 24/7 at on-demand price with zero load <rolls a D100 for sanity check… >
You’re using neither TensorFlow, Apache MXNet nor Spark MLlib. Yes, you could install additional packages to your clusters, but that’s extra work.
Your ETL and ML jobs have conflicting instance requirements. Let’s say that they would respectively run best on 8 r5.4xlarge and 2 p3.8xlarge for training. How do you compromise? That’s a hard call, and you may end up picking an instance type that’s suboptimal for both ETL and training… or creating a dedicated GPU cluster (another one to manage and worry about).

I have mixed feelings about this: I’d be fine with piling some amount of ML on top on an existing cluster, but unless it was 100% based on Spark MLlib, scaling it simply wouldn’t feel right.

Container services

In part 1, I suggested early one that you containerize your code in order to solve deployment issues. Obviously, this would also pay dividends when deploying to Docker clusters, whether on Amazon ECS or Amazon EKS.

You can run any workload in a Docker container, and you can move it around without any restriction, from your laptop to your production environment (or so I’m told). When running on AWS, costs can be squeezed with auto scaling, reserved instances, spot instances, etc. Woohoo.

From a training perspective, containers give you full flexibility to use any open source library, or even your own custom code. All popular ML/DL libraries provide base images, which you can either run directly or customize, and these will save you a lot of time.

Thanks to auto scaling now supporting mixed instances types, different instance types can coexist within the same cluster. Thus, you can easily add compute-optimized instances to any cluster and schedule your ML/DL trainings there. Of course, you can also create a dedicated cluster for training if you think that makes more sense.

Last but not least, GPU instances are supported on both services, with GPU-optimized AMIs to boot (nvidia-docker, NVIDIA drivers, etc).

Training on Amazon ECS

To minimize training time and cost, you need to make sure that training jobs run on the most appropriate instance type (say c5 or p3). Amazon ECS lets you add placement constraints in task definitions. Here’s how we would ask ECS to schedule this task only on p3 instances.

"placementConstraints": [
{
"expression": "attribute:ecs.instance-type =~ p3.*",
"type": "memberOf"
}
]

We can also one step further, thanks to a new feature that lets pin a specific number of GPUs to a given task.

Training on Amazon EKS

You can pretty much do the same thing on EKS. This nice blog post will walk you through the whole process of adding p3 worker nodes to an existing cluster, defining a GPU-powered pod, and launching it on the cluster.

Running GPU-Accelerated Kubernetes Workloads on P3 and P2 EC2 Instances with Amazon EKS | Amazon…
This post contributed by Scott Malkie, AWS Solutions Architect Amazon EC2 P3 and P2 instances, featuring NVIDIA GPUs…aws.amazon.com

Container services for ML/DL training, yes or no?

If you belong to the Wild Hyperborean Horde who’d rather eat frozen Yack poo than NOT use home-made containers for absolutely everything, you’ve already answered the question, haven’t you? 😉

Yes, for a very short while. And then you got your guts ripped out. Hmm.

If the need ever arose, I wouldn’t worry about scaling container services to a large number of nodes (ECS was designed to scale linearly to 1,000+ nodes). However, once again, you should very much worry about your ability to scale ops (containers, clusters, etc.) and manage cost.

If you work in a Docker shop where another team is managing clusters, providing you with agile, automated and cost-effective ways to provisionthem, then sure. It could be as easy as committing a TensorFlow script, and then letting a CI/CD pipeline deploy it automatically to a cluster. Not a lot of extra work for ML developers and data scientists.

Now, if you live a world where you have to build and operate clusters on top of your actual ML job, that’s not such a exciting proposition any more. Plumbing, large bills, fire, brimstone… you know the story.

Let’s look at setting up and managing large, distributed training jobs with Horovod: here’s the documentation. Once you’ve got everything figured out (Docker, Kubeflow, MPI Operator, Helm Chart, and FfDL), here’s how to run a training job on 4 machines with 4 GPUs each:

Don’t get me wrong: Docker, Kubernetes, Horovod and so on are impressive project, but if you insist on building and maintaining everything yourself (or if Hyperborean Harald sneers that it’s “the only proper way to do it”), you should know what you’re stepping into, as you will be using this all day long.

Is this what you really need? Maybe, maybe not. Please make up your own mind.

Amazon SageMaker

One more option to go: Amazon SageMaker. I’ve discussed it at lengths in previous posts and talks (start here for an recent overview), and as the most recent service of the bunch, I’ve saved it for last to see where it improves on previous options with respect to training large jobs.

A quick reminder:

All activity in SageMaker is driven by a high-level Python SDK.
Training is based on on-demand, fully-managed instances. Zero infrastructure work. Spot instances are not available.
Models may be trained using AWS-maintained built-in algorithms and optimized frameworks (same ones as in the Deep Learning AMI), as well as custom algorithms.
Distributed training is built-in. Zero setup.
Plenty of examples are available here.

For instance, here’s how you would train a TensorFlow model. First, put your data in Amazon S3 (it’s hopefully already there). Then, configure your training job: pass your code, the instance type, the number of instances. Finally, call fit().

That’s it. This is at least 10x (100x ?) less code than any automation you’d be using with EC2 or containers.

And lock-in? Well, none: you’re free to take your TensorFlow code and run it anywhere else.

EC2 or SageMaker?

Compared to EC2, SageMaker saves you from managing any infrastructure, and probably a lot of framework containers too. That might not be a big deal when you’re working with a couple of instances, but it sure it when you start scaling to tens or hundreds, running all kinds of different jobs.

SageMaker also terminates training clusters automatically once training is complete.

You will never overpay for training.

Yes, SageMaker instances are more expensive than EC2 instances. However, if you factor in less ops and automatic termination, I’d be really surprised if the gap wasn’t significantly reduced.

Total cost of ownership is what matters.

EMR or SageMaker?

As mentioned earlier, I don’t see any compelling reason to use EMR at scale for training unless you stick to Spark MLlib. Still, if you’re asking yourself the question, you’re probably already using EMR… so how about both? As it happens, SageMaker also includes a Spark SDK. I’ve covered this topic before.

The short version is: separate concerns.

Spark for large-scale ETL, SageMaker for large-scale training.

Container services or SageMaker?

There is a myriad of technical details that separate these two approaches, but at the end of the day, I think the choice does come down to engineering culture and focus.

Some teams are convinced doing everything themselves creates value for the company ( in some cases, it does), and some other teams would rather rely on managed services in order to iterate as fast as possible. Some teams feel better about putting all their eggs in a single basket (“we run everything on Docker clusters”), some other teams are happier with using different services for different things. No one but them can judge what’s best for their particular use case.

My personal choice would still go to SageMaker, because unlike ECS and EKS, SageMaker is built for Machine Learning only: the team is obsessed with simplifying and optimizing the service for ML users, and ML users only. No offence to the ECS and EKS teams, but their focus is different, as they have to accommodate literally every possible workload.

These services are all based on containers anyway, and if you’re able to run distributed TensorFlow with Horovod on Kubernetes, the SageMaker SDK will feel like a breeze! Give it a try and let me know what you think.

That’s the end of the second part. In the next post, we’ll talk about optimizing training from a framework perspective. Plenty more to come, we haven’t even talked about prediction yet!

Scaling Machine Learning from 0 to Millions of Users — Part 2 Training: EC2, EMR, ECS, EKS or SageMaker?

What is a Full Stack Developer?