Build a modern data science platform: first steps - transition to cloud based compute

Many data science teams start out by using local desktop & on-premise tools to begin exploring data, building machine learning models and performing inference. In the first instance, this is convenient, as it offers a quick and simple method to get started. However, as data and computation demands grow, local compute is no longer sufficient and on-premise infrastructure lacks flexibility.

On premise data science resources can be under-utilized due to the high compute power required and the batch nature of model training

Since data science workloads need lots of memory, cpu and possibly gpu power, the infrastructure to support this can be expensive and difficult to provision, and can quickly lead to resources being under utilized if your model development and training is not consistently saturating your compute.

Hence, a modern, cloud based platform where elasticity is built in is ideal for AI/ML workloads.

Comprehensive cloud machine learning services

There are many cloud based service offerings that can give a comprehensive set of tools along with flexible compute power for the entire, end to end model pipeline, such as Amazon SageMaker or Azure Machine Learning.

These services can be useful to adopt a feature rich set of capabilities that can cover your end to end lifecycle and related MLOps functions.

However, many teams want to get started by simply taking their current tooling & processes and start to apply them on cloud based compute.

For example, switching from a desktop version of RStudio to RStudio Server, using a server based instance of JupyterHub or simply running your preferred IDEs on a server that has sufficient compute power, ability to scale and turn on and off.

The advantage of this approach is that you can create a cloud portable solution while keeping your tooling the same (as long as you can run it without licensing issues but will likely be mostly open source) and retain control over the underlying infrastructure and configuration. It is also less of an architectural jump to shift from a local process, to something very similar but leveraging cloud compute. This favours a more conservative, iterative or lift and shift style approach to cloud migration.

First steps data science platform

In this initial, cloud based data science platform, data scientists are given development boxes, similar to how you may give access to virtual instances to software developers. However, the data science environments can be spec’d to accommodate machine learning workloads and pre-configured with relevant tools and software.

It is critical to ensure quality FinOps procedures, at the most basic level these data science boxes should be switched off when not in use. Automation rules could also be applied to control costs, for instance turning compute off outside working hours and switching under utilized instances to lower spec infrastructure.

Each data science environment can be pre-configured with your chosen tools, for example Jupyter Notebooks or RStudio Server. In the context of Amazon EC2 instances, this could be done by using pre-built Amazon Machine Images (AMIs) from the AWS Marketplace or AMI catalog. Alternatively you could build your own custom AMIs or provision a base instance using configuration management software such as Ansible, as described in the blog Automate provisioning and configuration of Amazon EC2 instances using Infrastructure as Code.

Once set up, data scientists can access their environment to start exploring data and experimenting with model building. In this context, the data still has to fit within the box and we rely on vertical scaling to accommodate larger data or more complex workloads.

In this early phase, you may choose to train your model using this same environment after experimentation. You may then also choose to score data (conduct inference) using the same approach. This set up will allow you to manually conduct batch scoring and re-training when required. You could look to set up some orchestration to conduct scheduled inference or re-training.

Preferably though, when this level of maturity is reached, compute pipelines should be established that allow you to conduct regular training and batch inference as part of a well defined and independent workload. We will cover this along with real time inference in future posts.