Kalani Samarawickrama

Classifying data using Amazon SageMaker in-built libraries

Amazon SageMaker is one of the newest additions to Amazon’s ever growing portfolio of compute services. This Tech Guide from Mitra Innovation shows how it can also be used to classify data.


Amazon SageMaker is one of the newest additions to Amazon’s ever growing portfolio of compute services. It was introduced as a fully managed service that allows developers and data scientists to build, train, and deploy machine learning models. This Tech Guide from Mitra Innovation shows how it can also be used to classify data.

Implementing a machine learning model involves the processing of big data that needs to be analysed in order to provide intelligence and assist in rational decision-making. It can be a complex and expensive process that requires a number of new and advanced programming techniques, as well as a huge amount of computational resource. The ability to process large data sets also requires the use of normalisation techniques and the integration of various data sources towards a single data repository.

Additionally, to fulfill data analysis and forecasting, computational infrastructures need to be set up to process the large data sets that generate the inferences.

Machine learning development teams are also required to set up the management of distributed clusters containing the learning models. (Scaling up and creating the distributed machine learning algorithms to process terabytes of data is a cumbersome process and is not recommended for the weak of will.)

Finally, the learning models need to be deployed on production servers to facilitate the feeding of data and extraction of inferences. This includes setting up a separate set of clusters that will test, version and monitor the data and results.

All in all, the deployment of machine learning models using existing processes and patterns, can be frustrating and difficult to manage effectively.
Fortunately, Amazon SageMaker can make the process easier, as it allows for distributed training (a single API call can spool up a cluster and carry out distributed trainings on large volumes of data). It also hosts and maintains numerous machine learning models.

Amazon Sagemaker provisions its capabilities to be used with a number of machine learning frameworks, including:

  1. AWS In-built Training Models
  2. Apache Spark
  3. Deep Learning Frameworks such GLUON, Tensor Flow and MXNET
  4. Custom Docker Images.

In this Tech Guide we illustrate an example using a Jupyter Notebook instance and Python 3, as we attempt to ‘classify using SageMaker In-built Libraries’.

Follow the steps below to see what we did:

1. Create an S3 bucket

Create an S3 bucket to hold the following –
The model training data
Model artifacts (which Amazon SageMaker generates during model training).

2. Create a Notebook instance
Create a Notebook instance by logging onto: https://console.aws.amazon.com/sagemaker/

(Fig 1: Amazon SageMaker Dashboard)


3. Create a new conda_python3 notebook
Once created, open the notebook instance and you will be directed to Jupyter Server. At this point create a new conda_python3 notebook.

4. Specify the role
Specify the role and S3 bucket as follows:

from sagemaker import get_execution_role

role = get_execution_role()

(Snippet 1 : Specify the role and S3 bucket name)

5. Download the MNIST dataset
Download the MNIST dataset to the notebook’s memory.
The MNIST database of handwritten digits has a training set of 60,000 examples.

import pickle, gzip, numpy, urllib.request, json

# Load the dataset
urllib.request.urlretrieve(“http://deeplearning.net/data/mnist/mnist.pkl.gz”, “mnist.pkl.gz”)
with gzip.open(‘mnist.pkl.gz’, ‘rb’) as f:
train_set, valid_set, test_set = pickle.load(f, encoding=’latin1′)

(Snippet 2 : Download the MNIST dataset)

6. Convert to RecordIO Format
For this example, Data needs to be converted to RecordIO format – which is a file format for storing a sequence of records. Records are stored as an unsigned variant specifying the length of the data, and then the data itself as a binary blob.

Algorithms can accept input data from one or more channels. For example, an algorithm might have two channels of input data, training_data and validation_data. The configuration for each channel provides the S3 location where the input data is stored. It also provides information about the stored data: the MIME type, compression method, and whether the data is wrapped in RecordIO format.

Depending on the input mode that the algorithm supports, Amazon SageMaker either copies input data files from an S3 bucket to a local directory in the Docker container, or makes it available as input streams.

Manual transformation is not needed since we are following Amazon SageMaker’s High Level Libraries fit method in this example.

7. Create a training job
In this example we will use the Amazon SageMaker KMeans module.
From SageMaker, import KMeans class as follows:

from sagemaker import KMeans

data_location = ‘s3://{}/kmeans_highlevel_example/data’.format(bucket)
output_location = ‘s3://{}/kmeans_example/output’.format(bucket)

print(‘training data will be uploaded to: {}’.format(data_location))
print(‘training artifacts will be uploaded to: {}’.format(output_location))

kmeans = KMeans(role=role,

(Snippet 3 : Creating a training job)

  • role— The IAM role that Amazon SageMaker can assume to perform tasks on your behalf (for example, reading training results, called model artifacts, from the S3 bucket and writing training results to Amazon S3).
  • output_path—The S3 location where Amazon SageMaker stores the training results.
    train_instance_count and train_instance_type—The type and number of ML EC2 compute instances to use for model training.
  • k—The number of clusters to create. For more information, see K-Means Hyperparameters.
  • data_location—The S3 location where the high-level library uploads the transformed training data.


8. Start model training



(Snippet 4 : commence model training)

9. Deploy a model
Deploying a model is a three step process.

(Fig 2: Deploying a model in three steps)


  • Create a Model – CreateModel request is used to provide information such as the location of the S3 bucket that contains your model artifacts and the registry path of the image that contains inference code.
  • Create an Endpoint Configuration – CreateEndpointConfig request is used to provide the resource configuration for hosting. This includes the type and number of ML compute instances to launch for deploying the model.
  • Create an Endpoint – CreateEndpoint request is used to create an endpoint. Amazon SageMaker launches the ML compute instances and deploys the model.

The High Level Python Library deploy method provides all these tasks.


kmeans_predictor = kmeans.deploy(initial_instance_count=1,

(Snippet 5 : Deploy a Training Model using KMeans class)

The sagemaker.amazon.kmeans.KMeans instance knows the registry path of the image that contains the k-means inference code, so you don’t need to provide it.

This is a synchronous operation. The method waits until the deployment completes before returning. It returns a kmeans_predictor.

10. Validate the model
Here we get an inference for the 30th image of a handwritten number in the valid_set dataset.

result = kmeans_predictor.predict(train_set[0][30:31])

(Snippet 6 : Getting an inference using inbuilt K-Means Algorithm)


The result would show the closest cluster and the distance from that cluster.

The below shows sample images clustered together:


Cluster 0

(Fig 3: Cluster 0)

Cluster 1

(Fig 4: Cluster 1)


Cluster 2

(Fig 5: Cluster 2)

Cluster 3

(Fig 6: Cluster 3)

11. Clean up
Last but not least, once you have completed this tutorial, in order to avoid further costs, don’t forget to delete the resources you created using Amazon Web Services.

Thank you for reading our latest Mitra Innovation Tech Guide. We hope you will also read our next tutorial so that we can help you solve some more interesting problems.

(Watch Amazon’s video to find out more about Amazon Sagemaker – here.)

Amazon Web Services (AWS) is one of the most progressive vendors in the Cloud-based Infrastructure as a Service (IaaS) market. They regularly assess professional services firms to identify potential Consulting Partners who can help customers design, architect, build, migrate and manage their workloads and applications on AWS.

As Mitra Innovation is skilled in using the AWS platform, we have been chosen as a Standard Consulting Partner for AWS. This means that AWS officially recognises us as a trusted Consulting Partner for enterprises and organisations utilising – and needing help with – the AWS Platform.

Kalani Samarawickrama

Senior Software Engineer | Mitra Innovation