Kalani Samarawickrama

Using AWS SageMaker linear regression to predict store transactions

The ability to accurately forecast retail sales, housing and property sales, weather patterns and customer behaviour is of growing importance to businesses.


The ability to accurately forecast retail sales, housing and property sales, weather patterns and customer behaviour is of growing importance to businesses. Especially so in today’s competitive landscapes where organisations wish to capitalise on opportunities as soon as they arise, by being able to forecast opportunities ahead of time. As a result, large and medium scale corporations are becoming more interested in machine learning time series forecasting techniques, to help them gain a competitive edge.

To serve the growing demand for machine learning forecasting techniques, Mitra Innovation is investigating new technologies that are easy to use by developers with a broad range of skills and experience.

One of the newest additions to the growing list of machine learning tools is Amazon Sagemaker, and as a trusted consulting partner of AWS, we were keen to start experimenting with the tool. Amazon SageMaker is a fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale, removing all the barriers that typically slow them down.

In this latest Mitra Innovation Tech Guide, we illustrate how to utilise the Amazon Sagemaker in-built linear regression algorithm for forecasting. For demonstration purposes, we’ll be using data from a grocery chain to accurately predict sales transactions for grocery store.

Sales transaction predictions using Amazon Sagemaker linear regression

Corporación Favorita Grocery is based in Ecuador, and was established in Quito in 1952. We have used their data as part of this demonstration. The data that we have (in a transactions.csv file), contains the transactions of all the branches between 2013 to 2017. As a starting point, we chose to use the data from store #47 and plot the corresponding transactional time series.

Let’s get started
We have used a Jupyter Notebook instance provided by Amazon Sagemaker, Python 3 and Amazon S3 bucket storage. Here are the steps to follow:

  1. Specify the role
    Specify the role and S3 bucket as follows:


bucket = ‘<your_s3_bucket_name_here>’
prefix = ‘sagemaker/linear_time_series_forecast’

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

(Snippet 1: Specifying the role and S3 bucket)


2. Import files

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import time
import json
import sagemaker.amazon.common as smac
import sagemaker
from sagemaker.predictor import csv_serializer, json_deserializer

(Snippet 2: import files)


3. Load data and transform datetime


df_train = pd.read_csv(‘transactions.csv’)
# convert date to datetime
df_train[“date”] = pd.to_datetime(df_train[“date”])

(Snippet 3: Loading the data and transform to date and time)


4. Select the dataset

df_train_2016 = df_train[df_train[“date”].dt.year == 2016]
df_train_2016_store47 = df_train_2016[df_train_2016[“store_nbr”] == 47]

(Snippet 4: Selecting the dataset)


5. Drop unnecessary columns from dataframe

transactions_store47 = df_train_2016_store47.drop(‘date’,axis=1)
transactions_store47 = transactions_store47.drop(‘store_nbr’, axis=1)

(Snippet 5: Eliminate unnecessary columns from the dataframe)


6. Plot the transaction fluctuations of 2016


(Snippet 6 : Plotting the transaction fluctuations of the given timeframe)


(Fig 1: Transaction fluctuations visualised)


7. Transform the dataset

Our target variable is transactions. We will create explanatory features, such as:

  • Transactions for each of the four preceding weeks
  • Trends – The chart above suggests the trend is simply linear, but we’ll create log and quadratic trends in case
  • Indicator variables {0 or 1} that will help capture seasonality and key holiday weeks.


transactions_store47[‘transactions_lag1’] = transactions_store47[‘transactions’].shift(1)
transactions_store47[‘transactions_lag2’] = transactions_store47[‘transactions’].shift(2)
transactions_store47[‘transactions_lag3’] = transactions_store47[‘transactions’].shift(3)
transactions_store47[‘transactions_lag4’] = transactions_store47[‘transactions’].shift(4)
transactions_store47[‘trend’] = np.arange(len(transactions_store47))
transactions_store47[‘log_trend’] = np.log1p(np.arange(len(transactions_store47)))
transactions_store47[‘sq_trend’] = np.arange(len(transactions_store47)) ** 2
weeks = pd.get_dummies(np.array(list(range(52)) * 15)[:len(transactions_store47)], prefix=’week’)
transactions_store47 = pd.concat([transactions_store47, weeks], axis=1)

(Snippet 7: Transforming the data)


8. Format dataset

  • Clear out the first four rows where we don’t have lagged information
  • Split the target off from the explanatory features
  • Split the data into training, validation, and test groups so that we can tune our model and then evaluate its accuracy on data it hasn’t seen yet. Since this is time-series data, we’ll use the first 60% for training, the second 20% for validation, and the final 20% for final test evaluation.

transactions_store47 = transactions_store47.iloc[4:, ]
split_train = int(len(transactions_store47) * 0.6)
split_test = int(len(transactions_store47) * 0.8)

train_y = transactions_store47[‘transactions’][:split_train]
train_X = transactions_store47.drop(‘transactions’, axis=1).iloc[:split_train, ].as_matrix()
validation_y = transactions_store47[‘transactions’][split_train:split_test]
validation_X = transactions_store47.drop(‘transactions’, axis=1).iloc[split_train:split_test, ].as_matrix()
test_y = transactions_store47[‘transactions’][split_test:]
test_X = transactions_store47.drop(‘transactions’, axis=1).iloc[split_test:, ].as_matrix()

(Snippet 8: Formatting the dataset)


9. Convert to RecordIO format as required by Amazon Sagemaker

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, np.array(train_X).astype(‘float32’), np.array(train_y).astype(‘float32’))

(Snippet 9: Converting to RecordIO format as required by Amazon Sagemaker)


10. Upload data to S3 Bucket

key = ‘linear_train.data’
boto3.resource(‘s3’).Bucket(bucket).Object(os.path.join(prefix, ‘train’, key)).upload_fileobj(buf)
s3_train_data = ‘s3://{}/{}/train/{}’.format(bucket, prefix, key)
print(‘uploaded training data location: {}’.format(s3_train_data))

(Snippet 10: Uploading the data to the Amazon S3 Bucket)


11. Convert the validation dataset

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, np.array(validation_X).astype(‘float32’), np.array(validation_y).astype(‘float32’))

(Snippet 11: Converting the validation dataset)


12. Upload the validation dataset

key = ‘linear_validation.data’
boto3.resource(‘s3’).Bucket(bucket).Object(os.path.join(prefix, ‘validation’, key)).upload_fileobj(buf)
s3_validation_data = ‘s3://{}/{}/validation/{}’.format(bucket, prefix, key)
print(‘uploaded validation data location: {}’.format(s3_validation_data))

(Snippet 12: Uploading the validation dataset)


13. Specify the containers

containers = {‘us-west-2’: ‘174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest’,
             ‘us-east-1’: ‘382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest’,
             ‘us-east-2’: ‘404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest’,
             ‘eu-west-1’: ‘438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest’}

(Snippet 13: Specifying the containers)


14. Train the model using Amazon Linear Learner Algorithm

Amazon SageMaker’s Linear Learner actually fits many models in parallel, each with slightly different hyperparameters, and then returns the one with the best fit. This functionality is automatically enabled. We can influence this using parameters like:

  • num_models to increase to total number of models run. The specified parameters will always be one of those models, but the algorithm also chooses models with nearby parameter values in order to find a solution nearby that may be more optimal. In this case, we’re going to use the max of 32.
  • loss which controls how we penalise mistakes in our model estimates. For this case, let’s use absolute loss as we haven’t spent much time cleaning the data, and absolute loss will adjust less to accommodate outliers.
  • wd or l1 which controls regularisation. Regularisation can prevent model overfitting by preventing our estimates from becoming too finely tuned to the training data, which can actually hurt generalisability. In this case, we’ll leave these parameters as their default “auto” though.

sess = sagemaker.Session()

linear = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
output_path=’s3://{}/{}/output’.format(bucket, prefix),

linear.fit({‘train’: s3_train_data, ‘validation’: s3_validation_data})

(Snippet 14: Training the model using Amazon linear learner algorithm)


15. Deploy the model to an Endpoint

linear_predictor = linear.deploy(initial_instance_count=1,

(Snippet 15: Deploying the model to an endpoint)


16. Forecast using Naive Forecast

transactions_store47[‘transactions_lag52’] = transactions_store47[‘transactions’].shift(52)
transactions_store47[‘transactions_lag104’] = transactions_store47[‘transactions’].shift(104)
transactions_store47[‘transactions_naive_forecast’] = transactions_store47[‘transactions_lag52’] ** 2 / transactions_store47[‘transactions_lag104’]
naive = transactions_store47[split_test:][‘transactions_naive_forecast’].as_matrix()

(Snippet 16: forecasting using Naive Forecast)


17. Verify forecasting accuracy using Naive Forecast

print(‘Naive MdAPE =’, np.median(np.abs(test_y – naive) / test_y))
plt.plot(np.array(test_y), label=’actual’)
plt.plot(naive, label=’naive’)

(Snippet 17: Verifying accuracy using Naive Forecast)


There are many metrics to measure forecast error. But in this example we will use Median Absolute Percent Error (MdAPE)


Naive MdAPE = 0.381823053331


(Fig 2: Naive Median Absolute Percent Error)


18. One Step Ahead Forecast

Create a function to convert our numpy arrays into a format that can be handled by the HTTP POST request we pass to the inference container. In this case it’s a simple CSV string. The results will be published back as JSON. For these common formats we can use the Amazon SageMaker Python SDK’s built in csv_serializer and json_deserializer functions.

linear_predictor.content_type = ‘text/csv’
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer

(Snippet 18: One step ahead forecast)


19. Call the endpoint to get the predictions

result = linear_predictor.predict(test_X)
one_step = np.array([r[‘score’] for r in result[‘predictions’]])

(Snippet 19: Calling the endpoint to get predictions)


20. Verify forecasting accuracy using One Step Ahead

print(‘One-step-ahead MdAPE = ‘, np.median(np.abs(test_y – one_step) / test_y))
plt.plot(np.array(test_y), label=’actual’)
plt.plot(one_step, label=’forecast’)

(Snippet 20: Verifying forecast accuracy using One Step Ahead)


One step ahead MdAPE = 0.154073139758

(Fig 3: One Step Ahead MdAPE)


21. Multi step forecast

Loop over, invoking the endpoint one row at a time, and ensure the lags in the model are updated appropriately.


multi_step = []
lags = test_X[0, 0:4]
for row in test_X:
row[0:4] = lags
result = linear_predictor.predict(row)
prediction = result[‘predictions’][0][‘score’]
lags[1:4] = lags[0:3]
lags[0] = prediction

multi_step = np.array(multi_step)

(Snippet 21: Multi step forecasting)


22. Verify forecasting accuracy using Multi Step Ahead

print(‘Multi-step-ahead MdAPE =’, np.median(np.abs(test_y – multi_step) / test_y))
plt.plot(np.array(test_y), label=’actual’)
plt.plot(one_step, label=’forecast’)

(Snippet 22: Verifying forecast accuracy using Multi Step Ahead)


(Fig 4: Multi-step-ahead MdAPE = 0.413045222697)



By examining the Median Absolute Percent Error (MdAPE) we are able to observe that the ‘One Step Ahead’ forecasting method serves the best predictions with the least errors, followed by Naive forecast and Multi Step Ahead forecasts. We noted that the differences between one step forecasts and second best ‘Naive’ forecasts are substantially higher and can result in greater losses to the store.

The multi step forecasting method uses prediction based values from the past – which could lead to past errors compounding into future errors. This is one reason for the high rate of errors in multi step forecasting methods.

On the other hand, the ‘One Step Ahead’ forecasting method updates history with the correct known value and produces the least error while Naive forecast stands in between – with its general ability to forecast stable time series data and adjust to seasonal variations and trends.


One Step < Naive < Multi Step
0.154073139758 < 0.381823053331 < 0.413045222697

(Snippet 23: Rate of error in respect to the forecasting method)

Hope you enjoyed reading. This example was based on https://github.com/awslabs/amazon-sagemaker-examples so please check it out for more cool examples.


Thank you for reading our latest Mitra Innovation Tech Guide We hope you will also read our next tutorial so that we can help you solve some more interesting problems.

Amazon Web Services (AWS) is one of the most progressive vendors in the Cloud-based Infrastructure as a Service (IaaS) market. They regularly assess professional services firms to identify potential Consulting Partners who can help customers design, architect, build, migrate and manage their workloads and applications on AWS.

As Mitra Innovation is skilled in using the AWS platform, we have been chosen as a Standard Consulting Partner for AWS. This means that AWS officially recognises us as a trusted Consulting Partner for enterprises and organisations utilising – and needing help with – the AWS Platform.


Kalani Samarawickrama

Senior Software Engineer | Mitra Innovation