Rossmann Store Sales Prediction

Introduction:

The demand for a good or service is constantly shifting. Without effectively predicting client demand and future sales of products/services, no firm can enhance its financial performance. Sales forecasting predicts a given product's demand or sales over a predetermined time frame. I'll demonstrate how machine learning can be used to forecast sales using a real-world business challenge from Kaggle in this post. Everything is resolved entirely from the start in this case study. So, you will watch every step of how a case study is resolved in the actual world.

Issue Statement

In seven European nations, Rossmann runs more than 3,000 pharmacies.

Rossmann shop managers must forecast their daily sales six weeks in advance. Factors affecting store sales, including marketing, rivalry, state and federal holidays, seasonality, and location. The accuracy of the findings can be highly variable because thousands of different managers are making sales predictions based on their situations.

Error Metric: RMSPE stands for Root Mean Square Percent Error.

The metric's formula is as follows:

Objectives:

Use the data to forecast sales for the next six weeks.
Reduce the specified measure as much as you can.

Data:

The files are as follows:

train.csv
test.csv
store.csv

Data fields:

Id: A unique identifier for a double contained in the test set.
The store is a unique Id for every store.
Sales are the daily turnover, which is what you are assuming.
Customers are how many there are on a particular day.
Open: A 0 means the store was closed, and a one means it was open.

StateHoliday is a flag that denotes a state holiday. With very few exceptions, all businesses are typically closed on national holidays. Keep in mind that every institution is out on weekends and federal holidays. Stands for a public holiday, B for the Easter break, C for Christmas, and N for none.

If the closing of public schools impacted the ( Store, Time ), it is indicated by SchoolHoliday.

StoreType distinguishes the four distinct store models ( a, b, c, and d ).

Assortment - specifies three levels of assortment: basic, extra, and extended.

CompetitionDistance is measuring the distance in meters to the closest rival store.

CompetitionOpenSince - provides an approximation of the year and month the closest competitor was first made available.

Shows whether a shop is offering a promotion that day.

Promo2 is a continuous campaign that some stores are running: 0 indicates that a store is not taking part, while one indicates it is.

Promo2Since[Year / Week] - specifies the calendar year and week when the store first joined Promo2.

PromoInterval defines the months in which Promo2 is launched at regular intervals.

Exploratory Data Analysis(EDA)

Let us use EDA to obtain insights into the provided data.

The following is information about train.csv:

class 'pandas.core.frame.DataFrame'>
Int64Index: 1058297 entries, 0 to 41087
Data columns ( total 11 columns ):
Customers        1017209 nonnull float64
Date             1058297 non null datetime64[ns]
DayOfWeek        1058297 non null int64
Id               41088 non null float64
Open             1058286 non null float64
Promo            1058297 non null int64
Sales            1017209 non null float64
SchoolHoliday    1058297 non null int64
Set              1058297 non null int64
StateHoliday     1058297 non null object
Store            1058297 non null int64
dtypes: datetime64[ns]( 1 ), float64( 4 ), int64( 5 ), object( 1 )
memory usage: 96.9+ MB

As we can see, we have around 1 million data points. Also, because this is a time-series prediction issue, we must sort the data by date.

Our goal variable, in this case, is Sales.

The following is information about store.csv:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1115 entries, 0 to 1114
Data columns ( total ten columns ):
Store                        1115 non null int64
StoreType                    1115 non null object
Assortment                   1115 non null object
CompetitionDistance          1112 non null float64
CompetitionOpenSinceMonth    761 non null float64
CompetitionOpenSinceYear     761 non null float64
Promo2                       1115 non null int64
Promo2SinceWeek              571 non null float64
Promo2SinceYear              571 non null float64
PromoInterval                571 non null object
dtypes: float64( 5 ), int64( 2 ), object( 3 )
memory usage: 95.8+ KB

We have 1115 distinct shops. Many of the columns in this table have null values. We'll take charge of them in a moment.

Let's look at the details of the columns in the data now.

Promo:

Promotion column next to Sales and Customers.

We can see that revenue and consumer base increase significantly during promotions. This demonstrates that promotion has a favorable impact on a store.

Sales:

Sales over week Average

It is also worth noting that Christmas and New Year's ( see graph at week 52 ) led to a spike in sales. Because Rossmann Stores offers health and beauty items, it is reasonable to assume that around the holidays and New Year's, individuals buy beauty products when they go out to celebrate, which might explain the rapid surge in sales.

DayOfWeek:

Sales and customers are compared to the DayOfWeek column.

Since most stores are closed, we can see that business and consumer numbers are down on Sundays.

In addition, sales on Monday are the greatest of the week. This might be because most stores are closed on Sundays.

Another crucial point is that retailers operating during school vacations had more sales than usual.

Customers and sales are based on a store's store type.

We can see how stores of type A have a more significant number of customers and sales. StoreType D ranks second in both Sales as well as Customers.

Conclusions of EDA:

A is the most popular and congested retail type.
The quantity of customers is closely connected with sales.
Promotion increases both sales and customers in all stores.
Stores open during the school holidays had more sales than typical days.
School holidays have more stores open than state holidays.
Sales increase over Christmas week, maybe because individuals buy more beauty items throughout the holiday season.
The lack of data in attributes Competition Open Since Year / Month does not imply the absence of competition. Where the other two values are null, distance values are not null.
After evaluating sales using Wave decomposition, I discovered some seasonality in the Sales data.

Feature Engineering

Outliers Column:

In this column, we will determine whether a Sales number is an outlier or not based on the Median Absolute Deviation ( MAD ).

The MAD formula.

We built outlier columns store per shop, which means we did it for each unique store independently and then combined the data.

Date Features:

First, we'll use pandas' to__datetime method to transform the Date column. Following that, we can extract further properties from Date.

This week's, last week's, and next week's holidays:

We've designed three features that show the total number of holidays in the current week, the previous week, and the following week.

State Holiday Calculator:

The function shown above was used to generate two new features. One indicates how many days remain before a state holiday, while the other shows how many days have elapsed since the last state holiday.

School holiday promotion and counter:

In addition to the characteristics above, I built four more that indicate the number of days before or after a promotion or a school holiday.

Close dummy variable: This feature has two values: +1 or -1. +1 if the shop was closed yesterday or tomorrow; otherwise, -1.

Removing data points with zeros Sales: In this case, data points with zeros are eliminated since they indicate that the store was closed for whatever reason. And if we receive a store that isn't open, we can forecast zero sales.

Customers__per__day, Sales__per__customers__per__day, and Sales__per__customers__per__day:

The names of the features merely indicate their meaning. There is no need for additional clarification.

Open Competition and Open Promo: We are changing these two characteristics from 'year' to month as a unit.

Promo interval feature generation: Promointerval is offered in the following format: May, August, November. We will split them as follows: May is one characteristic, August is another, and November is the third.

Sales Variation and Acceleration: Variation = y - ( y-1 ), y = sales

Acceleration equals [( y-1 ) - ( y-2 )]., where y = sales

Fourier Characteristics:

I use the fft function from numpy to compute Fourier frequencies and amplitudes. Then employ them as features.

Other features include:

These contain vital patterns about DayOfWeek, Promotions, Holidays, etc.

External Information:

There are just two of them for extra information. One type of data is state data, which identifies which state a shop belongs to, and the other is weather data for a particular state on a specific day.

VIF Analysis: After adding all of the characteristics, we ran a VIF analysis to see whether there was any collinearity between them. High collinearity features were eliminated.

Let's do some modeling right now.

Modeling

Base model:

We used sklearn Pipe and Column Transformer to preprocess the data.

import numpy as npp
import pandas as pdd
import matplotlib.pyplot as plot
import xgboost as boost

import pylab
import csv
import datetime
import math
import re
import time
import random
import os
From pandas. Series.offsets import *
from operator import *
from sklearn.cross__validation import train__test__split
%matplotlib inline
npp.set__printoptions( precision = 4, threshold = 10000, linewidth = 100, edgeitems = 999, suppress = True )
pdd.set__option( 'display.max__columns', None )
pdd.set__option( 'display.max__rows', None )
pdd.set__option( 'display.width', 100 )
pdd.set__option( 'expand__frame__repr', False )
pdd.set__option( 'precision', 6 )
start__time  =  time.time( )
In [2]:
def ToWeight( y ):
    w  =  npp.zeros( y.shape, dtype = float )
    ind  =  y ! =  0
    w[ind]  =  1. / ( y[ind]**2 )
    return w

def rmspe( yhat, y ):
    w  =  ToWeight( y )
    rmspe  =  npp.sqrt( npp.mean(  w * ( y - yhat )**2  ) )
    return rmspe

def rmspe__xg( yhat, y ):
    # y  =  y.values
    y  =  y.get__label( )
    y  =  npp.exp( y ) - 1
    yhat  =  npp.exp( yhat ) - 1
    w  =  ToWeight( y )
    rmspe  =  npp.sqrt( npp.mean( w * ( y - yhat )**2 ) )
    return "rmspe", rmspe

The median is used to impute numerical values, whereas the most frequent is used to assume qualitative values. Numeric values are scaled as well.

Now divide the data into training and validation sets.

Spliting Data in test and train for Outliers

In [36]:
X__train, X__test, y__train, y__test  =  train__test__split( dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  False )][features__x],
                                                    dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  False )][features__y],
                                                    test__size = 0.1, random__state = seed )
In [37]:
dtrain  =  xgboost.DMatrix( X__train, y__train )
dtest  =  xgboost.DMatrix( X__test, y__test )
In [38]:
num__round  =  20000
evallist  =  [( dtrain, 'train' ), ( dtest, 'test' )]
In [39]:
param  =  { 'bst:max__depth' :12,
         'bst:eta' :0.01,
         'subsample' :0.8,
         'colsample__bytree' :0.7,
         'silent' :1,
         'objective' :'reg:linear',
         'nthread' :6,
         'seed' :seed }

plst  =  param.items( )

bst  =  xgboost.train( plst, dtrain, num__round, evallist, feval = rmspe__xg, verbose__eval = 250, early__stopping__rounds = 250 )

Output:

Will train until test error hasn't decreased in 250 rounds.
[0]	train-rmspe:0.99963	test-rmspe:0.999863
[250]	train-rmspe:0.41216	test-rmspe:0.487971
[500]	train-rmspe:0.19972	test-rmspe:0.188309
[71]	train-rmspe:1.166821	test-rmspe:1.156818
[111]	train-rmspe:1.137129	test-rmspe:1.132996
151]	train-rmspe:1.122311	test-rmspe:1.121135
[111]	train-rmspe:1.119952	test-rmspe:1.112465
[175]	train-rmspe:1.111481	test-rmspe:1.116788
[211]	train-rmspe:1.193883	test-rmspe:1.112796
[251]	train-rmspe:1.188571	test-rmspe:1.111461
[511]	train-rmspe:1.183871	test-rmspe:1.198587
[275]	train-rmspe:1.181263	test-rmspe:1.197114
[311]	train-rmspe:1.177273	test-rmspe:1.195896
[351]	train-rmspe:1.174326	test-rmspe:1.194886
[511]	train-rmspe:1.171833	test-rmspe:1.194189
[375]	train-rmspe:1.169711	test-rmspe:1.193385
[411]	train-rmspe:1.167834	test-rmspe:1.192811
[451]	train-rmspe:1.166113	test-rmspe:1.192347
[511]	train-rmspe:1.164519	test-rmspe:1.191941
[4751]	train-rmspe:1.162961	test-rmspe:1.191569
[5111]	train-rmspe:1.161581	test-rmspe:1.191271
[5251]	train-rmspe:1.161224	test-rmspe:1.191116
[5511]	train-rmspe:1.158971	test-rmspe:1.191792
[5751]	train-rmspe:1.157782	test-rmspe:1.191615
[6111]	train-rmspe:1.156656	test-rmspe:1.191459
[6251]	train-rmspe:1.155568	test-rmspe:1.191351
[6511]	train-rmspe:1.154547	test-rmspe:1.191252
[6751]	train-rmspe:1.153527	test-rmspe:1.191143
[7111]	train-rmspe:1.152577	test-rmspe:1.191167
[7251]	train-rmspe:1.151698	test-rmspe:1.191118
[7511]	train-rmspe:1.151825	test-rmspe:1.189956
[7751]	train-rmspe:1.151112	test-rmspe:1.189897
[8111]	train-rmspe:1.149217	test-rmspe:1.189844
[8251]	train-rmspe:1.148413	test-rmspe:1.189811
[8511]	train-rmspe:1.147679	test-rmspe:1.189768
[8751]	train-rmspe:1.146973	test-rmspe:1.189742
[9111]	train-rmspe:1.146274	test-rmspe:1.189741
Stopping. Best iteration:
[8751]	train-rmspe:1.146971	test-rmspe:1.189741

Feature Selection:

I did forward deciding on all of the additional features created in the feature engineering stage after generating the aforementioned basic model.

The following are the new features in the pipeline:

Reading store data

In [14]:
dataframe__store  =  pdd.read__csv( '.. / data / store.csv', 
                       nrows = nrows )
In [15]:
In [16]:
dataframe__store['StoreType']  =  dataframe__store['StoreType'].astype( 'category' ).cat.codes
dataframe__store['Assortment']  =  dataframe__store['Assortment'].astype( 'category' ).cat.codes
In [17]:
def convertCompetitionOpen( dataframe ):
    try:
        date  =  '{  }-{  }'.format( int( dataframe['CompetitionOpenSinceYear'] ), int( dataframe['CompetitionOpenSinceMonth'] ) )
        return pdd.to__datetime( date )
    except:
        return npp.nan

dataframe__store['CompetitionOpenInt']  =  dataframe__store.apply( lambda dataframe: convertCompetitionOpen( dataframe ), axis = 1 ).astype( npp.int64 )
In [18]:
def convertPromo2( dataframe ):
    try:
        date  =  '{  }{  }1'.format( int( dataframe['Promo2SinceYear'] ), int( dataframe['Promo2SinceWeek'] ) )
        return pdd.to__datetime( date, format = '%Y%W%w' )
    except:
        return npp.nan

dataframe__store['Promo2SinceFloat']  =  dataframe__store.apply( lambda dataframe: convertPromo2( dataframe ), axis = 1 ).astype( npp.int64 )
In [19]:
s  =  dataframe__store['PromoInterval'].str.split( ',' ).apply( pdd.Series, 1 )
s.columns  =  ['PromoInterval0', 'PromoInterval1', 'PromoInterval2', 'PromoInterval3']
dataframe__store  =  dataframe__store.join( s )
In [20]:
def monthToNum( date ):
    return{ 
            'Jan' : 1,
            'Feb' : 2,
            'Mar' : 3,
            'Apr' : 4,
            'May' : 5,
            'Jun' : 6,
            'Jul' : 7,
            'Aug' : 8,
            'Sept' : 9, 
            'Oct' : 10,
            'Nov' : 11,
            'Dec' : 12
     }[date]

dataframe__store['PromoInterval0']  =  dataframe__store['PromoInterval0'].map( lambda x: monthToNum( x ) if str( x ) ! =  'nan' else npp.nan )
dataframe__store['PromoInterval1']  =  dataframe__store['PromoInterval1'].map( lambda x: monthToNum( x ) if str( x ) ! =  'nan' else npp.nan )
dataframe__store['PromoInterval2']  =  dataframe__store['PromoInterval2'].map( lambda x: monthToNum( x ) if str( x ) ! =  'nan' else npp.nan )
dataframe__store['PromoInterval3']  =  dataframe__store['PromoInterval3'].map( lambda x: monthToNum( x ) if str( x ) ! =  'nan' else npp.nan )
In [21]:
del dataframe__store['PromoInterval']
In [22]:
store__features  =  ['Store', 'StoreType', 'Assortment', 
                  'CompetitionDistance', 'CompetitionOpenInt',
                  'PromoInterval0']

### Features not helping
# PromoInterval1, PromoInterval2, PromoInterval3

features__x  =  list( set( features__x + store__features ) )
In [23]:
dataframe  =  pdd.merge( dataframe, dataframe__store[store__features], how = 'left', on = ['Store'] )
In [24]:
### Convert every NAN to -1
for feature in features__x:
    dataframe[feature]  =  dataframe[feature].fillna( -1 )

Meta Learning:

This method is outlined as follows:

Divide the data into an 80-20 split.
Split the train section into two pieces, D1 and D2.
Take 9 D1 samples and train a forest-based Regressor model on all of them.
Predict D2 based on these 9 models. To train a new model, use these 9 forecasts as features and D2 y__original as output.
For the experimental set, forecast it using 9 models and send it to the meta model using 9 forecasts as features. As the final prediction, use the prediction from the meta model.

Reading sales data

In [4]:
nrows  =  None

dataframe__train  =  pdd.read__csv( '.. / data / train.csv', 
                       nrows = nrows,
                       parse__dates = ['Date'],
                       date__parser = ( lambda dt: pdd.to__datetime( dt, format = '%Y-%m-%d' ) ) )

nrows  =  nrows

dataframe__submit  =  pdd.read__csv( '.. / data / test.csv', 
                        nrows = nrows,
                        parse__dates = ['Date'],
                        date__parser = ( lambda dt: pdd.to__datetime( dt, format = '%Y-%m-%d' ) ) )
dataframe__train['Set']  =  1
dataframe__submit['Set']  =  0
In [6]:
frames  =  [dataframe__train, dataframe__submit]
dataframe  =  pdd.concat( frames )
In [8]:
features__x  =  ['Store', 'Date', 'DayOfWeek', 'Open', 'Promo', 'SchoolHoliday', 'StateHoliday']
features__y  =  ['SalesLog']
In [9]:
dataframe  =  dataframe.loc[~( ( dataframe['Open']  =  =  1 ) & ( dataframe['Sales']  =  =  0 ) )]
In [10]:
dataframe.loc[dataframe['Set']  =  =  1, 'SalesLog']  =  npp.log1p( dataframe.loc[dataframe['Set']  =  =  1]['Sales'] ) #  =  npp.log( dataframe['Sales'] + 1 )
In [11]:
dataframe['StateHoliday']  =  dataframe['StateHoliday'].astype( 'category' ).cat.codes
In [12]:
var__name  =  'Date'

dataframe[var__name + 'Day']  =  pdd.Index( dataframe[var__name] ).day
dataframe[var__name + 'Week']  =  pdd.Index( dataframe[var__name] ).week
dataframe[var__name + 'Month']  =  pdd.Index( dataframe[var__name] ).month
dataframe[var__name + 'Year']  =  pdd.Index( dataframe[var__name] ).year
dataframe[var__name + 'DayOfYear']  =  pdd.Index( dataframe[var__name] ).dayofyear

dataframe[var__name + 'Day']  =  dataframe[var__name + 'Day'].fillna( 0 )
dataframe[var__name + 'Week']  =  dataframe[var__name + 'Week'].fillna( 0 )
dataframe[var__name + 'Month']  =  dataframe[var__name + 'Month'].fillna( 0 )
dataframe[var__name + 'Year']  =  dataframe[var__name + 'Year'].fillna( 0 )
dataframe[var__name + 'DayOfYear']  =  dataframe[var__name + 'DayOfYear'].fillna( 0 )

features__x.remove( var__name )
features__x.append( var__name + 'Day' )
features__x.append( var__name + 'Week' )
features__x.append( var__name + 'Month' )
features__x.append( var__name + 'Year' )
features__x.append( var__name + 'DayOfYear' )
In [13]:
dataframe['DateInt']  =  dataframe['Date'].astype( npp.int64 )



In [40]:
dpred  =  xgboost.DMatrix( dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  True )][features__x] )
In [41]:
ypred__bst  =  bst.predict( dpred )
In [42]:
dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  True ), 'SalesLog']  =  ypred__bst
dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  True ), 'Sales']  =  npp.exp( ypred__bst ) - 1
In [43]:
no__stores__to__check  =  10

plot.rcParams["figure.figsize"]  =  [20,no__stores__to__check*5]

for i in range( 1,no__stores__to__check+1 ):
    stor  =  i

    # Normal sales
    X1  =  dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Store']  =  =  stor ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  False )]
    y1  =  dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Store']  =  =  stor ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  False )]['Sales']

    # Outliers
    X2  =  dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Store']  =  =  stor ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  True )]
    y2  =  dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Store']  =  =  stor ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  True )]['Sales']

    Xt  =  dataframe.loc[( dataframe['Store']  =  =  stor )]
    
    plot.subplot( 10,5,i )
    plot.plot( X1['Date'], y1, '-' )
    plot.plot( X2['Date'], y2, 'r.' )
    plot.title( i )
    plot.axis( 'off' )

Output:

Conclusion:

TABLE 1
Model	rmspe on test set	Kaggle private score
SGD Regressor	0.250	0.234
Decision Tree Regressor	0.1767	0.16442
Random Forest Regressor	0.164	0.139
Light GBM Regressor	0.157	0.121
Stacking with 9 models	0.1988	0.17375