The quintessential dimension

 

Time; Analysing for stationarity
The Why
In an earlier dispatch, we touched upon an unconscious dimension to us developers. The treatment of the
dimension of time in programming languages, database management systems, functions etc. Today we
expand upon that to a branch of Predictive Analytics which deals with time. The Time Series analysis.
Given the volume of content available on the web for classification, regression, neural networks, and NLP

topics it is not easy to spot and learn about what others are doing in the time series analysis. Questions
that are often intuitive can be answered by time series forecasting.
1. What will be the yield of rice in the next Kharif Marketing Season?
2. What should be the temperature of the building in the evening hours?
3. Forecast the number of food orders that will be placed during lunch hours?
The list could go on. In case you got stuck on this can be answered by regression. Yes, to some extent but
look closely at the question. Each of them has a temporal context to it.
1. Kharif Marketing Season, in India is a period of time typically close to a year but starts on 1st
October.
2. Assuming the building is an office building, the evening hours will span from 1600-1900 Hrs.
3. Lunch hours could typically span a few hours starting from 1200 to 1500 Hrs.
These questions are time-boxed and the outcome in the specific time box could have an influence in the
next hour or could have get influenced by the previous window of time or the same period of time across
years or days. It is apt now that we equip ourselves with the vocabulary involved in time series forecasting
and analysis.
Seasonality
The word is so common in our parlance that it does not need any introduction. It is the repetition of a
pattern over time thus forming a cycle. It is also known as cyclic behaviour.
Trend
It is best we resolve this with a clear definition, it is such a common word that we cannot assume its
meaning synchs with the way it is interpreted in the time series analysis. It is often a linearly increasing or
decreasing behaviour of the time series over.
Noise
The unexplainable and unpredictable component for a time series analysis in a time series. It seldom is a
large component of the time series. Because a large noise value will lead to useless data and by induction
not a useful analysis.
Level
The lowest value in the time series above which all other observations lie.
Armed with these words we could talk about an important concept in time series analysis – Stationarity.
There is this one last thing that we want to set right going forward. The word time series analysis is used to
describe the time series. Time series forecasting on the other hand works on predicting a value in future.
We will be cautious not to use these words interchangeably in the rest of the article. The descriptive part
which is the analysis is performed to describe heaps of time series data in a somewhat
Y = Seasonality + Trend + Noise + Level
Y is the observation that varies with time and is of importance for the modelling. Time series analysis is
often explained as the Why for any point of interest. Now to Staionarity
Stationarity
Dimension of time often introduces a constraint which cannot be modelled by classification or regression
algorithms. The constraint adds an implicit meaning to the order of occurrence in the time series. Similar to
the constraint treating time series data requires a foundational aspect that must be checked before
proceeding with any analysis. This important check is like comparing apples to apple kind of check. This
check is called the check of stationarity within the time series data. No, it is not that observable will not
change thus making the time series data stationary. It is rather the statistical summary of the time series
data that must be stationary for the window of time that you use to analyse.

Using the words from our vocabulary above, time series is stationary if it is devoid of trend and seasonality.
Statistically, it translates to having the mean or variance constant over a different window of time. It is better
to show in a plot what we mean –
from pandas import read_csv
from matplotlib import pyplot
ts = read_csv('daily-total-female-births.csv', header=0, index_col=0)
ts.plot()
pyplot.show()

The data used in this example is from here. Similarly let us look at a time series data that looks non-
stationary –

This time series data represents the number of passengers travelling by aeroplane in the USA over the 10-
year window. There is a clear increasing trend and seasonality as well towards the end of the year.
Probably the holiday travellers, isn’t it? The word and visuals shown above might fool us to let us lean on
the always trustworthy numbers.
ts = read_csv('daily-total-female-births.csv', header=0, index_col=0)
X = ts.values
cut = round(len(X) / 2)
X1, X2 = X[0:cut], X[cut:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('Mean for window 1=%f, and for window 2=%f' % (round(mean1), round(mean2)))
print(‘Variance for window 1 =%f, and for window 2=%f' % (round(var1,2), round(var2,2)))

The result will yield –
Mean for window 1=40, and for window 2=44
Variance for window 1=49.21, and for window 2 = 48.70
If we repeat the same process for the passengers travelling in airlines is used, we get –
Mean for window 1=183, and for window 2=378
Variance for window 1=2244.09, and for window 2 = 7367.96
By the absolute values, we see deviations in both datasets across windows. However, the deviations are
far stronger in the airline’s dataset than in the girl childbirth rate. The observation is far stronger with the
variance. In case you wonder if is this the right approach, we will say it is good for a start. As one
progresses to modelling the approach must include statistical tests if there is even a slight doubt based on
the context of the application. One such statistical test is the Augmented Dickey-Fuller test. This test aids
the analyst in establishing whether the null hypothesis holds true or fails. Thus, offering a sound
mathematical basis instead of an analyst feels that the difference in a numerical value is not sufficient or
sufficient for the time series data to be called stationary. We will use a library to run the statistical test –
! pip install statsmodels

Then running the test will be like –
from statsmodels.tsa.stattools import adfuller
ts = read_csv('daily-total-female-births.csv', header=0, index_col=0, squeeze=True)
X = ts.values
testOutcome = adfuller(X)
print('ADF value: %f' % testOutcome [0])
print('p-value of series: %f' % testOutcome [1])
Remember we are running a statistical test where the p-value is of great importance to either accept the
null hypothesis or to reject it. The value of the girl childbirth data p-value comes substantially lower than
0.05 or 0.5% which is used to accept or reject the null hypothesis. In this case, the null hypothesis is that
the given time series is stationary which holds true for the girl childbirth data but not so for the airlines data.