❓ Problem: where should a drinks company run promotions?¶

The company owns a chain of stores across Russia that sell a variety of alcoholic drinks. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. The marketing team would like to target 10 other regions that have similar buying habits to Saint Petersburg where they would expect the promotion to be similarly successful.

📈 The data¶

The marketing team has sourced with historical sales volumes per capita for several different drinks types.

  • "year" - year (1998-2016)
  • "region" - name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city and a single autonomous oblast
  • "wine" - sale of wine in litres by year per capita
  • "beer" - sale of beer in litres by year per capita
  • "vodka" - sale of vodka in litres by year per capita
  • "champagne" - sale of champagne in litres by year per capita
  • "brandy" - sale of brandy in litres by year per capita

✅ Solution proposed by this notebook¶

This notebook includes an interactive data exploration where you can easily compare alcohol consumption patterns between different regions of Russia. As a solution to the business problem, the Dynamic Time Warping (DTW) technique was used to calculate the distance between time series of the same drink from different regions. In this way, using Principal Component Analysis (PCA), it was possible to project the distances between all regions on a plane, where it is easy to visualize the regions most similar to Saint Petersburg in terms of alcohol consumption pattern.

✨ Table of Contents¶

  1. Loading and processing data
  2. Exploring the data
  3. A number to compare: Dynamic Time Warping
  4. Final list!

💾 Loading and processing data¶

First, let's load the data and take a look at some of its rows:

In [1]:
import pandas as pd

data = pd.read_csv("russian_alcohol_consumption.csv")
data.head(10)
Out[1]:
year region wine beer vodka champagne brandy
0 1998 Republic of Adygea 1.9 8.8 3.4 0.3 0.1
1 1998 Altai Krai 3.3 19.2 11.3 1.1 0.1
2 1998 Amur Oblast 2.1 21.2 17.3 0.7 0.4
3 1998 Arkhangelsk Oblast 4.3 10.6 11.7 0.4 0.3
4 1998 Astrakhan Oblast 2.9 18.0 9.5 0.8 0.2
5 1998 Republic of Bashkortostan 1.8 17.5 10.7 0.9 0.2
6 1998 Belgorod Oblast 3.4 23.0 10.8 0.9 0.1
7 1998 Bryansk Oblast 3.4 32.4 9.7 0.5 0.1
8 1998 Republic of Buryatia 1.1 8.8 15.8 0.9 0.1
9 1998 Vladimir Oblast 1.5 16.6 16.8 0.5 0.1

As we can see, the data appears to be well behaved, with no apparent problems. Let's take a look at the total number of rows:

In [2]:
data.shape
Out[2]:
(1615, 7)

There are 1615 rows in total in the dataset. We can now take a look at the range of values in the "year" column:

In [3]:
print(data.year.unique())
print(len(data.year.unique()))
[1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
 2012 2013 2014 2015 2016]
19

It's all okay: 1988 to 2016, totalling 19 values. But how many federal subjects are included in the "region" column?

In [4]:
print(data.region.unique())
print(len(data.region.unique()))
['Republic of Adygea' 'Altai Krai' 'Amur Oblast' 'Arkhangelsk Oblast'
 'Astrakhan Oblast' 'Republic of Bashkortostan' 'Belgorod Oblast'
 'Bryansk Oblast' 'Republic of Buryatia' 'Vladimir Oblast'
 'Volgograd Oblast' 'Vologda Oblast' 'Voronezh Oblast'
 'Republic of Dagestan' 'Jewish Autonomous Oblast' 'Zabaykalsky Krai'
 'Ivanovo Oblast' 'Republic of Ingushetia' 'Irkutsk Oblast'
 'Kabardino-Balkar Republic' 'Kaliningrad Oblast' 'Republic of Kalmykia'
 'Kaluga Oblast' 'Kamchatka Krai' 'Karachay-Cherkess Republic'
 'Republic of Karelia' 'Kemerovo Oblast' 'Kirov Oblast' 'Kostroma Oblast'
 'Krasnodar Krai' 'Krasnoyarsk Krai' 'Republic of Crimea' 'Kurgan Oblast'
 'Kursk Oblast' 'Leningrad Oblast' 'Lipetsk Oblast' 'Magadan Oblast'
 'Mari El Republic' 'Republic of Mordovia' 'Moscow' 'Moscow Oblast'
 'Murmansk Oblast' 'Nenets Autonomous Okrug' 'Nizhny Novgorod Oblast'
 'Novgorod Oblast' 'Novosibirsk Oblast' 'Omsk Oblast' 'Orenburg Oblast'
 'Oryol Oblast' 'Penza Oblast' 'Perm Krai' 'Primorsky Krai' 'Pskov Oblast'
 'Altai Republic' 'Komi Republic' 'Tuva Republic' 'Rostov Oblast'
 'Ryazan Oblast' 'Samara Oblast' 'Saint Petersburg' 'Saratov Oblast'
 'Sakhalin Oblast' 'Sverdlovsk Oblast' 'Sevastopol'
 'Republic of North Ossetia-Alania' 'Smolensk Oblast' 'Stavropol Krai'
 'Tambov Oblast' 'Republic of Tatarstan' 'Tver Oblast' 'Tomsk Oblast'
 'Tula Oblast' 'Tyumen Oblast' 'Udmurt Republic' 'Ulyanovsk Oblast'
 'Khabarovsk Krai' 'Republic of Khakassia'
 'Khanty–Mansi Autonomous Okrug – Yugra' 'Chelyabinsk Oblast'
 'Chechen Republic' 'Chuvash Republic' 'Chukotka Autonomous Okrug'
 'Sakha (Yakutia) Republic' 'Yamalo-Nenets Autonomous Okrug'
 'Yaroslavl Oblast']
85

85 federal subjects in total. It's a considerable amount! Regarding the other columns, let's just make sure that the dtypes are as we expect (float64):

In [5]:
data.dtypes
Out[5]:
year           int64
region        object
wine         float64
beer         float64
vodka        float64
champagne    float64
brandy       float64
dtype: object

Everything is fine! But what about missing values?

In [6]:
data.isna().sum()
Out[6]:
year          0
region        0
wine         63
beer         58
vodka        61
champagne    63
brandy       66
dtype: int64

Hm... There are some missing values for the time series. But, of the regions included, which ones have missing values? How many missing values are there in these regions?

In [7]:
for region in data["region"].unique():
    temp = data[data["region"] == region]
    
    if temp.isna().sum().sum() > 0:
        print("\n" + "-"*30)
        print(f"For region {region}:")
        print(temp.isna().sum())
------------------------------
For region Republic of Ingushetia:
year          0
region        0
wine         12
beer         10
vodka        10
champagne    12
brandy       15
dtype: int64

------------------------------
For region Republic of Crimea:
year          0
region        0
wine         16
beer         16
vodka        16
champagne    16
brandy       16
dtype: int64

------------------------------
For region Sevastopol:
year          0
region        0
wine         16
beer         16
vodka        16
champagne    16
brandy       16
dtype: int64

------------------------------
For region Chechen Republic:
year          0
region        0
wine         19
beer         16
vodka        19
champagne    19
brandy       19
dtype: int64

Given that we have 19 time points for the time series, it is more sensible to simply exclude the regions shown above. It wouldn't even be convenient to apply some interpolation technique!

In [8]:
regions_to_drop = ["Republic of Ingushetia", "Republic of Crimea", "Sevastopol", "Chechen Republic"]

data = data[~data.region.isin(regions_to_drop)]
print(data.isna().sum())
print("New shape of the DataFrame:", data.shape)
year         0
region       0
wine         0
beer         0
vodka        0
champagne    0
brandy       0
dtype: int64
New shape of the DataFrame: (1539, 7)

Ready! Now only regions with all time points are included. It's time to explore time series!


🕵️‍♂️ Exploring the data¶

Due to the way the time series are stored in the pd.DataFrame, we will need to make some adjustments using the melt method, as follows:

In [9]:
data_melted = data.melt(["year", "region"], var_name="drink", value_name="liters/capita")
data_melted.head()
Out[9]:
year region drink liters/capita
0 1998 Republic of Adygea wine 1.9
1 1998 Altai Krai wine 3.3
2 1998 Amur Oblast wine 2.1
3 1998 Arkhangelsk Oblast wine 4.3
4 1998 Astrakhan Oblast wine 2.9

1️⃣ Comparing regions using the non-normalized time series for drinks¶

For a more interactive data exploration and to avoid overloading the notebook with zillions of different plots, we will use the plotly package! Feel free to compare any regions you want! But remember: our main objective is to compare the Saint Petersburg region with the others present in our data!

In [10]:
from utils import plot_time_series_by_category

plot_time_series_by_category(data_melted, "year", "liters/capita",
                            "region", "drink", "simple_white")

plot_time_series_by_category(data_melted, "year", "liters/capita",
                            "region", "drink", "simple_white")

Have you noticed how difficult it is to compare regions with time series all at different scales? We can alleviate this problem by normalizing the time series. But how can we do it without breaking the relation between the different time series of each region?

2️⃣ Comparing regions from normalized time series¶

A reasonable solution to centralize and standardize the scale of different time series is to use specific statistics on the distribution of its values. But note: using the mean, standard deviation, minimum or maximum are very bad options in the context of time series, as these statistics are very sensitive to outliers. So what we will do is center the time series at its median and divide its values by the difference between the 90% and 10% quantiles. In this way, we are ensuring robust standardization for the different time series, making them more comparable to each other.

In [11]:
plot_time_series_by_category(data_melted, "year", "liters/capita",
                            "region", "drink", "simple_white", True)

plot_time_series_by_category(data_melted, "year", "liters/capita",
                            "region", "drink", "simple_white", True)

Hasn't it become easier to compare time series? But understand: comparing all pairs of regions using the interactive chart is not the best way to answer the business problem. We need a number that ranks the regions most similar to Saint Petersburg, and we haven't done that so far.

📏 A number to compare: Dynamic Time Warping¶

A smart alternative to reducing the cognitive load in the task of comparing time series of drinks consumption between the regions is to use Dynamic Time Warping (DTW). This time series analysis technique allows the calculation of distance between different time series considering a time warping to force a match between them. This is a more robust method than the naive euclidean distance because it considers that there may be a "shift", or a "stretching", between the time series. A disadvantage of this method is that intuitive distance conditions, such as the triangle inequality, are not always satisfied by DTW. See the dotted lines that are considered in the distance calculation by DTW in the figure below:

dtw-fig

Caption: Dynamic time warping between two piecewise linear functions. The dotted line illustrates the time-warp relation. Notice that several points in the lower function are mapped to one point in the upper function, and vice versa. Source here.

In order to calculate the DTW distance between the corresponding time series from different regions, we will use the dtaidistance package. We will calculate the distance between ALL time series from ALL regions. We will have a lot of numbers now!

In [12]:
from dtaidistance.dtw import distance as dtw_distance
import numpy as np

# function to standardize the time series
def normalize_time_series(time_series):
    return (time_series - np.quantile(time_series, q=.5))/\
            (np.quantile(time_series, q=.9) - np.quantile(time_series, q=.1))

# the dtw distance is calculated between the correspondent time series of all the regions
distance_data = []
for region1 in data.region.unique():
    row_data = {}
    row_data["region"] = region1
    for drink in ["wine", "beer", "vodka", "champagne", "brandy"]:
        for region2 in data.region.unique():
            row_data[f"{drink} dist {region2}"] = dtw_distance(data[data["region"] == region1][drink].values,
                                                                           data[data["region"] == region2][drink].values)
    distance_data.append(row_data)

# dataframe with all the distances
distance_data = pd.DataFrame(distance_data)
distance_data.head()
Out[12]:
region wine dist Republic of Adygea wine dist Altai Krai wine dist Amur Oblast wine dist Arkhangelsk Oblast wine dist Astrakhan Oblast wine dist Republic of Bashkortostan wine dist Belgorod Oblast wine dist Bryansk Oblast wine dist Republic of Buryatia ... brandy dist Ulyanovsk Oblast brandy dist Khabarovsk Krai brandy dist Republic of Khakassia brandy dist Khanty–Mansi Autonomous Okrug – Yugra brandy dist Chelyabinsk Oblast brandy dist Chuvash Republic brandy dist Chukotka Autonomous Okrug brandy dist Sakha (Yakutia) Republic brandy dist Yamalo-Nenets Autonomous Okrug brandy dist Yaroslavl Oblast
0 Republic of Adygea 0.000000 3.900026 7.269264 17.626883 3.513759 4.414941 3.125396 11.353400 3.984997 ... 0.850529 0.938403 0.616523 1.895811 0.612372 0.843623 1.482194 0.696419 2.014671 0.919239
1 Altai Krai 3.900026 0.000000 6.587427 20.660663 3.291580 3.381080 4.382933 13.245214 6.676541 ... 0.476340 2.031280 1.236770 3.261533 1.196871 0.448553 3.351477 0.855862 3.457514 1.715372
2 Amur Oblast 7.269264 6.587427 0.000000 11.679940 5.299858 9.647368 8.995115 5.654529 3.784455 ... 0.335410 1.074104 0.676166 2.377646 0.464866 0.494368 2.456827 0.253180 2.580310 0.981886
3 Arkhangelsk Oblast 17.626883 20.660663 11.679940 0.000000 16.997803 24.876103 21.993792 4.748737 13.172813 ... 1.780590 0.754785 1.340597 1.405276 1.363121 1.901894 0.947206 1.492180 1.548548 0.887187
4 Astrakhan Oblast 3.513759 3.291580 5.299858 16.997803 0.000000 5.969590 4.575194 10.618663 4.609610 ... 0.180278 1.133005 0.615467 2.309372 0.382230 0.406940 2.337092 0.210000 2.513563 0.856796

5 rows × 406 columns

For each region, we have 405 features, as we calculated the distance between all time series for all regions (81 regions x 5 drink time series = 405 features). We will apply the describe method just to look at some of the statistics relating to these distances.

In [13]:
distance_data.describe()
Out[13]:
wine dist Republic of Adygea wine dist Altai Krai wine dist Amur Oblast wine dist Arkhangelsk Oblast wine dist Astrakhan Oblast wine dist Republic of Bashkortostan wine dist Belgorod Oblast wine dist Bryansk Oblast wine dist Republic of Buryatia wine dist Vladimir Oblast ... brandy dist Ulyanovsk Oblast brandy dist Khabarovsk Krai brandy dist Republic of Khakassia brandy dist Khanty–Mansi Autonomous Okrug – Yugra brandy dist Chelyabinsk Oblast brandy dist Chuvash Republic brandy dist Chukotka Autonomous Okrug brandy dist Sakha (Yakutia) Republic brandy dist Yamalo-Nenets Autonomous Okrug brandy dist Yaroslavl Oblast
count 81.000000 81.000000 81.000000 81.000000 81.000000 81.000000 81.000000 81.000000 81.000000 81.000000 ... 81.000000 81.000000 81.000000 81.000000 81.000000 81.000000 81.000000 81.000000 81.000000 81.000000
mean 9.327511 10.586655 7.514745 12.398218 8.580767 13.191111 11.370777 8.459765 8.090933 9.360812 ... 0.984503 1.343979 1.177580 2.149788 1.087095 1.079536 2.187614 0.972232 2.309936 1.198920
std 5.716611 6.793815 4.665270 6.655064 5.830522 7.711322 6.899607 4.923744 4.500338 4.587454 ... 1.005795 0.515729 0.723902 0.883784 0.800197 1.034073 0.866551 0.881617 0.941177 0.493985
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 4.955855 5.620374 4.022549 7.528307 4.178576 7.445435 5.882933 4.844915 4.847731 6.481929 ... 0.300167 0.967316 0.687895 1.454510 0.559106 0.376298 1.482194 0.380263 1.656744 0.844156
50% 7.395066 8.878187 6.193626 11.333270 6.592056 11.002363 9.318846 7.400142 7.176078 8.628546 ... 0.602080 1.374809 0.900222 2.348617 0.846995 0.657343 2.337092 0.605062 2.553586 1.169017
75% 12.367417 14.192428 9.647368 15.267279 11.452829 17.672705 15.537635 11.066526 10.607907 11.870594 ... 1.089679 1.702645 1.338096 2.724702 1.285457 1.316055 2.753343 1.106616 2.923696 1.430035
max 29.256215 31.695591 22.943690 31.808007 28.718001 35.413811 33.275733 24.180529 23.772650 23.217969 ... 3.704929 2.497138 3.187350 3.749827 3.358884 3.846505 3.827532 3.415567 3.939797 2.308679

8 rows × 405 columns

Just above, it is possible to see that some distances have a considerably higher average than others. This is to be expected if this distance refers to a region with very specific patterns, so to speak, so it is not surprising. But this excess number is annoying, and we can reduce it using Principal Component Analysis (PCA). This dimensionality reduction method will generate linear combinations between the most correlated distances in order to return components that are uncorrelated with each other and that "spread" the data better. With this method, we can even project the regions and their distances on a two-dimensional plot. This is extremely useful! Below, we will apply PCA.

In [14]:
from sklearn.decomposition import PCA

pca = PCA(n_components=.95, # we want to explain at least 95% of the total variance
          random_state=669)
X = pca.fit_transform(distance_data.drop(columns="region"))

print(f"Total explained variance ratio by the first 2 components:", pca.explained_variance_ratio_[:2].sum()*100, "%")

X = pd.DataFrame(X, columns=[f"PC {i}" for i in range(1, pca.n_components_ + 1)])
X["region"] = distance_data["region"]
X.head()
Total explained variance ratio by the first 2 components: 91.11352441115945 %
Out[14]:
PC 1 PC 2 PC 3 PC 4 region
0 602.317963 -200.959146 -33.206956 -19.736541 Republic of Adygea
1 -97.656202 -250.621705 -63.924742 8.505152 Altai Krai
2 -256.190093 -97.716054 18.604340 18.419457 Amur Oblast
3 -25.303670 -18.988241 98.487830 172.392736 Arkhangelsk Oblast
4 -170.742741 -96.718758 -96.177554 12.022089 Astrakhan Oblast

Note above that, to explain at least 95% of the total variance of DTW distances between regions, only 4 principal components were needed. More specifically, 2 principal components explain more than 90% of the total variance! Let's visualize the distances of the points on a two-dimensional plot!

In [15]:
import plotly.express as px
from sklearn.metrics.pairwise import euclidean_distances

# using the 4 principal components, calculate the euclidean distance
X["Distance to Saint Petersburg"] = euclidean_distances(X.drop(columns=["region"]),
                                                        X[X["region"] == "Saint Petersburg"].drop(columns=["region"]))

# visualizing the position of the regions
fig = px.scatter(X, x="PC 1", y="PC 2", hover_data = "region",
                 color="Distance to Saint Petersburg", color_continuous_scale='Turbo')
fig.update_layout(xaxis_title="Principal Component 1",
                  yaxis_title="Principal Component 2",
                  title="Alcohol consumption pattern: bidimensional projection of all regions",
                  template="simple_white")
fig.update_traces(marker=dict(size=9))
fig.show()

Wow! We have an amazing chart! Note that the colors represent the euclidean distance considering the regions of Russia projected onto the 4 principal components extracted by PCA. From the plot, we see that Saint Petersburg is relatively isolated in the upper left corner. But despite this we have some regions similar to Saint Petersburg, such as (1) Omsk Oblast, (2) Chelyabinsk Oblast and (3) Moscow Oblast.

Another interesting fact is that the Zabaykalsky Kray region is very different from all other regions!

To make it easier to construct our list, let's plot an ordered bar plot showing the regions most similar to Saint Petersburg in terms of alcohol consumption pattern.

💎 Final list!¶

In [16]:
fig = px.bar(X.sort_values("Distance to Saint Petersburg", ascending=False),
             x="Distance to Saint Petersburg", y="region", orientation='h', template="simple_white",
             title="Alcohol consumption pattern: regions most similar to Saint Petersburg",
             color="Distance to Saint Petersburg",
             color_continuous_scale='Turbo',
             width=800, height=1400)
fig.show()

We did it! Using DTW distance method and the dimensionality reduction method PCA, we were able to create the incredible plot above and realize that the regions most similar to Saint Petersburg in terms of alcohol consumption pattern are:

  1. Omsk Oblast
  2. Chelyabinsk Oblast
  3. Moscow Oblast
  4. Yamalo-Nenets Autonomous Okrug
  5. Moscow
  6. Tyumen Oblast
  7. Khabarovski Krai
  8. Khanty-Mansi Autonomous Okrug - Yugra
  9. Ivanovo Oblast
  10. Sverdlovsk Oblast

Feel free to compare the time series of these regions with those of Saint Petersburg using our interactive plots!

Thank you!