Question asked: Which promotion was the most effective?
Scenario:
A fast food chain plans to add a new item to its menu. However, they are still undecided between three possible marketing campaigns for promoting the new product. In order to determine which promotion has the greatest impact on the sales of the new item, the new item is introduced at locations in several randomly selected markets. A different promotion is used at each location, and the weekly sales of the new item are recorded for the first four weeks.
The description of the data set: Our data set consists of 548 entries including:
import pandas as pd
import matplotlib.pyplot as plt
# Uncomment this line if using this notebook locally
#df = pd.read_csv('./data/marketing/WA_Fn-UseC_-Marketing-Campaign-Eff-UseC_-FastF.csv')
file_name = "https://raw.githubusercontent.com/rajeevratan84/datascienceforbusiness/master/WA_Fn-UseC_-Marketing-Campaign-Eff-UseC_-FastF.csv"
df = pd.read_csv(file_name)
df.head(10)
print ("Rows : " , df.shape[0])
print ("Columns : " , df.shape[1])
print ("\nFeatures : \n", df.columns.tolist())
print ("\nMissing values : ", df.isnull().sum().values.sum())
print ("\nUnique values : \n", df.nunique())
df.describe()
# Visualisation of our sales and marketing data
# Using ggplot's style
plt.style.use('ggplot')
ax = df.groupby('Promotion').sum()['SalesInThousands'].plot.pie(figsize=(8,8),
autopct='%1.0f%%',
shadow=True,
explode = (0, 0.1, 0))
ax.set_ylabel('')
ax.set_title('Sales Distribution Across the 3 Different Promotions')
plt.show()
# Now let's view the promotions for each market size
df.groupby(['Promotion', 'MarketSize']).count()['MarketID']
# Using unstack
df.groupby(['Promotion', 'MarketSize']).count()['MarketID'].unstack('MarketSize')
# Put this into a plot
ax = df.groupby(['Promotion', 'MarketSize']).count()['MarketID'].unstack('MarketSize').plot(
kind='bar',
figsize=(12,10),
grid=True)
ax.set_ylabel('count')
ax.set_title('breakdowns of market sizes across different promotions')
plt.show()
# Put this into a different plot
ax = df.groupby(['Promotion', 'MarketSize']).count()['MarketID'].unstack('MarketSize').plot(
kind='bar',
figsize=(12,10),
grid=True,
stacked=True)
ax.set_ylabel('count')
ax.set_title('breakdowns of market sizes across different promotions')
plt.show()
#Plot visualisations to view the distribution of store ages
ax = df.groupby('AgeOfStore').count()['MarketID'].plot(
kind='bar',
figsize=(12,7),
grid=True)
ax.set_xlabel('age')
ax.set_ylabel('count')
ax.set_title('Overall Distributions Store Ages')
plt.show()
# Group by Age of Store and Promotion to get counts
df.groupby(['AgeOfStore', 'Promotion']).count()['MarketID']
# Visaulize this summary
ax = df.groupby(['AgeOfStore', 'Promotion']).count()['MarketID'].unstack('Promotion').iloc[::-1].plot(
kind='barh',
figsize=(14,18),
grid=True)
ax.set_ylabel('age')
ax.set_xlabel('count')
ax.set_title('overall distributions of age of store')
plt.show()
df.groupby('Promotion').describe()['AgeOfStore']
This table makes it easy to understand the overall store age distribution from our summary stats.
All test groups have similar age profiles and the average store ages is ~8 to 9 years old for theese 3 groups.
The majority of the stores are 10–12 years old or even younger.
We can see that the store profiles are similar to each other.
This indicates that our sample groups are well controlled and the A/B testing results will be meaningful and trustworthy.
means = df.groupby('Promotion').mean()['SalesInThousands']
stds = df.groupby('Promotion').std()['SalesInThousands']
ns = df.groupby('Promotion').count()['SalesInThousands']
print(means)
print(stds)
print(ns)
T-Value
The t-value measures the degree of difference relative to the variation in our data groups. Large t-values indicate a higher degree of difference between the grups.
P-Value
P-value measures the probability that the results would occur by random chance. Therefore the smaller the p-value is, the more statistically significant difference there will be between the two groups
# Computing the t and p values using scipy
from scipy import stats
t, p = stats.ttest_ind(df.loc[df['Promotion'] == 1, 'SalesInThousands'].values,
df.loc[df['Promotion'] == 2, 'SalesInThousands'].values,
equal_var=False)
print("t-value = " +str(t))
print("p-value = " +str(p))
Our P-Value is close to 0 which suggests that there is good evidence to REJECT the Null Hypothesis. Meaning the there is a statistical difference between the two groups. Our threshold rejectings the Null is usually less than 0.05.
Our t-test shows that the marketing performances for these two groups (1 and 2) are significantly different and that promotion group 1 outperforms promotion group 2.
However, if we run a t-test between the promotion group 1 and promotion group 3, we see different results:
t, p = stats.ttest_ind(
df.loc[df['Promotion'] == 1, 'SalesInThousands'].values,
df.loc[df['Promotion'] == 3, 'SalesInThousands'].values,
equal_var=False)
print("t-value = " +str(t))
print("p-value = " +str(p))
We note that the average sales from promotion group 1 (58.1) is higher than those from promotion group 3 (55.36).
But, running a t-test between these two groups, gives us a t-value of 1.556 and a p-value of 0.121.
The computed p-value is a lot higher than 0.05, past the threshold for statistical significance.
Our t-test shows that the marketing performances for these two groups (1 and 3) are not significantly different.