A/B Testing with Python

Categories Experiments, Python

My most re-used Python code for A/B testing

Import the things you need

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from scipy  import norm, stats, binom, mannwhitneyu

There are a few ways the data could be set up. Step 1 is viewing the data you have, using the dataframe name like test_data or the .head() method. test_data.head() will give you something like the below:

Once you have the test data you want to make sure the split was accurate. To do that run a sample ratio mismatch test on the users/visitors/sessions data. Below uses .iloc[] and .sum() to insert the visits data.

stats.binom_test(test_data.iloc[1,1], test_data['visits'].sum(), 0.5)
0.16416588441261487 

Another method that I use for other calculations just pulls the appropriate cell. That is shown below:

stats.binom_test((test_data[test_data.campaign == '1'].visits), test_data['visits'].sum(), 0.5) 
0.16416588441261487 

The difference from .iloc[] and (test_data[test_data.campaign == '1'].visits), is that this is saying in the test_data, where the campaign_version is equal to 1, return that visit number. This to me is easier than .iloc because I don’t have to worry about the dataframe location being wrong or changing. And you can see both approaches return the same number.

I typically look for anything less than 0.05 or 0.01. If you find SRM you need to look into why this could be the case since any results are not valid. Here’s a helpful link of probable causes and some things to look into.

If there’s no, SRM you can go on to comparing the metrics.

To look at rates instead of raw numbers to get a better sense of the data I use the following approach:

dataframe['new_metric'] = dataframe.column1 {/*+-} dataframe.column2

test_data['ctr_rate'] = test_data.ctr/test_data.visits
test_data['bounce_rate'] = test_data.bounces/test_data.visits
test_data['exit_rate'] = test_data.exits/test_data.visits
test_data['atc_rate'] = test_data.top_funnel/test_data.visits
test_data['cvr'] = test_data.orders/test_data.visits
test_data['bof'] = test_data.orders/test_data.top_funnel

test_data 

The table now has 6 additional columns of rates based off the data.

While not entirely necessary, the rates are helpful to see from a high level. It also helps set up the graphs. I’m going to store the rates I want to examine as variables to help in calculations going forward.

cont_visits, test_visits = float(test_data[test_data.campaign_version == '0'].visits), float(test_data[test_data.campaign_version == '1'].visits)

cont_atc, test_atc = float(test_data[test_data.campaign_version == '0'].top_funnel), float(test_data[test_data.campaign_version == '1'].top_funnel)

cont_rate, test_rate = cont_atc / cont_visits, test_atc / test_visits 

I’m starting by looking at the top_funnel or add to cart metric. You can do the same for any other metric. The calculations and graphs need our visit number, ATC number, and ATC rate for each variation. (Using the float() saves them as numbers we can use vs just calling them).

# Calculate standard deviation performance and plot.
std_cont = np.sqrt(cont_rate * (1 - cont_rate) / cont_visits)
std_test = np.sqrt(test_rate * (1 - test_rate) / test_visits)

click_rate = np.linspace(0.22, 0.27, 200)
prob_a = norm(cont_rate, std_cont).pdf(click_rate)
prob_b = norm(test_rate, std_test).pdf(click_rate)

# Make the bar plots.
plt.plot(click_rate, prob_a, label="Control")
plt.plot(click_rate, prob_b, label="Test")
plt.legend(frameon=False)
plt.xlabel("ATC rate"); 
plt.ylabel("Probability"); 
z_score = (test_rate - cont_rate) / np.sqrt(std_cont**2 + std_test**2)
p = norm(test_rate - cont_rate, np.sqrt(std_cont**2 + std_test**2))

x = np.linspace(-0.03, 0.04, 1000)
y = p.pdf(x)
area_under_curve = p.sf(0)

plt.plot(x, y, label="PDF")
plt.fill_between(x, 0, y, where=x>0, label="Prob(Test>Control)", alpha=0.3)
plt.annotate(f"Area={area_under_curve:0.3f}", (0.02, 5))

plt.legend()
plt.xlabel("Difference in ATC rate"); plt.ylabel("Prob");

diff = round(test_rate - cont_rate,4)

print("Change in ATC is", diff*100,"%")
print(f"zscore is {z_score:0.3f}, with p-value {norm().sf(z_score):0.3f}")

Change in ATC is 0.18 % 
zscore is 0.279, with p-value 0.390  

Just checking my work and looking at the confidence intervals:

def get_z_score(alpha):
    return norm.ppf(alpha)
  
alpha = 0.05 
import math as mt

pool = (test_data.top_funnel.sum()/test_data.visits.sum())
p_sd = mt.sqrt(pool*(1-pool)*(1/cont_visits + 1/test_visits))
z = (test_rate - cont_rate)/p_sd
me = round(get_z_score(1-alpha/2)*p_sd,4)
diff = (test_rate - cont_rate)

print("Change in ATC is", diff*100,"%")
print("Confidence interval: [", diff-me, ",", diff+me, "]") 

Change in ATC is 0.17863760153874197 % Confidence interval: [ -0.01081362398461258 , 0.01438637601538742 ] 

The difference in the ATC rate matches and the confidence intervals show -1% to +1% which is what I’d expect to see.

At this time, no statistical difference is detected from the changes we made.