A Page Speed A/B Test: How It’s Done Right!
Sep 8, 2023 • 7 min read
Website Speed is important because of the following reasons:
Users hate to wait on websites. Attention spans are shorter. If a website needs too long to load, the users will bounce.
If a website provides a good user experience without long waiting times, it's more likely that they click through the entire customer journey and convert in the end.
Google also ranks websites based on page speed through the Core Web Vitals. So fast, websites get higher rankings.
So there are good reasons to optimize page speed.
While working on different approaches to optimize page speed, it's often a challenge to properly monitor page speed of a website. If that's done poorly, important considerations might be forgotten and opportunities for better web performance might be wasted.
Therefore, this article covers the basics of how to conduct a proper A/B test to examine if a change in the website infrastructure was successful or not.
Set up an A/B test for page speed:
An A/B test is useful, to validate if a treatment, in this case, a change in a website's infrastructure, leads to a statistically significant change in a chosen performance metric. In page speed measurement, these performance metrics are generally regarded as web performance metrics.
As an exemplary performance metric, we will use the Largest Contentful Paint (LCP). It measures the time a website takes to show the user the largest piece of content (Video, picture or text). It is a very user experience focussed measurement for page speed, since when the largest visual content piece is loaded, the website appears mostly finished to the user.
Then we need our groups A and B. We will build our groups from the total number of users who visit our website during our observation period. Every user has a 50% chance of being assigned to either group A or group B. Users of group A will receive a version of the website with the new website’s infrastructure. Users of group B will receive the website as if no changes have been done. Then we measure the LCP of each user for each individual page load and accumulate the LCP performance for each group.
According to Google, a good LCP means that for more than 75% of website users, the LCP is below 2.5 seconds.
Now let's say after an observation period of one month we come to the following result.
A (new Infrastructure)
B (old Infrastructure)
We see that group A passes the Google benchmark for LCP, while group B doesn't. Now we have to find out if this difference is statistically significant.
But what does statistical significance mean?
It means that the difference between the two groups is big enough that it can't be random.
To check if this is the case, we will calculate the statistical significance of the A/B test in the next step.
Conducting statistical significance analysis for an A/B test for page speed:
Today, there is almost no need anymore to calculate statistical significance for an A/B test manually. There are several tools like A/B test Guide or Chat GPT4 Code Interpreter that can determine statistical significance of an A/B test in seconds. Yet, it's still crucial to be able to interpret the output of those tests.
A standard A/B always follows these steps:
1. State the Hypothesis:
Before starting the analysis, we need to clearly state the thesis. The hypothesis we want to prove wrong is called the null hypothesis or H₀. If the test is significant, we will reject H₀ and see the opposing hypothesis H₁ as the true statement.
H₀: There's no significant difference in LCP values between Group A and Group B.
H₁: Group A has a significantly better LCP value than Group B.
2. Choose a Significance Level:
The significance level α is the probability of rejecting the H₀ hypothesis, although it is true. Therefore, this probability should be rather small. A common choice for α is 0.05, which means there's a 5% chance of incorrectly rejecting the null hypothesis.
3. Decide on a statistical Test:
There are several different statistical tests that can be used here. Which one depends on how the test is set up, what the metric is, and how many observations/users are available.
We can assume that the outcome is binary (either the LCP was good or not) and we have proportions from two independent samples, since the users were assigned randomly. Therefore, the following statistical tests are suitable examples:
This test is used to compare two proportions. In our case, we have proportions of users who converted in Groups A and B. The Z-test assumes that the sample size is sufficiently large (which is true with 50,000 users in each group). This test will help determine if the proportion of conversions in Group A is statistically different from that in Group B.
Chi-Squared Test for Independence:
The Chi-squared test is used to test relationships between categorical variables. In our context, the two categories are "LCP is good" and "LCP is not good", and the two groups are A and B. If the proportions are different enough between the two groups, the Chi-squared test will be significant. This test will tell us if the observed number of conversions in each group is different from what we would expect if there was no difference between the groups.
Fisher's Exact Test:
While the Chi-squared test is suitable for large samples, Fisher's Exact Test is often used when the sample sizes are small. However, it can still be applied to large samples. It's an exact test which doesn't have the assumptions of the Chi-squared test. Since our sample size is large, the Chi-squared test is more typical, but Fisher's can be used for a more exact approach.
Each of these tests has its assumptions, benefits, and drawbacks. An of course, there are other tests that are also possible, but the ones above are the most common.
It's essential to understand the underlying assumptions of each test and ensure that the data meets these assumptions before conducting the test.
In our case, we will conduct a chi-square test. We will do this because of the following reasons:
Type of Data: The chi-square test is suitable for categorical data, and in our scenario, we're examining the two cases if the LCP in each group got better.
Independence: The observations should be independent, meaning the outcome of one observation should not affect the outcome of another. In A/B testing, the users in Group A and Group B are typically randomly assigned, ensuring independence.
Sample Size: The chi-square test is a good test for larger sample sizes. Given our sample size and proportions, this test is a suitable option.
Objective: The chi-square test determines if there's a significant association between the two categorical variables. In our case, we want to know if there's a significant association between the group (A or B) and the outcome (better LCP or not).
In our case, also a Two-Proportion Z-test would be suitable, but for this article we decided for a chi-square test since its concept and inner workings are more intuitive.
4. Conduct the Test:
For this, A/B test using the chi-square method, we'll set up a contingency table with observed counts:
Success (better LCP)
Failure (not better LCP)
This gives us the following results:
Success (better LCP)
Failure (not better LCP)
The chi-square statistic is then calculated based on the differences between the observed and expected frequencies. If the observed and expected frequencies are close, the chi-square statistic will be small. If they're different, the chi-square statistic will be large.
At this point, normally, the Chi-square statistic (χ²) and the p value would have to be manually calculated to determine if the change is significant. Luckily, there are several free online calculators for this kind of test. As stated before also GPT4 can calculate this test. For this article, we used this Online calculator. For a deep dive behind the math of the test, click here.
Our results are the following:
Chi-square statistic (χ²): Approximately 793.65
P-value: Approximately 1.58×10−174
5. Interpretation of results:
The p-value is extremely close to zero, which is significantly less than our chosen significance level (α=0.05). This means we reject H₀.
In simpler terms, the change of infrastructure in group A has led to a statistically significant improvement in LCP values when compared to group B, as indicated by the chi-square test. So the infrastructure change was successful.
Of course, not only missing monitoring but also false monitoring cause wrong decisions. Therefore, we also listed some of the most common mistakes when it comes to A/B testing that should be avoided.
Choosing the wrong test: Every statistical test has different requirements. Choosing the wrong one causes invalid results.
Not Checking Data: Before conducting the test, it's crucial to ensure that the data doesn't contain errors or anomalies that could skew the results.
Misinterpreting Results: A significant Chi-squared test only states that there's a difference between the observed and expected frequencies. It doesn't tell how substantial it is. Post-hoc tests or further analysis might be needed to understand the results better.
Over-simplification: If the data doesn't fit to one certain test, don't try to force it into it. Search for a fitting test to avoid invalid results or loosing information.
Misinterpreting the significance level: The significance level only tells how likely it is that H₀ is falsely rejected. It doesn't tell how likely it is that Group A and B will exactly show the observed difference.
To optimize web performance, the effects of the new deployments constantly have to be monitored. Understanding how to conduct significance analysis of an A/B test and avoid common mistakes, provides a valuable tool to examine the effect of changes on the web infrastructure.
One of the great advantages of Speed Kit is that it's possible to A/B test its effects. Speed Kit can be activated and deactivated with one click, and therefore it's possible to create two different groups for an A/B test. That way we can prove the effect of Speed Kit for all our customers.
For more Information about Speed Kit click here.