# CRO Statistics: How to Avoid Reporting Bad Data

Posted by CraigBradford

Without a basic understanding of statistics, you can often present misleading results to your clients or superiors. This can lead to underwhelming results when you roll out new versions of a page which on paper look like they should perform much better. In this post I want to cover the main aspects of planning, monitoring and interpreting CRO results so that when you do roll out new versions of pages, the results are much closer to what you would expect. Iâve also got a free tool to give away at the end, which does most of this for you.

## Planning

A large part running a successful conversion optimisation campaign starts before a single visitor reaches the site. Before starting a CRO test itâs important to have:

- A hypothesis of what you expect to happen
- An estimate of how long the test should take
- Analytics set up correctly so that you can measure the effect of the change accurately

Assuming you have a hypothesis, letâs look at predicting how long a test should take.

### How long will it take?

As a general rule, the less traffic that your site gets and/or the lower the existing conversion rate, the longer it will take to get statistically significant results. Thereâs a great tool by Evan Miller that I recommend using before starting any CRO project. Entering the baseline conversion rate and the minimum detectable effect (i.e. What is the minimum percentage change in conversion rate that you care about, 2%? 5%? 20%?) you can get an estimate of how much traffic youâll need to send to each version. Working backwards from the traffic your site normally gets, you can estimate how long your test is likely to take. When you arrive on the site, youâll see the following defaults:

Notice the setting that allows you to swap between âabsoluteâ and ârelativeâ. Toggling between them will help you understand the difference, but as a general rule, people tend to speak about conversion rate increases in relative terms. For example:

Using a baseline conversion rate of 20%

- With a 5% absolute improvement – the new conversion rate would be 25%
- With a 5% relative improvement - the new conversion would be 21%

Thereâs a huge difference in the sample size needed to detect any change as well. In the absolute example above, 1,030 visits are needed to each branch. If you’re running two test versions against the original, that looks like this:

- Original – 1,030
- Version A – 1,030
- Version B – 1,030

Total 3,090 visits needed.

If you change that to relative, that drastically changes: 25,255 visits are needed for each version. A total of 75,765 visits.

If your site only gets 1,000 visits per month and you have a baseline conversion rate of 20%, itâs going to take you 6 years to detect a significant **relative** increase in conversion rate of 5% compared to only around 3 months for an **absolute** change of the same size.

This is why the question of whether or not small sites can do CRO often comes up. The answer is yes, they can, but youâll want to aim higher than a 5% relative increase in conversions. For example, If you aim for a 35% relative increase (with 20% baseline conversion), youâll only need 530 visits to each version. In summary, go big if youâre a small site. Donât test small changes like button changes, test complete new landing pages, otherwise itâs going to take you a very long time to get significantly better results.

### Analytics

A critical part of understanding your test results is having appropriate tracking in place. At Distilled we use Optimizely so thatâs what Iâll cover today; fortunately Optimizely makes testing and tracking really easy. All you need is a Google analytics account that has a custom variable (custom dimension in universal analytics) slot free. For either Classic or Universal Analytics, begin by going to the Optimizely Editor, then clicking Options > Analytics Integration. Select enable and enter the custom variable slot that you want to use, that’s it. For more details, see the help section on the Optimizely website here.

With Google analytics tracking enabled, now when you go to the appropriate custom variable slot in Google Analytics, you should see a custom variable named after the experiment name. In the example below the client was using custom variable slot 5:

This is a crucial step. While you can get by by just using Optimizely goals like setting a thankyou page as a conversion, it doesnât give you the full picture. As well as measuring conversions, youâll also want to measure behavioral metrics. Using analytics allows you to measure not only conversions, but other metrics like average order value, bounce rates, time on site, secondary conversions etc.

### Measuring interaction

Another thing thatâs easy to measure with Optimizely is interactions on the page, things like clicking buttons. Even if you donât have event tracking set up in Google Analytics, you can still measure changes in how people interact with the site. Itâs not as simple as it looks though. If you try and track an element in the new version of a page, youâll get an error message saying that no items are being tracked. See the example from Optimizely below:

Ignore this message, as long as youâve highlighted the correct button before selecting track clicks, the tracking should work just fine. See the help section on Optimizely for more details.

## Interpreting results

Once you have a test up and running, you should start to see results in Google Analytics as well as Optimizely. At this point, thereâs a few things to understand before you get too disappointed or excited.

### Understanding statistical significance

If youâre using Google analytics for conversion rates, youâll need something to tell you whether or not your results are statistically significant – I like this tool by Kiss Metrics which looks like this:

Itâs easy to look at the above and celebrate your 18% increase in conversions – however youâd be wrong. Itâs easier to explain what this means with an example. Letâs imagine you have a pair of dice that we know are exactly the same. If you were to roll each die 100 times, you would expect to see each of the numbers 1-6 the same number of times on both die (which works out at around 17 times per side). Letâs say on this occasion though we are trying to see how good each die is at rolling a 6. Look at the results below:

- Die A – 17/100 = 0.17 conversion rate
- Die B – 30/100 = 0.30 conversion rate

A simplistic way to think about Statistical significance is itâs the chance that getting more 6s on the second die was just a fluke and that it hasnât been optimised in some way to roll 6s.

This makes sense when we think about it. Given that out of 100 rolls we expect to roll a 6 around 17 times, if the second time we rolled a 6 19/100 times, we could believe that we just got lucky. But if we rolled a 6 30/100 times (76% more), we would find it hard to believe that we just got lucky and the second die wasnât actually a loaded die. If you were to put these numbers into a statistical significance tool (2 sided t-test), it would say that B performed better than A by 76% with 97% significance.

In statistics, statistical significance is the complement of the P value. The P value in this case is 3% and the complement therefore being 97% (100-3 = 97). This means thereâs a 3% chance that weâd see results this extreme if the die are identical.

When we see statistical significance in tools like Optimizely, they have just taken the complement of the P-value (100-3 = 97%) and displayed it as the chance to beat baseline. In the example above, we would see a chance to beat baseline of 97%. Notice that I didnât say thereâs a 97% chance of B being 76% better – itâs just that on this occasion the difference was 76% better.

This means that if we were to throw each dice 100 times again, weâre 97% sure we would see noticeable differences again, which may or may not be by as much as 76%. So, with that in mind here is what we can accurately say about the dice experiment:

- Thereâs a 97% chance that die B is different to die A

Hereâs what we cannot say:

- Thereâs a 97% chance that die B will perform 76% better than die A

This still leaves us with the question of what we can expect to happen if we roll version B out. To do this we need to use confidence intervals.

### Confidence intervals

Confidence intervals help give us an estimate of how likely a change in a certain range is. To continue with the dice example, we saw an increase in conversions by 76%. Calculating confidence intervals allow us to say things like:

- Weâre 90% sure B will increase the number of 6s you roll by between 19% to 133%
- Weâre 99% sure B will increase the number of 6s you roll by between -13% to 166%

**Note: These are relative ranges. That being -13% less than 17% and 166% greater than 17%.**

The three questions you might be asking at this point are:

- Why is the range so large?
- Why is there a chance it could go negative?
- How likely is the difference to be on the negative side of the range?

The only way we can reduce the range of the confidence intervals is by collecting more data. To decrease the chance of the difference being less than 0 (we donât want to roll out a version that performs worse than the original) we need to roll the dice more times. Assuming the same conversion rate of A (0.17%) and B (0.3%) – look at the difference increasing the sample size makes on the range of the confidence intervals.

As you can see, with a sample size of 100 we have a 99% confidence range of -13% to 166%. If we kept rolling the dice until we had a sample size of 10,000 the 99% confidence range looks much better, itâs now between 67% better and 85% better.

The point of showing this is to show that even if you have a statistically significant result, itâs often wise to keep the test running until you have tighter confidence intervals. At the very least I donât like to present results until the lower limit of the 90% interval is greater than or equal to 0.

### Calculating average order value

Sometimes conversion rate on its own doesnât matter. If you make a change that makes 10% fewer people buy, but those that do buy spend 10x more money, then the net effect is still positive.

To track this we need to be able to see the average order value of the control compared to the test value. If youâve set up Google analytics integration like I showed previously, this is very easy to do.

If you go into Google analytics, select the custom variable tab, then select the e-commerce view, youâll see something like:

- Version A 1000 visits – 10 conversions – Average order value
- Version B 1000 visits – 10 conversions – Average order value 0

It’s great that people who saw version B appear to spend twice as much, but how do we know if we just got lucky? To do that we need to do some more work. Luckily, thereâs a tool that makes this very easy and again this is made by Evan Miller: Two sample t-test tool.

To find out if the change in average order value is significant, we need a list of all the transaction amounts for version A and version B. The steps to do that are below:

1 - Create an advanced segment for version A and version B using the custom variable values.

2 - Individually apply the two segments youâve just created, go to the transactions report under e-commerce and download all transaction data to a CSV.

3 - Dump data into the two-sample t-test tool

The tool doesnât accept special characters like $ or ÂŁ so remember to remove those before pasting into the tool. As you can see in the image below, I have version A data in the sample 1 area and the transaction values for version B in the sample 2 area. The output can be seen in the image below:

Whether or not the difference is significant is shown below the graphs. In this case the verdict was that sample 1 was in fact significantly different. To find out the difference, look at the âdâ value where is says âdifference of meansâ. In the example above the transactions of those people that saw the test version were on average more than those that saw the original.

## A free tool for reading this far

If you run a lot of CRO tests youâll find yourself using the above tools a lot. While they are all great tools, I like to have these in one place. One of my colleagues Tom Capper built a spreadsheet which does all of the above very quickly. Thereâs 2 sheets, conversion rate and average order value. The only data you need to enter in the conversion rate sheet is conversions and sessions, and in the AOV sheet just paste in the transaction values for both data sets. The conversion rate sheet calculates:

- Conversion rate
- Percentage change
- Statistical significance (one sided and two sided)
- 90,95 and 99% confidence intervals (Relative and absolute)

Thereâs an extra field that Iâve found really helpful (working agency side) thatâs called **âChance of <=0 upliftâ. **

If like the example above, you present results that have a potential negative lower range of a confidence interval:

- Weâre 90% sure B will increase the number of 6s you roll by between 19% and 133%
- Weâre 99% sure B will increase the number of 6s you roll by between -13% and 166%

The logical question a client is going to ask is: **âWhat chance is there of the result being negative?â**

Thatâs what this extra field calculates. It gives us the chance of rolling out the new version of a test and the difference being less than or equal to 0%. For the data above, the 99% confidence interval was -13% to +166%. The fact that the lower limit of the range is negative doesn’t look great, but using this calculation, the chance of the difference being <=0% is only 1.41%. Given the potential upside, most clients would agree that this is a chance worth taking.

You can download the spreadsheet here: **Statistical Significance.xls**

Feel free to say thanks to Tom on Twitter.

This is an internal tool so if it breaks, please donât send Tom (or me) requests to fix/upgrade or change.

If you want to speed this process up even more, I recommend transferring this spreadsheet into Google docs and using the Google Analytics API to do it automatically. Hereâs a good post on how you can do that.

I hope youâve found this useful and if you have any questions or suggestions please leave a comment.

If you want to learn more about the numbers behind this spreadsheet and statistics in general, some blog posts Iâd recommend reading are:

Scientific method: Statistical errors

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

August 4, 2014 Tags: Avoid, Data, Reporting, Statistics Posted in: SEO / Traffic / Marketing

## Leave a Reply

You must be logged in to post a comment.