Conversion Wednesday – Have Confidence in Your Testing Data



In order to understand how confident you can be in your test results, we first need to define three terms.

The first term, confidence, can sometimes be called “significance” as well. It shows how stable your test results are.

The second item is Z-score. Z-score is related to bell curves and standard deviation from the mean. To keep your eyes from crossing and glazing over, all you need to know is: the higher a Z-score is, the less likely your test results are due to chance (with 3.0 being the goal).

Chance of error is just that: how likely your results are due to chance.

The trouble is that the confidence metric is misleading. If we only look at the confidence metric, we would assume that 90% confidence is pretty confident, right? I mean, after all, 90% is 9 out of 10.

The problem is that there’s a big difference in Z-score between 90%, 95%, and 99% confidence.

The table below shows confidence next to it’s corresponding Z-score and chance of error. As you can see, the chance of error drops dramatically from 95.4% to 99.7% confidence.

A Z-score of 2.0 means there is a 5% probability that the results are due to chance. So, 50 out of 1,000 customers had an experience that didn’t follow what your test result data is telling you.

A Z-score of 3.0 reduces that to 0.1% or about 3 varying experiences out of 1,000.

So, to be confident in your test results, try and focus on the Z-score. A Z-score of 2.0 means that data is trending in that direction and a Z-score of 3.0 means you can be confident that your test results are valid.

There are a few other factors to consider, but if your test has been running long enough to cover weekly trends (such as newsletter mailing schedules or weekend performance trends) and it has been running a while to get enough traffic to your test, your Z-score should give you confident results.

Considering a test completed and a making a change to your site based on 95% confidence could leave you scratching your head later about why it didn’t perform like your test performed.

