Oh, it’s a scary world out there! Around a campfire in a dark forest, late at night, we researchers shock each other by telling stories about what happens when statistics run amok. Let’s ensure *you* don’t fall prey to some treacherous errors.

Measurement is imprecise, so we use statistical testing to tell us how likely it is that two numbers are really different from each other. We use stats to answer questions like these:

- Is my product really better than my competitor’s, or not?
- Is my reformulated product as good as my current product, or not?

Here’s an explanation of two types of “error” that impact marketing research, in layman’s terms. Warning to statisticians: please look away now, otherwise you may be appalled, since I am not using our professional, precise jargon or covering all the relevant concepts.

- Type 1 error: Saying products are different when really, they are not.
- Type 2 error: Saying products are
*not*different, when really they*are*different

** **** Type 1 error, saying 2 products are different when they are really not**. The most common type of stat testing you’ll see is done to avoid this error. When we look at the means of a Liking scale and say Our Product is significantly better than Competitor’s Product, we want to be pretty sure it really really is.

Let’s say the statistical criterion we use is 90%, and the mean Liking score for Our Product is significantly higher than the mean Liking score for Competitor’s Product. The 90% means there’s about a 1 in 10 chance we’ve committed a Type 1 error: these 2 numbers aren’t really different after all, Our Product is no better than Competitor’s Product.

If we use the 95% criterion, and the stat test says the numbers are different, then we are pretty sure we truly did beat the Competition – hooray! At 95%, there is only about 1 in 20 chance that the 2 products aren’t really different on Liking.

This is the most common stat testing because usually we are sleuthing to find the good stuff. We are exploring to see which of x items (concepts, products, etc.) stand out, are especially strong. So we want to be sure not to make a Type 1 error.

** Type 2 error, saying products aren’t different, when they are different. ** This comes into play when we have a product reformulation, perhaps to reduce product costs, or because an ingredient or manufacturing process has to change. We are hoping that the new formula is just as good as the current one; we are trying to mimic the current product. We want the 2 products to be “at parity”, or

*not*significantly different from each other. The relevant risk here is Type 2: concluding that the 2 formulas are the same, when really they are not, and new is really worse than current.

Avoid Type 2 error risk by lowering your criterion for statistical significance. Common practice is to set it at 80%. If the score for the new product is lower than the score for the current product, and the difference is significant at the 80% confidence level, you’ll conclude that the new formula isn’t as good as current. You’ve got a 1 in 5 chance of being wrong: you said the product scores are different and they are not. But in this case, it is important not to go ahead with a potentially inferior reformulation.

It would be easy (and wrong) to set an Action Standard that new has to be *not* significantly different from current at the 95% confidence level. You’ve stacked the deck in favor of saying new = current (when maybe it is not). Want to really stack the deck? Set your criterion at 99%. Or cut the sample size way down, or do both. Now you’ve got to have a really big difference between the two numbers in order for the stats test to say, “these numbers are statistically significantly different from each other”. Voila! You conclude the reformulation is just as good as the current product, and off you go to disaster. So, so sad.

Type 2 error is a hard concept to grasp, and communicate, so here is a path I’ve seen some companies take for testing reformulations.

The reformulated new product has to score at least as well as the current formula on % positive purchase interest. There is no stat testing, just look at the actual numbers. If current has 78% positive purchase interest, then the reformulated has to get at least 78% or it fails that Action Standard. This is a tougher standard than saying “not statistically significantly worse than current at the 80% confidence level”.

Other companies set hurdle rates based on past experience or category databases: a new or reformulated product must hit certain hurdles (regardless of stats) in order to proceed.

** A note on sample size: **It is pretty much true that the bigger your base of respondents, the more sensitive your stat test will be. That means, at a very large sample size (say, 500), even small differences in numbers may be statistically significant. If your sample size is very small (say, 25 people), only huge differences will be statistically significant. That’s why we so often see studies with sample sizes of around 150 people, it is a sort of “sweet spot” balancing study costs with test sensitivity.

**These are tricky concepts – I’d be glad to chat with you about how to use stats well.**

An approach used in health sciences is to consider not only statistical differences, but also whether a difference is clinically meaningful. It usually helps to determine in advance what is a meaningful difference. For the Type 2 error, one might set up the hypothesis test to say that the new product must be x points higher than the benchmark to be considered actionable. And if you are only interested in a higher score, the test can be powered to be a one-sided test of whether the new product exceeds the benchmark by xx points on the scale.