Ipsos Encyclopedia - Statistical Significance
In statistics, when we say something is 'significant' we mean it is probably true and not due to chance. We might talk about whether two groups of people have significantly different levels of alcohol consumption, or whether a net promotor score has significantly increased year on year. In both these instances 'significant' means the difference is large enough for us to believe it to be genuine and not caused by random chance.
Definition
What is statistical significance?
In statistics, when we say something is 'significant' we mean it is probably true and not due to chance. We might talk about whether two groups of people have significantly different levels of alcohol consumption, or whether a net promotor score has significantly increased year on year. In both these instances 'significant' means the difference is large enough for us to believe it to be genuine and not caused by random chance.
Significance and p-values
So how do we decide if something is actually significant or not? We do this by looking at p-values. P-values are simply probabilities; statistics is all about probability. The p-value is the probability we would find a difference if no difference existed between the two groups. The second part of this sentence is important. Our default for testing is that we assume there are no differences between the two groups, this is known as the 'null hypothesis', and then conduct the test to see whether this is the case or not. (Statistical tests are always done this way round. This is because it is easier to show there is a difference when we don't expect there to be one, than the reverse.) The smaller the p-value is, the less likely it is that the difference is due to chance.
This is best explained with an example.
We ran a study for a regulatory body that looked at farmers' attitudes towards safety. We asked them how much they agreed with a number of statements about risky behaviours. This included the statement 'I sometimes do things that I know could potentially get me seriously injured or even killed'. Agreement was measured on a five-point scale where 1 = strongly agree and 5 is strongly disagree. We wished to compare risky behaviour across two groups of farmers; those aged 18-54 and those aged 55+. The farmers aged 18-54 had a mean score of 4.1, whereas farmers aged 55+ had a mean score of 4.3, see below.
Statement | Age group | N | Mean |
I sometimes do things that I know could potentially get me seriously injured or even killed | 18-54 | 587 | 4.1 |
55+ | 1045 | 4.3 |
This suggested younger farmers were more likely to agree with the statement. However, the difference did not seem very large. We wanted to test whether it reflected genuine differences of opinion in the population (i.e. there are genuine age differences in the attitudes of farmers towards risk) or whether it is due to random chance (i.e. random error that occurred when drawing the sample).
We ran a statistical test on the results. There are different tests available for continuous data and categorical data. In the example above, we used a t-test as we were comparing means from scale questions. A chi-squared test would have been more appropriate for categorical data (i.e. if we had been comparing proportions).
The testing can be run by DP or using specialist statistical packages, such as SPSS, Stata, R or SAS. The table below shows the output for the test. The test compares the means to an expected population and uses that to evaluate how large the difference is. The test statistic and degrees of freedom for the test are given in columns 't' and 'df'. However, the bit we are interested in here is the p-value. This indicates whether the test result is significant or not. The p-value for this test was 0.001.
t-test for Equality of Means | ||||
t | df | P-value | Mean Difference | Std. Error Difference |
-3.414 | 1630 | 0.001 | -0.22 | 0.06 |
The difference between the two groups is said to be significantly different if the probability for the difference is smaller than a pre-set value. We are usually testing at ta 95% level, which means this value is usually set at 0.05. The p-value in our test is 0.001, which is far lower than 0.05. This means we can say farmers aged 18-54 are significantly more likely to agree with the statement than older farmers, and that this difference is significant at the 95% level.
Different significance levels
Generally, the threshold for significance is determined in advance. A p-value of 0.05 means we are testing things at a 95% confidence level. This is the threshold most often used, although it is also relatively common to test at 80% level (a p-value of 0.2), 90% level (a p-value of 0.1) or 99% level (a p-value of 0.01). You can set the threshold at any value, as long as you do it in advance. It depends on your sample size (discussed later) or how large you think the difference is likely to be. However, please note: this needs to be decided in advance, it's very poor practice to move the threshold around after the event just to make your results significant!
It's important to remember (again) that statistics is all about probability. So even when something is significant it doesn't entirely rule out that the difference you are finding might simply be due to chance. It just means it is less likely. A p-value of 0.05 means 5% of the time (i.e. once in every twenty tests) we will find a difference between groups purely by chance. A p-value of 0.01 means 1% of the time we will find a difference purely by chance, and so on. Nothing is ever 100% certain.
Impact of sample size and design effects on statistical significance
There are a few aspects of your survey design that will affect the statistical tests and, the size of the p-values. It is important to be aware of these. The size of the sample, the design of the sample and any weights, will impact on p-values and significance.
Small samples tend to contain more sampling error. Sampling error is random error that exists in every sample. It is there because you have taken a random selection of cases from the population, rather than the full population, to create your sample. On an intuitive level this makes sense – you want your sample to be a representative mini version of your target population, with cases included from all areas of the population. This would be difficult with a large number of cases but even harder if you only have a small number of cases. Larger samples are therefore better for making inferences about the population.
If we took two random samples of 100 from a population and compared the estimates, then took two random samples of 1000 from the same population, we would expect there to be larger differences between the two samples of 100 than between the two samples of 1000. This is purely due to random chance.
We therefore need larger differences between results from two groups with small samples before we can be sure they are statistically significant and not due to random chance. This is because the greater variability causes there to be more overlap between samples.
The same thing happens when samples are weighted. This is because weights increase the amount of variance in the sample. This means, again, the difference between estimates needs to be larger before we can be sure that the difference is genuine and not caused by random chance. The impact of your survey design and weights on the sample can be measured using design effects.
One and two-sided tests
Many statistical tests can be run as one-sizes or two-sided tests (also referred to as one-tailed or two-tailed). A two-sided test should be used if you are testing whether the result from group A should be larger or smaller than the result from group B. A one-sided test can be used if there is a natural direction in your data (i.e. you are very confident that the results from group A not be smaller than group B), and you are only interested in testing how much bigger the results are (if at all).
When in doubt, use a two-sided test. The one-sided test should only be used if you have very good reasons to believe the results from one group will be much higher than the other.
Random probability sampling
Statistical theory (the science behind the statistical tests) is all based on random probability sampling (I think I may have mentioned that statistics is all about probability!). A sample is random if each sample member has a known and non-zero probability of being selected. If we know the probability of selection, then we have a link between the sample and the population that allows us to apply statistical theory and make assertions about the population using results from the sample.
This means statistical theory does not apply to quota samples, or any samples where a selected individual is allowed to be replaced. Strictly speaking, we should not be running statistical test on quota samples, although it often gets used as a rule of thumb to indicate where differences are likely to lie. We need to be clear to clients when this is the case. Statistical tests may overstate the accuracy of the results with samples that are not truly random. This is because statistical tests only consider random error and ignore biases resulting from non-random error (such as a badly selected sample).