Intuitively you would think that the size of the sample required to measure a population would depend on the size of that population – statistically however this is not the case in the majority of household surveys as the population in statistical terms is large enough to be considered infinite – hence a population of say 100,000 would be the same as a population of 500,000 when considering overall sample sizes. In practice there does need to be some consideration of the overall population size as larger populations are usually divided into a greater number of subsamples, for example geographical zones.

Intuitively you would think that the size of the sample required to measure a population would depend on the size of that population – statistically however this is not the case in the majority of household surveys as the population in statistical terms is large enough to be considered infinite – hence a population of say 100,000 would be the same as a population of 500,000 when considering overall sample sizes. In practice there does need to be some consideration of the overall population size as larger populations are usually divided into a greater number of subsamples, for example geographical zones.

To help ensure that data is representative across a survey area, the overall sample needs to be carefully allocated across zones. There are two approaches to this, both of which are equally valid.
The sample can the split in proportion to the population or number of households in each zone.

Alternatively interviews can be evenly spread across zones, then the data weighted at the analysis stage in proportion to population or number of households. This latter approach can be particularly useful if one or more zones are unusually small in terms of population or number of households.

As with any data collection where a sample is being drawn to represent a population, there is potentially a difference between the response from the sample and the true situation in the population as a whole. Many steps are taken to help minimise this difference (e.g. random sample selection, questionnaire construction etc) but there is always potentially a difference between the sample and population – this is known as the standard error. The standard error can be estimated using statistical calculations based on the sample size, the population size and the level of response measured (as you would expect you can potentially get a larger error in a 50% response than say a 10% response simply because of the magnitude of the numbers). While population size is theoretically an issue, in practice for most consumer type surveys the population in statistical terms is large enough to be considered infinite. This leaves the sample size as the primary factor for determining the standard error. To help understand the significance of this error, it is normally expressed as a confidence interval for the results. Clearly to have 100% accuracy of the results would require you to sample the entire population. The usual confidence interval used is 95% - this means that you can be confident that in 19 out of 20 instances the actual population behaviour will be within the confidence interval range.

Below is a table that shows the 95% confidence interval for 500 and 1000 sample sizes at varying response levels:

Sample size |
%ge Response |
95% confidence interval |

500 | 10 | 2.6 |

500 | 20 | 3.5 |

500 | 30 | 4.0 |

500 | 40 | 4.3 |

500 | 50 | 4.4 |

1000 | 10 | 1.9 |

1000 | 20 | 2.5 |

1000 | 30 | 2.8 |

1000 | 40 | 3.0 |

1000 | 50 | 3.1 |

So for example, if the results show a 10% response, with a 500 sample we can say with 95% confidence that this survey result shows that in the population as a whole the percentage will be between 7.4% and 12.6%. If the sample size is increased to 1000, this narrows the range and at 95% confidence the results show the population figure would be in the range 8.1% to 11.9%. In practice the survey results will frequently be even closer to the population than the confidence interval suggests. For example in almost 7 out of 10 cases the confidence interval would be half the figures detailed above.

As you can see, doubling the sample size does not double the accuracy of the results. Larger sample sizes are usually employed where the overall sample is broken down into a large number of subsamples that require a high degree of accuracy to allow comparison. For instance, if detailed data were required for a small zone to compare with other neighbouring zones, then clearly the sample size in that zone may become quite small. For instance with a sample of 100 in a zone, the 95% confidence interval on a 10% response would be 59% and on a 50% response would be 9.8%. However, on most household surveys, zonal level data is used to provide an overall picture of shopping patterns and expenditure distribution across the entire catchment rather than looking primarily at x% shop at store A in zone 1 compared with y% who shop at store A in zone 2. We therefore give greatest consideration to the overall sample size, while being mindful of zonal subsamples.