Lesson 2: Populations and Samples

This is short, but very important. So it gets it's own lesson.
Image from Boston Univeristy

A population of data is the group of all relevant data points that could be measured. For example, the number of cats living in each home in the US, the color of every car sold in 2013, the height of every adult male (age 20-80) living in China, the finish time of every woman who finished the 2014 Boston Marathon. Populations are usually large, but not always. If you have the value for every data point in the population (every race finish time), you can calculate summary parameters to describe the population. Common summary parameters/statistics include average and standard deviation. I'll get to that latter.

Often the population of values is too large to measure. How long would it take to visit every home in the US and count all the cats? Longer than most researchers have. In this case researchers will look at a sample. A sample of a population of data is a sub-set. Ideally a sample should be randomly drawn from the population, so that it is most likely to be representative. Continuing with the cats per house example, you could send letters in the mail to 1,000 houses and ask them to tell you how many cats they have. When you have a sample you can still calculate summary statistics such as average and standard deviation. These values will be estimates of the population summary parameters. If the researcher wants to know how close the sample statistics are to the population statistics (s)he may repeat the sampling (i.e. send out another 1,000 letters) and see how similar the results are.

Key point: anything we learn using a sample is an estimate of what we would find in the whole population, but we use samples because often populations are too large to study.

On to lesson 3...

No comments:

Post a Comment