1.1: What is Statistics?

What is Statistics?

In a nutshell, what statistics is all about is converting data into useful information. Statistics is therefore a process in which we

  • Collect data,

  • Summarize data, and

  • Interpret data.

To really understand how this process works, we need to put it in a context. We will do that by introducing one of the central ideas of this course—the Big Picture of Statistics. We will introduce the Big Picture by building it gradually and explaining each step. At the end of the introductory explanation, once you have the full Big Picture in front of you, we will show it again using a concrete example.

The process of statistics starts when we identify what group we want to study or learn something about. We call this group the population. Note that the word population here (and in the entire course) does not refer only to people; it is used in the broader statistical sense to refer not only to people, but also to animals, objects, and so on. For example, we might be interested in

  • The opinions of the population of U.S. adults about the death penalty

  • How the population of mice react to a certain chemical

  • The average price of the population of all one-bedroom apartments in a certain city

Population, then, is the entire group that is the target of our interest:

Pictorial representation of a population

In most cases, the population is so large that, as much as we want to, there is absolutely no way we can study all of it (imagine trying to get the opinions of all U.S. adults about the death penalty). A more practical approach would be to examine and collect data only from a subgroup of the population, which we call a sample. We call this first step, which involves choosing a sample and collecting data from it, producing data.

Producing data is visualized as taking a subset of the population, and creating a new set of points.

It should be noted that since, for practical reasons, we need to compromise and examine only a sub-group of the population rather than the whole population, we should make an effort to choose a sample in such a way that it will represent the population well. For example, if we choose a sample from the population of U.S. adults, and ask their opinions about the death penalty, we do not want our sample to consist of only Republicans or only Democrats.

Once the data have been collected, what we have is a long list of answers to questions, or numbers, and in order to explore and make sense of the data, we need to summarize that list in a meaningful way. This second step, which consists of summarizing the collected data, is called exploratory data analysis.

Exploratory data analysis is performed on the data which is a subset of the population.

Now we’ve obtained the sample results and summarized them, but we are not done. Remember that our goal is to study the population, so what we want is to be able to draw conclusions about the population based on the sample results. Before we can do so, we need to look at how the sample we’re using may differ from the population as a whole, so that we can factor that into our analysis. To examine this difference, we use probability.

In essence, probability is the “machinery” that allows us to draw conclusions about the population based on the data collected about the sample.

The data and summarization of the data created from data analysis are examined using probability, which is the first step in allowing us to draw conclusions about the population based on the data.

Finally, we can use what we’ve discovered about our sample to draw conclusions about our population. We call this final step in the process inference.

First, a set of data was created from a subset of the population. Then, we perform exploratory data analysis on the data. With these results, we apply probability which is our first step in drawing conclusions about the population from the data. After we have applied probability to the data, we can draw conclusions. This is called inference, the second step in drawing conclusions.

This is the Big Picture of statistics.

Example

At the end of April 2005, a poll was conducted (by ABC News and the Washington Post) for the purpose of learning the opinions of U.S. adults about the death penalty.

1. Producing Data: A (representative) sample of 1,082 U.S. adults was chosen, and each adult was asked whether he or she favored or opposed the death penalty.

2. Exploratory Data Analysis (EDA): The collected data were summarized, and it was found that 65% of the sampled adults favor the death penalty for persons convicted of murder.

3 and 4. Probability and Inference: Based on the sample result (of 65% favoring the death penalty) and our knowledge of probability, it was concluded (with 95% confidence) that the percentage of those who favor the death penalty in the population is within 3% of what was obtained in the sample (i.e., between 62% and 68%). The following figure summarizes the example:

A visual representation of the poll conducted about the opinions of U.S. adults about the death penalty. The large population, which represents the U.S. adults, and data was produced from 1082 of these adults by asking them about the death penalty. In the data set, we have 1082 responses, and exploratory data analysis tells us that 65% are in in favor of the death penalty. Using both probability and inference, we can draw the conclusion that we are 95% sure that the population percentage is within 3% of 65% (i.e., between 62% and 68%). This brings us back to where we started, the population.

Share This Book