1.2: Data Types and Levels of Measurement

Variables

Recall the Big Picture, the four-step process that encompasses statistics (as it is presented in this course):

1. Producing Data—Choosing a sample from the population of interest and collecting data.

2. Exploratory Data Analysis (EDA)—Summarizing the data we’ve collected.

3. and 4. Probability and Inference—Drawing conclusions about the entire population based on the data collected from the sample.

Even though in practice it is the second step in the process, we look at exploratory data analysis (EDA) first.

The big picture of statistics. First, a set of data was created from a subset of the population. Then, we perform step 2, exploratory data analysis on the data. This is the step we are working on in this unit. With the analysis results, we apply the third step, probability to help us draw conclusions about the population from the data. Next, we perform inference and draw conclusions. This the fourth step.

Before we jump into exploratory data analysis and really appreciate its importance in the process of statistical analysis, let’s step back for a minute and ask:

What do we really mean by data?

Data are pieces of information about individuals organized into variables. By an individual, we mean a particular person or object. By a variable, we mean a particular characteristic of the individual.

dataset is a set of data identified with particular circumstances. Datasets are typically displayed in tables, in which rows represent individuals and columns represent variables.

Example

Medical Records

The following dataset shows medical records from a particular survey:

A table in which the rows represent patients and each column represents a variable. For example, the third row is for Patient #3, and each cell in the row is in a particular column. The first column is Gender, and Patient #3's gender is female, so there is an F in the first column of the third row.

In this example, the individuals are patients, and the variables are Gender, Age, Weight, Height, Smoking, and Race. Each row, then, gives us all the information about a particular individual (in this case, patient), and each column gives us information about a particular characteristic of all the patients.

Variables can be classified into one of two types: categorical or quantitative.

  • Categorical variables take category or label values and place an individual into one of several groups. Each observation can be placed in only one category, and the categories are mutually exclusive.

    In our example of medical records, Smoking is a categorical variable, with two groups, since each participant can be categorized only as either a nonsmoker or a smoker. Gender and Race are the two other categorical variables in our medical records example. Notice that the values of the categorical variable Smoking have been coded as the numbers 1 or 2. It is common to code the values of a categorical variable as numbers, but you should remember that these are just codes. They have no arithmetic meaning (i.e., it does not make sense to add, subtract, multiply, divide, or compare the magnitude of such values).

  • Quantitative variables take numerical values and represent some kind of measurement.

    In our medical example, Age is an example of a quantitative variable because it can take on multiple numerical values. It also makes sense to think about it in numerical form; that is, a person can be 18 years old or 80 years old. Weight and Height are also examples of quantitative variables.

Categorical variables are sometimes called qualitative variables, but in this course we use the term categorical.

Exercise

We took a random sample from the 2000 U.S. Census. Here is part of the dataset:

Activity

Let’s Explore a Dataset

In this activity we

  • Learn how to open and examine a dataset.

  • Practice classifying variables by their type: quantitative or categorical.

  • Learn how to handle categorical variables whose values are numerically coded.

Background to Dataset

Clinical depression is the most common mental illness in the United States, affecting 19 million adults each year (Source: NIMH, 1999). Nearly 50% of individuals who experience a major episode will have a recurrence within 2 to 3 years. Researchers are interested in comparing therapeutic solutions that could delay or reduce the incidence of recurrence.

In a study conducted by the National Institutes of Health, 109 clinically depressed patients were separated into three groups, and each group was given one of two active drugs (imipramine or lithium) or no drug at all. For each patient, the dataset contains the treatment used, the outcome of the treatment, and several other interesting characteristics.

Here is a summary of the variables in our dataset:

  • Hospt: The patient’s hospital, represented by a code for each of the 5 hospitals (1, 2, 3, 5, or 6)

  • Treat: The treatment received by the patient (Lithium, Imipramine, or Placebo)

  • Outcome: Whether or not a recurrence occurred during the patient’s treatment (Recurrence or No Recurrence)

  • Time: Either the time in days till the first recurrence, or if a recurrence did not occur, the length in days of the patient’s participation in the study

  • AcuteT: The time in days that the patient was depressed prior to the study

  • Age: The age of the patient in years, when the patient entered the study

  • Gender: The patient’s gender (1 = Female, 2 = Male)

To open the dataset, click here to download the file to your computer. Then find the downloaded file and double-click it to open it in Excel. When Excel opens you may have to enable editing.

Often it is easier to use labels for categorical variables that are as close as possible to the meanings of the categories. Now we will recode the variable gender with the labels “Male” and “Female.” To do that in Excel:

  • Click on the column header above the Gender column. In this case, Gender is in column G, so click on the column header labeled G. This will select the entire column of gender values.

  • In the Home tab under the Editing group, choose Find & Select → Replace to bring up the Find and Replace window.

  • In the first textbox labeled Find what:, enter “1”.

  • In the second textbox labeled Replace with, enter “Female”.

  • Now click the button labeled Replace All. This will replace all of the “1” values in our selected column with the word “Female”. Press OK to close the dialog box that was launched.

  • Now do the same thing for males:

  • In the first textbox labeled Find what:, enter “2”.

  • In the second textbox labeled Replace with, enter “Male”

  • Click the Replace All button and press OK to close the dialog box. Then click the Close button.

Notice that the column Gender now contains the meaningful labels “Female” and “Male” where before it contained “1” and “2” codes.

Did I Get This?

  • Hospt: The patient’s hospital, represented by a code for each of the five hospitals (1, 2, 3, 5, and 6)

  • Treat: The treatment received by the patient (Lithium, Imipramine, or Placebo)

  • Outcome: Whether a recurrence occurred during the patient’s treatment (Recurrence or No Recurrence)

  • Time: Either the time in days until the first recurrence or, if a recurrence did not occur, the length in days of the patient’s participation in the study

  • AcuteT: The time in days that the patient was depressed prior to the study

  • Age: The age of the patient in years when the patient entered the study

  • Gender: The patient’s gender (1 = Female; 2 = Male)

Scales of Measurement

In the previous section, a simple distinction was made between quantitative and categorical variables. However, there is a more precise method of categorizing variables: it is called scale of measurement. The four different scales of measurement, from least to most precise, are

  • Nominal

  • Ordinal

  • Interval

  • Ratio

In the following walkthrough video and sections, each one of these types of variables is described with a comparison of their properties.

Nominal Scale of Measurement

The nominal scale of measurement is a qualitative measure that uses discrete categories to describe a characteristic of the research participants. For each participant, the researcher determines the presence, absence, and type of the attribute. Nominal scales of measurement may have two categories, such as citizen status (citizen/non-citizen), or they can have more than two categories, like religious affiliation (e.g., Agnostic, Buddhist, Jewish, Muslim) or marital status (e.g., divorced, married, single). Often, as described here, the categories have names; however, researchers code them with numbers for use in statistical analyses. These categories are not ordered or ranked in any way.

Learn by Doing

Share This Book