1.2: Data Types and Levels of Measurement
Variables
Recall the Big Picture, the four-step process that encompasses statistics (as it is presented in this course):
1. Producing Data—Choosing a sample from the population of interest and collecting data.
2. Exploratory Data Analysis (EDA)—Summarizing the data we’ve collected.
3. and 4. Probability and Inference—Drawing conclusions about the entire population based on the data collected from the sample.
Even though in practice it is the second step in the process, we look at exploratory data analysis (EDA) first.
Before we jump into exploratory data analysis and really appreciate its importance in the process of statistical analysis, let’s step back for a minute and ask:
What do we really mean by data?
Data are pieces of information about individuals organized into variables. By an individual, we mean a particular person or object. By a variable, we mean a particular characteristic of the individual.
A dataset is a set of data identified with particular circumstances. Datasets are typically displayed in tables, in which rows represent individuals and columns represent variables.
Example
The following dataset shows medical records from a particular survey:
In this example, the individuals are patients, and the variables are Gender, Age, Weight, Height, Smoking, and Race. Each row, then, gives us all the information about a particular individual (in this case, patient), and each column gives us information about a particular characteristic of all the patients.
Variables can be classified into one of two types: categorical or quantitative.
-
Categorical variables take category or label values and place an individual into one of several groups. Each observation can be placed in only one category, and the categories are mutually exclusive.
In our example of medical records, Smoking is a categorical variable, with two groups, since each participant can be categorized only as either a nonsmoker or a smoker. Gender and Race are the two other categorical variables in our medical records example. Notice that the values of the categorical variable Smoking have been coded as the numbers 1 or 2. It is common to code the values of a categorical variable as numbers, but you should remember that these are just codes. They have no arithmetic meaning (i.e., it does not make sense to add, subtract, multiply, divide, or compare the magnitude of such values).
-
Quantitative variables take numerical values and represent some kind of measurement.
In our medical example, Age is an example of a quantitative variable because it can take on multiple numerical values. It also makes sense to think about it in numerical form; that is, a person can be 18 years old or 80 years old. Weight and Height are also examples of quantitative variables.
Categorical variables are sometimes called qualitative variables, but in this course we use the term categorical.
Exercise
We took a random sample from the 2000 U.S. Census. Here is part of the dataset:
Activity
Let’s Explore a Dataset
In this activity we
-
Learn how to open and examine a dataset.
-
Practice classifying variables by their type: quantitative or categorical.
-
Learn how to handle categorical variables whose values are numerically coded.
Background to Dataset
Clinical depression is the most common mental illness in the United States, affecting 19 million adults each year (Source: NIMH, 1999). Nearly 50% of individuals who experience a major episode will have a recurrence within 2 to 3 years. Researchers are interested in comparing therapeutic solutions that could delay or reduce the incidence of recurrence.
In a study conducted by the National Institutes of Health, 109 clinically depressed patients were separated into three groups, and each group was given one of two active drugs (imipramine or lithium) or no drug at all. For each patient, the dataset contains the treatment used, the outcome of the treatment, and several other interesting characteristics.
Here is a summary of the variables in our dataset:
-
Hospt: The patient’s hospital, represented by a code for each of the 5 hospitals (1, 2, 3, 5, or 6)
-
Treat: The treatment received by the patient (Lithium, Imipramine, or Placebo)
-
Outcome: Whether or not a recurrence occurred during the patient’s treatment (Recurrence or No Recurrence)
-
Time: Either the time in days till the first recurrence, or if a recurrence did not occur, the length in days of the patient’s participation in the study
-
AcuteT: The time in days that the patient was depressed prior to the study
-
Age: The age of the patient in years, when the patient entered the study
-
Gender: The patient’s gender (1 = Female, 2 = Male)
Often it is easier to use labels for categorical variables that are as close as possible to the meanings of the categories. Now we will recode the variable gender with the labels “Male” and “Female.” To do that in Excel:
-
Click on the column header above the Gender column. In this case, Gender is in column G, so click on the column header labeled G. This will select the entire column of gender values.
-
In the Home tab under the Editing group, choose Find & Select → Replace to bring up the Find and Replace window.
-
In the first textbox labeled Find what:, enter “1”.
-
In the second textbox labeled Replace with, enter “Female”.
-
Now click the button labeled Replace All. This will replace all of the “1” values in our selected column with the word “Female”. Press OK to close the dialog box that was launched.
-
Now do the same thing for males:
-
In the first textbox labeled Find what:, enter “2”.
-
In the second textbox labeled Replace with, enter “Male”
-
Click the Replace All button and press OK to close the dialog box. Then click the Close button.
Notice that the column Gender now contains the meaningful labels “Female” and “Male” where before it contained “1” and “2” codes.
Did I get this?
-
Hospt: The patient’s hospital, represented by a code for each of the five hospitals (1, 2, 3, 5, and 6)
-
Treat: The treatment received by the patient (Lithium, Imipramine, or Placebo)
-
Outcome: Whether a recurrence occurred during the patient’s treatment (Recurrence or No Recurrence)
-
Time: Either the time in days until the first recurrence or, if a recurrence did not occur, the length in days of the patient’s participation in the study
-
AcuteT: The time in days that the patient was depressed prior to the study
-
Age: The age of the patient in years when the patient entered the study
-
Gender: The patient’s gender (1 = Female; 2 = Male)
Scales of Measurement
In the previous section, a simple distinction was made between quantitative and categorical variables. However, there is a more precise method of categorizing variables: it is called scale of measurement. The four different scales of measurement, from least to most precise, are
-
Nominal
-
Ordinal
-
Interval
-
Ratio
In the following walkthrough video and sections, each one of these types of variables is described with a comparison of their properties.
Nominal Scale of Measurement
The nominal scale of measurement is a qualitative measure that uses discrete categories to describe a characteristic of the research participants. For each participant, the researcher determines the presence, absence, and type of the attribute. Nominal scales of measurement may have two categories, such as citizen status (citizen/non-citizen), or they can have more than two categories, like religious affiliation (e.g., Agnostic, Buddhist, Jewish, Muslim) or marital status (e.g., divorced, married, single). Often, as described here, the categories have names; however, researchers code them with numbers for use in statistical analyses. These categories are not ordered or ranked in any way.