{"id":489,"date":"2024-10-18T02:02:04","date_gmt":"2024-10-18T02:02:04","guid":{"rendered":"https:\/\/pressbooks.ccconline.org\/mat1260\/?post_type=chapter&#038;p=489"},"modified":"2024-12-11T21:22:49","modified_gmt":"2024-12-11T21:22:49","slug":"3-4-linear-regression-equation","status":"publish","type":"chapter","link":"https:\/\/pressbooks.ccconline.org\/mat1260\/chapter\/3-4-linear-regression-equation\/","title":{"raw":"3.4: Linear Regression Equation","rendered":"3.4: Linear Regression Equation"},"content":{"raw":"<h2><span title=\"Quick scroll up\">Linear Regression: Summarizing the Pattern of the Data with a Line<\/span><\/h2>\r\nSo far we\u2019ve used the scatterplot to describe the relationship between two quantitative variables, and in the special case of a linear relationship, we have supplemented the scatterplot with the correlation (r). The correlation, however, doesn\u2019t fully characterize the linear relationship between two quantitative variables\u2014it only measures the strength and direction. We often want to describe more precisely how one variable changes with the other (by \u201cmore precisely,\u201d we mean more than just the direction), or\u00a0<em>predict\u00a0<\/em>the value of the response variable for a given value of the explanatory variable. In order to be able to do that, we need to summarize the linear relationship with a line that best fits the linear pattern of the data. In the remainder of this section, we will introduce a way to find such a line, learn how to interpret it, and use it (cautiously) to make predictions.\r\n\r\nAgain, let\u2019s start with a motivating example:\r\n\r\nEarlier, we examined the linear relationship between the age of a driver and the maximum distance at which a highway sign was legible, using both a scatterplot and the correlation coefficient. Suppose a government agency wanted to predict the maximum distance at which the sign would be legible for 60-year-old drivers, and thus make sure that the sign could be used safely and effectively.\r\n<p id=\"N10B12\">How would we make this prediction?<\/p>\r\nhttps:\/\/youtube.com\/watch?v=8hf3dMf59cI\r\n<div class=\"figurewrap\">\r\n<div class=\"figure clearfix\">\r\n<div class=\"youtube\"><span style=\"orphans: 1; text-align: initial; font-size: 1em;\">How and why did we pick this particular line (the one shown in red in the above walkthrough) to describe the dependence of the maximum distance at which a sign is legible upon the age of a driver? What line exactly did we choose? We will return to this example once we can answer that question with a bit more precision.<\/span><\/div>\r\n<\/div>\r\n<\/div>\r\nThe technique that specifies the dependence of the response variable on the explanatory variable is called\u00a0<em>regression<\/em>. When that dependence is linear (which is the case in our examples in this section), the technique is called\u00a0<em>linear regression<\/em>. Linear regression is therefore the technique of finding the line that best fits the pattern of the linear relationship (or in other words, the line that best describes how the response variable linearly depends on the explanatory variable).\r\n<p id=\"N10B09\">To understand how such a line is chosen, consider the following very simplified version of the age-distance example (we left just 6 of the drivers on the scatterplot):<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"The scatterplot of Sign Legibility vs. Driver Age with only 6 data points. The data points chosen to be shown roughly make a parallelogram, whose top and bottom sides represent negative relationships.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear9.gif\" alt=\"The scatterplot of Sign Legibility vs. Driver Age with only 6 data points. The data points chosen to be shown roughly make a parallelogram, whose top and bottom sides represent negative relationships.\" \/><\/span><\/span><\/p>\r\nThere are many lines that look like they would be good candidates to be the line that best fits the data:<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"The same scatterplot the 6 data points. Five different lines have been drawn from the upper left region of the plot to the lower right. They all intersect the parallelogram created by the 6 data points in a way such that each line is above 3 points and below 3 points. These lines are potential candidates. There are many other lines which could be used to fit the data.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear10.gif\" alt=\"The same scatterplot the 6 data points. Five different lines have been drawn from the upper left region of the plot to the lower right. They all intersect the parallelogram created by the 6 data points in a way such that each line is above 3 points and below 3 points. These lines are potential candidates. There are many other lines which could be used to fit the data.\" \/><\/span><\/span>\r\n\r\nIt is doubtful that everyone would select the same line in the plot above. We need to agree on what we mean by \u201cbest fits the data\u201d; in other words, we need to agree on a criterion by which we would select this line. We want the line we choose to be close to the data points. In other words, whatever criterion we choose, it had better somehow take into account the vertical deviations of the data points from the line, which are marked with blue arrows in the plot below:<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"The same scatterplot with 6 points. A potential line has been drawn, and a vertical line from each data point to the line has also been drawn. The length of these vertical lines have to be taken into acount when choosing a best fit line.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear11.gif\" alt=\"The same scatterplot with 6 points. A potential line has been drawn, and a vertical line from each data point to the line has also been drawn. The length of these vertical lines have to be taken into acount when choosing a best fit line.\" \/><\/span><\/span>\r\n<p id=\"N10B24\">The most commonly used criterion is called the\u00a0<em>least squares<\/em>\u00a0criterion. This criterion says: Among all the lines that look good on your data, choose the one that has the smallest sum of squared vertical deviations. Visually, each squared deviation is represented by the area of one of the squares in the plot below. Therefore, we are looking for the line that will have the smallest total yellow area.<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"The same scatterplot with 6 data points. A line has been chosen, and for each of the 6 data points, a vertical line is drawn from the data point to the line. A square is then drawn, one side using this line, so that all 4 sides are the same length as the vertical line. For all 6 data points we have 6 different vertical lines and thus 6 different squares. The least squares criterion looks to reduce the total area of these squares.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear12.gif\" alt=\"The same scatterplot with 6 data points. A line has been chosen, and for each of the 6 data points, a vertical line is drawn from the data point to the line. A square is then drawn, one side using this line, so that all 4 sides are the same length as the vertical line. For all 6 data points we have 6 different vertical lines and thus 6 different squares. The least squares criterion looks to reduce the total area of these squares.\" \/><\/span><\/span>This line is called the\u00a0<em>least-squares regression line<\/em>, and, as we\u2019ll see, it fits the linear pattern of the data very well.<\/p>\r\n<p id=\"N10B37\">For the remainder of this lesson, you\u2019ll need to feel comfortable with the algebra of a straight line. In particular you\u2019ll need to be familiar with the\u00a0<em>slope\u00a0<\/em>and the\u00a0<em>intercept\u00a0<\/em>in the equation of a line, and their interpretation.<\/p>\r\n<p id=\"c028ff36fa7942d88c382dfc2721704a\">Like any other line, the equation of the least-squares regression line for summarizing the linear relationship between the response variable(Y) and the explanatory variable (X) has the form:\u00a0<span class=\"mjx-chtml MathJax_CHTML\"><span class=\"mjx-math\"><span class=\"mjx-mrow\"><span class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">Y<\/span><\/span><span class=\"mjx-mtext\"><span class=\"mjx-char MJXc-TeX-main-R\">\u00a0<\/span><\/span><span class=\"mjx-mo MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\">=<\/span><\/span><span class=\"mjx-mtext MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\">\u00a0<\/span><\/span><span class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">a<\/span><\/span><span class=\"mjx-mtext\"><span class=\"mjx-char MJXc-TeX-main-R\">\u00a0<\/span><\/span><span class=\"mjx-mo MJXc-space2\"><span class=\"mjx-char MJXc-TeX-main-R\">+<\/span><\/span><span class=\"mjx-mtext MJXc-space2\"><span class=\"mjx-char MJXc-TeX-main-R\">\u00a0<\/span><\/span><span class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">b<\/span><\/span><span class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">X<\/span><\/span><\/span><\/span><\/span><\/p>\r\n<p id=\"d1b3ef3d9f8b48c2a09e1aa1fe88da46\">All we need to do is calculate the intercept\u00a0<em class=\"italic\">a<\/em>, and the slope\u00a0<em class=\"italic\">b<\/em>, which is easily done if we know:<\/p>\r\n\r\n<ul id=\"b579dde0330a492f9986670011fb0635\">\r\n \t<li>\r\n<p id=\"a3dbf83574db455daca9779659353ef3\"><span id=\"MathJax-Element-2-Frame\" class=\"mjx-chtml MathJax_CHTML\"><span class=\"mjx-math\"><span class=\"mjx-mrow\"><span class=\"mjx-mover\"><span class=\"mjx-stack\"><span class=\"mjx-over\"><span class=\"mjx-mo\"><span class=\"mjx-delim-h\"><span class=\"mjx-char MJXc-TeX-main-R\">\u00af<\/span><span class=\"mjx-char MJXc-TeX-main-R\">\u00af<\/span><span class=\"mjx-char MJXc-TeX-main-R\">\u00af<\/span><\/span><\/span><\/span><span class=\"mjx-op\"><span class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">X<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>\u2014the mean of the explanatory variable\u2019s values<\/p>\r\n<\/li>\r\n \t<li>\r\n<p id=\"bad6f2b749c6474bb0d90c7764eeb7ec\">S<sub>X<\/sub>\u2014the standard deviation of the explanatory variable\u2019s values<\/p>\r\n<\/li>\r\n \t<li>\r\n<p id=\"c5a51a2a23c74f3396256757ed746d50\"><span id=\"MathJax-Element-3-Frame\" class=\"mjx-chtml MathJax_CHTML\"><span class=\"mjx-math\"><span class=\"mjx-mrow\"><span class=\"mjx-mover\"><span class=\"mjx-stack\"><span class=\"mjx-over\"><span class=\"mjx-mo\"><span class=\"mjx-delim-h\"><span class=\"mjx-char MJXc-TeX-main-R\">\u00af<\/span><span class=\"mjx-char MJXc-TeX-main-R\">\u00af<\/span><span class=\"mjx-char MJXc-TeX-main-R\">\u00af<\/span><\/span><\/span><\/span><span class=\"mjx-op\"><span class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">Y<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>\u2014the mean of the response variable\u2019s values<\/p>\r\n<\/li>\r\n \t<li>\r\n<p id=\"d6cb5347802e46a18a6793df6aec145b\">S<sub>Y<\/sub>\u2014the standard deviation of the response variable\u2019s values<\/p>\r\n<\/li>\r\n \t<li>\r\n<p id=\"d0b502bdd9034080a0702bca8de42d8a\">r\u2014the correlation coefficient<\/p>\r\n<\/li>\r\n<\/ul>\r\n<p id=\"e589cb0d6eca43c198fef8969f3d8788\">Given the five quantities above, the slope and intercept of the least squares regression line are found using the following formulas:<\/p>\r\n[latex]\\mathcal{b}=\\mathcal{r}\\left(\\frac{\\mathcal{S}_\\mathcal{y}}{\\mathcal{S}_\\mathcal{x}}\\right)\\\\\r\n\\mathcal{a}=\\bar{\\mathcal{Y}}-\\mathcal{b}\\bar{\\mathcal{X}}[\/latex]\r\n<div id=\"e1c575ccf0da4755b49f40375f057c53\" class=\"section\">\r\n<div class=\"sectionContain\">\r\n<h2><span title=\"Quick scroll up\">Comments<\/span><\/h2>\r\n<ol id=\"e2855b52651047879bfdbef498b6a412\">\r\n \t<li>\r\n<p id=\"f597bcc25d944bbfa514faa5dea200e3\">Note that since the formula for the intercept\u00a0<em class=\"italic\">a<\/em>\u00a0depends on the value of the slope,\u00a0<em class=\"italic\">b<\/em>, you need to find\u00a0<em class=\"italic\">b<\/em>\u00a0first.<\/p>\r\n<\/li>\r\n \t<li>\r\n<p id=\"ae5c771ea7e194ab1aee39ddcf0bc1369\">The slope of the least squares regression line can be interpreted as the average change in the response variable when the explanatory variable increases by 1 unit.<\/p>\r\n<\/li>\r\n<\/ol>\r\n<\/div>\r\n<\/div>\r\n<div id=\"b381b4db93c24ee4a9b8da6739aa0216\" class=\"examplewrap\">\r\n<div class=\"example clearfix\">\r\n<div class=\"textbox textbox--examples\"><header class=\"textbox__header\">\r\n<h3 class=\"textbox__title\">Example<\/h3>\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n<h4>Age-Distance<\/h4>\r\n<div>\r\n<p id=\"d93a3e775803447f9667eb3f1a133ac3\">Let\u2019s revisit our age-distance example, and find the\u00a0<em class=\"italic\">least-squares regression line<\/em>. The following output will be helpful in getting the 5 values we need:<\/p>\r\n\r\n<div class=\"Excel2019PC altContentOn\">\r\n<div class=\"alternative\">\r\n\r\n<span class=\"imagewrap\"><span class=\"image\"><img id=\"b114ce1718d44144a8f3ad4fcf7e1da8\" class=\"img-responsive popimg aligncenter\" title=\"Output from Excel 2007\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear13excel.gif\" alt=\"Output from Excel 2007\" \/><\/span><\/span>\r\n<ul id=\"b0e4d665c4b84c0095be89272b378e41\">\r\n \t<li>\r\n<p id=\"fb10138e9bcf41799d61123859774cff\">The\u00a0<em class=\"bold\">slope\u00a0<\/em>of the line is [latex]\\mathcal{b}=\\left(-0.793\\right)*\\left(\\frac{82.8}{21.78}\\right)=-3[\/latex]. This means that for every 1-unit increase of the explanatory variable, there is, on average, a 3-unit decrease in the response variable. The interpretation\u00a0<em class=\"italic\">in context<\/em>\u00a0of the slope being -3 is, therefore: For every year a driver gets older, the maximum distance at which he\/she can read a sign decreases,\u00a0<em class=\"italic\">on average<\/em>, by 3 feet.<\/p>\r\n<\/li>\r\n \t<li>\r\n<p id=\"de1f9103357f4371a9de5a168d315d30\">The\u00a0<em class=\"bold\">intercept<\/em>\u00a0of the line is\u00a0<span id=\"MathJax-Element-12-Frame\" class=\"mjx-chtml MathJax_CHTML\"><span id=\"MJXc-Node-271\" class=\"mjx-math\"><span id=\"MJXc-Node-272\" class=\"mjx-mrow\"><span id=\"MJXc-Node-273\" class=\"mjx-mrow\"><span id=\"MJXc-Node-274\" class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">a<\/span><\/span><span id=\"MJXc-Node-275\" class=\"mjx-mo MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\">=<\/span><\/span><span id=\"MJXc-Node-276\" class=\"mjx-mn MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\">423<\/span><\/span><span id=\"MJXc-Node-277\" class=\"mjx-mo MJXc-space2\"><span class=\"mjx-char MJXc-TeX-main-R\">\u2212<\/span><\/span><span id=\"MJXc-Node-278\" class=\"mjx-mo MJXc-space2\"><span class=\"mjx-char MJXc-TeX-main-R\">(<\/span><\/span><span id=\"MJXc-Node-279\" class=\"mjx-mn\"><span class=\"mjx-char MJXc-TeX-main-R\">\u22123<\/span><\/span><span id=\"MJXc-Node-280\" class=\"mjx-mo\"><span class=\"mjx-char MJXc-TeX-main-R\">\u2217<\/span><\/span><span id=\"MJXc-Node-281\" class=\"mjx-mn\"><span class=\"mjx-char MJXc-TeX-main-R\">51<\/span><\/span><span id=\"MJXc-Node-282\" class=\"mjx-mo\"><span class=\"mjx-char MJXc-TeX-main-R\">)<\/span><\/span><span id=\"MJXc-Node-283\" class=\"mjx-mo MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\">=<\/span><\/span><span id=\"MJXc-Node-284\" class=\"mjx-mn MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\">576<\/span><\/span><\/span><\/span><\/span><\/span>\u00a0and therefore the\u00a0<em class=\"bold\">least-squares regression line<\/em>\u00a0for this example is<\/p>\r\n\r\n<table class=\"formula\">\r\n<tbody>\r\n<tr>\r\n<td>Distance = 576 + (\u22123 * Age)<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n<\/li>\r\n<\/ul>\r\n<\/div>\r\n<\/div>\r\n<p id=\"f49e6141036f426491adb4e7b7447b11\">Here is the regression line plotted on the scatterplot:<span class=\"imagewrap\"><span class=\"image\"><img id=\"af03a2aff0f14f409363fe047be4e064\" class=\"img-responsive popimg aligncenter\" title=\"The scatterplot for Driver Age and Sign Legibility Distance. The least squares regression line has been drawn. It is a negative relationship line.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear14.gif\" alt=\"The scatterplot for Driver Age and Sign Legibility Distance. The least squares regression line has been drawn. It is a negative relationship line.\" \/><\/span><\/span>As we can see, the regression line fits the linear pattern of the data quite well.<\/p>\r\n\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div id=\"ec08df282d9f4598bac4a27d745d9acb\" class=\"section\">\r\n<div class=\"sectionContain\">\r\n<h2><span title=\"Quick scroll up\">Comment<\/span><\/h2>\r\n<p id=\"beab96cc803549ef97c79c1787068ce2\">As we mentioned before, hand-calculation is not the focus of this course. We wanted you to see one example in which the least squares regression line is calculated by hand, but in general we\u2019ll let a statistics package do that for us.<\/p>\r\n\r\n\r\n<hr \/>\r\n<p id=\"N10B28\">Let\u2019s go back now to our motivating example, in which we wanted to predict the maximum distance at which a sign is legible for a 60-year-old. Now that we have found the least squares regression line, this prediction becomes quite easy:<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"The scatterplot for Driver Age and Sign Legibility Distance. Now that we have a regression line, finding out the maximum distance at which a sign is legible for a 60-year-old person is easy. We simply check at what y coordinate does the regression line cross a vertical line at x = 60. This happens to be at y = 396.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear15.gif\" alt=\"The scatterplot for Driver Age and Sign Legibility Distance. Now that we have a regression line, finding out the maximum distance at which a sign is legible for a 60-year-old person is easy. We simply check at what y coordinate does the regression line cross a vertical line at x = 60. This happens to be at y = 396.\" \/><\/span><\/span><\/p>\r\nPractically, what the figure tells us is that in order to find the predicted legibility distance for a 60-year-old, we plug Age = 60 into the regression line equation, to find that:\r\n<table class=\"formula\">\r\n<tbody>\r\n<tr>\r\n<td>Predicted distance = 576 + (- 3 * 60) = 396<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n396 feet is our best prediction for the maximum distance at which a sign is legible for a 60-year-old.\r\n<div class=\"textbox textbox--exercises\"><header class=\"textbox__header\">\r\n<h3 class=\"textbox__title\">Did I get this?<\/h3>\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n<p id=\"N10B43\"><em>Background:\u00a0<\/em>A statistics department is interested in tracking the progress of its students from entry until graduation. As part of the study, the department tabulates the performance of 10 students in an introductory course and in an upper-level course required for graduation. The scatterplot below includes the least squares line (the line that best explains the upper-level course average based on the lower-level course average), and its equation:<\/p>\r\n\r\n<div class=\"image shouldbeleft\"><img id=\"_i_1\" class=\"img-responsive popimg aligncenter\" title=\"The scatterplot for Introductory Course Average vs. Upper Level Course Average. In addition to the data plotted on the scatterplot, we have a least squares regression line. The line's equation is Y = -1.4 + X.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear21.gif\" alt=\"The scatterplot for Introductory Course Average vs. Upper Level Course Average. In addition to the data plotted on the scatterplot, we have a least squares regression line. The line's equation is Y = -1.4 + X.\" \/><\/div>\r\n<div>[h5p id=\"49\"]<\/div>\r\n<\/div>\r\n<\/div>\r\n<div id=\"N10BB7\" class=\"section\">\r\n<div class=\"sectionContain\">\r\n<h2><span title=\"Quick scroll up\">Comment About Predictions<\/span><\/h2>\r\n<p id=\"N10BBE\">Suppose a government agency wanted to design a sign appropriate for an even wider range of drivers than were present in the original study. They want to predict the maximum distance at which the sign would be legible for a 90-year-old. Using the least squares regression line again as our summary of the linear dependence of the distances upon the drivers' ages, the agency predicts that 90-year-old drivers can see the sign at no more than 576 + (- 3 * 90) = 306 feet:<span class=\"imagewrap\"><span class=\"image\"><img id=\"_i_2\" class=\"img-responsive popimg aligncenter\" style=\"box-sizing: border-box; border: none; vertical-align: middle; margin: auto; padding: 0px; outline: 0px; cursor: pointer; display: block; max-width: 100%; height: auto;\" title=\" The scatterplot for Driver Age vs. Sign Legibility Distance. The scales of both axes have been enlarged so that the regression line has room on the right to be extended past where data exists. The regression line is negative, so it grows from the upper left to the lower right of the plot. Where the regression line is creating an estimate in between existing data, it is red. Beyond that, where there are no data points, the line is green. This area is x&gt;82. The equation of the regression line is Distance = 576 - 3 * Age\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear16.gif\" alt=\" The scatterplot for Driver Age vs. Sign Legibility Distance. The scales of both axes have been enlarged so that the regression line has room on the right to be extended past where data exists. The regression line is negative, so it grows from the upper left to the lower right of the plot. Where the regression line is creating an estimate in between existing data, it is red. Beyond that, where there are no data points, the line is green. This area is x&gt;82. The equation of the regression line is Distance = 576 - 3 * Age\" \/><\/span><\/span><\/p>\r\n<p id=\"N10BC7\">(The green segment of the line is the region of ages beyond 82, the age of the oldest individual in the study.)<\/p>\r\n\r\n<div class=\"inquiry\">\r\n<div><em>Question:\u00a0<\/em>Is our prediction for 90-year-old drivers reliable?<\/div>\r\n<div><\/div>\r\n<div class=\"answer\"><em>Answer:\u00a0<\/em>Our original age data ranged from 18 (youngest driver) to 82 (oldest driver), and our regression line is therefore a summary of the linear relationship\u00a0<em>in that age range only.\u00a0<\/em>When we plug the value 90 into the regression line equation, we are assuming that the same linear relationship extends beyond the range of our age data (18-82) into the green segment.\u00a0<em>There is no justification for such an assumption.<\/em>\u00a0It might be the case that the vision of drivers older than 82 falls off more rapidly than it does for younger drivers. (i.e., the slope changes from -3 to something more negative). Our prediction for age = 90 is therefore\u00a0<em>not reliable.<\/em><\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div id=\"N10BE0\" class=\"section purposewrap\">\r\n<div class=\"sectionContain\">\r\n<h2><span title=\"Quick scroll up\">In General<\/span><\/h2>\r\n<p id=\"N10BE7\">Prediction for ranges of the explanatory variable that are not in the data is called\u00a0<em>extrapolation<\/em>. Since there is no way of knowing whether a relationship holds beyond the range of the explanatory variable in the data, extrapolation is not reliable, and should be avoided. In our example, like most others, extrapolation can lead to very poor or illogical predictions.<\/p>\r\n\r\n<h2><span title=\"Quick scroll up\">Let\u2019s Summarize<\/span><\/h2>\r\n<ul>\r\n \t<li>A special case of the relationship between two quantitative variables is the\u00a0<em>linear\u00a0<\/em>relationship. In this case, a straight line simply and adequately summarizes the relationship.<\/li>\r\n \t<li>When the scatterplot displays a linear relationship, we supplement it with the\u00a0<em>correlation coefficient (r)<\/em>, which measures the\u00a0<em>strength<\/em>\u00a0and direction of a linear relationship between two quantitative variables. The correlation ranges between -1 and 1. Values near -1 indicate a strong negative linear relationship, values near 0 indicate a weak linear relationship, and values near 1 indicate a strong positive linear relationship.<\/li>\r\n \t<li>The correlation is only an appropriate numerical measure for linear relationships, and is sensitive to outliers. Therefore, the correlation should only be used as a supplement to a scatterplot (after we look at the data).<\/li>\r\n \t<li>The most commonly used criterion for finding a line that summarizes the pattern of a linear relationship is \u201cleast squares.\u201d The\u00a0<em>least squares regression line\u00a0<\/em>has the smallest sum of squared vertical deviations of the data points from the line.<\/li>\r\n \t<li>The slope of the least squares regression line can be interpreted as the average change in the response variable when the explanatory variable increases by 1 unit.<\/li>\r\n \t<li>The least squares regression line predicts the value of the response variable for a given value of the explanatory variable.\u00a0<em>Extrapolation<\/em>\u00a0is prediction of values of the explanatory variable that fall outside the range of the data. Since there is no way of knowing whether a relationship holds beyond the range of the explanatory variable in the data, extrapolation is not reliable, and should be avoided.<\/li>\r\n<\/ul>\r\n<h2>Causation<\/h2>\r\nSo far we have discussed different ways in which data can be used to explore the relationship (or association) between two variables. To frame our discussion we followed the role-type classification table:<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"It is possible for any type of explanatory variable to be paired with any type of response variable. The possible pairings are: Categorical Explanatory \u2192 Categorical Response (C\u2192C), Categorical Explanatory \u2192 Quantitative Response (C\u2192Q), Quantitative Explanatory \u2192 Categorical Response (Q\u2192C), and Quantitative Explanatory \u2192 Quantitative Response (Q\u2192Q).\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation1.gif\" alt=\"It is possible for any type of explanatory variable to be paired with any type of response variable. The possible pairings are: Categorical Explanatory \u2192 Categorical Response (C\u2192C), Categorical Explanatory \u2192 Quantitative Response (C\u2192Q), Quantitative Explanatory \u2192 Categorical Response (Q\u2192C), and Quantitative Explanatory \u2192 Quantitative Response (Q\u2192Q).\" width=\"700\" height=\"500\" \/><\/span><\/span>\r\n\r\nand we have now completed learning how to explore the relationship in cases C\u2192Q, C\u2192C, and Q\u2192Q. (As noted before, case Q\u2192C will not be discussed in this course.) When we explore the relationship between two variables, there is often a temptation to conclude from the observed relationship that changes in the explanatory variable\u00a0<em>cause<\/em>\u00a0changes in the response variable. In other words, you might be tempted to interpret the observed association as causation. The purpose of this part of the course is to convince you that this kind of interpretation is often\u00a0<em class=\"italic\">wrong!<\/em>\u00a0The motto of this section is one of the most fundamental principles of this course:\r\n<table id=\"N10B0A_bx\" class=\"theorem labeled\">\r\n<thead>\r\n<tr>\r\n<th>Principle<\/th>\r\n<\/tr>\r\n<\/thead>\r\n<tbody>\r\n<tr>\r\n<td>\r\n<div class=\"theorem\">\r\n<div class=\"statement\">\r\n\r\nAssociation\u00a0<em>does not<\/em>\u00a0imply causation!\r\n\r\n<\/div>\r\n<\/div><\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n<div>Let\u2019s start by looking at the following example:<\/div>\r\n<div class=\"textbox textbox--examples\"><header class=\"textbox__header\">\r\n<h3 class=\"textbox__title\">Example<\/h3>\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n<div class=\"examplewrap\">\r\n<h4 class=\"exHead\">Fire Damage<\/h4>\r\n<div class=\"example clearfix\">\r\n<div>\r\n<p id=\"N10AF9\">The scatterplot below illustrates how the number of firefighters sent to fires (X) is related to the amount of damage caused by fires (Y) in a certain city.<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"A scatterplot in which the horizontal axis is labeled &quot;# Of Firefighters&quot;, and the vertical axis is labeled &quot;Damage ($)&quot;. The vertical axis ranges from $0 to $2500000 and the horizontal axis ranges from 0 to 40.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation2.gif\" alt=\"A scatterplot in which the horizontal axis is labeled &quot;# Of Firefighters&quot;, and the vertical axis is labeled &quot;Damage ($)&quot;. The vertical axis ranges from $0 to $2500000 and the horizontal axis ranges from 0 to 40.\" \/><\/span><\/span><\/p>\r\n<p id=\"N10B02\">The scatterplot clearly displays a fairly strong (slightly curved)\u00a0<em>positive<\/em>\u00a0relationship between the two variables. Would it, then, be reasonable to conclude that sending more firefighters to a fire causes more damage, or that the city should send fewer firefighters to a fire, in order to decrease the amount of damage done by the fire? Of course not! So what is going on here?<\/p>\r\nThere is a\u00a0<em>third variable in the background<\/em>\u2014the seriousness of the fire\u2014that is responsible for the observed relationship. More serious fires require more firefighters, and also cause more damage.\r\n\r\nThe following figure will help you visualize this situation:<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"A flowchart. The &quot; Seriousness of the fire&quot; is a &quot;lurking variable.&quot; This is a cause of both &quot;Number of firefighters (X)&quot; and &quot;amount of damage (Y)&quot; We have falsely observed a &quot;observed association&quot; between &quot;Number of firefighters (X) &quot; and &quot;Amount of damage (Y)&quot;\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation3.gif\" alt=\"A flowchart. The &quot; Seriousness of the fire&quot; is a &quot;lurking variable.&quot; This is a cause of both &quot;Number of firefighters (X)&quot; and &quot;amount of damage (Y)&quot; We have falsely observed a &quot;observed association&quot; between &quot;Number of firefighters (X) &quot; and &quot;Amount of damage (Y)&quot;\" \/><\/span><\/span>\r\n<p id=\"N10B17\">Here, the seriousness of the fire is a\u00a0<em>lurking variable.\u00a0<\/em>A\u00a0<em>lurking variable<\/em>\u00a0is a variable that is not among the explanatory or response variables in a study, but could substantially affect your interpretation of the relationship among those variables.<\/p>\r\n\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<p id=\"N10B21\">In particular, as in our example, the lurking variable might have an effect on\u00a0<em class=\"italic\">both<\/em>\u00a0the explanatory and the response variables. This common effect creates the observed association between the explanatory and response variables, even though there is no causal link between them. This possibility, that there might be a lurking variable (which we might not be thinking about) that is responsible for the observed relationship leads to our principle:<\/p>\r\n\r\n<table id=\"N10B28_bx\" class=\"theorem labeled\">\r\n<thead>\r\n<tr>\r\n<th>Principle<\/th>\r\n<\/tr>\r\n<\/thead>\r\n<tbody>\r\n<tr>\r\n<td>\r\n<div class=\"theorem\">\r\n<div class=\"statement\">\r\n\r\nAssociation\u00a0<em>does not<\/em>\u00a0imply causation!\r\n\r\n<\/div>\r\n<\/div><\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\nThe next example will illustrate another way in which a lurking variable might interfere and prevent us from reaching any causal conclusions.\r\n<div class=\"examplewrap\">\r\n<div class=\"example clearfix\">\r\n<div class=\"textbox textbox--examples\"><header class=\"textbox__header\">\r\n<h3 class=\"textbox__title\">Example<\/h3>\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n<h4>SAT Test<\/h4>\r\n<div>\r\n<p id=\"N10B11\">For U.S. colleges and universities, a standard entrance examination is the SAT test. The side-by-side boxplots below provide evidence of a relationship between the student\u2019s country of origin (the United States or another country) and the student\u2019s SAT Math score.<\/p>\r\n<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"A side-by-side boxplot. The vertical axis is labeled &quot;SAT Math Score&quot;, and it ranges from 450 to 800. The horizontal axis is labeled &quot;Country&quot; and has two categories, &quot;Other&quot; and &quot;US&quot;.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation4.gif\" alt=\"A side-by-side boxplot. The vertical axis is labeled &quot;SAT Math Score&quot;, and it ranges from 450 to 800. The horizontal axis is labeled &quot;Country&quot; and has two categories, &quot;Other&quot; and &quot;US&quot;.\" \/><\/span><\/span>\r\n<p id=\"N10B1A\">The distribution of international students\u2019 scores is higher than that of U.S. students. The international students\u2019 median score (about 700) exceeds the third quartile of U.S. students\u2019 scores. Can we conclude that the country of origin is the\u00a0<em>cause<\/em>\u00a0of the difference in SAT Math scores, and that students in the United States are weaker at math than students in other countries?<\/p>\r\nNo, not necessarily. While it\u00a0<em class=\"italic\">might<\/em>\u00a0be true that U.S. students differ in math ability from other students\u2014i.e. due to differences in educational systems\u2014we can\u2019t conclude that a student\u2019s country of origin is the cause of the disparity. One important\u00a0<em>lurking variable<\/em>\u00a0that might explain the observed relationship is the educational level of the two populations taking the SAT Math test. In the United States, the SAT is a standard test, and therefore a broad cross-section of all U.S. students (in terms of educational level) take this test. Among all international students, on the other hand, only those who plan on coming to the U.S. to study, which is usually a more selected subgroup, take the test.\r\n\r\nThe following figure will help you visualize this explanation:\r\n\r\n<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"A flowchart. We have two causes, one of which is &quot;Education level of SAT Takers&quot;. This is a &quot;Lurking variable &quot; The other cause is &quot;Nationality (X)&quot;. Both of these might be causes of &quot; SAT-Math score (Y)&quot;. We have observed an association between &quot;Nationality (X)&quot; and &quot;SAT-Math Score (Y)&quot;. Notice that between these two variables is also a suspected cause relationship.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation5.gif\" alt=\"A flowchart. We have two causes, one of which is &quot;Education level of SAT Takers&quot;. This is a &quot;Lurking variable &quot; The other cause is &quot;Nationality (X)&quot;. Both of these might be causes of &quot; SAT-Math score (Y)&quot;. We have observed an association between &quot;Nationality (X)&quot; and &quot;SAT-Math Score (Y)&quot;. Notice that between these two variables is also a suspected cause relationship.\" \/><\/span><\/span>\r\n<p id=\"N10B33\">Here, the explanatory variable (X)\u00a0<em>may<\/em>\u00a0have a causal relationship with the response variable (Y), but the lurking variable might be a contributing factor as well, which makes it very hard to isolate the effect of the explanatory variable and prove that it has a causal link with the response variable. In this case, we say that the lurking variable is\u00a0<em>confounded<\/em> with the explanatory variable, since their effects on the response variable cannot be distinguished from each other.<\/p>\r\n\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\nNote that in each of the above two examples, the lurking variable interacts differently with the variables studied. In example 1, the lurking variable has an effect on both the explanatory and the response variables, creating the illusion that there is a causal link between them. In example two, the lurking variable is confounded with the explanatory variable, making it hard to assess the isolated effect of the explanatory variable on the response variable.\r\n\r\nThe distinction between these two types of interactions is not as important as the fact that in either case, the observed association can be at least partially explained by the lurking variable. The most important message from these two examples is therefore:\u00a0<em>An observed association between two variables is not enough evidence that there is a <\/em><em>causal relationship between them.<\/em>\r\n<p id=\"N10B4C\">In other words \u2026<\/p>\r\n\r\n<table id=\"N10B4F_bx\" class=\"theorem labeled\">\r\n<thead>\r\n<tr>\r\n<th>Principle<\/th>\r\n<\/tr>\r\n<\/thead>\r\n<tbody>\r\n<tr>\r\n<td>\r\n<div class=\"theorem\">\r\n<div class=\"statement\">\r\n<p id=\"N10B54\">Association\u00a0<em>does not<\/em>\u00a0imply causation!<\/p>\r\n\r\n<\/div>\r\n<\/div><\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n<div>\r\n<div class=\"textbox textbox--exercises\"><header class=\"textbox__header\">\r\n<h3 class=\"textbox__title\">Did I get this?<\/h3>\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n\r\n[h5p id=\"50\"]\r\n\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<p id=\"N10AF5\">So far, we have:<\/p>\r\n\r\n<ul>\r\n \t<li>discussed what lurking variables are,<\/li>\r\n \t<li>demonstrated different ways in which the lurking variables can interact with the two studied variables, and<\/li>\r\n \t<li>understood that the existence of a possible lurking variable is the main reason why we say that association does not imply causation.<\/li>\r\n<\/ul>\r\nAs you recall, a lurking variable, by definition, is a variable that was not included in the study, but could have a substantial effect on our understanding of the relationship between the two studied variables.\r\n\r\nWhat if we\u00a0<em class=\"italic\">did<\/em>\u00a0include a lurking variable in our study? What kind of effect could that have on our understanding of the relationship? These are the questions we are going to discuss next.\r\n\r\nLet\u2019s start with an example:\r\n<div class=\"textbox textbox--examples\"><header class=\"textbox__header\">\r\n<h3 class=\"textbox__title\">Example<\/h3>\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n<div class=\"examplewrap\">\r\n<div class=\"example clearfix\">\r\n<h4>Hospital Death Rates<\/h4>\r\n<div>\r\n<p id=\"N10B16\"><em>Background:<\/em>\u00a0A government study collected data on the death rates in nearly 6,000 hospitals in the United States. These results were then challenged by researchers, who said that the federal analyses failed to take into account the variation among hospitals in the severity of patients\u2019 illnesses when they were hospitalized. As a result, said the researchers, some hospitals were treated unfairly in the findings, which named hospitals with higher-than-expected death rates. What the researchers meant is that when the federal government explored the relationship between the two variables\u2014hospital and death rate\u2014<em>it also should have included in the study (or taken into account) the lurking variable\u2014severity of illness.<\/em><\/p>\r\nWe will use a simplified version of this study to illustrate the researchers\u2019 claim, and see what the possible effect could be of including a lurking variable in a study. (Reference: Moore and McCabe (2003).\u00a0<em class=\"italic\">Introduction to the Practice of Statistics<\/em>.)\r\n<p id=\"N10B26\">Consider the following two-way table, which summarizes the data about the status of patients who were admitted to two hospitals in a certain city (Hospital A and Hospital B). Note that since the purpose of the study is to examine whether there is a \u201chospital effect\u201d on patients\u2019 status, \u201cHospital is the explanatory variable, and \u201cPatient\u2019s Status\u201d is the response variable.<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"A two-way table. The columns are the categories within the variable &quot;Patient's Status&quot;. These categories are &quot;Died&quot; and &quot;Survived.&quot; In addition, there is a &quot;Total&quot; column. The rows are categories for the variable &quot;Hospital&quot;. These categories are &quot;Hospital A&quot; and &quot;Hospital B&quot;. Like usual there is also a &quot;Total&quot; Row. Here is the data in &quot;Row,Column: Value &quot; format: Hospital A, Died: 63; Hospital A, Survived: 2037; Hospital A, Total: 2100; Hospital B, Died: 16; Hospital B, Survived: 784; Hospital B, Total: 800; Total, Died: 79; Total, Survived: 2821; Total, Total: 2900;\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation6.gif\" alt=\"A two-way table. The columns are the categories within the variable &quot;Patient's Status&quot;. These categories are &quot;Died&quot; and &quot;Survived.&quot; In addition, there is a &quot;Total&quot; column. The rows are categories for the variable &quot;Hospital&quot;. These categories are &quot;Hospital A&quot; and &quot;Hospital B&quot;. Like usual there is also a &quot;Total&quot; Row. Here is the data in &quot;Row,Column: Value &quot; format: Hospital A, Died: 63; Hospital A, Survived: 2037; Hospital A, Total: 2100; Hospital B, Died: 16; Hospital B, Survived: 784; Hospital B, Total: 800; Total, Died: 79; Total, Survived: 2821; Total, Total: 2900;\" \/><\/span><\/span><\/p>\r\n<p id=\"N10B2F\">When we supplement the two-way table with the conditional percents within each hospital:<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"A two-way table with the same rows and columns as the previous two-way table, except the Total row has been removed. Here is the data in the same format: Hospital A, Died: 3%; Hospital A, Survived: 97%; Hospital A, Total: 100%; Hospital B, Died: 2%; Hospital B, Survived: 98%; Hospital B, Total: 100%;\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation7.gif\" alt=\"A two-way table with the same rows and columns as the previous two-way table, except the Total row has been removed. Here is the data in the same format: Hospital A, Died: 3%; Hospital A, Survived: 97%; Hospital A, Total: 100%; Hospital B, Died: 2%; Hospital B, Survived: 98%; Hospital B, Total: 100%;\" \/><\/span><\/span><\/p>\r\nwe find that Hospital A has a higher death rate (3%) than Hospital B (2%). Should we jump to the conclusion that a sick patient admitted to Hospital A is 50% more likely to die than if he\/she were admitted to Hospital B?\u00a0<em>Not so fast \u2026<\/em>\r\n<p id=\"N10B3E\">Maybe Hospital A gets most of the severe cases, and that explains why it has a higher death rate. In order to explore this, we need to\u00a0<em>include (or account for) the lurking variable \u201cseverity of illness\u201d in our analysis.<\/em>\u00a0To do this, we go back to the two-way table and split it up to look separately at patents who are severely ill, and patients who are not.<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"The original two-way table has been split into two two-way tables, one for &quot;Patients severely ill&quot; and one for &quot;Patients not severely ill.&quot; Once again, here are the columns, for the variable &quot;Patient's Status&quot;: &quot;Died&quot;, &quot;Survived&quot;, &quot;Total&quot;. The rows, for the variable &quot;Hospital&quot;: &quot;Hospital A&quot;, &quot;Hospital B&quot;, &quot; Total&quot;. Data will be given in &quot;Row,Column: Value&quot; format. Table for &quot;Patients severely ill:&quot; Hospital A, Died: 57; Hospital A, Survived: 1443; Hospital A, Total: 1500; Hospital B, Died: 8; Hospital B, Survived: 192; Hospital B, Total: 200; Total, Died: 65; Total, Survived: 1635; Total, Total: 1700; Table for &quot;Patients not severely ill:&quot; Hospital A, Died: 6; Hospital A, Survived: 594; Hospital A, Total: 600; Hospital B, Died: 8; Hospital B, Survived: 592; Hospital B, Total: 600; Total, Died: 14; Total, Survived: 1186; Total, Total: 1200;\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation10.gif\" alt=\"The original two-way table has been split into two two-way tables, one for &quot;Patients severely ill&quot; and one for &quot;Patients not severely ill.&quot; Once again, here are the columns, for the variable &quot;Patient's Status&quot;: &quot;Died&quot;, &quot;Survived&quot;, &quot;Total&quot;. The rows, for the variable &quot;Hospital&quot;: &quot;Hospital A&quot;, &quot;Hospital B&quot;, &quot; Total&quot;. Data will be given in &quot;Row,Column: Value&quot; format. Table for &quot;Patients severely ill:&quot; Hospital A, Died: 57; Hospital A, Survived: 1443; Hospital A, Total: 1500; Hospital B, Died: 8; Hospital B, Survived: 192; Hospital B, Total: 200; Total, Died: 65; Total, Survived: 1635; Total, Total: 1700; Table for &quot;Patients not severely ill:&quot; Hospital A, Died: 6; Hospital A, Survived: 594; Hospital A, Total: 600; Hospital B, Died: 8; Hospital B, Survived: 592; Hospital B, Total: 600; Total, Died: 14; Total, Survived: 1186; Total, Total: 1200;\" \/><\/span><\/span><\/p>\r\n<p id=\"N10B4A\">As we can see, Hospital A\u00a0<em>did<\/em>\u00a0admit many more severely ill patients than Hospital B (1,500 vs. 200). In fact, from the way the totals were split, we see that in Hospital A, severely ill patients were a much higher proportion of the patients\u20141,500 out of a total of 2,100 patients. In contrast, only 200 out of 800 patients at Hospital B were severely ill. To better see the effect of including the lurking variable, we need to supplement each of the two new two-way tables with its conditional percentages:<span class=\"imagewrap\"><span class=\"image\"><img class=\"img-responsive popimg aligncenter\" title=\"Two two-way tables with the same rows and columns as the previous two-way table, except the Total row has been removed. Here is the data in the same format: Table for &quot;Patients severely ill:&quot; Hospital A, Died: 3.8%; Hospital A, Survived: 96.2%; Hospital A, Total: 100%; Hospital B, Died: 4.0%; Hospital B, Survived: 96.0%; Hospital B, Total: 100%; Table for &quot;Patients not severely ill:&quot; Hospital A, Died: 1.0%; Hospital A, Survived: 99.0%; Hospital A, Total: 100%; Hospital B, Died: 1.3%; Hospital B, Survived: 98.7%; Hospital B, Total: 100%;\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation11.gif\" alt=\"Two two-way tables with the same rows and columns as the previous two-way table, except the Total row has been removed. Here is the data in the same format: Table for &quot;Patients severely ill:&quot; Hospital A, Died: 3.8%; Hospital A, Survived: 96.2%; Hospital A, Total: 100%; Hospital B, Died: 4.0%; Hospital B, Survived: 96.0%; Hospital B, Total: 100%; Table for &quot;Patients not severely ill:&quot; Hospital A, Died: 1.0%; Hospital A, Survived: 99.0%; Hospital A, Total: 100%; Hospital B, Died: 1.3%; Hospital B, Survived: 98.7%; Hospital B, Total: 100%;\" \/><\/span><\/span><\/p>\r\n<p id=\"N10B56\">Note that despite our earlier finding that overall Hospital A has a higher death rate (3% vs. 2%), when we take into account the lurking variable, we find that actually it is Hospital B that has the higher death rate both among the severely ill patients (4% vs. 3.8%) and among the not severely ill patients (1.3% vs. 1%).\u00a0<em>Thus, we see that adding a lurking variable can change the direction of an association.<\/em><\/p>\r\n<p id=\"N10B5C\">Whenever including a lurking variable causes us to rethink the direction of an association, this is called\u00a0<em>Simpson\u2019s paradox.<\/em><\/p>\r\n<p id=\"N10B62\">The possibility that a lurking variable can have such a dramatic effect is another reason we must adhere to the principle:<\/p>\r\n\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<table id=\"N10B66_bx\" class=\"theorem labeled\">\r\n<thead>\r\n<tr>\r\n<th>Principle<\/th>\r\n<\/tr>\r\n<\/thead>\r\n<tbody>\r\n<tr>\r\n<td>\r\n<div class=\"theorem\">\r\n<div class=\"statement\">\r\n<p id=\"N10B6B\">Association\u00a0<em>does not<\/em>\u00a0imply causation!<\/p>\r\n\r\n<\/div>\r\n<\/div><\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n<\/div>\r\n<\/div>\r\nIt is\u00a0<em class=\"italic\">not<\/em> always the case that including a lurking variable makes us rethink the direction of the association. In the next example we will see how including a lurking variable just helps us gain a deeper understanding of the observed relationship.\r\n<div class=\"examplewrap\">\r\n<div class=\"example clearfix\">\r\n<div class=\"textbox textbox--examples\"><header class=\"textbox__header\">\r\n<h3 class=\"textbox__title\">Example<\/h3>\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n<h4>College Entrance Exams<\/h4>\r\n<div>\r\n\r\nAs discussed earlier, in the United States, the SAT is a widely used college entrance examination, required by the most prestigious schools. In some states, a different college entrance examination is prevalent, the ACT.\r\n\r\nhttps:\/\/youtube.com\/watch?v=Nnj1YlqzkX4\r\n<div class=\"figurewrap\">\r\n<div class=\"figure clearfix\">\r\n<div class=\"youtube\"><\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\nThe last two examples showed us that including a lurking variable in our exploration may\r\n<ul>\r\n \t<li>Lead us to rethink the direction of an association (as in the Hospital\/Death Rate example).<\/li>\r\n \t<li>Help us to gain a deeper understanding of the relationship between variables (as in the SAT\/ACT example)<\/li>\r\n<\/ul>\r\n<div class=\"textbox textbox--exercises\"><header class=\"textbox__header\">\r\n<h3 class=\"textbox__title\">Did I get this?<\/h3>\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n\r\n[h5p id=\"51\"]\r\n\r\n<\/div>\r\n<\/div>\r\n<h2><span title=\"Quick scroll up\">Let\u2019s Summarize<\/span><\/h2>\r\n<ul>\r\n \t<li>A\u00a0<em>lurking variable<\/em>\u00a0is a variable that was not included in your analysis, but that could substantially change your interpretation of the data if it were included.<\/li>\r\n \t<li>Because of the possibility of lurking variables, we adhere to the principle that\u00a0<em>association does not imply causation<\/em>.<\/li>\r\n \t<li>Including a lurking variable in our exploration may:\r\n<ul>\r\n \t<li>Help us to gain a deeper understanding of the relationship between variables.<\/li>\r\n \t<li>Lead us to rethink the direction of an association.<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li>Whenever including a lurking variable causes us to rethink the direction of an association, this is an instance of\u00a0<em>Simpson\u2019s paradox<\/em>.<\/li>\r\n<\/ul>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>","rendered":"<h2><span title=\"Quick scroll up\">Linear Regression: Summarizing the Pattern of the Data with a Line<\/span><\/h2>\n<p>So far we\u2019ve used the scatterplot to describe the relationship between two quantitative variables, and in the special case of a linear relationship, we have supplemented the scatterplot with the correlation (r). The correlation, however, doesn\u2019t fully characterize the linear relationship between two quantitative variables\u2014it only measures the strength and direction. We often want to describe more precisely how one variable changes with the other (by \u201cmore precisely,\u201d we mean more than just the direction), or\u00a0<em>predict\u00a0<\/em>the value of the response variable for a given value of the explanatory variable. In order to be able to do that, we need to summarize the linear relationship with a line that best fits the linear pattern of the data. In the remainder of this section, we will introduce a way to find such a line, learn how to interpret it, and use it (cautiously) to make predictions.<\/p>\n<p>Again, let\u2019s start with a motivating example:<\/p>\n<p>Earlier, we examined the linear relationship between the age of a driver and the maximum distance at which a highway sign was legible, using both a scatterplot and the correlation coefficient. Suppose a government agency wanted to predict the maximum distance at which the sign would be legible for 60-year-old drivers, and thus make sure that the sign could be used safely and effectively.<\/p>\n<p id=\"N10B12\">How would we make this prediction?<\/p>\n<p><iframe loading=\"lazy\" id=\"oembed-1\" title=\"Making Predictions\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube.com\/embed\/8hf3dMf59cI?feature=oembed&#38;rel=0&#38;rel=0\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/p>\n<div class=\"figurewrap\">\n<div class=\"figure clearfix\">\n<div class=\"youtube\"><span style=\"orphans: 1; text-align: initial; font-size: 1em;\">How and why did we pick this particular line (the one shown in red in the above walkthrough) to describe the dependence of the maximum distance at which a sign is legible upon the age of a driver? What line exactly did we choose? We will return to this example once we can answer that question with a bit more precision.<\/span><\/div>\n<\/div>\n<\/div>\n<p>The technique that specifies the dependence of the response variable on the explanatory variable is called\u00a0<em>regression<\/em>. When that dependence is linear (which is the case in our examples in this section), the technique is called\u00a0<em>linear regression<\/em>. Linear regression is therefore the technique of finding the line that best fits the pattern of the linear relationship (or in other words, the line that best describes how the response variable linearly depends on the explanatory variable).<\/p>\n<p id=\"N10B09\">To understand how such a line is chosen, consider the following very simplified version of the age-distance example (we left just 6 of the drivers on the scatterplot):<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"The scatterplot of Sign Legibility vs. Driver Age with only 6 data points. The data points chosen to be shown roughly make a parallelogram, whose top and bottom sides represent negative relationships.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear9.gif\" alt=\"The scatterplot of Sign Legibility vs. Driver Age with only 6 data points. The data points chosen to be shown roughly make a parallelogram, whose top and bottom sides represent negative relationships.\" \/><\/span><\/span><\/p>\n<p>There are many lines that look like they would be good candidates to be the line that best fits the data:<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"The same scatterplot the 6 data points. Five different lines have been drawn from the upper left region of the plot to the lower right. They all intersect the parallelogram created by the 6 data points in a way such that each line is above 3 points and below 3 points. These lines are potential candidates. There are many other lines which could be used to fit the data.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear10.gif\" alt=\"The same scatterplot the 6 data points. Five different lines have been drawn from the upper left region of the plot to the lower right. They all intersect the parallelogram created by the 6 data points in a way such that each line is above 3 points and below 3 points. These lines are potential candidates. There are many other lines which could be used to fit the data.\" \/><\/span><\/span><\/p>\n<p>It is doubtful that everyone would select the same line in the plot above. We need to agree on what we mean by \u201cbest fits the data\u201d; in other words, we need to agree on a criterion by which we would select this line. We want the line we choose to be close to the data points. In other words, whatever criterion we choose, it had better somehow take into account the vertical deviations of the data points from the line, which are marked with blue arrows in the plot below:<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"The same scatterplot with 6 points. A potential line has been drawn, and a vertical line from each data point to the line has also been drawn. The length of these vertical lines have to be taken into acount when choosing a best fit line.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear11.gif\" alt=\"The same scatterplot with 6 points. A potential line has been drawn, and a vertical line from each data point to the line has also been drawn. The length of these vertical lines have to be taken into acount when choosing a best fit line.\" \/><\/span><\/span><\/p>\n<p id=\"N10B24\">The most commonly used criterion is called the\u00a0<em>least squares<\/em>\u00a0criterion. This criterion says: Among all the lines that look good on your data, choose the one that has the smallest sum of squared vertical deviations. Visually, each squared deviation is represented by the area of one of the squares in the plot below. Therefore, we are looking for the line that will have the smallest total yellow area.<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"The same scatterplot with 6 data points. A line has been chosen, and for each of the 6 data points, a vertical line is drawn from the data point to the line. A square is then drawn, one side using this line, so that all 4 sides are the same length as the vertical line. For all 6 data points we have 6 different vertical lines and thus 6 different squares. The least squares criterion looks to reduce the total area of these squares.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear12.gif\" alt=\"The same scatterplot with 6 data points. A line has been chosen, and for each of the 6 data points, a vertical line is drawn from the data point to the line. A square is then drawn, one side using this line, so that all 4 sides are the same length as the vertical line. For all 6 data points we have 6 different vertical lines and thus 6 different squares. The least squares criterion looks to reduce the total area of these squares.\" \/><\/span><\/span>This line is called the\u00a0<em>least-squares regression line<\/em>, and, as we\u2019ll see, it fits the linear pattern of the data very well.<\/p>\n<p id=\"N10B37\">For the remainder of this lesson, you\u2019ll need to feel comfortable with the algebra of a straight line. In particular you\u2019ll need to be familiar with the\u00a0<em>slope\u00a0<\/em>and the\u00a0<em>intercept\u00a0<\/em>in the equation of a line, and their interpretation.<\/p>\n<p id=\"c028ff36fa7942d88c382dfc2721704a\">Like any other line, the equation of the least-squares regression line for summarizing the linear relationship between the response variable(Y) and the explanatory variable (X) has the form:\u00a0<span class=\"mjx-chtml MathJax_CHTML\"><span class=\"mjx-math\"><span class=\"mjx-mrow\"><span class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">Y<\/span><\/span><span class=\"mjx-mtext\"><span class=\"mjx-char MJXc-TeX-main-R\">\u00a0<\/span><\/span><span class=\"mjx-mo MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\">=<\/span><\/span><span class=\"mjx-mtext MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\">\u00a0<\/span><\/span><span class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">a<\/span><\/span><span class=\"mjx-mtext\"><span class=\"mjx-char MJXc-TeX-main-R\">\u00a0<\/span><\/span><span class=\"mjx-mo MJXc-space2\"><span class=\"mjx-char MJXc-TeX-main-R\">+<\/span><\/span><span class=\"mjx-mtext MJXc-space2\"><span class=\"mjx-char MJXc-TeX-main-R\">\u00a0<\/span><\/span><span class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">b<\/span><\/span><span class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">X<\/span><\/span><\/span><\/span><\/span><\/p>\n<p id=\"d1b3ef3d9f8b48c2a09e1aa1fe88da46\">All we need to do is calculate the intercept\u00a0<em class=\"italic\">a<\/em>, and the slope\u00a0<em class=\"italic\">b<\/em>, which is easily done if we know:<\/p>\n<ul id=\"b579dde0330a492f9986670011fb0635\">\n<li>\n<p id=\"a3dbf83574db455daca9779659353ef3\"><span id=\"MathJax-Element-2-Frame\" class=\"mjx-chtml MathJax_CHTML\"><span class=\"mjx-math\"><span class=\"mjx-mrow\"><span class=\"mjx-mover\"><span class=\"mjx-stack\"><span class=\"mjx-over\"><span class=\"mjx-mo\"><span class=\"mjx-delim-h\"><span class=\"mjx-char MJXc-TeX-main-R\">\u00af<\/span><span class=\"mjx-char MJXc-TeX-main-R\">\u00af<\/span><span class=\"mjx-char MJXc-TeX-main-R\">\u00af<\/span><\/span><\/span><\/span><span class=\"mjx-op\"><span class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">X<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>\u2014the mean of the explanatory variable\u2019s values<\/p>\n<\/li>\n<li>\n<p id=\"bad6f2b749c6474bb0d90c7764eeb7ec\">S<sub>X<\/sub>\u2014the standard deviation of the explanatory variable\u2019s values<\/p>\n<\/li>\n<li>\n<p id=\"c5a51a2a23c74f3396256757ed746d50\"><span id=\"MathJax-Element-3-Frame\" class=\"mjx-chtml MathJax_CHTML\"><span class=\"mjx-math\"><span class=\"mjx-mrow\"><span class=\"mjx-mover\"><span class=\"mjx-stack\"><span class=\"mjx-over\"><span class=\"mjx-mo\"><span class=\"mjx-delim-h\"><span class=\"mjx-char MJXc-TeX-main-R\">\u00af<\/span><span class=\"mjx-char MJXc-TeX-main-R\">\u00af<\/span><span class=\"mjx-char MJXc-TeX-main-R\">\u00af<\/span><\/span><\/span><\/span><span class=\"mjx-op\"><span class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">Y<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>\u2014the mean of the response variable\u2019s values<\/p>\n<\/li>\n<li>\n<p id=\"d6cb5347802e46a18a6793df6aec145b\">S<sub>Y<\/sub>\u2014the standard deviation of the response variable\u2019s values<\/p>\n<\/li>\n<li>\n<p id=\"d0b502bdd9034080a0702bca8de42d8a\">r\u2014the correlation coefficient<\/p>\n<\/li>\n<\/ul>\n<p id=\"e589cb0d6eca43c198fef8969f3d8788\">Given the five quantities above, the slope and intercept of the least squares regression line are found using the following formulas:<\/p>\n<p>[latex]\\mathcal{b}=\\mathcal{r}\\left(\\frac{\\mathcal{S}_\\mathcal{y}}{\\mathcal{S}_\\mathcal{x}}\\right)\\\\  \\mathcal{a}=\\bar{\\mathcal{Y}}-\\mathcal{b}\\bar{\\mathcal{X}}[\/latex]<\/p>\n<div id=\"e1c575ccf0da4755b49f40375f057c53\" class=\"section\">\n<div class=\"sectionContain\">\n<h2><span title=\"Quick scroll up\">Comments<\/span><\/h2>\n<ol id=\"e2855b52651047879bfdbef498b6a412\">\n<li>\n<p id=\"f597bcc25d944bbfa514faa5dea200e3\">Note that since the formula for the intercept\u00a0<em class=\"italic\">a<\/em>\u00a0depends on the value of the slope,\u00a0<em class=\"italic\">b<\/em>, you need to find\u00a0<em class=\"italic\">b<\/em>\u00a0first.<\/p>\n<\/li>\n<li>\n<p id=\"ae5c771ea7e194ab1aee39ddcf0bc1369\">The slope of the least squares regression line can be interpreted as the average change in the response variable when the explanatory variable increases by 1 unit.<\/p>\n<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<div id=\"b381b4db93c24ee4a9b8da6739aa0216\" class=\"examplewrap\">\n<div class=\"example clearfix\">\n<div class=\"textbox textbox--examples\">\n<header class=\"textbox__header\">\n<h3 class=\"textbox__title\">Example<\/h3>\n<\/header>\n<div class=\"textbox__content\">\n<h4>Age-Distance<\/h4>\n<div>\n<p id=\"d93a3e775803447f9667eb3f1a133ac3\">Let\u2019s revisit our age-distance example, and find the\u00a0<em class=\"italic\">least-squares regression line<\/em>. The following output will be helpful in getting the 5 values we need:<\/p>\n<div class=\"Excel2019PC altContentOn\">\n<div class=\"alternative\">\n<p><span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" id=\"b114ce1718d44144a8f3ad4fcf7e1da8\" class=\"img-responsive popimg aligncenter\" title=\"Output from Excel 2007\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear13excel.gif\" alt=\"Output from Excel 2007\" \/><\/span><\/span><\/p>\n<ul id=\"b0e4d665c4b84c0095be89272b378e41\">\n<li>\n<p id=\"fb10138e9bcf41799d61123859774cff\">The\u00a0<em class=\"bold\">slope\u00a0<\/em>of the line is [latex]\\mathcal{b}=\\left(-0.793\\right)*\\left(\\frac{82.8}{21.78}\\right)=-3[\/latex]. This means that for every 1-unit increase of the explanatory variable, there is, on average, a 3-unit decrease in the response variable. The interpretation\u00a0<em class=\"italic\">in context<\/em>\u00a0of the slope being -3 is, therefore: For every year a driver gets older, the maximum distance at which he\/she can read a sign decreases,\u00a0<em class=\"italic\">on average<\/em>, by 3 feet.<\/p>\n<\/li>\n<li>\n<p id=\"de1f9103357f4371a9de5a168d315d30\">The\u00a0<em class=\"bold\">intercept<\/em>\u00a0of the line is\u00a0<span id=\"MathJax-Element-12-Frame\" class=\"mjx-chtml MathJax_CHTML\"><span id=\"MJXc-Node-271\" class=\"mjx-math\"><span id=\"MJXc-Node-272\" class=\"mjx-mrow\"><span id=\"MJXc-Node-273\" class=\"mjx-mrow\"><span id=\"MJXc-Node-274\" class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\">a<\/span><\/span><span id=\"MJXc-Node-275\" class=\"mjx-mo MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\">=<\/span><\/span><span id=\"MJXc-Node-276\" class=\"mjx-mn MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\">423<\/span><\/span><span id=\"MJXc-Node-277\" class=\"mjx-mo MJXc-space2\"><span class=\"mjx-char MJXc-TeX-main-R\">\u2212<\/span><\/span><span id=\"MJXc-Node-278\" class=\"mjx-mo MJXc-space2\"><span class=\"mjx-char MJXc-TeX-main-R\">(<\/span><\/span><span id=\"MJXc-Node-279\" class=\"mjx-mn\"><span class=\"mjx-char MJXc-TeX-main-R\">\u22123<\/span><\/span><span id=\"MJXc-Node-280\" class=\"mjx-mo\"><span class=\"mjx-char MJXc-TeX-main-R\">\u2217<\/span><\/span><span id=\"MJXc-Node-281\" class=\"mjx-mn\"><span class=\"mjx-char MJXc-TeX-main-R\">51<\/span><\/span><span id=\"MJXc-Node-282\" class=\"mjx-mo\"><span class=\"mjx-char MJXc-TeX-main-R\">)<\/span><\/span><span id=\"MJXc-Node-283\" class=\"mjx-mo MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\">=<\/span><\/span><span id=\"MJXc-Node-284\" class=\"mjx-mn MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\">576<\/span><\/span><\/span><\/span><\/span><\/span>\u00a0and therefore the\u00a0<em class=\"bold\">least-squares regression line<\/em>\u00a0for this example is<\/p>\n<table class=\"formula\">\n<tbody>\n<tr>\n<td>Distance = 576 + (\u22123 * Age)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<p id=\"f49e6141036f426491adb4e7b7447b11\">Here is the regression line plotted on the scatterplot:<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" id=\"af03a2aff0f14f409363fe047be4e064\" class=\"img-responsive popimg aligncenter\" title=\"The scatterplot for Driver Age and Sign Legibility Distance. The least squares regression line has been drawn. It is a negative relationship line.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear14.gif\" alt=\"The scatterplot for Driver Age and Sign Legibility Distance. The least squares regression line has been drawn. It is a negative relationship line.\" \/><\/span><\/span>As we can see, the regression line fits the linear pattern of the data quite well.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"ec08df282d9f4598bac4a27d745d9acb\" class=\"section\">\n<div class=\"sectionContain\">\n<h2><span title=\"Quick scroll up\">Comment<\/span><\/h2>\n<p id=\"beab96cc803549ef97c79c1787068ce2\">As we mentioned before, hand-calculation is not the focus of this course. We wanted you to see one example in which the least squares regression line is calculated by hand, but in general we\u2019ll let a statistics package do that for us.<\/p>\n<hr \/>\n<p id=\"N10B28\">Let\u2019s go back now to our motivating example, in which we wanted to predict the maximum distance at which a sign is legible for a 60-year-old. Now that we have found the least squares regression line, this prediction becomes quite easy:<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"The scatterplot for Driver Age and Sign Legibility Distance. Now that we have a regression line, finding out the maximum distance at which a sign is legible for a 60-year-old person is easy. We simply check at what y coordinate does the regression line cross a vertical line at x = 60. This happens to be at y = 396.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear15.gif\" alt=\"The scatterplot for Driver Age and Sign Legibility Distance. Now that we have a regression line, finding out the maximum distance at which a sign is legible for a 60-year-old person is easy. We simply check at what y coordinate does the regression line cross a vertical line at x = 60. This happens to be at y = 396.\" \/><\/span><\/span><\/p>\n<p>Practically, what the figure tells us is that in order to find the predicted legibility distance for a 60-year-old, we plug Age = 60 into the regression line equation, to find that:<\/p>\n<table class=\"formula\">\n<tbody>\n<tr>\n<td>Predicted distance = 576 + (- 3 * 60) = 396<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>396 feet is our best prediction for the maximum distance at which a sign is legible for a 60-year-old.<\/p>\n<div class=\"textbox textbox--exercises\">\n<header class=\"textbox__header\">\n<h3 class=\"textbox__title\">Did I get this?<\/h3>\n<\/header>\n<div class=\"textbox__content\">\n<p id=\"N10B43\"><em>Background:\u00a0<\/em>A statistics department is interested in tracking the progress of its students from entry until graduation. As part of the study, the department tabulates the performance of 10 students in an introductory course and in an upper-level course required for graduation. The scatterplot below includes the least squares line (the line that best explains the upper-level course average based on the lower-level course average), and its equation:<\/p>\n<div class=\"image shouldbeleft\"><img decoding=\"async\" id=\"_i_1\" class=\"img-responsive popimg aligncenter\" title=\"The scatterplot for Introductory Course Average vs. Upper Level Course Average. In addition to the data plotted on the scatterplot, we have a least squares regression line. The line's equation is Y = -1.4 + X.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear21.gif\" alt=\"The scatterplot for Introductory Course Average vs. Upper Level Course Average. In addition to the data plotted on the scatterplot, we have a least squares regression line. The line's equation is Y = -1.4 + X.\" \/><\/div>\n<div>\n<div id=\"h5p-49\">\n<div class=\"h5p-iframe-wrapper\"><iframe id=\"h5p-iframe-49\" class=\"h5p-iframe\" data-content-id=\"49\" style=\"height:1px\" src=\"about:blank\" frameBorder=\"0\" scrolling=\"no\" title=\"3.4 Did I get this? 1\"><\/iframe><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"N10BB7\" class=\"section\">\n<div class=\"sectionContain\">\n<h2><span title=\"Quick scroll up\">Comment About Predictions<\/span><\/h2>\n<p id=\"N10BBE\">Suppose a government agency wanted to design a sign appropriate for an even wider range of drivers than were present in the original study. They want to predict the maximum distance at which the sign would be legible for a 90-year-old. Using the least squares regression line again as our summary of the linear dependence of the distances upon the drivers&#8217; ages, the agency predicts that 90-year-old drivers can see the sign at no more than 576 + (- 3 * 90) = 306 feet:<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" id=\"_i_2\" class=\"img-responsive popimg aligncenter\" style=\"box-sizing: border-box; border: none; vertical-align: middle; margin: auto; padding: 0px; outline: 0px; cursor: pointer; display: block; max-width: 100%; height: auto;\" title=\"The scatterplot for Driver Age vs. Sign Legibility Distance. The scales of both axes have been enlarged so that the regression line has room on the right to be extended past where data exists. The regression line is negative, so it grows from the upper left to the lower right of the plot. Where the regression line is creating an estimate in between existing data, it is red. Beyond that, where there are no data points, the line is green. This area is x&gt;82. The equation of the regression line is Distance = 576 - 3 * Age\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/linear16.gif\" alt=\"The scatterplot for Driver Age vs. Sign Legibility Distance. The scales of both axes have been enlarged so that the regression line has room on the right to be extended past where data exists. The regression line is negative, so it grows from the upper left to the lower right of the plot. Where the regression line is creating an estimate in between existing data, it is red. Beyond that, where there are no data points, the line is green. This area is x&gt;82. The equation of the regression line is Distance = 576 - 3 * Age\" \/><\/span><\/span><\/p>\n<p id=\"N10BC7\">(The green segment of the line is the region of ages beyond 82, the age of the oldest individual in the study.)<\/p>\n<div class=\"inquiry\">\n<div><em>Question:\u00a0<\/em>Is our prediction for 90-year-old drivers reliable?<\/div>\n<div><\/div>\n<div class=\"answer\"><em>Answer:\u00a0<\/em>Our original age data ranged from 18 (youngest driver) to 82 (oldest driver), and our regression line is therefore a summary of the linear relationship\u00a0<em>in that age range only.\u00a0<\/em>When we plug the value 90 into the regression line equation, we are assuming that the same linear relationship extends beyond the range of our age data (18-82) into the green segment.\u00a0<em>There is no justification for such an assumption.<\/em>\u00a0It might be the case that the vision of drivers older than 82 falls off more rapidly than it does for younger drivers. (i.e., the slope changes from -3 to something more negative). Our prediction for age = 90 is therefore\u00a0<em>not reliable.<\/em><\/div>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"N10BE0\" class=\"section purposewrap\">\n<div class=\"sectionContain\">\n<h2><span title=\"Quick scroll up\">In General<\/span><\/h2>\n<p id=\"N10BE7\">Prediction for ranges of the explanatory variable that are not in the data is called\u00a0<em>extrapolation<\/em>. Since there is no way of knowing whether a relationship holds beyond the range of the explanatory variable in the data, extrapolation is not reliable, and should be avoided. In our example, like most others, extrapolation can lead to very poor or illogical predictions.<\/p>\n<h2><span title=\"Quick scroll up\">Let\u2019s Summarize<\/span><\/h2>\n<ul>\n<li>A special case of the relationship between two quantitative variables is the\u00a0<em>linear\u00a0<\/em>relationship. In this case, a straight line simply and adequately summarizes the relationship.<\/li>\n<li>When the scatterplot displays a linear relationship, we supplement it with the\u00a0<em>correlation coefficient (r)<\/em>, which measures the\u00a0<em>strength<\/em>\u00a0and direction of a linear relationship between two quantitative variables. The correlation ranges between -1 and 1. Values near -1 indicate a strong negative linear relationship, values near 0 indicate a weak linear relationship, and values near 1 indicate a strong positive linear relationship.<\/li>\n<li>The correlation is only an appropriate numerical measure for linear relationships, and is sensitive to outliers. Therefore, the correlation should only be used as a supplement to a scatterplot (after we look at the data).<\/li>\n<li>The most commonly used criterion for finding a line that summarizes the pattern of a linear relationship is \u201cleast squares.\u201d The\u00a0<em>least squares regression line\u00a0<\/em>has the smallest sum of squared vertical deviations of the data points from the line.<\/li>\n<li>The slope of the least squares regression line can be interpreted as the average change in the response variable when the explanatory variable increases by 1 unit.<\/li>\n<li>The least squares regression line predicts the value of the response variable for a given value of the explanatory variable.\u00a0<em>Extrapolation<\/em>\u00a0is prediction of values of the explanatory variable that fall outside the range of the data. Since there is no way of knowing whether a relationship holds beyond the range of the explanatory variable in the data, extrapolation is not reliable, and should be avoided.<\/li>\n<\/ul>\n<h2>Causation<\/h2>\n<p>So far we have discussed different ways in which data can be used to explore the relationship (or association) between two variables. To frame our discussion we followed the role-type classification table:<span class=\"imagewrap\"><span class=\"image\"><img loading=\"lazy\" decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"It is possible for any type of explanatory variable to be paired with any type of response variable. The possible pairings are: Categorical Explanatory \u2192 Categorical Response (C\u2192C), Categorical Explanatory \u2192 Quantitative Response (C\u2192Q), Quantitative Explanatory \u2192 Categorical Response (Q\u2192C), and Quantitative Explanatory \u2192 Quantitative Response (Q\u2192Q).\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation1.gif\" alt=\"It is possible for any type of explanatory variable to be paired with any type of response variable. The possible pairings are: Categorical Explanatory \u2192 Categorical Response (C\u2192C), Categorical Explanatory \u2192 Quantitative Response (C\u2192Q), Quantitative Explanatory \u2192 Categorical Response (Q\u2192C), and Quantitative Explanatory \u2192 Quantitative Response (Q\u2192Q).\" width=\"700\" height=\"500\" \/><\/span><\/span><\/p>\n<p>and we have now completed learning how to explore the relationship in cases C\u2192Q, C\u2192C, and Q\u2192Q. (As noted before, case Q\u2192C will not be discussed in this course.) When we explore the relationship between two variables, there is often a temptation to conclude from the observed relationship that changes in the explanatory variable\u00a0<em>cause<\/em>\u00a0changes in the response variable. In other words, you might be tempted to interpret the observed association as causation. The purpose of this part of the course is to convince you that this kind of interpretation is often\u00a0<em class=\"italic\">wrong!<\/em>\u00a0The motto of this section is one of the most fundamental principles of this course:<\/p>\n<table id=\"N10B0A_bx\" class=\"theorem labeled\">\n<thead>\n<tr>\n<th>Principle<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\n<div class=\"theorem\">\n<div class=\"statement\">\n<p>Association\u00a0<em>does not<\/em>\u00a0imply causation!<\/p>\n<\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div>Let\u2019s start by looking at the following example:<\/div>\n<div class=\"textbox textbox--examples\">\n<header class=\"textbox__header\">\n<h3 class=\"textbox__title\">Example<\/h3>\n<\/header>\n<div class=\"textbox__content\">\n<div class=\"examplewrap\">\n<h4 class=\"exHead\">Fire Damage<\/h4>\n<div class=\"example clearfix\">\n<div>\n<p id=\"N10AF9\">The scatterplot below illustrates how the number of firefighters sent to fires (X) is related to the amount of damage caused by fires (Y) in a certain city.<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"A scatterplot in which the horizontal axis is labeled &quot;# Of Firefighters&quot;, and the vertical axis is labeled &quot;Damage ($)&quot;. The vertical axis ranges from $0 to $2500000 and the horizontal axis ranges from 0 to 40.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation2.gif\" alt=\"A scatterplot in which the horizontal axis is labeled &quot;# Of Firefighters&quot;, and the vertical axis is labeled &quot;Damage ($)&quot;. The vertical axis ranges from $0 to $2500000 and the horizontal axis ranges from 0 to 40.\" \/><\/span><\/span><\/p>\n<p id=\"N10B02\">The scatterplot clearly displays a fairly strong (slightly curved)\u00a0<em>positive<\/em>\u00a0relationship between the two variables. Would it, then, be reasonable to conclude that sending more firefighters to a fire causes more damage, or that the city should send fewer firefighters to a fire, in order to decrease the amount of damage done by the fire? Of course not! So what is going on here?<\/p>\n<p>There is a\u00a0<em>third variable in the background<\/em>\u2014the seriousness of the fire\u2014that is responsible for the observed relationship. More serious fires require more firefighters, and also cause more damage.<\/p>\n<p>The following figure will help you visualize this situation:<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"A flowchart. The &quot; Seriousness of the fire&quot; is a &quot;lurking variable.&quot; This is a cause of both &quot;Number of firefighters (X)&quot; and &quot;amount of damage (Y)&quot; We have falsely observed a &quot;observed association&quot; between &quot;Number of firefighters (X) &quot; and &quot;Amount of damage (Y)&quot;\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation3.gif\" alt=\"A flowchart. The &quot; Seriousness of the fire&quot; is a &quot;lurking variable.&quot; This is a cause of both &quot;Number of firefighters (X)&quot; and &quot;amount of damage (Y)&quot; We have falsely observed a &quot;observed association&quot; between &quot;Number of firefighters (X) &quot; and &quot;Amount of damage (Y)&quot;\" \/><\/span><\/span><\/p>\n<p id=\"N10B17\">Here, the seriousness of the fire is a\u00a0<em>lurking variable.\u00a0<\/em>A\u00a0<em>lurking variable<\/em>\u00a0is a variable that is not among the explanatory or response variables in a study, but could substantially affect your interpretation of the relationship among those variables.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p id=\"N10B21\">In particular, as in our example, the lurking variable might have an effect on\u00a0<em class=\"italic\">both<\/em>\u00a0the explanatory and the response variables. This common effect creates the observed association between the explanatory and response variables, even though there is no causal link between them. This possibility, that there might be a lurking variable (which we might not be thinking about) that is responsible for the observed relationship leads to our principle:<\/p>\n<table id=\"N10B28_bx\" class=\"theorem labeled\">\n<thead>\n<tr>\n<th>Principle<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\n<div class=\"theorem\">\n<div class=\"statement\">\n<p>Association\u00a0<em>does not<\/em>\u00a0imply causation!<\/p>\n<\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The next example will illustrate another way in which a lurking variable might interfere and prevent us from reaching any causal conclusions.<\/p>\n<div class=\"examplewrap\">\n<div class=\"example clearfix\">\n<div class=\"textbox textbox--examples\">\n<header class=\"textbox__header\">\n<h3 class=\"textbox__title\">Example<\/h3>\n<\/header>\n<div class=\"textbox__content\">\n<h4>SAT Test<\/h4>\n<div>\n<p id=\"N10B11\">For U.S. colleges and universities, a standard entrance examination is the SAT test. The side-by-side boxplots below provide evidence of a relationship between the student\u2019s country of origin (the United States or another country) and the student\u2019s SAT Math score.<\/p>\n<p><span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"A side-by-side boxplot. The vertical axis is labeled &quot;SAT Math Score&quot;, and it ranges from 450 to 800. The horizontal axis is labeled &quot;Country&quot; and has two categories, &quot;Other&quot; and &quot;US&quot;.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation4.gif\" alt=\"A side-by-side boxplot. The vertical axis is labeled &quot;SAT Math Score&quot;, and it ranges from 450 to 800. The horizontal axis is labeled &quot;Country&quot; and has two categories, &quot;Other&quot; and &quot;US&quot;.\" \/><\/span><\/span><\/p>\n<p id=\"N10B1A\">The distribution of international students\u2019 scores is higher than that of U.S. students. The international students\u2019 median score (about 700) exceeds the third quartile of U.S. students\u2019 scores. Can we conclude that the country of origin is the\u00a0<em>cause<\/em>\u00a0of the difference in SAT Math scores, and that students in the United States are weaker at math than students in other countries?<\/p>\n<p>No, not necessarily. While it\u00a0<em class=\"italic\">might<\/em>\u00a0be true that U.S. students differ in math ability from other students\u2014i.e. due to differences in educational systems\u2014we can\u2019t conclude that a student\u2019s country of origin is the cause of the disparity. One important\u00a0<em>lurking variable<\/em>\u00a0that might explain the observed relationship is the educational level of the two populations taking the SAT Math test. In the United States, the SAT is a standard test, and therefore a broad cross-section of all U.S. students (in terms of educational level) take this test. Among all international students, on the other hand, only those who plan on coming to the U.S. to study, which is usually a more selected subgroup, take the test.<\/p>\n<p>The following figure will help you visualize this explanation:<\/p>\n<p><span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"A flowchart. We have two causes, one of which is &quot;Education level of SAT Takers&quot;. This is a &quot;Lurking variable &quot; The other cause is &quot;Nationality (X)&quot;. Both of these might be causes of &quot; SAT-Math score (Y)&quot;. We have observed an association between &quot;Nationality (X)&quot; and &quot;SAT-Math Score (Y)&quot;. Notice that between these two variables is also a suspected cause relationship.\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation5.gif\" alt=\"A flowchart. We have two causes, one of which is &quot;Education level of SAT Takers&quot;. This is a &quot;Lurking variable &quot; The other cause is &quot;Nationality (X)&quot;. Both of these might be causes of &quot; SAT-Math score (Y)&quot;. We have observed an association between &quot;Nationality (X)&quot; and &quot;SAT-Math Score (Y)&quot;. Notice that between these two variables is also a suspected cause relationship.\" \/><\/span><\/span><\/p>\n<p id=\"N10B33\">Here, the explanatory variable (X)\u00a0<em>may<\/em>\u00a0have a causal relationship with the response variable (Y), but the lurking variable might be a contributing factor as well, which makes it very hard to isolate the effect of the explanatory variable and prove that it has a causal link with the response variable. In this case, we say that the lurking variable is\u00a0<em>confounded<\/em> with the explanatory variable, since their effects on the response variable cannot be distinguished from each other.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>Note that in each of the above two examples, the lurking variable interacts differently with the variables studied. In example 1, the lurking variable has an effect on both the explanatory and the response variables, creating the illusion that there is a causal link between them. In example two, the lurking variable is confounded with the explanatory variable, making it hard to assess the isolated effect of the explanatory variable on the response variable.<\/p>\n<p>The distinction between these two types of interactions is not as important as the fact that in either case, the observed association can be at least partially explained by the lurking variable. The most important message from these two examples is therefore:\u00a0<em>An observed association between two variables is not enough evidence that there is a <\/em><em>causal relationship between them.<\/em><\/p>\n<p id=\"N10B4C\">In other words \u2026<\/p>\n<table id=\"N10B4F_bx\" class=\"theorem labeled\">\n<thead>\n<tr>\n<th>Principle<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\n<div class=\"theorem\">\n<div class=\"statement\">\n<p id=\"N10B54\">Association\u00a0<em>does not<\/em>\u00a0imply causation!<\/p>\n<\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div>\n<div class=\"textbox textbox--exercises\">\n<header class=\"textbox__header\">\n<h3 class=\"textbox__title\">Did I get this?<\/h3>\n<\/header>\n<div class=\"textbox__content\">\n<div id=\"h5p-50\">\n<div class=\"h5p-iframe-wrapper\"><iframe id=\"h5p-iframe-50\" class=\"h5p-iframe\" data-content-id=\"50\" style=\"height:1px\" src=\"about:blank\" frameBorder=\"0\" scrolling=\"no\" title=\"3.4 Learn by doing 3\"><\/iframe><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p id=\"N10AF5\">So far, we have:<\/p>\n<ul>\n<li>discussed what lurking variables are,<\/li>\n<li>demonstrated different ways in which the lurking variables can interact with the two studied variables, and<\/li>\n<li>understood that the existence of a possible lurking variable is the main reason why we say that association does not imply causation.<\/li>\n<\/ul>\n<p>As you recall, a lurking variable, by definition, is a variable that was not included in the study, but could have a substantial effect on our understanding of the relationship between the two studied variables.<\/p>\n<p>What if we\u00a0<em class=\"italic\">did<\/em>\u00a0include a lurking variable in our study? What kind of effect could that have on our understanding of the relationship? These are the questions we are going to discuss next.<\/p>\n<p>Let\u2019s start with an example:<\/p>\n<div class=\"textbox textbox--examples\">\n<header class=\"textbox__header\">\n<h3 class=\"textbox__title\">Example<\/h3>\n<\/header>\n<div class=\"textbox__content\">\n<div class=\"examplewrap\">\n<div class=\"example clearfix\">\n<h4>Hospital Death Rates<\/h4>\n<div>\n<p id=\"N10B16\"><em>Background:<\/em>\u00a0A government study collected data on the death rates in nearly 6,000 hospitals in the United States. These results were then challenged by researchers, who said that the federal analyses failed to take into account the variation among hospitals in the severity of patients\u2019 illnesses when they were hospitalized. As a result, said the researchers, some hospitals were treated unfairly in the findings, which named hospitals with higher-than-expected death rates. What the researchers meant is that when the federal government explored the relationship between the two variables\u2014hospital and death rate\u2014<em>it also should have included in the study (or taken into account) the lurking variable\u2014severity of illness.<\/em><\/p>\n<p>We will use a simplified version of this study to illustrate the researchers\u2019 claim, and see what the possible effect could be of including a lurking variable in a study. (Reference: Moore and McCabe (2003).\u00a0<em class=\"italic\">Introduction to the Practice of Statistics<\/em>.)<\/p>\n<p id=\"N10B26\">Consider the following two-way table, which summarizes the data about the status of patients who were admitted to two hospitals in a certain city (Hospital A and Hospital B). Note that since the purpose of the study is to examine whether there is a \u201chospital effect\u201d on patients\u2019 status, \u201cHospital is the explanatory variable, and \u201cPatient\u2019s Status\u201d is the response variable.<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"A two-way table. The columns are the categories within the variable &quot;Patient's Status&quot;. These categories are &quot;Died&quot; and &quot;Survived.&quot; In addition, there is a &quot;Total&quot; column. The rows are categories for the variable &quot;Hospital&quot;. These categories are &quot;Hospital A&quot; and &quot;Hospital B&quot;. Like usual there is also a &quot;Total&quot; Row. Here is the data in &quot;Row,Column: Value &quot; format: Hospital A, Died: 63; Hospital A, Survived: 2037; Hospital A, Total: 2100; Hospital B, Died: 16; Hospital B, Survived: 784; Hospital B, Total: 800; Total, Died: 79; Total, Survived: 2821; Total, Total: 2900;\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation6.gif\" alt=\"A two-way table. The columns are the categories within the variable &quot;Patient's Status&quot;. These categories are &quot;Died&quot; and &quot;Survived.&quot; In addition, there is a &quot;Total&quot; column. The rows are categories for the variable &quot;Hospital&quot;. These categories are &quot;Hospital A&quot; and &quot;Hospital B&quot;. Like usual there is also a &quot;Total&quot; Row. Here is the data in &quot;Row,Column: Value &quot; format: Hospital A, Died: 63; Hospital A, Survived: 2037; Hospital A, Total: 2100; Hospital B, Died: 16; Hospital B, Survived: 784; Hospital B, Total: 800; Total, Died: 79; Total, Survived: 2821; Total, Total: 2900;\" \/><\/span><\/span><\/p>\n<p id=\"N10B2F\">When we supplement the two-way table with the conditional percents within each hospital:<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"A two-way table with the same rows and columns as the previous two-way table, except the Total row has been removed. Here is the data in the same format: Hospital A, Died: 3%; Hospital A, Survived: 97%; Hospital A, Total: 100%; Hospital B, Died: 2%; Hospital B, Survived: 98%; Hospital B, Total: 100%;\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation7.gif\" alt=\"A two-way table with the same rows and columns as the previous two-way table, except the Total row has been removed. Here is the data in the same format: Hospital A, Died: 3%; Hospital A, Survived: 97%; Hospital A, Total: 100%; Hospital B, Died: 2%; Hospital B, Survived: 98%; Hospital B, Total: 100%;\" \/><\/span><\/span><\/p>\n<p>we find that Hospital A has a higher death rate (3%) than Hospital B (2%). Should we jump to the conclusion that a sick patient admitted to Hospital A is 50% more likely to die than if he\/she were admitted to Hospital B?\u00a0<em>Not so fast \u2026<\/em><\/p>\n<p id=\"N10B3E\">Maybe Hospital A gets most of the severe cases, and that explains why it has a higher death rate. In order to explore this, we need to\u00a0<em>include (or account for) the lurking variable \u201cseverity of illness\u201d in our analysis.<\/em>\u00a0To do this, we go back to the two-way table and split it up to look separately at patents who are severely ill, and patients who are not.<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"The original two-way table has been split into two two-way tables, one for &quot;Patients severely ill&quot; and one for &quot;Patients not severely ill.&quot; Once again, here are the columns, for the variable &quot;Patient's Status&quot;: &quot;Died&quot;, &quot;Survived&quot;, &quot;Total&quot;. The rows, for the variable &quot;Hospital&quot;: &quot;Hospital A&quot;, &quot;Hospital B&quot;, &quot; Total&quot;. Data will be given in &quot;Row,Column: Value&quot; format. Table for &quot;Patients severely ill:&quot; Hospital A, Died: 57; Hospital A, Survived: 1443; Hospital A, Total: 1500; Hospital B, Died: 8; Hospital B, Survived: 192; Hospital B, Total: 200; Total, Died: 65; Total, Survived: 1635; Total, Total: 1700; Table for &quot;Patients not severely ill:&quot; Hospital A, Died: 6; Hospital A, Survived: 594; Hospital A, Total: 600; Hospital B, Died: 8; Hospital B, Survived: 592; Hospital B, Total: 600; Total, Died: 14; Total, Survived: 1186; Total, Total: 1200;\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation10.gif\" alt=\"The original two-way table has been split into two two-way tables, one for &quot;Patients severely ill&quot; and one for &quot;Patients not severely ill.&quot; Once again, here are the columns, for the variable &quot;Patient's Status&quot;: &quot;Died&quot;, &quot;Survived&quot;, &quot;Total&quot;. The rows, for the variable &quot;Hospital&quot;: &quot;Hospital A&quot;, &quot;Hospital B&quot;, &quot; Total&quot;. Data will be given in &quot;Row,Column: Value&quot; format. Table for &quot;Patients severely ill:&quot; Hospital A, Died: 57; Hospital A, Survived: 1443; Hospital A, Total: 1500; Hospital B, Died: 8; Hospital B, Survived: 192; Hospital B, Total: 200; Total, Died: 65; Total, Survived: 1635; Total, Total: 1700; Table for &quot;Patients not severely ill:&quot; Hospital A, Died: 6; Hospital A, Survived: 594; Hospital A, Total: 600; Hospital B, Died: 8; Hospital B, Survived: 592; Hospital B, Total: 600; Total, Died: 14; Total, Survived: 1186; Total, Total: 1200;\" \/><\/span><\/span><\/p>\n<p id=\"N10B4A\">As we can see, Hospital A\u00a0<em>did<\/em>\u00a0admit many more severely ill patients than Hospital B (1,500 vs. 200). In fact, from the way the totals were split, we see that in Hospital A, severely ill patients were a much higher proportion of the patients\u20141,500 out of a total of 2,100 patients. In contrast, only 200 out of 800 patients at Hospital B were severely ill. To better see the effect of including the lurking variable, we need to supplement each of the two new two-way tables with its conditional percentages:<span class=\"imagewrap\"><span class=\"image\"><img decoding=\"async\" class=\"img-responsive popimg aligncenter\" title=\"Two two-way tables with the same rows and columns as the previous two-way table, except the Total row has been removed. Here is the data in the same format: Table for &quot;Patients severely ill:&quot; Hospital A, Died: 3.8%; Hospital A, Survived: 96.2%; Hospital A, Total: 100%; Hospital B, Died: 4.0%; Hospital B, Survived: 96.0%; Hospital B, Total: 100%; Table for &quot;Patients not severely ill:&quot; Hospital A, Died: 1.0%; Hospital A, Survived: 99.0%; Hospital A, Total: 100%; Hospital B, Died: 1.3%; Hospital B, Survived: 98.7%; Hospital B, Total: 100%;\" src=\"https:\/\/oli.cmu.edu\/repository\/webcontent\/72712ec00a0001dc418a87e73e8ebb77\/_u2_summarizing_data\/_m2_examining_relationships\/webcontent\/causation11.gif\" alt=\"Two two-way tables with the same rows and columns as the previous two-way table, except the Total row has been removed. Here is the data in the same format: Table for &quot;Patients severely ill:&quot; Hospital A, Died: 3.8%; Hospital A, Survived: 96.2%; Hospital A, Total: 100%; Hospital B, Died: 4.0%; Hospital B, Survived: 96.0%; Hospital B, Total: 100%; Table for &quot;Patients not severely ill:&quot; Hospital A, Died: 1.0%; Hospital A, Survived: 99.0%; Hospital A, Total: 100%; Hospital B, Died: 1.3%; Hospital B, Survived: 98.7%; Hospital B, Total: 100%;\" \/><\/span><\/span><\/p>\n<p id=\"N10B56\">Note that despite our earlier finding that overall Hospital A has a higher death rate (3% vs. 2%), when we take into account the lurking variable, we find that actually it is Hospital B that has the higher death rate both among the severely ill patients (4% vs. 3.8%) and among the not severely ill patients (1.3% vs. 1%).\u00a0<em>Thus, we see that adding a lurking variable can change the direction of an association.<\/em><\/p>\n<p id=\"N10B5C\">Whenever including a lurking variable causes us to rethink the direction of an association, this is called\u00a0<em>Simpson\u2019s paradox.<\/em><\/p>\n<p id=\"N10B62\">The possibility that a lurking variable can have such a dramatic effect is another reason we must adhere to the principle:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<table id=\"N10B66_bx\" class=\"theorem labeled\">\n<thead>\n<tr>\n<th>Principle<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\n<div class=\"theorem\">\n<div class=\"statement\">\n<p id=\"N10B6B\">Association\u00a0<em>does not<\/em>\u00a0imply causation!<\/p>\n<\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>It is\u00a0<em class=\"italic\">not<\/em> always the case that including a lurking variable makes us rethink the direction of the association. In the next example we will see how including a lurking variable just helps us gain a deeper understanding of the observed relationship.<\/p>\n<div class=\"examplewrap\">\n<div class=\"example clearfix\">\n<div class=\"textbox textbox--examples\">\n<header class=\"textbox__header\">\n<h3 class=\"textbox__title\">Example<\/h3>\n<\/header>\n<div class=\"textbox__content\">\n<h4>College Entrance Exams<\/h4>\n<div>\n<p>As discussed earlier, in the United States, the SAT is a widely used college entrance examination, required by the most prestigious schools. In some states, a different college entrance examination is prevalent, the ACT.<\/p>\n<p><iframe loading=\"lazy\" id=\"oembed-2\" title=\"Including a Lurking Variable\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube.com\/embed\/Nnj1YlqzkX4?feature=oembed&#38;rel=0&#38;rel=0\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/p>\n<div class=\"figurewrap\">\n<div class=\"figure clearfix\">\n<div class=\"youtube\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>The last two examples showed us that including a lurking variable in our exploration may<\/p>\n<ul>\n<li>Lead us to rethink the direction of an association (as in the Hospital\/Death Rate example).<\/li>\n<li>Help us to gain a deeper understanding of the relationship between variables (as in the SAT\/ACT example)<\/li>\n<\/ul>\n<div class=\"textbox textbox--exercises\">\n<header class=\"textbox__header\">\n<h3 class=\"textbox__title\">Did I get this?<\/h3>\n<\/header>\n<div class=\"textbox__content\">\n<div id=\"h5p-51\">\n<div class=\"h5p-iframe-wrapper\"><iframe id=\"h5p-iframe-51\" class=\"h5p-iframe\" data-content-id=\"51\" style=\"height:1px\" src=\"about:blank\" frameBorder=\"0\" scrolling=\"no\" title=\"3.4 Did I get this? 2\"><\/iframe><\/div>\n<\/div>\n<\/div>\n<\/div>\n<h2><span title=\"Quick scroll up\">Let\u2019s Summarize<\/span><\/h2>\n<ul>\n<li>A\u00a0<em>lurking variable<\/em>\u00a0is a variable that was not included in your analysis, but that could substantially change your interpretation of the data if it were included.<\/li>\n<li>Because of the possibility of lurking variables, we adhere to the principle that\u00a0<em>association does not imply causation<\/em>.<\/li>\n<li>Including a lurking variable in our exploration may:\n<ul>\n<li>Help us to gain a deeper understanding of the relationship between variables.<\/li>\n<li>Lead us to rethink the direction of an association.<\/li>\n<\/ul>\n<\/li>\n<li>Whenever including a lurking variable causes us to rethink the direction of an association, this is an instance of\u00a0<em>Simpson\u2019s paradox<\/em>.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"author":150,"menu_order":5,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[48],"contributor":[],"license":[],"class_list":["post-489","chapter","type-chapter","status-publish","hentry","chapter-type-numberless"],"part":417,"_links":{"self":[{"href":"https:\/\/pressbooks.ccconline.org\/mat1260\/wp-json\/pressbooks\/v2\/chapters\/489","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pressbooks.ccconline.org\/mat1260\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/pressbooks.ccconline.org\/mat1260\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/pressbooks.ccconline.org\/mat1260\/wp-json\/wp\/v2\/users\/150"}],"version-history":[{"count":14,"href":"https:\/\/pressbooks.ccconline.org\/mat1260\/wp-json\/pressbooks\/v2\/chapters\/489\/revisions"}],"predecessor-version":[{"id":853,"href":"https:\/\/pressbooks.ccconline.org\/mat1260\/wp-json\/pressbooks\/v2\/chapters\/489\/revisions\/853"}],"part":[{"href":"https:\/\/pressbooks.ccconline.org\/mat1260\/wp-json\/pressbooks\/v2\/parts\/417"}],"metadata":[{"href":"https:\/\/pressbooks.ccconline.org\/mat1260\/wp-json\/pressbooks\/v2\/chapters\/489\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/pressbooks.ccconline.org\/mat1260\/wp-json\/wp\/v2\/media?parent=489"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/pressbooks.ccconline.org\/mat1260\/wp-json\/pressbooks\/v2\/chapter-type?post=489"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/pressbooks.ccconline.org\/mat1260\/wp-json\/wp\/v2\/contributor?post=489"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/pressbooks.ccconline.org\/mat1260\/wp-json\/wp\/v2\/license?post=489"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}