Guide to Multiple Regression

Guide to Multiple Regression

Files

Description

Multiple regression analysis is a statistical tool for understanding the relationship between two or more variables. Multiple regression involves a variable to be explained—called the dependent variable—and additional explanatory variables that are thought to produce or be associated with changes in the dependent variable. For example, a multiple regression analysis might estimate the effect of the number of years of work on salary. Salary would be the dependent variable to be explained; years of experience would be the explanatory variable. Multiple regression analysis is sometimes well suited to the analysis of data about competing theories in which there are several possible explanations for the relationship among a number of explanatory variables. Multiple regression typically uses a single dependent variable and several explanatory variables to assess the statistical data pertinent to these theories. In a case alleging sex discrimination in salaries, for example, a multiple regression analysis would examine not only sex, but also other explanatory variables of interest, such as education and experience. The employer-defendant might argue that salary is a function of the employee’s education and experience, and the employee-plaintiff might argue that salary is also a function of the individual’s sex, with both using multiple regression to evaluate which explanation is more nearly correct. Multiple regression also may be useful (1) in determining whether or not a particular effect is present, (2) in measuring the magnitude of a particular effect, and (3) in forecasting what a particular effect would be, but for an intervening event. In a patent infringement case, for example, a multiple regression analysis could be used to determine (1) whether the behavior of the alleged infringer affected the price of the patented product, (2) the size of the effect, and (3) what the price of the product would have been had the alleged infringement not occurred. Over the past several decades the use of regression analysis n court has grown widely. Although multiple regression analysis has been used most frequently in cases alleging sex and race discrimination and antitrust violations, other applications include census undercounts, voting rights, the study of the deterrent effect of the death penalty, and intellectual property. Multiple regression analysis can be a source of valuable scientific testimony in litigation. However, when inappropriately used, regression analysis can confuse important issues while having little, if any, probative value. In EEOC v. Sears, Roebuck & Company, in which Sears was charged with discrimination against women in hiring practices, the Seventh Circuit acknowledged that “[m]ultiple regression analyses, designed to determine the effect of several independent variables on a dependent variable, which in this case is hiring, are an accepted and common method of proving disparate treatment claims.” However, the court affirmed the district court’s finding that the “E.E.O.C’s regression analyses did not ‘accurately reflect Sears’ complex, nondiscriminatory decision-making processes’” and that the “’E.E.O.C.’s statistical analyses [were] so flawed that they lack[ed] any persuasive value.’” Serious questions also have been raised about the use of multiple regression analysis in census undercount cases and in death penalty cases. Moreover, in interpreting the results of a multiple regression analysis, it is important to distinguish between correlation and causality. Two variables are correlated when the events associated with the variable occur more frequently together than one would expect by chance. For example, if higher salaries are associated with a greater number of years of work experience, and lower salaries are associated with fewer years of experience, there is a positive correlation between the two variables. However, if higher salaries are associated with less experience, and lower salaries are associated with more experience, there is a negative correlation between the two variables. A correlation between two variables does not imply that one event causes the second. Therefore, in making causal inferences, it is important to avoid spurious correlation. Spurious correlation arises when two variables are closely related but bear no causal relationship because they are both caused by a third, unexamined variable. For example, there might be a negative correlation between the age of certain skilled employees of a computer company and their salaries. One should not conclude from this correlation that the employer has necessarily discriminated against the employees on the basis of their age. A third, unexamined variable—the level of the employees’ technological skills—could explain differences in productivity and, consequently, differences in salary. Or, consider a patent infringement damage case in which increased sales of an allegedly infringing product are associated with a lower price of the patented product. This correlation would be spurious if the two products have their own noncompetitive market niches and the lower price is due to a decline in the production costs of the patented product. Causality cannot be inferred by data analysis alone—rather, one must infer that a causal relationship exist on the basis of an underlying causal theory that explains the relationship between the two variables. Even when an appropriate theory has been identified, causality can never be inferred directly—one must also look for empirical evidence that there is a causal relationship. Conversely, the presence of a non-zero correlation between two variables does not guarantee the existence of a relationship; it could be that the model does not reflect the correct interplay among the explanatory variables. In fact, the absence of correlation does not guarantee that a causal relationship does not exist. Rather, lack of correlation could occur if (1) there are insufficient data; (2) the data are measured inaccurately; (3) the date do not allow multiple causal relationships to be sorted our; or (4) the model is specified wrongly. There is a tension between any attempt to reach conclusions with near certainty and the inherently probabilistic nature of multiple regression analysis. In general, statistical analysis involves the formal expression of uncertainty in terms of probabilities that there are relationships should not be seen in itself as an argument against the use of statistical evidence. The only alternative might be to use less reliable anecdotal evidence. This chapter addresses a number of procedural and methodological issues that are relevant in considering the admissibility of, and weight to be accorded to, the findings of multiple regression analyses. It also suggests some standards of reporting and analysis that an expert presenting multiple regression analyses might be expected to meet. Section 2 discusses research design—how the multiple regression framework can be used to sort out alternative theories about a case. Section 3 concentrates on the interpretation of the multiple regression results, from both a statistical and a practical point of view. Section 4 briefly discusses the qualifications of experts. In section 5 the emphasis turns to procedural aspects associated with the use of the data underlying regression analyses. Finally, the Appendix delves into the multiple regression framework in further detail; it also contains a number of specific examples that illustrate the application of the technique.

Source Publication

Modern Scientific Evidence: The Law and Science of Expert Testimony

Source Editors/Authors

David L. Faigman, David H. Kaye, Michael J. Saks, Joseph Sanders

Publication Date

1997

Edition

1

Volume Number

1

Guide to Multiple Regression

Share

COinS