C

At some point, we've all likely heard the cautionary assertion that correlation is not causation. It sounds reasonable so we tend to accept the assertion, but what does it really mean? And is it always true?To answer these questions, we first need to understand what the terms mean and how they are distinguished from one another. Correlation is a mathematical representation that summarizes the measured association between variables. In simpler terms, it's a number between -1 and 1 that describes what happens to one variable (let's call this variable y) when another variable changes (let's call this one x). Causation takes correlation a bit further by demanding more from our variables than a basic association. Causation requires that at least part of the the change we see in variable y is actually due to changes in variable x. In other words, a change in one variable has actually caused a change in the other, hence the term causal.First, let's look at how correlations between variables can be misleading. The scatterplot in Fig. 1 shows simulated data from a sample of 50 elementary students, grades 1-6. The plot shows two variables for each student: a measure of shoe size along the x-axis (var.x) and performance on a common math test along the y-axis (var.y). Each point in the plot represents the intersection between those variables for each student in our simulated sample. The association between these variables is clear, as shoe size (x) increases, so do our math scores (y). There is a rather wide range in math scores across shoe sizes, but this range doesn't throw off the overall association demonstrated by the linear increase indicated by the blue line of best fit. To further reinforce this association, we can look at the calculated correlation statistic between shoe size and math performance [r(xy)=.74]. [If this statistic is unfamiliar, see Linear Association and Correlation.] This is a strong correlation, certainly something to take notice of, and provides further evidence for the association between shoe size and math performance within our sample.

C

Correlation is a fundamental concept within statistics that, once understood, provides insight into more complex statistical models and ideas. From a conceptual standpoint, correlation summarizes the measured association between variables, meaning the extent to which one variable is affected by the other. Put another way, correlation is simply a measure of association.The term measured association carries a lot of meaning here, so let's unpack it. To calculate the correlation between variables, we first have to measure those variables. The term measured association rather than simply association is a hedge against the possibility that those measures could be inaccurate, and not truly reflect the thing we intend to measure. Science is cautious; and the terms we use reflect that caution. The word association refers to how the data points between variables trend relative to each other. If one goes up does the other go up as well? Or maybe it goes down? Or maybe the change in one does not systematically affect the the other. Of course, this means that we need multiple data points across variables to determine correlation; but more on that in a minute.Association as a concept is a singular thing, but correlation as a measurement is multiple things. There are a variety of way to calculate correlation; and each is responsive to two important data characteristics. The first characteristic is the type of data being analyzed. All data is not created equal. It comes in levels of measurement that are categorized from least to most detailed as: nominal, ordinal, interval, ratio. Nominal is often something like a discrete category (e.g., Democrat, Republican, Libertarian, Independent) and ratio is a continuous measurement where zero represents an absence of the variable (e.g., height, age). Ordinal and interval are somewhere between. The second characteristic is the trend within the data. Data comes in different types of distributions. Imagine having a list of test scores, placing them in order from lowest to highest, plotting them on a graph, and fitting a line that summarizes the trend of the data. That line may be straight (i.e., linear) or curved (non-linear). Correlation is calculated differently based on this trend within the data begin analyzed. When the data we have is at the interval or ratio level of measurement, and we expect the trend of the data to be essentially linear, correlation is measured with a statistic known as Pearson's r, or sometimes simply r. It is measured on a continuous decimal scale from -1 to 1 (Figure 1). The number we arrive upon, called a correlation coefficient, tells us the magnitude (i.e., strength) of the association between the measured variables and the direction (i.e., positive or negative) of that association.