# Understanding correlation analysis

Note from the editor: This is the third article in our “Spotlight on Statistics” series, which aims to clarify statistical practices used in research articles.

As a nurse, you’re expected to make clinical decisions based on the evidence. To do this requires an understanding of basic statistical procedures, so you can critically evaluate the statistical analyses used in research and draw conclusions about the accuracy and meaning of the results. This article discusses correlation—what it is, how to calculate it, and how to interpret results of the calculations.

## What is correlation?

A statistical tool, correlation can be used with dichotomous (yes/no) outcomes. But its main use is in evaluating associations among ranked or continuous outcomes, which are unwieldy for relative-risk or odds-ratio analyses. In many situations, outcomes have values ranging from small to large. These outcomes may be ordered (as in “Rate your level of pain from 1 to 5”), or they may be numeric and continuous (as with age).

In general, correlation estimates the degree to which two variables relate to one another. The analysis can’t
be used with nominal variables (those with number codes that aren’t in meaningful order) with three or more groups—for instance, type of delivery, coded as vaginal = 1, cesarean = 2, forceps = 3.

Much of the clinical research relevant to nursing explores whether a relationship exists between two patient characteristics. Understanding potentially related characteristics helps nurses better identify which physical, psychological, or demographic factors are associated with reason for concern. Here’s an example: During postpartum hospitalization, is maternal anxiety related to maternal age?

## Understanding correlation coefficients

Determining if the relationship between two numeric variables, such as maternal age and anxiety, is statistically significant can improve outcomes by helping clinicians understand who needs the most attention to avoid negative outcomes. A correlation coefficient results from a statistical test that demonstrates how two variables are connected, identifying the strength and direction of this relationship. If you plot the relationship between the scores, you can see how closely the two values are connected and what the direction is.

The scatter plot (please click the PDF icon above to view) shows simulated values for the variables of maternal age (X) and anxiety level (Y). As you can see, as X increases, Y decreases. Maternal age ranges from 20 to 40 and anxiety scores range from 15 to 55. The five plotted scores are approximately in a line on the graph; this reflects a strong linear relationship.

Results from calculating a correlation coefficient show whether the relationship is positive or negative. A positive relationship means that as one value increases, so does the other. A negative relationship means that as one value decreases, the other increases. In our example, as maternal age increases, the anxiety level decreases.

## Strength of the relationship

The strength of the relationship also can be interpreted from the value of the correlation coefficient. For example, a perfect relationship would have a value of +1.0 or –1.0 (a perfect positive or a perfect negative relationship). A perfect relationship is rare, but the closer the value is to +1.0 or –1.0, the stronger the relationship. When two variables are unrelated, the correlation co-efficient is zero.

Some authors use the following markers for strength of correlation:

• 0.1 to 0.3: small linear relationship (weak)
• 0.31 to 0.5: moderate linear relationship.

A correlation coefficient (usually Pearson’s correlation, denoted as Pearson’s r) can be computed by hand or using a statistical package. The scatter plot of simulated data on the previous page illustrates a strong linear relationship, while the hand calculation shown in the table (please click the PDF icon above to view) verifies that the strength of the relationship is strongly negative (correlation coefficient r = –.97). The equation is shown in narrative form below the table.

The distance from the mean is identified for each subject on each of the two variables for which the correlation is being tested. This distance is squared (to deal with negatives and positives) and the relationship is expressed as a number (r) that reflects the strength and direction of the relationship. In our simulated example, as maternal age increased, anxiety decreased. While correlation indicates a relationship and not causality, its temporal ordering also is apparent depending on the variables used. Age is the independent variable and anxiety is the outcome (dependent variable)—not the other way around.

# Real-world example

In an example from our own research, a randomized trial of 328 breastfeeding mothers (shown in the correlation matrix below), we set out to determine the relationship of maternal age and anxiety in breastfeeding women during the immediate postpartum period. In this sample of 325 (n = 3 missing either age or anxiety score), the relationship is similar to but less robust than in our simulated example. A weak but significant negative relationship exists between age and anxiety; as age increases, anxiety decreases.

To use this finding as evidence for practice, you need to determine its relevancy and risk. For instance, it may warrant more concern about and efforts to prevent postpartum depression in younger women.

The examples in this article deal with two numeric continuous variables, and we’ve discussed the most commonly used parametric statistic, Pearson’s r. To study two ordinal variables (such as age-group categories and perceived anxiety ranked on a scale of 1 to 5), we’d use the nonparametric alternative, Spearman’s rho. Another reason to focus on correlation is that the basic idea of correlation underlies multiple regression analysis.