Comparing continuous data across groups is paramount in research and hospital operations. Naively assuming that any observed difference between the two groups implies that the groups are truly different ignores a few important facts: (1) there is natural variation that occurs in almost every process; and (2) our confidence in concluding that the differences are real, and not just due to chance. In the present article, we describe statistical procedures used to perform a comparison across 2 groups.
At some point in your career, you will likely want to compare data across ≥2 groups. In fact, this function is paramount to the research you read in this journal (eg, as shown in the first table given in almost any article) and to the day-to-day operations within hospitals. You may, for example, be asked to compare factors such as length of stay, mortality rates, productivity, or utilization of a specific drug between your hospital and another hospital. You may also be asked to compare some measure within your own institution at different time points, perhaps before and after a specific intervention. Naively assuming that any difference in outcome implies that the groups are truly different is risky. Doing so completely ignores a few important facts: (1) there is natural variation that occurs in almost every process; and (2) our confidence in concluding that the differences are real, and not just due to chance. In comes statistics.
In this article, we focus our attention on comparing the center of continuous data across 2 groups, but the ideas are generalizable to >2 groups. We outline here a series of important steps that you can take to identify the type of test you need and how to interpret the results. Throughout these steps, we consider a sample data set to compare the birth weight and charges for 187 female children and 227 male children with a principal diagnosis of septicemia of the newborn (International Classification of Diseases, Ninth Revision, code 771.81) from the 2009 Kids Inpatient Database (KID; Healthcare Utilization Project, Agency for Healthcare Research and Quality).1 KID contains an unweighted sample of 3.4 million hospital discharges for children ages 0 to 20 years from all community, nonrehabilitation hospitals in 44 states, regardless of payer. Data quality and reliability are jointly assured through the Healthcare Cost and Utilization Project and participating states and health care institutions. Discharges in which the birth weight was unavailable were excluded.
Because we are comparing the birth weights of female and male subjects in this population, the underlying hypothesis that we are testing is the null hypothesis of no difference (H0: µFemales = µMales) versus any difference (H1: µFemales ≠ µMales). The goal is to determine if we have enough evidence in the data to reject the null hypothesis in favor of the alternative.
Step 1: Know What Kind of Data You Have
The first step in determining if observed differences in outcomes are real or due to random chance is to know what kind of data you have. Although there are several different types of data (and more sophisticated definitions), our focus is on 2 of the most common types in the hospital setting: continuous and categorical. Broadly speaking, continuous data are data that can typically take on any numerical value within a range. Some examples include cost, age, and hours worked per day. These data are, in many ways, the more challenging type to analyze and the focus of the present article. In our next article, we will focus on categorical data, which are data that can be put into nonoverlapping categories or groups. Examples of categorical data include gender, payer, disposition, and receipt of a specific drug.
Step 2: How Many Groups Are You Comparing?
If you are doing a comparison, you clearly have at least 2 groups. Two groups may be all you ever need to compare: “pre” versus “post,” drug A versus drug B, or us versus them. However, there may be times when you have >2 groups. Suppose, for example, that you want to compare the average length of stay for patients with asthma across 3 different attending physicians at your hospital. In this situation, you would need a different type of statistical test to determine if there are differences across the 3 groups; this form of analysis is beyond the scope of the present article, however.
Step 3: If the Data Are Continuous, Know How Your Data Look
All statistical tests have assumptions underlying them to make them valid. One of the assumptions for continuous data (that is often overlooked) is how the data are distributed: normal (ie, bell shaped) or nonnormal. Although there are tests to determine if your data are normal (ie, the Shapiro-Wilk test2), 1 method is just to “eyeball” the data from a histogram. If you have a statistical program, you can even request that a normal distribution be overlaid on the histogram to assist your visual determination. Fig 1 presents some sample histograms. It appears that birth weight (Fig 1A) is fairly normal, but total charges (Fig 1B) are highly skewed right (eg, a few observations that are much higher than the rest). You should be cautious of using any statistical test on the total charges data that relies on the data being from a normal distribution.
One method of trying to fix-up the charge data is by transforming the data by taking the natural log of the data (Fig 2). The distribution of the transformed data looks much more normal, but trying to explain differences between 2 groups on a natural log-transformed scale can become confusing. When in doubt, it is always safer to assume that your data are not normally distributed. Most of the times in health care, they are not. You won’t lose much statistical power, and you won’t lose any sleep at night because you made the wrong decision.
Step 4: Summarize Your Data
Although summarizing the data is not really a necessary step in the process for comparing the center of 2 distributions, it is an important thing to remember when you are presenting your results. If you feel comfortable from Step 3 that your data are normal, you can summarize the data within the groups by using means with SDs or 95% confidence intervals. However, if you have nonnormal data, it is always best to use medians and quartiles (ie, the 25th and 75th percentiles) to avoid the influence of outliers in the data. There are other options if you really prefer means, with the added advantage of being more sensitive to changes than the median. The more common robust (ie, performs well with nonnormal distributions) means include the following: (1) the geometric mean, which can be calculated by taking the natural log of your data, calculating the mean, and then exponentiating the result; (2) a trimmed mean,3 which excludes any value beyond a specified threshold (eg, below the fifth or higher than the 95th percentiles [ie, 10% trimmed mean]); or (3) the Winsorized mean, which does not exclude values below the fifth or higher than the 95th percentiles but replaces them with the fifth or 95th percentile (ie, the 10% Winsorized mean). Table 1 describes these various measures of the center for the charge data. It is noteworthy that the mean is much larger than the other measures because it is influenced by the high outliers in our data.
Step 5: Pick Your Test
Finally, you are ready to think about a statistical test. The Supplemental Figure displays a flowchart to help you select the appropriate test based on your earlier answers. Suppose we want to know if differences in birth weight exist based on gender. Using the histogram in Fig 3, it looks like male infants generally weigh a little more, but we should confirm it with a t test because we believe we have normally distributed data and 2 groups. (As a side note, the t test was introduced in a 1908 article in Biometrika by a chemist at the Guinness brewery named William Gosset.4 Because he was not permitted to publish under his own name, Gosset published his work under the pseudonym “Student.”)
Step 5: Interpreting the Result
Once a statistical test is performed, it generally yields a value of the test statistic and an associated P value. The P value provides a measure of how much evidence we have to reject the null hypothesis. For the tests discussed here, the null hypothesis is that the groups have the same central tendency (ie, mean or median). The alternative hypothesis would be that the groups are different. Generally, if P < .05, we reject the null hypothesis in favor of the alternative. Figure 4 presents the SAS version 9.4 (SAS Institute, Inc, Cary, NC) output for the birth weight example comparing female and male infants, but the t test can also be performed in Excel by using the T.TEST function (Microsoft Corporation, Redmond, WA).
In Fig 4A, we note that female subjects have a mean weight of 3033.4 g compared with male subjects with an average weight of 3248.9 g. The method of t test that we used to compare the averages is based on whether the 2 groups have a similar spread (ie, variance). This question is tested in Fig 4B. The P value for this test is large (P = .447), and we thus conclude that the groups have equal variance, and we can use the pooled version of the t test in Fig 4C as opposed to the Satterthwaite version, which was designed for unequal variance. For the pooled version, P = .007 indicates strong evidence against the null hypothesis of equal means in birth weight between the genders. In other words, we are very confident that true differences in birth weight exist between male and female newborns and that these observed differences are more than just random chance.
There is a lot of other interesting information in the output from SAS (including the estimated difference of –215.5 g), but we have done our duty to compare the means. You may be wondering what would have happened if we had not assumed that the data were normally distributed and used a Wilcoxon rank-sum5 test instead. In this example, we would have come to the same conclusion because the result of that test was P = .002.
Comparing continuous data from 2 groups is common in research, and the comparisons are typically presented in the first table of the article. The results of the statistical tests allow the reader to assess baseline differences between the groups and guide the multivariable modeling that needs to occur. These statistical tests can easily be performed with the few steps outlined here and some basic software.
Send questions, comments, or ideas for a future section to us at:.
FINANCIAL DISCLOSURE: The authors have indicated they have no financial relationships relevant to this article to disclose.
FUNDING: No external funding.
POTENTIAL CONFLICT OF INTEREST: The authors have indicated they have no potential conflicts of interest to disclose.
- ↵Introduction to the HCUP KIDS' Inpatient Database (KID) 2009. Healthcare Cost and Utilization Project. 2011. Available at: www.hcup-us.ahrq.gov/db/nation/kid/kid_2009_introduction.jsp. Accessed November 13, 2015
- Shapiro SS,
- Wilk MB
- Wilcox RR
- Raju TN
- Copyright © 2016 by the American Academy of Pediatrics