Show Summary Details
Page of

# Glossary of Statistical Terms

Chapter:
Study Design and Statistics
Page of

PRINTED FROM AMA MANUAL OF STYLE ONLINE (www.amamanualofstyle.com). © American Medical Association, 2009. All Rights Reserved. Under the terms of the license agreement, an individual user may print out a PDF of a single chapter of a title in AMA Manual of Style Online for personal use (for details see Privacy Policy).

Subscriber: null; date: 26 February 2017

# Glossary of Statistical Terms

UPDATE: The terms multivariable and multivariate are not synonymous, as the entries in the Glossary of Statistical Terms suggest (Chapter 20.9, page 881 in the print). To be accurate, multivariable refers to multiple predictors (independent variables) for a single outcome (dependent variable). Multivariate refers to 1 or more independent variables for multiple outcomes. This update was implemented June 1, 2014.
UPDATE: We will discontinue using quotation marks to identify parts of an article, but retain the capitalization; eg, This is discussed in the Methods section (not the “Methods” section). This change was made February 14, 2013.
UPDATE: Although our style manual recommends (Section 20.9, page 888 in the print) that "[expressing] P to more than 3 significant digits does not add useful information to P<.001," in certain types of studies (particularly GWAS [genome-wide association studies] and other studies in which there are adjustments for multiple comparisons, such as Bonferroni correction, and the definition of level of significance is substantially less than P<.05) it may be important to express P values to more significant digits. For example, if the threshold of significance is P<.0004, then by definition the P value must be expressed to at least 4 digits to indicate whether a result is statistically significant. GWAS express P values to very small numbers, using scientific notation. If a manuscript you are editing defines statistical significance as a P value substantially less than .05, possibly even using scientific notation to express P values to very small numbers, it is best to retain the values as the author presents them. This change was made August 16, 2011.

In the glossary that follows, terms defined elsewhere in the glossary are printed in this font. An arrowhead () indicates points to consider in addition to the definition. For detailed discussion of these terms, the referenced texts and the resource list at the end of the chapter are useful sources.

Eponymous names for statistical procedures often differ from one text to another (eg, the Newman-Keuls and Student-Newman-Keuls test). The names provided in this glossary follow the Dictionary of Statistical Terms38 published for the International Statistical Institute. Although statistical texts use the possessive form for most eponyms, the possessive form for eponyms is not used in JAMA and the Archives Journals (see 16.0, Eponyms).

Most statistical tests are applicable only under specific circumstances, which are generally dictated by the scale properties of both the independent variable and the dependent variable. Table 3 presents a guide to selection of commonly used statistical techniques. This table is not meant to be exhaustive but rather to indicate the appropriate applications of commonly used statistical techniques.

Table 3. Selection of Commonly Used Statistical Techniquesa

Scale of Measurement

Intervalb

Ordinal

Nominalc

2 Treatment groups

Unpaired t test

Mann-Whitney rank sum test

χ2 Analysis-of-contingency table; Fisher exact test if ≤6 in any cell

≥3 Treatment groups

Analysis of variance

Kruskal-Wallis statistic

χ2 Analysis-of-contingency table; Fisher exact test if ≤6 in any cell

Before and after 1 treatment in same individual

Paired t test

Wilcoxon signed rank test

McNemar test

Multiple treatments in same individual

Repeated-measures analysis of variance

Friedman statistic

Cochran Q

Association between 2 variables

Linear regression and Pearson product moment correlation

Spearman rank correlation

Contingency coefficients

a Adapted with permission from Glantz, Primer of Biostatistics.39 © The McGraw-Hill Companies, Inc.

b Assumes normally distributed data. If data are not normally distributed, then rank the observations and use the methods for data measured on an ordinal scale.

c For a nominal dependent variable that is time dependent (such as mortality over time), use life-table analysis for nominal independent variables and Cox regression for continuous and/or nominal independent variables.

• abscissa: horizontal or x-axis of a graph.

• absolute risk: probability of an event occurring during a specified period. The absolute risk equals the relative risk times the average probability of the event during the same time, if the risk factor is absent.40(p327) See absolute risk reduction.

• absolute risk reduction: proportion in the control group experiencing an event minus the proportion in the intervention group experiencing an event. The inverse of the absolute risk reduction is the number needed to treat. See absolute risk.

• accuracy: ability of a test to produce results that are close to the true measure of the phenomenon.40(p327) Generally, assessing accuracy of a test requires that there be a criterion standard with which to compare the test results. Accuracy encompasses a number of measures including reliability, validity, and lack of bias.

• actuarial life-table method: see life table, Cutler-Ederer method.

• adjustment: statistical techniques used after the collection of data to adjust for the effect of known or potential confounding variables.40(p327) A typical example is adjusting a result for the independent effect of age of the participants (age is the independent variable).

• aggregate data: data accumulated from disparate sources.

• agreement: statistical test performed to determine the equivalence of the results obtained by 2 tests when one test is compared with another (one of which is usually but not always a criterion standard).

→ Agreement should not be confused with correlation. Correlation is used to test the degree to which changes in a variable are related to changes in another, whereas agreement tests whether 2 variables are equivalent. For example, an investigator compares results obtained by 2 methods of measuring hematocrit. Method A gives a result that is always exactly twice that of method B. The correlation between A and B is perfect since A is always twice B, but the agreement is very poor; method A is not equivalent to method B (written communication, George W. Brown, MD, September 1993). One appropriate way to assess agreement has been described by Bland and Altman.41

• algorithm: systematic process carried out in an ordered, typically branching sequence of steps; each step depends on the outcome of the previous step.42(p6) An algorithm may be used clinically to guide treatment decisions for an individual patient on the basis of the patient’s clinical outcome or result.

• α (alpha), α level: size of the likelihood acceptable to the investigators that a relationship observed between 2 variables is due to chance (the probability of a type I error); usually α = .05. If α = .05, P < .05 will be considered significant.

• analysis: process of mathematically summarizing and comparing data to confirm or refute a hypothesis. Analysis serves 3 functions: (1) to test hypotheses regarding differences in large populations based on samples of the populations, (2) to control for confounding variables, and (3) to measure the size of differences between groups or the strength of the relationship between variables in the study.40(p25)

• analysis of covariance (ANCOVA): statistical test used to examine data that include both continuous and nominal independent variables and a continuous dependent variable. It is basically a hybrid of multiple regression (used for continuous independent variables) and analysis of variance (used for nominal independent variables).40(p299)

• analysis of residuals: see linear regression.

• analysis of variance (ANOVA): statistical method used to compare a continuous dependent variable and more than 1 nominal independent variable. The null hypothesis in ANOVA is tested by means of the F test.

In 1-way ANOVA there is a single nominal independent variable with 2 or more levels (eg, age categorized into strata of 20 to 39 years, 40 to 59 years, and 60 years and older). When there are only 2 mutually exclusive categories for the nominal independent variable (eg, male or female), the 1-way ANOVA is equivalent to the t test.

A 2-way ANOVA is used if there are 2 independent variables (eg, age strata and sex), a 3-way ANOVA if there are 3 independent variables, etc. If more than 1 nonexclusive independent variable is analyzed, the process is called factorial ANOVA, which assesses the main effects of the independent variables as well as their interactions. An analysis of main effects in the 2-way ANOVA above would assess the independent effects of age group or sex; an association between female sex and systolic blood pressure that exists in one age group but not another would mean that an interaction between age and sex exists. In a factorial 3-way ANOVA with independent variables A, B, and C, there is one 3-way interaction term (A × B × C), 3 different 2-way interaction terms (A × B, A × C, and B × C), and 3 main effect terms (A, B, and C). A separate F test must be computed for each different main effect and interaction term.

If repeated measures are made on an individual (such as measuring blood pressure over time) so that a matched form of analysis is appropriate, but potentially confounding factors (such as age) are to be controlled for simultaneously, repeated-measures ANOVA is used. Randomized-block ANOVA is used if treatments are assigned by means of block randomization.40(pp291-295)

→ An ANOVA can establish only whether a significant difference exists among groups, not which groups are significantly different from each other. To determine which groups differ significantly, a pairwise analysis of a continuous dependent variable and more than 1 nominal variable is performed by a procedure such as the Newman-Keuls test or Tukey test, as well as many others. These multiple comparison procedures avoid the potential of a type I error that might occur if the t test were applied at this stage. Such comparisons may also be computed through the use of orthogonal contrasts.

→ The F ratio is the statistical result of ANOVA and is a number between 0 and infinity. The F ratio is compared with tables of the F distribution, taking into account the α level and degrees of freedom (df) for the numerator and denominator, to determine the P value.

Example: The difference was found to be significant by 1-way ANOVA (F2,63=61.07; P < .001).43

The dfs are provided along with the F statistic. The first subscript (2) is the df for the numerator; the second subscript (63) is the df for the denominator. The P value can be obtained from an F statistic table that provides the P value that corresponds to a given F and df. In practice, however, the P value is generally calculated by a computerized algorithm. Because ANOVA does not determine which groups are significantly different from each other, this example would normally be accompanied by the results of the multiple comparisons procedure.43 Other models such as Latin square may also be used.

• ANCOVA: see analysis of covariance.

• ANOVA: see analysis of variance.

• Ansari-Bradley dispersion test: rank test to determine whether 2 distributions known to be of identical shape (but not necessarily of normal distribution) have equal parameters of scale.35(p6)

• area under the curve (AUC): technique used to measure the performance of a test plotted on a receiver operating characteristic (ROC) curve or to measure drug clearance in pharmacokinetic studies.42(p12) When measuring test performance, the larger the AUC, the better the test performance. When measuring drug clearance, the AUC assesses the total exposure of the individual, as measured by levels of the drug in blood or urine, to a drug over time. The curve of drug clearance used to calculate the AUC is also used to calculate the drug half-life.

→ The method used to determine the AUC should be specified (eg, the trapezoidal rule).

• artifact: difference or change in measure of occurrence of a condition that results from the way the disease or condition is measured, sought, or defined.40(p327)

Example: An artifactual increase in the incidence of AIDS was expected because the definition of AIDS was changed to include a larger number of AIDS-defining illnesses.

• assessment: in the statistical sense, evaluating the outcome(s) of the study and control groups.

• assignment: process of distributing individuals to study and control groups. See also randomization.

• association: statistically significant relationship between 2 variables in which one does not necessarily cause the other. When 2 variables are measured simultaneously, association rather than causation generally is all that can be assessed.

Example: After confounding factors were controlled for by means of multivariate regression, a significant association remained between age and disease prevalence.

• attributable risk: disease that can be attributed to a given risk factor; conversely, if the risk factor were eliminated entirely, the amount of the disease that could be eliminated.40(pp327-328) Attributable risk assumes a causal relationship (ie, the factor to be eliminated is a cause of the disease and not merely associated with the disease). See attributable risk percentage and attributable risk reduction.

• attributable risk percentage: the percentage of risk associated with a given factor among those with the risk factor.40(pp327-328) For example, risk of stroke in an older person who smokes and has hypertension and no other risk factors can be divided among the risks attributable to smoking, hypertension, and age. Attributable risk percentage is often determined for a population and is the percentage of the disease related to the risk factor. See population attributable risk percentage.

• attributable risk reduction: the number of events that can be prevented by eliminating a particular risk factor from the population. Attributable risk reduction is a function of 2 factors: the strength of the association between the risk factor and the disease (ie, how often the risk factor causes the disease) and the frequency of the risk factor in the population (ie, a common risk factor may have a lower attributable risk in an individual than a less common risk factor, but could have a higher attributable risk reduction because of the risk factor’s high prevalence in the population). Attributable risk reduction is a useful concept for public health decisions. See also attributable risk.

• average: the central tendency of a number of measurements. This is often used synonymously with mean, but can also imply the median, mode, or some other statistic. Thus, the word should generally be avoided in favor of a more precise term.

• Bayesian analysis: theory of statistics involving the concept of prior probability, conditional probability or likelihood, and posterior probability.38(p16) For interpreting studies, the prior probability is based on previous studies and may be informative, or, if no studies exist or those that exist are not useful, one may assume a uniform prior. The study results are then incorporated with the prior probability to obtain a posterior probability. Bayesian analysis can be used to interpret how likely it is that a positive result indicates presence of a disease, by incorporating the prevalence of the disease in the population under study and the sensitivity and specificity of the test in the calculation.

→ Bayesian analysis has been criticized because the weight that a particular study is given when prior probability is calculated can be a subjective decision. Nonetheless, the process most closely approximates how studies are considered when they are incorporated into clinical practice. When Bayesian analysis is used to assess posterior probability for an individual patient in a clinic population, the process may be less subjective than usual practice because the prior probability, equal to the prevalence of the disease in the clinic population, is more accurate than if the prevalence for the population at large were used.32

• β (beta), β level: probability of showing no significant difference when a true difference exists; a false acceptance of the null hypothesis.42(p57) One minus β is the statistical power of the test to detect a true difference; the smaller the β, the greater the power. A value of .20 for β is equal to .80 or 80% power. A β of .1 or .2 is most frequently used in power calculations. The β error is synonymous with type II error.43

• bias: a systematic situation or condition that causes a result to depart from the true value in a consistent direction. Bias refers to defects in study design (often selection bias) or measurement.40(p328) One method to reduce measurement bias is to ensure that the investigator measuring outcomes for a participant is unaware of the group to which the participant belongs (ie, blinded assessment).

• bimodal distribution: nonnormal distribution with 2 peaks, or modes. The mean and median may be equivalent, but neither will describe the data accurately. A population composed entirely of schoolchildren and their grandparents might have a mean age of 35 years, although everyone in the population would in fact be either much younger or much older.

• binary variable: variable that has 2 mutually exclusive subgroups, such as male/female or pregnant/not pregnant; synonym for dichotomous variable.44(p75)

• binomial distribution: probability with 2 possible mutually exclusive outcomes; used for modeling cumulative incidence and prevalence rates42(p17) (for example, the probability of a person having a stroke in a given population over a given period; the outcome must be stroke or no stroke). In a binomial sample with a probability p of the event and n number of participants, the predicted mean is p × n and the predicted variance is p(p −1).

• biological plausibility: evidence that an independent variable can be expected to exert a biological effect on a dependent variable with which it is associated. For example, studies in animals were used to establish the biological plausibility of adverse effects of passive smoking.

• bivariable analysis: see bivariate analysis.

• bivariate analysis: used when 1 dependent and 1 independent variable are to be assessed.40(p263) Common examples include the t test for 1 continuous variable and 1 binary variable and the χ2 test for 2 binary variables. Bivariate analyses can be used for hypothesis testing in which only 1 independent variable is taken into account, to compare baseline characteristics of 2 groups, or to develop a model for multivariate regression. See also univariate and multivariate analysis.

→ Bivariate analysis is the simplest form of hypothesis testing but is often used incorrectly, either because it is used too frequently, resulting in an increased likelihood of a type I error, or because tests that assume a normal distribution (eg, the t test) are applied to nonnormally distributed data.

• Bland-Altman plot: a method to assess agreement (eg, between 2 tests) developed by Bland and Altman.41

• blinded (masked) assessment: evaluation or categorization of an outcome in which the person assessing the outcome is unaware of the treatment assignment. Masked assessment is the term preferred by some investigators and journals, particularly those in ophthalmology.

→ Blinded assessment is important to prevent bias on the part of the investigator performing the assessment, who may be influenced by the study question and consciously or unconsciously expect a certain test result.

• blinded (masked) assignment: assignment of individuals participating in a prospective study (usually random) to a study group and a control group without the investigator or the participants being aware of the group to which they are assigned. Studies may be single-blind, in which either the participant or the person administering the intervention does not know the treatment assignment, or double-blind, in which neither knows the treatment assignment. The term triple-blinded is sometimes used to indicate that the persons who analyze or interpret the data are similarly unaware of treatment assignment. Authors should indicate who exactly was blinded. The term masked assignment is preferred by some investigators and journals, particularly those in ophthalmology.

• block randomization: type of randomization in which the unit of randomization is not the individual but a larger group, sometimes stratified on particular variables such as age or severity of illness to ensure even distribution of the variable between randomized groups.

• Bonferroni adjustment: one of several statistical adjustments to the P value that may be applied when multiple comparisons are made. The α level (usually .05) is divided by the number of comparisons to determine the α level that will be considered statistically significant. Thus, if 10 comparisons are made, an α of .05 would become α = .005 for the study. Alternatively, the P value may be multiplied by the number of comparisons, while retaining the α of .05.44(pp31-32) Alternatively, the P value may be multiplied by the number of comparisons, while retaining the α of .05. For example, a P value of .02 obtained for 1 of 10 comparisons would be multiplied by 10 to get the final result of P = .20, a nonsignificant result.

→ The Bonferroni test is a conservative adjustment for large numbers of comparisons (ie, less likely than other methods to give a significant result) but is simple and used frequently.

• bootstrap method: statistical method for validating a new diagnostic parameter in the same group from which the parameter was derived. Thus, the validation of the method is based on a simulated sample, rather than a new sample. The parameter is first derived from the entire group, then applied sequentially to subsegments of the group to see whether the parameter performs as well for the subgroups as it does for the entire group (derived from “pulling oneself up by one’s own bootstraps”).42(p32)

For example, a number of prognostic indicators are measured in a cohort of hospitalized patients to predict mortality. To determine whether the model using the indicators is equally predictive of mortality for subsegments of the group, the bootstrap method is applied to the subsegments and confidence intervals are calculated to determine the predictive ability of the model. The jackknife dispersion test also uses the same sample for both derivation and validation.

→ Although the preferable means for validating a model is to apply the model to a new sample (eg, a new cohort of hospitalized patients in the previous example), the bootstrap method can be used to reduce the time, effort, and expense necessary to complete the study. However, the bootstrap method provides less assurance than validation in a new sample that the model is generalizable to another population.

• Brown-Mood procedure: test used with a regression model that does not assume a normal distribution or common variance of the errors.38(p26) It is an extension of the median test.

• C statistic: a measure of the area under a receiver operating characteristic curve.

• case: in a study, an individual with the outcome or disease of interest.

• case-control study: retrospective study in which individuals with the disease (cases) are compared with those who do not have the disease (controls). Cases and controls are identified without knowledge of exposure to the risk factors under study. Cases and controls are matched on certain important variables, such as age, sex, and year in which the individual was treated or identified. A case-control study on individuals already enrolled in a cohort study is referred to as a nested case-control study.42(p111) This type of case-control study may be an especially strong study design if characteristics of the cohort have been carefully ascertained. See also 20.3.2, Observational Studies, Case-Control Studies.

→ Cases and controls should be selected from the same population to minimize confounding by factors other than those under study. Matching cases and controls on too many characteristics may obscure the association of interest, because if cases and controls are too similar, their exposures may be too similar to detect a difference (see overmatching).

• case-fatality rate: probability of death among people diagnosed as having a disease. The rate is calculated as the number of deaths during a specific period divided by the number of persons with the disease at the beginning of the period.44(p38)

• case series: retrospective descriptive study in which clinical experience with a number of patients is described. See 20.3.3, Observational Studies, Case Series.

• categorical data: counts of members of a category or class; for the analysis each member or item should fit into only 1 category or class38(p29) (eg, sex or race/ethnicity). The categories have no numerical significance. Categorical data are summarized by proportions, percentages, fractions, or simple counts. Categorical data is synonymous with nominal data.

• cause, causation: something that brings about an effect or result; to be distinguished from association, especially in cohort studies. To establish something as a cause it must be known to precede the effect. The concept of causation includes the contributory cause, the direct cause, and the indirect cause.

• censored data: censoring has 2 different statistical connotations: (1) data in which extreme values are reassigned to some predefined, more moderate value; (2) data in which values have been assigned to individuals for whom the actual value is not known, such as in survival analyses for individuals who have not experienced the outcome (usually death) at the time the data collection was terminated.

The term left-censored data means that data were censored from the low end or left of the distribution; right-censored data come from the high end or right of the distribution42(p26) (eg, in survival analyses). For example, if data for falls are categorized as individuals who have 0, 1, or 2 or more falls, falls exceeding 2 have been right-censored.

• central limit theorem: theorem that states that the mean of a number of samples with variances that are not large relative to the entire sample will increasingly approximate a normal distribution as the sample size increases. This is the basis for the importance of the normal distribution in statistical testing.38(p30)

• central tendency: property of the distribution of data, usually measured by mean, median, or mode.42(p41)

• χ2 test (chi-square test): a test of significance based on the χ2 statistic, usually used for categorical data. The observed values are compared with the expected values under the assumption of no association. The χ2 goodness-of-fit test compares the observed with expected frequencies. The χ2 test can also compare an observed variance with hypothetical variance in normally distributed samples.38(p33) In the case of a continuous independent variable and a nominal dependent variable, the χ2 test for trend can be used to determine whether a linear relationship exists (for example, the relationship between systolic blood pressure and stroke).40(pp284-285)

→ The P value is determined from χ2 tables with the use of the specified α level and the df calculated from the number of cells in the χ2 table. The χ2 statistic should be reported to no more than 1 decimal place; if the Yates correction was used, that should be specified. See also contingency table.

Example: The exercise intervention group was least likely to have experienced a fall in the previous month ($χ32$ = 17.7, P = .02).

Note that the df for $χ32$ is specified using a subscript 3; it is derived from the number of cells in the χ2 table (for this example, 4 cells in a 2 × 2 table). The value 17.7 is the χ2 value. The P value is determined from the χ2 value and df.

Results of the χ2 test may be biased if there are too few observations (generally 5 or fewer) per cell. In this case, the Fisher exact test is preferred.

• choropleth map: map of a region or country that uses shading to display quantitative data.42(p28) See also 4.2.3, Visual Presentation of Data, Figures, Maps.

• chunk sample: subset of a population selected for convenience without regard to whether the sample is random or representative of the population.38(p32) A synonym is convenience sample.

• Cochran Q test: method used to compare percentage results in matched samples (see matching), often used to test whether the observations made by 2 observers vary in a systematic manner. The analysis results in a Q statistic, which, with the df, determines the P value; if significant, the variation between the 2 observers cannot be explained by chance alone.38(p25) See also interobserver bias.

• coefficient of determination: square of the correlation coefficient, used in linear or multiple regression analysis. This statistic indicates the proportion of the variation of the dependent variable that can be predicted from the independent variable.40(p328) If the analysis is bivariate, the correlation coefficient is indicated as r and the coefficient of determination is r2. If the correlation coefficient is derived from multivariate analysis, the correlation coefficient is indicated as R and the coefficient of determination is R2. See also correlation coefficient.

Example: The sum of the R2 values for age and body mass index was 0.23. [Twenty-three percent of the variance could be explained by those 2 variables.]

→ When R2 values of the same dependent variable total more than 1.0 or 100%, then the independent variables have an interactive effect on the dependent variable.

• coefficient of variation: ratio of the standard deviation (SD) to the mean. The coefficient of variation is expressed as a percentage and is used to compare dispersions of different samples. The smaller the coefficient of variation, the greater the precision.43 The coefficient of variation is also used when the SD is dependent on the mean (eg, the increase in height with age is accompanied by an increasing SD of height in the population).

• cohort: a group of individuals who share a common exposure, experience, or characteristic, or a group of individuals followed up or traced over time in a cohort study.38(p31)

• cohort effect: change in rates that can be explained by the common experience or characteristic of a group or cohort of individuals. A cohort effect implies that a current pattern of variables may not be generalizable to a different cohort.38(p328)

Example: The decline in socioeconomic status with age was a cohort effect explained by fewer years of education among the older individuals.

• cohort study: study of a group of individuals, some of whom are exposed to a variable of interest (eg, a drug treatment or environmental exposure), in which participants are followed up over time to determine who develops the outcome of interest and whether the outcome is associated with the exposure. Cohort studies may be concurrent (prospective) or nonconcurrent (retrospective).40(pp328-329) See also 20.3.1, Observational Studies, Cohort Studies.

→ Whenever possible, a participant’s outcome should be assessed by individuals who do not know whether the participant was exposed (see blinded assessment).

• concordant pair: pair in which both individuals have the same trait or outcome (as opposed to discordant pair). Used frequently in twin studies.42(p35)

• conditional probability: probability that an event E will occur given the occurrence of F, called the conditional probability of E given F. The reciprocal is not necessarily true: the probability of E given F may not be equal to the probability of F given E.44(p55)

• confidence interval (CI): range of numerical expressions within which one can be confident (usually 95% confident, to correspond to an α level of .05) that the population value the study is intended to estimate lies.40(p329) The CI is an indication of the precision of an estimated population value.

→ Confidence intervals used to estimate a population value usually are symmetric or nearly symmetric around a value, but CIs used for relative risks and odds ratios may not be. Confidence intervals are preferable to P values because they convey information about precision as well as statistical significance of point estimates.

→ Confidence intervals are expressed with a hyphen separating the 2 values. To avoid confusion, the word to replaces hyphens if one of the values is a negative number. Units that are closed up with the numeral are repeated for each CI; those not closed up are repeated only with the last numeral. See also 20.8, Significant Digits and Rounding Numbers, and 19.4, Numbers and Percentages, Use of Digit Spans and Hyphens.

Example: The odds ratio was 3.1 (95% CI, 2.2-4.8). The prevalence of disease in the population was 1.2% (95% CI, 0.8%-1.6%).

• confidence limits (CLs): upper and lower boundaries of the confidence interval, expressed with a comma separating the 2 values.42(p35)

Example: The mean (95% confidence limits) was 30% (28%, 32%).

• confounding: (1) a situation in which the apparent effect of an exposure on risk is caused by an association with other factors that can influence the outcome; (2) a situation in which the effects of 2 or more causal factors as shown by a set of data cannot be separated to identify the unique effects of any of them; (3) a situation in which the measure of the effect of an exposure on risk is distorted because of the association of exposure with another factor(s) that influences the outcome under study.42(p35) See also confounding variable.

• confounding variable: variable that can cause or prevent the outcome of interest, is not an intermediate variable, and is associated with the factor under investigation. Unless it is possible to adjust for confounding variables, their effects cannot be distinguished from those of the factors being studied. Bias can occur when adjustment is made for any factor that is caused in part by the exposure and also is correlated with the outcome.25(p35) Multivariate analysis is used to control the effects of confounding variables that have been measured.

• contingency coefficient: the coefficient C (Note: not to be confused with the C statistic), used to measure the strength of association between 2 characteristics in a contingency table.44(pp56-57)

• contingency table: table created when categorical variables are used to calculate expected frequencies in an analysis and to present data, especially for a χ2 test (2-dimensional data) or log-linear models (data with at least 3 dimensions). A 2 × 3 contingency table has 2 rows and 3 columns. The df are calculated as (number of rows − 1)(number of columns −1). Thus, a 2 x 3 contingency table has 6 cells and 2 df.

• continuous data: data with an unlimited number of equally spaced values.40(p329) There are 2 kinds of continuous data: ratio data and interval data. Ratio-level data have a true 0, and thus numbers can meaningfully be divided by one another (eg, weight, systolic blood pressure, cholesterol level). For instance, 75 kg is half as heavy as 150 kg. Interval data may be measured with a similar precision but lack a true 0 point. Thus, 328C is not half as warm as 648C, although temperature may be measured on a precise continuous scale. Continuous data include more information than categorical, nominal, or dichotomous data. Use of parametric statistics requires that continuous data have a normal distribution, or that the data can be transformed to a normal distribution (eg, by computing logarithms of the data).

• contributory cause: independent variable (cause) that is thought to contribute to the occurrence of the dependent variable (effect). That a cause is contributory should not be assumed unless all of the following have been established: (1) an association exists between the putative cause and effect, (2) the cause precedes the effect in time, and (3) altering the cause alters the probability of occurrence of the effect.40(p329) Other factors that may contribute to establishing a contributory cause include the concept of biological plausibility, the existence of a dose-response relationship, and consistency of the relationship when evaluated in different settings.

• control: in a case-control study, the designation for an individual without the disease or outcome of interest; in a cohort study, the individuals not exposed to the independent variable of interest; in a randomized controlled trial, the group receiving a placebo or standard treatment rather than the intervention under study.

• controlled clinical trial: study in which a group receiving an experimental treatment is compared with a control group receiving a placebo or an active treatment. See also 20.2.1, Randomized Controlled Trials, Parallel-Design Double-blind Trials.

• convenience sample: sample of participants selected because they were available for the researchers to study, not because they are necessarily representative of a particular population.

→ Use of a convenience sample limits generalizability and can confound the analysis depending on the source of the sample. For instance, in a study comparing cardiac auscultation, echocardiography, and cardiac catheterization, the patients studied, simply by virtue of their having undergone cardiac catheterization and echocardiography, likely are not comparable to an unselected population.

• correlation: description of the strength of an association among 2 or more variables, each of which has been sampled by means of a representative or naturalistic method from a population of interest.40(p329) The strength of the association is described by the correlation coefficient. See also agreement. There are many reasons why 2 variables may be correlated, and thus correlation alone does not prove causation.

→ The Kendall τ rank correlation test is used when testing 2 ordinal variables, the Pearson product moment correlation is used when testing 2 normally distributed continuous variables, and the Spearman rank correlation is used when testing 2 non-normally distributed continuous variables.43

→ Correlation is often depicted graphically by means of a scatterplot of the data (see Example F4 in 4.2.1, Visual Presentation of Data, Figures, Statistical Graphs). The more circular a scatterplot, the smaller the correlation; the more linear a scatterplot, the greater the correlation.

• correlation coefficient: measure of the association between 2 variables. The coefficient falls between -1 and 1; the sign indicates the direction of the relationship and the number the magnitude of the relationship. A positive sign indicates that the 2 variables increase or decrease together; a negative sign indicates that increases in one are associated with decreases in the other. A value of 1 or -1 indicates that the sample values fall in a straight line, while a value of 0 indicates no relationship. The correlation coefficient should be followed by a measure of the significance of the correlation, and the statistical test used to measure correlation should be specified.

Example: Body mass index increased with age (Pearson r = 0.61; P < .001); years of education decreased with age (Pearson r = -0.48; P = .01).

→ When 2 variables are compared, the correlation coefficient is expressed by r; when more than 2 variables are compared by multivariate analysis, the correlation coefficient is expressed by R. The symbol r2 or R2 is termed the coefficient of determination and indicates the amount of variation in the dependent variable that can be explained by knowledge of the independent variable.

• cost-benefit analysis: economic analysis that compares the costs accruing to an individual for some treatment, process, or procedure and the ensuing medical consequences, with the benefits of reduced loss of earnings resulting from prevention of death or premature disability. The cost-benefit ratio is the ratio of marginal benefit (financial benefit of preventing 1 case) to marginal cost (cost of preventing 1 case).42(p38) See also 20.5, Cost-effectiveness Analysis, Cost-Benefit Analysis.

• cost-effectiveness analysis: comparison of strategies to determine which provides the most clinical value for the cost.43 A preferred intervention is the one that will cost the least for a given result or be the most effective for a given cost.30(pp38-39) Outcomes are expressed by the cost-effectiveness ratio, such as cost per year of life saved. See also 20.5, Cost-effectiveness Analysis, Cost-Benefit Analysis.

• cost-utility analysis: form of economic evaluation in which the outcomes of alternative procedures are expressed in terms of a single utility-based measurement, most often the quality-adjusted life-year (QALY).42(p39)

• covariates: variables that may mediate or confound the relationship between the independent and dependent variables. Because patterns of covariates may differ systematically between groups in a trial or observational study, their effect should be accounted for during the analysis. This can be accomplished in a number of ways, including analysis of covariance, multiple regression, stratification, or propensity matching.

• Cox-Mantel test: method for comparing 2 survival curves that does not assume a particular distribution of data,44(p63) similar to the log-rank test.45(p113)

• Cox proportional hazards regression model (Cox proportional hazards model): in survival analysis, a procedure used to determine relationships between survival time and treatment and prognostic independent variables such as age.37(p290) The hazard function is modeled on the set of independent variables and assumes that the hazard function is independent of time. Estimates depend only on the order in which events occur, not on the times they occur.44(p64) Thus, authors should generally indicate that they have tested the proportionality assumption of the Cox model, which assumes that the ratio of the hazards between groups is similar at all points in time. The proportionality assumption would not be met, for instance, if one group experienced an early surge in mortality while the other group did not. In this case, the ratio of the hazards would be different early vs late during the time of follow-up.

• criterion standard: test considered to be the diagnostic standard for a particular disease or condition, used as a basis of comparison for other (usually noninvasive) tests. Ideally, the sensitivity and specificity of the criterion standard for the disease should be 100%. (A commonly used synonym, gold standard, is considered jargon by some.42(p70)) See also diagnostic discrimination.

• Cronbach α: index of the internal consistency of a test,44(p65) which assesses the correlation between the total score across a series of items and the comparable score that would have been obtained had a different series of items been used.42(p39) The Cronbach α is often used for psychological tests.

• cross-design synthesis: method for evaluating outcomes of medical interventions, developed by the US General Accounting Office, which pools results from databases of randomized controlled trials and other study designs. It is a form of meta-analysis (see 20.4, Meta-analysis).42(p39)

• crossover design: method of comparing 2 or more treatments or interventions. Individuals initially are randomized to one treatment or the other; after completing the first treatment they are crossed over to 1 or more other randomization groups and undergo other courses of treatment being tested in the experiment. Advantages are that a smaller sample size is needed to detect a difference between treatments, since a paired analysis is used to compare the treatments in each individual, but the disadvantage is that an adequate washout period is needed after the initial course of treatment to avoid carryover effect from the first to the second treatment. Order of treatments should be randomized to avoid potential bias.44(pp65-66) See 20.2.2, Randomized Controlled Trials, Crossover Trials.

• cross-sectional study: study that identifies participants with and without the condition or disease under study and the characteristic or exposure of interest at the same point in time.40(p329)

→ Causality is difficult to establish in a cross-sectional study because the outcome of interest and associated factors are assessed simultaneously.

• crude death rate: total deaths during a year divided by the midyear population. Deaths are usually expressed per 100 000 persons.44(p66)

• cumulative incidence: number of people who experience onset of a disease or outcome of interest during a specified period; may also be expressed as a rate or ratio.42(p40)

• Cutler-Ederer method: form of life-table analysis that uses actuarial techniques. The method assumes that the times at which follow-up ended (because of death or the outcome of interest) are uniformly distributed during the time period, as opposed to the Kaplan-Meier method, which assumes that termination of follow-up occurs at the end of the time block. Therefore, Cutler-Ederer estimates of risk tend to be slightly higher than Kaplan-Meier estimates.40(p308) Often an intervention and control group are depicted on 1 graph and the curves are compared by means of a log-rank test. This is also known as the actuarial life-table method.

• cut point: in testing, the arbitrary level at which “normal” values are separated from “abnormal” values, often selected at the point 2 SDs from the mean. See also receiver operating characteristic curve.42(p40)

• data: collection of items of information.42(p42) (Datum, the singular form of this word, is rarely used.)

• data dredging (aka “fishing expedition”): jargon meaning post hoc analysis, with no a priori hypothesis, of several variables collected in a study to identify variables that have a statistically significant association for purposes of publication.

→ Although post hoc analyses occasionally can be useful to generate hypotheses, data dredging increases the likelihood of a type I error and should be avoided. If post hoc analyses are performed, they should be declared as such and the number of post hoc comparisons performed specified.

• decision analysis: process of identifying all possible choices and outcomes for a particular set of decisions to be made regarding patient care. Decision analysis generally uses preexisting data to estimate the likelihood of occurrence of each outcome. The process is displayed as a decision tree, with each node depicting a branch point representing a decision in treatment or intervention to be made (usually represented by a square at the branch point), or possible outcomes or chance events (usually represented by a circle at the branch point). The relative worth of each outcome may be expressed as a utility, such as the quality-adjusted life-year.42(p44) See Figure 2.

• degrees of freedom (df): see df.

• dependent variable: outcome variable of interest in any study; the outcome that one intends to explain or estimate40(p329) (eg, death, myocardial infarction, or reduction in blood pressure). Multivariate analysis controls for independent variables or covariates that might modify the occurrence of the dependent variable (eg, age, sex, and other medical diseases or risk factors).

• descriptive statistics: method used to summarize or describe data with the use of the mean, median, SD, SE, or range, or to convey in graphic form (eg, by using a histogram, shown in Example F5 in 4.2.1, Visual Presentation of Data, Figures, Statistical Graphs) for purposes of data presentation and analysis.44(p73)

• df (degrees of freedom) (df is not expanded at first mention): the number of arithmetically independent comparisons that can be made among members of a sample. In a contingency table, df is calculated as (number of rows − 1)(number of columns − 1).

→ The df should be reported as a subscript after the related statistic, such as the t test, analysis of variance, and χ2 test (eg, $χ32$ = 17.7, P = .02; in this example, the subscript 3 is the number of df).

• diagnostic discrimination: statistical assessment of how the performance of a clinical diagnostic test compares with the criterion standard. To assess a test’s ability to distinguish an individual with a particular condition from one without the condition, the researcher must (1) determine the variability of the test, (2) define a population free of the disease or condition and determine the normal range of values for that population for the test (usually the central 95% of values, but in tests that are quantitative rather than qualitative, a receiver operating characteristic curve may be created to determine the optimal cut point for defining normal and abnormal), and (3) determine the criterion standard for a disease (by definition, the criterion standard should have 100% sensitivity and specificity for the disease) with which to compare the test. Diagnostic discrimination is reported with the performance measures sensitivity, specificity, positive predictive value, and negative predictive value; false-positive rate; and the likelihood ratio.40(pp151-163) See Table 4.

→ Because the values used to report diagnostic discrimination are ratios, they can be expressed either as the ratio, using the decimal form, or as the percentage, by multiplying the ratio by 100.

Example: The test had a sensitivity of 0.80 and a specificity of 0.95; the false-positive rate was 0.05.

Or: The test had a sensitivity of 80% and a specificity of 95%; the false-positive rate was 5%.

→ When the diagnostic discrimination of a test is defined, the individuals tested should represent the full spectrum of the disease and reflect the population on whom the test will be used. For example, if a test is proposed as a screening tool, it should be assessed in the general population.

• dichotomous variable: a variable with only 2 possible categories (eg, male/female, alive/dead); synonym for binary variable.44(p75)

→ A variable may have a continuous distribution during data collection but is made dichotomous for purposes of analysis (eg, age <65 years/age ≥ 65 years). This is done most often for nonnormally distributed data. Note that the use of a cut point generally converts a continuous variable to a dichotomous one (eg, normal vs abnormal).

• direct cause: contributory cause that is believed to be the most immediate cause of a disease. The direct cause is dependent on the current state of knowledge and may change as more immediate mechanisms are discovered.40(p330)

Example: Although several other causes were suggested when the disease was first described, the human immunodeficiency virus is the direct cause of AIDS.

• disability-adjusted life-years (DALY): A quantitative indicator of burden of disease that reflects the years lost due to premature mortality and years lived with disability, adjusted for severity.45

• discordant pair: pair in which the individuals have different outcomes. In twin studies, only the discordant pairs are informative about the association between exposure and disease.42(pp47-48) Antonym is concordant pair.

• discrete variable: variable that is counted as an integer; no fractions are possible.44(p77) Examples are counts of pregnancies or surgical procedures, or responses to a Likert scale.

• discriminant analysis: analytic technique used to classify participants according to their characteristics (eg, the independent variables, signs, symptoms, and diagnostic test results) to the appropriate outcome or dependent variable,44(pp77-78) also referred to as discriminatory analysis.37(pp59-60) This analysis tests the ability of the independent variable model to correctly classify an individual in terms of outcome. Conceptually, this may be thought of as the opposite of analysis of variance, in that the predictor variables are continuous, while the dependent variables are categorical.

• dispersion: degree of scatter shown by observations; may be measured by SD, various percentiles (eg, tertiles, quantiles, quintiles), or range.38(p60)

• distribution: group of ordered values; the frequencies or relative frequencies of all possible values of a characteristic.40(p330) Distributions may have a normal distribution (bell-shaped curve) or a nonnormal distribution (eg, binomial or Poisson distribution).

• dose-response relationship: relationship in which changes in levels of exposure are associated with changes in the frequency of an outcome in a consistent direction. This supports the idea that the agent of exposure (most often a drug) is responsible for the effect seen.40(p330) May be tested statistically by using a χ2 test for trend.

• Duncan multiple range test: modified form of the Newman-Keuls test for multiple comparisons.44(p82)

• Dunnett test: multiple comparisons procedure intended for comparing each of a number of treatments with a single control.44(p82)

• Dunn test: multiple comparisons procedure based on the Bonferroni adjustment.44(p84)

• Durbin-Watson test: test to determine whether the residuals from linear regression or multiple regression are independent or, alternatively, are serially correlated.44(p84)

• ecological fallacy: error that occurs when the existence of a group association is used to imply, incorrectly, the existence of a relationship at the individual level.40(p330)

• effectiveness: extent to which an intervention is beneficial when implemented under the usual conditions of clinical care for a group of patients,40(p330) as distinguished from efficacy (the degree of beneficial effect seen in a clinical trial) and efficiency (the intervention effect achieved relative to the effort expended in time, money, and resources).

• effect of observation: bias that results when the process of observation alters the outcome of the study.40(p330) See also Hawthorne effect.

• effect size: observed or expected change in outcome as a result of an intervention. Expected effect size is used during the process of estimating the sample size necessary to achieve a given power. Given a similar amount of variability between individuals, a large effect size will require a smaller sample size to detect a difference than will a smaller effect size.

• efficacy: degree to which an intervention produces a beneficial result under the ideal conditions of an investigation,40(p330) usually in a randomized controlled trial; it is usually greater than the intervention’s effectiveness.

• efficiency: effects achieved in relation to the effort expended in money, time, and resources. Statistically, the precision with which a study design will estimate a parameter of interest.42(pp52-53)

• effort-to-yield measures: amount of resources needed to produce a unit change in outcome, such as number needed to treat43; used in cost-effectiveness and cost-benefit analyses. See 20.5, Cost-effectiveness Analysis, Cost-Benefit Analysis.

• error: difference between a measured or estimated value and the true value. Three types are seen in scientific research: a false or mistaken result obtained in a study; measurement error, a random form of error; and systematic error that skews results in a particular direction.42(pp56-57)

• estimate: value or values calculated from sample observations that are used to approximate the corresponding value for the population.40(p330)

• event: end point or outcome of a study; usually the dependent variable. The event should be defined before the study is conducted and assessed by an individual blinded to the intervention or exposure category of the study participant.

• exclusion criteria: characteristics of potential study participants or other data that will exclude them from the study sample (such as being younger than 65 years, history of cardiovascular disease, expected to move within 6 months of the beginning of the study). Like inclusion criteria, exclusion criteria should be defined before any individuals are enrolled.

• explanatory variable: synonymous with independent variable, but preferred by some because “independent” in this context does not refer to statistical independence.38(p98)

• extrapolation: conclusions drawn about the meaning of a study for a target population that includes types of individuals or data not represented in the study sample.40(p330)

• factor analysis: procedure used to group related variables to reduce the number of variables needed to represent the data. This analysis reduces complex correlations between a large number of variables to a smaller number of independent theoretical factors. The researcher must then interpret the factors by looking at the pattern of “loadings” of the various variables on each factor.43 In theory, there can be as many factors as there are variables, and thus the authors should explain how they decided on the number of factors in their solution. The decision about the number of factors is a compromise between the need to simplify the data and the need to explain as much of the variability as possible. There is no single criterion on which to make this decision, and thus authors may consider a number of indexes of goodness of fit. There are a number of algorithms for rotation of the factors, which may make them more straightforward to interpret. Factor analysis is commonly used for developing scoring systems for rating scales and questionnaires.

• false negative: negative test result in an individual who has the disease or condition as determined by the criterion standard.40(p330) See also diagnostic discrimination.

• false-negative rate: proportion of test results found or expected to yield a false-negative result; equal to 1 − sensitivity.40 See also diagnostic discrimination.

• false positive: positive test result in an individual who does not have the disease or condition as determined by the criterion standard.40(p330) See also diagnostic discrimination.

• false-positive rate: proportion of tests found to or expected to yield a false-positive result; equal to 1 − specificity.40 See also diagnostic discrimination.

• F distribution: ratio of the distribution of 2 normally distributed independent variables; synonymous with variance ratio distribution.42(p61)

• Fisher exact test: assesses the independence of 2 variables by means of a 2 × 2 contingency table, used when the frequency in at least 1 cell is small44(p96) (usually <6). This test is also known as the Fisher-Yates test and the Fisher-Irwin test.38(p77)

• fixed-effects model: model used in meta-analysis that assumes that differences in treatment effect in each study all estimate the same true difference. This is not often the case, but the model assumes that it is close enough to the truth that the results will not be misleading.46(p349) Antonym is random-effects model.

• Friedman test: a nonparametric test for a design with 2 factors that uses the ranks rather than the values of the observations.38(p80) Nonparametric analog to analysis of variance.

• F test (score): alternative name for the variance ratio test (or F ratio),42(p74) which results in the F score. Often encountered in analysis of variance.44(p101)

Example: There were differences by academic status in perceptions of the quality of both primary care training (F1,682 = 6.71, P = .01) and specialty training (F1,682 = 6.71, P = .01). [The numbers set as subscripts for the F test are the df for the numerator and denominator, respectively.]

• funnel plot: in meta-analysis, a graph of the sample size or standard error of each study plotted against its effect size. Estimates of effect size from small studies should have more variability than estimates from larger studies, thus producing a funnel-shaped plot. Departures from a funnel pattern suggest publication bias.

• gaussian distribution: see normal distribution.

• gold standard: see criterion standard.

• goodness of fit: agreement between an observed set of values and a second set that is derived wholly or partly on a hypothetical basis.38(p86) The Kolmogorov-Smirnov test is one example.

• group association: situation in which a characteristic and a disease both occur more frequently in one group of individuals than another. The association does not mean that all individuals with the characteristic necessarily have the disease.40(p331)

• group matching: process of matching during assignment in a study to ensure that the groups have a nearly equal distribution of particular variables; also known as frequency matching.40(p331)

• Hartley test: test for the equality of variances of a number of populations that are normally distributed, based on the ratio between the largest and smallest sample variations.38(p90)

• Hawthorne effect: effect produced in a study because of the participants' awareness that they are participating in a study. The term usually refers to an effect on the control group that changes the group in the direction of the outcome, resulting in a smaller effect size.44(p115) A related concept is effect of observation. The Hawthorne effect is different than the placebo effect, which relates to participants' expectations that an intervention will have specific effects.

• hazard rate, hazard function: theoretical measure of the likelihood that an individual will experience an event within a given period.42(p73) A number of hazard rates for specific intervals of time can be combined to create a hazard function.

• hazard ratio: the ratio of the hazard rate in one group to the hazard rate in another. It is calculated from the Cox proportional hazards model. The interpretation of the hazard ratio is similar to that of the relative risk.

• heterogeneity: inequality of a quantity of interest (such as variance) in a number of groups or populations. Antonym is homogeneity.

• histogram: graphical representation of data in which the frequency (quantity) within each class or category is represented by the area of a rectangle centered on the class interval. The heights of the rectangles are proportional to the observed frequencies. See also Example F5 in 4.2.1, Visual Presentation of Data, Figures, Statistical Graphs.

• Hoeffding independence test: bivariate test of nonnormally distributed continuous data to determine whether the elements of the 2 groups are independent of each other.42(p93)

• Hollander parallelism test: determines whether 2 regression lines for 2 independent variables plotted against a dependent variable are parallel. The test does not require a normal distribution, but there must be an equal and even number of observations corresponding to each line. If the lines are parallel, then both independent variables predict the dependent variable equally well. The Hollander parallelism test is a special case of the signed rank test.38(p94)

• homogeneity: equality of a quantity of interest (such as variance) specifically in a number of groups or populations.38(p94) Antonym is heterogeneity.

• homoscedasticity: statistical determination that the variance of the different variables under study is equal.42(p78) See also heterogeneity.

• Hosmer-Lemeshow goodness-of-fit test: a series of statistical steps used to assess goodness of fit; approximates the χ2 statistic.47

• Hotelling T statistic: generalization of the t test for use with multivariate data; results in a T statistic. Significance can be tested with the variance ratio distribution.38(p94)

• hypothesis: supposition that leads to a prediction that can be tested to be either supported or refuted.42(p80) The null hypothesis is generally that there is no difference between groups or relationships among variables and that any such difference or relationship, if found, would occur strictly by chance. Hypothesis testing includes (1) generating the study hypothesis and defining the null hypothesis, (2) determining the level below which results are considered statistically significant, or α level (usually α = .05), and (3) identifying and applying the appropriate statistical test to accept or reject the null hypothesis.

• imputation: a group of techniques for replacing missing data with values that would have been likely to have been observed. Among the simplest methods of imputation is last-observation-carried-forward, in which missing values are replaced by the last observed value. This provides a conservative estimate in cases in which the condition is expected to improve on its own, but may be overly optimistic in conditions that are known to worsen over time. Missing values may also be imputed based on the patterns of other variables. In multiple imputation, repeated random samples are simulated, each of which produces a set of values to replace the missing values. This provides not only an estimate of the missing values but also an estimate of the uncertainty with which they can be predicted.

• incidence: number of new cases of disease among persons at risk that occur over time,42(p82) as contrasted with prevalence, which is the total number of persons with the disease at any given time. Incidence is usually expressed as a percentage of individuals affected during an interval (eg, year) or as a rate calculated as the number of individuals who develop the disease during a period divided by the number of person-years at risk.

Example: The incidence rate for the disease was 1.2 cases per 100 000 per year.

• inclusion criteria: characteristics a study participant must possess to be included in the study population (such as age 65 years or older at the time of study enrollment and willing and able to provide informed consent). Like exclusion criteria, inclusion criteria should be defined before any participants are enrolled.

• independence, assumption of: assumption that the occurrence of one event is in no way linked to another event. Many statistical tests depend on the assumption that each outcome is independent.42(p83) This may not be a valid assumption if repeated tests are performed on the same individuals (eg, blood pressure is measured sequentially over time), if more than 1 outcome is measured for a given individual (eg, myocardial infarction and death or all hospital admissions), or if more than 1 intervention is made on the same individual (eg, blood pressure is measured during 3 different drug treatments). Tests for repeated measures may be used in those circumstances.

• independent variable: variable postulated to influence the dependent variable within the defined area of relationships under study.42(p83) The term does not refer to statistical independence, so some use the term explanatory variable instead.38(p98)

Example: Age, sex, systolic blood pressure, and cholesterol level were the independent variables entered into the multiple logistic regression.

• indirect cause: contributory cause that acts through the biological mechanism that is the direct cause.40(p331)

Example: Overcrowding in the cities facilitated transmission of the tubercle bacillus and precipitated the tuberculosis epidemic. [Overcrowding is an indirect cause; the tubercle bacillus is the direct cause.]

• inference: process of passing from observations to generalizations, usually with calculated degrees of uncertainty.42(p85)

Example: Intake of a high-fat diet was significantly associated with cardiovascular mortality; therefore, we infer that eating a high-fat diet increases the risk of cardiovascular death.

• instrument error: error introduced in a study when the testing instrument is not appropriate for the conditions of the study or is not accurate enough to measure the study outcome40(p331) (may be due to deficiencies in such factors as calibration, accuracy, and precision).

• intention-to-treat analysis, intent-to-treat analysis: analysis of outcomes for individuals based on the treatment group to which they were randomized, rather than on which treatment they actually received and whether they completed the study. The intention-to-treat analysis generally avoids biases associated with the reasons that participants may not complete the study and should be the main analysis of a randomized trial.44(p125) See 20.2, Randomized Controlled Trials.

→ Although other analyses, such as evaluable patient analysis or per-protocol analyses, are often performed to evaluate outcomes based on treatment actually received, the intention-to-treat analysis should be presented regardless of other analyses because the intervention may influence whether treatment was changed and whether participants dropped out. Intention-to-treat analyses may bias the results of equivalence and noninferiority trials; for those trials, additional analyses should be presented. See 20.2.3, Randomized Controlled Trials, Equivalence and Noninferiority Trials.

• interaction: see interactive effect.

• interaction term: variable used in analysis of variance or analysis of covariance in which 2 independent variables interact with each other (eg, when assessing the effect of energy expenditure on cardiac output, the increase in cardiac output per unit increase in energy expenditure might differ between men and women; the interaction term would enable the analysis to take this difference into account).40(p301)

• interactive effect: effect of 2 or more independent variables on a dependent variable in which the effect of an independent variable is influenced by the presence of another.38(p101) The interactive effect may be additive (ie, equal to the sum of the 2 effects present separately), synergistic (ie, the 2 effects together have a greater effect than the sum of the effects present separately), or antagonistic (ie, the 2 effects together have a smaller effect than the sum of the effects present separately).

• interim analysis: data analysis carried out during a clinical trial to monitor treatment effects. Interim analysis should be determined as part of the study protocol prior to patient enrollment and specify the stopping rules if a particular treatment effect is reached.7(p130)

• interobserver bias: likelihood that one observer is more likely to give a particular response than another observer because of factors unique to the observer or instrument. For example, one physician may be more likely than another to identify a particular set of signs and symptoms as indicative of religious preoccupation on the basis of his or her beliefs, or a physician may be less likely than another physician to diagnose alcoholism in a patient because of the physician’s expectations.44(p25) The Cochran Q test is used to assess interobserver bias.44(p25)

• interobserver reliability: test used to measure agreement among observers about a particular measure or outcome.

→ Although the proportion of times that 2 observers agree can be reported, this does not take into account the number of times they would have agreed by chance alone. For example, if 2 observers must decide whether a factor is present or absent, they should agree 50% of the time according to chance. The κ statistic assesses agreement while taking chance into account and is described by the equation [(observed agreement) (agreement expected by chance)]/(1 agreement expected by chance). The value of κ may range from 0 (poor agreement) to 1 (perfect agreement) and may be classified by various descriptive terms, such as slight (0-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80), and near perfect (0.81-0.99).48(pp27-29)

→ In cases in which disagreement may have especially grave consequences, such as one pathologist rating a slide “negative”’ and another rating a slide “invasive carcinoma,” a weighted κ may be used to grade disagreement according to the severity of the consequences.48(p29) See also Pearson product moment correlation.

• interobserver variation: see interobserver reliability.

• interquartile range: the distance between the 25th and 75th percentiles, which is used to describe the dispersion of values. Like other quantiles (eg, tertiles, quintiles), such a range more accurately describes nonnormally distributed data than does the SD. The interquartile range describes the inner 50% of values; the interquintile range describes the inner 60% of values; the interdecile range describes the inner 80% of values.38(pp102-103)

• interrater reliability: reproducibility among raters or observers; synonymous with interobserver reliability.

• interval estimate: see confidence interval.40(p331)

• intraobserver reliability (or variation): reliability (or, conversely, variation) in measurements by the same person at different times.40(p331) Similar to interobserver reliability, intraobserver reliability is the agreement between measurements by one individual beyond that expected by chance and can be measured by means of the κ statistic or the Pearson product moment correlation.

• intrarater reliability: synonym for intraobserver reliability.

• jackknife dispersion test: technique for estimating the variance and bias of an estimator, applied to a predictive model derived from a study sample to determine whether the model fits subsamples from the model equally well. The estimator or model is applied to subsamples of the whole, and the differences in the results obtained from the subsample compared with the whole are analyzed as a jackknife estimate of variance. This method uses a single data set to derive and validate the model.48(p131)

→ Although validating a model in a new sample is preferable, investigators often use techniques such as jackknife dispersion or the bootstrap method to validate a model to save the time and expense of obtaining an entirely new sample for purposes of validation.

• Kaplan-Meier method: nonparametric method of compiling life tables. Unlike the Cutler-Ederer method, the Kaplan-Meier method assumes that termination of follow-up occurs at the end of the time block. Therefore, Kaplan-Meier estimates of risk tend to be slightly lower than Cutler-Ederer estimates.40(p308) Often an intervention and control group are depicted on one graph and the groups are compared by a log-rank test. Because the method is nonparametric, there is no attempt to fit the data to a theoretical curve. Thus, Kaplan-Meier plots have a jagged appearance, with discrete drops at the end of each time interval in which an event occurs. This method is also known as the product-limit method.

• κ (kappa) statistic: statistic used to measure nonrandom agreement between observers or measurements.42(p94) See interobserver and intraobserver reliability.

• Kendall τ (tau) rank correlation: rank correlation coefficient for ordinal data.48(p134)

• Kolmogorov-Smirnov test: comparison of 2 independent samples of continuous data without requiring that the data be normally distributed44(p136); may be used to test goodness of fit.43

• Kruskal-Wallis test: comparison of 3 or more groups of nonnormally distributed data to determine whether they differ significantly.44(p137) The Kruskal-Wallis test is a nonparametric analog of analysis of variance and generalizes the 2-sample Wilcoxon rank sum test to the multiple-sample case.38(p111)

• kurtosis: the way in which a unimodal curve deviates from a normal distribution; may be more peaked (leptokurtic) or more flat (platykurtic) than a normal distribution.44(p137)

• Latin square: form of complete treatment crossover design used for crossover drug trials that eliminates the effect of treatment order. Each patient receives each drug, but each drug is followed by another drug only once in the array. For example, in the following 4 × 4 array, letters A through D correspond to each of 4 drugs, each row corresponds to a patient, and each column corresponds to the order in which the drugs are given.8(p142)

First Drug

Second Drug

Third Drug

Fourth Drug

Patient 1

C

D

A

B

Patient 2

A

C

B

D

Patient 3

D

B

C

A

Patient 4

B

A

D

C

• lead-time bias: artifactual increase in survival time that results from earlier detection of a disease, usually cancer, during a time when the disease is asymptomatic. Lead-time bias produces longer survival from the time of diagnosis but not longer survival from the time of onset of the disease.40(p331) See also length-time bias.

→ Lead-time bias may give the appearance of a survival benefit from screening, when in fact the increased survival is only artifactual. Lead-time bias is used more generally to indicate a systematic error arising when follow-up of groups does not begin at comparable stages in the natural course of the condition.

• least significant difference test: test for comparing mean values arising in analysis of variance. An extension of the t test.40(p115)

• least squares method: method of estimation, particularly in regression analysis, that minimizes the sum of the differences between the observed responses and the values predicted by the model.44(p140) The regression line is created so that the sum of the squares of the residuals is as small as possible.

• left-censored data: see censored data.

• length-time bias: bias that arises when a sampling scheme is based on patient visits, because patients with more frequent clinic visits are more likely to be selected than those with less frequent visits. In a screening study of cancer, for example, screening patients with frequent visits is more likely to detect slow-growing tumors than would sampling patients who visit a physician only when symptoms arise.44(p140) See also lead-time bias.

• life table: method of organizing data that allows examination of the experience of 1 or more groups of individuals over time with varying periods of follow-up. For each increment of the follow-up period, the number entering, the number leaving, and the number dying of disease or developing disease can be calculated. An assumption of the life-table method is that an individual not completing follow-up is exposed for half the incremental follow-up period.44(p143) (The Kaplan-Meier method and the Cutler-Ederer method are also forms of life-table analysis but make different assumptions about the length of exposure.) See Figure 3.

→ The clinical life table describes the outcomes of a cohort of individuals classified according to their exposure or treatment history. The cohort life table is used for a cohort of individuals born at approximately the same time and followed up until death. The current life table is a summary of mortality of the population over a brief (1- to 3-year) period, classified by age, often used to estimate life expectancy for the population at a given age.42(p97)

• likelihood ratio: probability of getting a certain test result if the patient has the condition relative to the probability of getting the result if the patient does not have the condition. For dichotomous variables, this is calculated as sensitivity/(1 − specificity). The greater the likelihood ratio, the more likely that a positive test result will occur in a patient who has the disease. A ratio of 2 means a person with the disease is twice as likely to have a positive test result as a person without the disease.43 The likelihood ratio test is based on the ratio of 2 likelihood functions.38(p118) See also diagnostic discrimination.

• Likert scale: scale often used to assess opinion or attitude, ranked by attaching a number to each response such as 1, strongly agree; 2, agree; 3, undecided or neutral; 4, disagree; 5, strongly disagree. The score is a sum of the numerical responses to each question.`44(p144)

• Lilliefors test: test of normality (using the Kolmogorov-Smirnov test statistic) in which mean and variance are estimated from the data.38(p118)

• linear regression: statistical method used to compare continuous dependent and independent variables. When the data are depicted on a graph as a regression line, the independent variable is plotted on the x-axis and the dependent variable on the y-axis. The residual is the vertical distance from the data point to the regression line43(p110); analysis of residuals is a commonly used procedure for linear regression. (See Example F4 in 4.2.1, Visual Presentation of Data, Figures, Statistical Graphs.) This method is frequently performed using least squares regression.37(pp202-203)

→ The description of a linear regression model should include the equation of the fitted line with the slope and 95% confidence interval if possible, the fraction of variation in y explained by the x variables (correlation), and the variances of the fitted coefficients a and b (and their SDs).37(p227)

Example: The regression model identified a significant positive relationship between the dependent variable weight and height (slope = 0.25; 95% CI, 0.19-0.31; y = 12.6 + 0.25x; t451 = 8.3, P < .001; r2 = 0.67).43

(In this example, the slope is positive, indicating that as one variable increases the other increases; the t test with 451 df is significant; the regression line is described by the equation and includes the slope 0.25 and the constant 12.6. The coefficient of determination r2 demonstrates that 67% of the variance in weight is explained by height.)43

→ Four important assumptions are made when linear regression is conducted: the dependent variable is sampled randomly from the population; the spread or dispersion of the dependent variable is the same regardless of the value of the independent variable (this equality is referred to as homogeneity of variances or homoscedasticity); the relationship between the 2 variables is linear; and the independent variable is measured with complete precision.40(pp273-274)

• location: central tendency of a normal distribution, as distinguished from dispersion. The location of 2 curves may be identical (means are the same), but the kurtosis may vary (one may be peaked and the other flat, producing small and large SDs, respectively).49(p28)

• logistic regression: type of regression model used to analyze the relationship between a binary dependent variable (expressed as a natural log after a logit transformation) and 1 or more independent variables. Often used to determine the independent effect of each of several explanatory variables by controlling for several factors simultaneously in a multiple logistic regression analysis. Results are usually expressed by odds ratios or relative risks and 95% confidence intervals.40(pp311-312) (The multiple logistic regression equation may also be provided but, because these involve exponents, they are substantially more complicated than linear regression equations. Therefore, in JAMA and the Archives Journals, the equation is generally not published but can be made available on request from authors. Alternatively, it may be placed on the Web.)

→ To be valid, a multiple regression model must have an adequate sample size for the number of variables examined. A rough rule of thumb is to have at least 25 individuals in the study for each explanatory variable examined.

• log-linear model: linear models used in the analysis of categorical data.38(p122)

• log-rank test: method of using the relative death rates in subgroups to compare overall differences between survival curves for different treatments; same as the Mantel-Haenszel test.38(pp122,124)

• main effect: estimate of the independent effect of an explanatory (independent) variable on a dependent variable in analysis of variance or analysis of covariance.44(p153)

• Mann-Whitney test: nonparametric equivalent of the t test, used to compare ordinal dependent variables with either nominal independent variables or continuous independent variables converted to an ordinal scale.42(p100) Similar to the Wilcoxon rank sum test.

• MANOVA: multivariate analysis of variance. This involves examining the overall significance of all dependent variables considered simultaneously and thus has less risk of type I error than would a series of univariate analysis of variance procedures on several dependent variables.

• Mantel-Haenszel test: another name for the log-rank test.

• Markov process: process of modeling possible events or conditions over time that assumes that the probability that a given state or condition will be present depends only on the state or condition immediately preceding it and that no additional information about previous states or conditions would create a more accurate estimate.44(p155)

• masked assessment: synonymous with blinded assessment, preferred by some investigators and journals to the term blinded, especially in ophthalmology.

• masked assignment: synonymous with blinded assignment, preferred by some investigators and journals to the term blinded, especially in ophthalmology.

• matching: process of making study and control groups comparable with respect to factors other than the factors under study, generally as part of a case-control study. Matching can be done in several ways, including frequency matching (matching on frequency distributions of the matched variable[s]), category (matching in broad groups such as young and old), individual (matching on individual rather than group characteristics), and pair matching (matching each study individual with a control individual).42(p101)

• McNemar test: form of the χ2 test for binary responses in comparisons of matched pairs.42(p103) The ratio of discordant to concordant pairs is determined; the greater the number of discordant pairs with the better outcome being associated with the treatment intervention, the greater the effect of the intervention.44(p158)

• mean: sum of values measured for a given variable divided by the number of values; a measure of central tendency appropriate for normally distributed data.49(p29)

→ If the data are not normally distributed, the median is preferred. See also average.

• measurement error: estimate of the variability of a measurement. Variability of a given parameter (eg, weight) is the sum of the true variability of what is measured (eg, day-to-day weight fluctuations) plus the variability of the instrument or observer measurement, or variability caused by measurement error (error variability, eg, the scale used for weighing). The intraclass correlation coefficient R measures the relationship of these 2 types of variability: as the error variability declines with respect to true variability, R increases, up to 1 when error variance is 0. If all variability is a result of error variability, then R = 0.46(p30)

• median: midpoint of a distribution chosen so that half the values for a given variable appear above and half occur below.40(p332) For data that do not have a normal distribution, the median provides a better measure of central tendency than does the mean, since it is less influenced by outliers.47(p29)

• median test: nonparametric rank-order test for 2 groups.38(p128)

• meta-analysis: See 20.4, Meta-analysis.

• missing data: incomplete information on individuals resulting from any of a number of causes, including loss to follow-up, refusal to participate, and inability to complete the study. Although the simplest approach would be to remove such participants from the analysis, this would violate the intention-to-treat principle. Furthermore, certain health conditions may be systematically associated with the risk of having missing data, and thus removal of these individuals could bias the analysis. It is generally better to attempt imputation of these missing values, which are then included in the analysis.

• mode: in a series of values of a given variable, the number that occurs most frequently; used most often when a distribution has 2 peaks (bimodal distribution).49(p29) This is also appropriate as a measure of central tendency for categorical data.

• Monte Carlo simulation: a family of techniques for modeling complex systems for which it would otherwise be difficult to obtain sufficient data. In general, Monte Carlo simulations use a computer algorithm to generate a large number of random “observations.” The patterns of these numbers are then assessed for underlying regularities.

• mortality rate: death rate described by the following equation: [(number of deaths during period) × (period of observation)]/(number of individuals observed). For values such as the crude mortality rate, the denominator is the number of individuals observed at the midpoint of observation. See also crude death rate.44(p66)

→ Mortality rate is often expressed in terms of a standard ratio, such as deaths per 100 000 persons per year.

• Moses ranklike dispersion test: rank test of the equality of scale of 2 identically shaped populations, applicable when the population medians are not known.38(p134)

• multiple analyses problem: problem that occurs when several statistical tests are performed on one group of data because of the potential to introduce a type I error. The problem is particularly an issue when the analyses were not specified as primary outcome measures. Multiple analyses can be appropriately adjusted for by means of a Bonferroni adjustment or any of several multiple comparisons procedures.

• multiple comparisons procedures: any of several tests used to determine which groups differ significantly after another more general test has identified that a significant difference exists but not between which groups. These tests are intended to avoid the problem of a type I error caused by sequentially applying tests, such as the t test, not intended for repeated use. Authors should specify whether these tests were planned a priori, or whether the decision to perform them was post hoc.

→ Some tests result in more conservative estimates (less likely to be significant) than others. More conservative tests include the Tukey test and the Bonferroni adjustment; the Duncan multiple range test is less conservative. Other tests include the Scheffé test, the Newman-Keuls test, and the Gabriel test,38(p137)as well as many others. There is ongoing debate among statisticians about when it is appropriate to use these tests.

• multiple regression: general term for analysis procedures used to estimate values of the dependent variable for all measured independent variables that are found to be associated. The procedure used depends on whether the variables are continuous or nominal. When all variables are continuous variables, multiple linear regression is used and the mean of the dependent variable is expressed using the equation Y = α + β1χ1 + β2χ2 + ··· + βkχk, where Y is the dependent variable and k is the total number of independent variables. When independent variables may be either nominal or continuous and the dependent variable is continuous, analysis of covariance is used. (Analysis of covariance often requires an interaction term to account for differences in the relationship between the independent and dependent variables.) When all variables are nominal and the dependent variable is time-dependent, life-table methods are used. When the independent variables may be either continuous or nominal and the dependent variable is nominal and time-dependent (such as incidence of death), the Cox proportional hazards model may be used. Nominal dependent variables that are not time-dependent are analyzed by means of logistic regression or discriminant analysis.37(pp296-312)

• multivariable analysis: another name for multivariate analysis.

• multivariate analysis: any statistical test that deals with 1 dependent variable and at least 2 independent variables. It may include nominal or continuous variables, but ordinal data must be converted to a nominal scale for analysis. The multivariate approach has 3 advantages over bivariate analysis: (1) it allows for investigation of the relationship between the dependent and independent variables while controlling for the effects of other independent variables; (2) it allows several comparisons to be made statistically without increasing the likelihood of a type I error; and (3) it can be used to compare how well several independent variables individually can estimate values of the dependent variable.40(pp289-291) Examples include analysis of variance, multiple (logistic or linear) regression, analysis of covariance, Kruskal-Wallis test, Friedman test, life table, and Cox proportional hazards model.

• N: total number of units (eg, patients, households) in the sample under study.

Example: We assessed the admission diagnoses of all patients admitted from the emergency department during a 1-month period (N = 127).

• n: number of units in a subgroup of the sample under study.

Example: Of the patients admitted from the emergency department (N = 127), the most frequent admission diagnosis was unstable angina (n = 38).

• natural experiment: investigation in which a change in a risk factor or exposure occurs in one group of individuals but not in another. The distribution of individuals into a particular group is nonrandom and, as opposed to controlled clinical trials, the change is not brought about by the investigator.40(p332) The natural experiment is often used to study effects that cannot be studied in a controlled trial, such as the incidence of medical illness immediately after an earthquake. This is also referred to as a “found” experiment.

• naturalistic sample: set of observations obtained from a sample of the population in such a way that the distribution of independent variables in the sample is representative of the distribution in the population.40(p332)

• necessary cause: characteristic whose presence is required to bring about or cause the disease or outcome under study.50(p332) A necessary cause may not be a sufficient cause.

• negative predictive value: the probability that an individual does not have the disease (as determined by the criterion standard) if the test result is negative.40(p334) This measure takes into account the prevalence of the condition or the disease. A more general term is posttest probability. See diagnostic discrimination.

• nested case-control study: case-control study in which cases and controls are drawn from a cohort study. The advantages of a nested case-control study over a case-control study are that the controls are selected from participants at risk at the time of occurrence of each case that arises in a cohort, thus avoiding the confounding effect of time in the analysis, and that cases and controls are by definition drawn from the same population.40(p111) See also 20.3.1, Observational Studies, Cohort Studies, and 20.3.2, Observational Studies, Case-Control Studies.

• Newman-Keuls test: a type of multiple comparisons procedure, used to compare more than 2 groups. It first compares the 2 groups that have the highest and lowest means, then sequentially compares the next most extreme groups, and stops when a comparison is not significant.39(p92)

• n-of-1 trial: randomized controlled trial that uses a single patient and an outcome measure agreed on by the patient and physician. The n-of-1 trial may be used by clinicians to assess which of 2 or more possible treatment options is better for the individual patient.50

• nominal variable: also called categorical variable. There is no arithmetic relationship among the categories, and thus there is no intrinsic ranking or order between them (for example, sex, gene alleles, race, eye color). The nominal or discrete variable usually is assessed to determine its frequency within a population.40(p332) The variable can have either a binomial or Poisson distribution (if the nominal event is extremely rare, eg, a genetic mutation).

• nomogram: a visual means of representing a mathematical equation.

• nonconcurrent cohort study: cohort study in which an individual’s group assignment is determined by information that exists at the time a study begins. The extreme of a nonconcurrent cohort study is one in which the outcome is determined retrospectively from existing records.40(p332)

• nonnormal distribution: data that do not have a normal (bell-shaped curve) distribution; includes binomial, Poisson, and exponential distributions, as well as many others.

→ Nonnormally distributed continuous data must be either transformed to a normal distribution to use parametric methods or, more commonly, analyzed by non-parametric methods.

• nonparametric statistics: statistical procedures that do not assume that the data conform to any theoretical distribution. Nonparametric tests are most often used for ordinal or nominal data, or for nonnormally distributed continuous data converted to an ordinal scale40(p332) (for example, weight classified by tertile).

• normal distribution: continuous data distributed in a symmetrical, bell-shaped curve with the mean value corresponding to the highest point of the curve. This distribution of data is assumed in many statistical procedures.40(p330) This is also called a gaussian distribution.

→ Descriptive statistics such as mean and SD can be used to accurately describe data only if the values are normally distributed or can be transformed into a normal distribution.

• normal range: measure of the range of values on a particular test among those without the disease. Cut points for abnormal tests are arbitrary and are often defined as the central 95% of values, or the mean of values ± 2 SDs.

• null hypothesis: the assertion that no true association or difference in the study outcome or comparison of interest between comparison groups exists in the larger population from which the study samples are obtained.40(p332) In general, statistical tests cannot be used to prove the null hypothesis. Rather, the results of statistical testing can reject the null hypothesis at the stated α likelihood of a type I error.

• number needed to harm: computed similarly to number needed to treat, but number of patients who, after being treated for a specific period of time, would be expected to experience 1 bad outcome or not experience 1 good outcome.

• number needed to treat (NNT): number of patients who must be treated with an intervention for a specific period to prevent 1 bad outcome or result in 1 good outcome.40(pp332-333) The NNT is the reciprocal of the absolute risk reduction, the difference between event rates in the intervention and placebo groups in a clinical trial. See also number needed to harm.

→ The study patients from whom the NNT is calculated should be representative of the population to whom the numbers will be applied. The NNT does not take into account adverse effects of the intervention.

• odds ratio (OR): ratio of 2 odds. Odds ratio may have different definitions depending on the study and therefore should be defined. For example, it may be the odds of having the disease if a particular risk factor is present to the odds of not having the disease if the risk factor is not present, or the odds of having a risk factor present if the person has the disease to the odds of the risk factor being absent if the person does not have the disease.

→ The odds ratio typically is used for a case-control or cohort study. For a study of incident cases with an infrequent disease (for example, <2% incidence), the odds ratio approximates the relative risk.42(p118) When the incidence is relatively frequent the odds ratio may be arithmetically corrected to better approximate the relative risk.51

→ The odds ratio is usually expressed by a point estimate and 95% confidence interval (CI). An odds ratio for which the CI includes 1 indicates no statistically significant effect on risk; if the point estimate and CI are both less than 1, there is a statistically significant reduction in risk; if the point estimate and CI are both greater than 1, there is a statistically significant increase in risk.

• 1-tailed test: test of statistical significance in which deviations from the null hypothesis in only 1 direction are considered.40(p333) Most commonly used for the t test.

→ One-tailed tests are more likely to produce a statistically significant result than are 2-tailed tests. Since the use of a 1-tailed test implies that the intervention could have only 1 direction of effect, ie, beneficial or harmful, the use of a 1-tailed test must be justified.

• ordinal data: type of data with a limited number of categories with an inherent ordering of the category from lowest to highest, but without fixed or equal spacing between increments.40(p333) Examples are Apgar scores, heart murmur rating, and cancer stage and grade. Ordinal data can be summarized by means of the median and quantiles or range.

→ Because increments between the numbers for ordinal data generally are not fixed (eg, the difference between a grade 1 and a grade 2 heart murmur is not quantitatively the same as the difference between a grade 3 and a grade 4 heart murmur), ordinal data should be analyzed by nonparametric statistics.

• ordinate: vertical or y-axis of a graph.

• outcome: dependent variable or end point of an investigation. In retrospective studies such as case-control studies, the outcomes have already occurred before the study is begun; in prospective studies such as cohort studies and controlled trials, the outcomes occur during the time of the study.40(p333)

• outliers (outlying values): values at the extremes of a distribution. Because the median is far less sensitive to outliers than is the mean, it is preferable to use the median to describe the central tendency of data that have extreme outliers.

→ If outliers are excluded from an analysis, the rationale for their exclusion should be explained in the text. A number of tests are available to determine whether an outlier is so extreme that it should be excluded from the analysis.

• overmatching: the phenomenon of obscuring by the matching process of a case-control study a true causal relationship between the independent and dependent variables because the variable used for matching is strongly related to the mechanism by which the independent variable exerts its effect.40(pp119-120) For example, matching cases and controls on residence within a certain area could obscure an environmental cause of a disease. Overmatching may also be used to refer to matching on variables that have no effect on the dependent variable, and therefore are unnecessary, or the use of so many variables for matching that no suitable controls can be found.42(p120)

• oversampling: in survey research, a technique that selectively increases the likelihood of including certain groups or units that would otherwise produce too few responses to provide reliable estimates.

• paired samples: form of matching that can include self-pairing, where each participant serves as his or her own control, or artificial pairing, where 2 participants are matched on prognostic variables.42(p186) Twins may be studied as pairs to attempt to separate the effects of environment and genetics. Paired analyses provide greater power to detect a difference for a given sample size than do nonpaired analyses, since interindividual differences are minimized or eliminated. Pairing may also be used to match participants in case-control or cohort studies. See Table 3.

• paired t test: t test for paired data.

• parameter: measurable characteristic of a population. One purpose of statistical analysis is to estimate population parameters from sample observations.40(p333) The statistic is the numerical characteristic of the sample; the parameter is the numerical characteristic of the population. Parameter is also used to refer to aspects of a model (eg, a regression model).

• parametric statistics: tests used for continuous data and that require the assumption that the data being tested are normally distributed, either as collected initially or after transformation to the ln or log of the value or other mathematical conversion.40(p121) The t test is a parametric statistic. See Table 3.

• Pearson product moment correlation: test of correlation between 2 groups of normally distributed data. See diagnostic discrimination.

• percentile: see quantile.

• placebo: a biologically inactive substance administered to some participants in a clinical trial. A placebo should ideally appear similar in every other way to the experimental treatment under investigation. Assignment, allocation, and assessment should be blinded.

• placebo effect: refers to specific expectations that participants may have of the intervention. These can make the intervention appear more effective than it actually is. Comparison of a group receiving placebo vs those receiving the active intervention allows researchers to identify effects of the intervention itself, as the placebo effect should affect both groups equally.

• point estimate: single value calculated from sample observations that is used as the estimate of the population value, or parameter40(p333); in most circumstances accompanied by an interval estimate (eg, 95% confidence interval).

• Poisson distribution: distribution that occurs when a nominal event (often disease or death) occurs rarely.42(p125) The Poisson distribution is used instead of a binomial distribution when sample size is calculated for a study of events that occur rarely.

• population: any finite or infinite collection of individuals from which a sample is drawn for a study to obtain estimates to approximate the values that would be obtained if the entire population were sampled.44(p197) A population may be defined narrowly (eg, all individuals exposed to a specific traumatic event) or widely (eg, all individuals at risk for coronary artery disease).

• population attributable risk percentage: percentage of risk within a population that is associated with exposure to the risk factor. Population attributable risk takes into account the frequency with which a particular event occurs and the frequency with which a given risk factor occurs in the population. Population attributable risk does not necessarily imply a cause-and-effect relationship. It is also called attributable fraction, attributable proportion, and etiologic fraction.40(p333)

• positive predictive value: proportion of those participants or individuals with a positive test result who have the condition or disease as measured by the criterion standard. This measure takes into account the prevalence of the condition or the disease. Clinically, it is the probability that an individual has the disease if the test result is positive.40(p334) See Table 4 and diagnostic discrimination.

• posterior probability: in Bayesian analysis, the probability obtained after the prior probability is combined with the probability from the study of interest.42(p128) If one assumes a uniform prior (no useful information for estimating probability exists before the study), the posterior probability is the same as the probability from the study of interest alone.

• post hoc analysis: analysis performed after completion of a study and not based on a hypothesis considered before the study. Such analyses should be performed without prior knowledge of the relationship between the dependent and independent variables. A potential hazard of post hoc analysis is the type I error.

→ While post hoc analyses may be used to explore intriguing results and generate new hypotheses for future testing, they should not be used to test hypotheses, because the comparison is not hypothesis-driven. See also data dredging.

• posttest probability: the probability that an individual has the disease if the test result is positive (positive predictive value) or that the individual does not have the disease if the test result is negative (negative predictive value).40(p158)

• power: ability to detect a significant difference with the use of a given sample size and variance; determined by frequency of the condition under study, magnitude of the effect, study design, and sample size.40(p128) Power should be calculated before a study is begun. If the sample is too small to have a reasonable chance (usually 80% or 90%) of rejecting the null hypothesis if a true difference exists, then a negative result may indicate a type II error rather than a true failure to reject the null hypothesis.

→ Power calculations should be performed as part of the study design. A statement providing the power of the study should be included in the Methods section of all randomized controlled trials (see Table 1) and is appropriate for many other types of studies. A power statement is especially important if the study results are negative, to demonstrate that a type II error was unlikely to have been the reason for the negative result. Performing a post hoc power analysis is controversial, especially if it is based on the study results. Nonetheless, if such calculations were performed, they should be described in the Discussion section and their post hoc nature clearly stated.

Example: We determined that a sample size of 800 patients would have 80% power to detect the clinically important difference of 10% at α = .05.

• precision: inverse of the variance in measurement (see measurement error)42(p129); the degree of reproducibility that an instrument produces when measuring the same event. Note that precision and accuracy are independent concepts; if a blood pressure cuff is poorly calibrated against a standard, it may produce measurements that are precise but inaccurate.

• pretest probability: see prevalence.

• prevalence: proportion of persons with a particular disease at a given point in time. Prevalence can also be interpreted to mean the likelihood that a person selected at random from the population will have the disease (synonym: pretest probability).40(p334) See also incidence.

• principal components analysis: procedure used to group related variables to help describe data. The variables are grouped so that the original set of correlated variables is transformed into a smaller set of uncorrelated variables called the principal components.42(p131) Variables are not grouped according to dependent and independent variables, unlike many forms of statistical analysis. Principal components analysis is similar to factor analysis.

• prior probability: in Bayesian analysis, the probability of an event based on previous information before the study of interest is considered. The prior probability may be informative, based on previous studies or clinical information, or not, in which case the analysis uses a uniform prior (no information is known before the study of interest). A reference prior is one with minimal information, a clinical prior is based on expert opinion, and a skeptical prior is used when large treatment differences are not expected.44(p201) When Bayesian analysis is used to determine the posterior probability of a disease after a patient has undergone a diagnostic test, the prior probability may be estimated as the prevalence of the disease in the population from which the patient is drawn (usually the clinic or hospital population).

• probability: in clinical studies, the number of times an event occurs in a study group divided by the number of individuals being studied.40(p334)

• product-limit method: see Kaplan-Meier method.

• propensity analysis: in observational studies, a way of minimizing bias by selecting controls who have similar statistical likelihoods of having the outcome or intervention under investigation. In general, this involves examining a potentially large number of variables for their multivariate relationship with the outcome. The resulting model is then used to predict cases’ individual propensities to the outcome or intervention. Each case can then be matched to a control participant with a similar propensity. Propensity analysis is thus a way of correcting for underlying sources of bias when computing relative risk.

• proportionate mortality ratio: number of individuals who die of a particular disease during a span of time, divided by the number of individuals who die of all diseases during the same period.40(p334) This ratio may also be expressed as a rate, ie, a ratio per unit of time (eg, cardiovascular deaths per total deaths per year).

• prospective study: study in which participants with and without an exposure are identified and then followed up over time; the outcomes of interest have not occurred at the time the study commences.44(p205) Antonym is retrospective study.

• pseudorandomization: assigning of individuals to groups in a nonrandom manner, eg, selecting every other individual for an intervention or assigning participants by Social Security number or birth date.

• publication bias: tendency of articles reporting positive and/or “new” results to be submitted and published, and studies with negative or confirmatory results not to be submitted or published; especially important in meta-analysis, but also in other systematic reviews. Substantial publication bias has been demonstrated from the “file-drawer” problem.52 See funnel plot.

• purposive sample: set of observations obtained from a population in such a way that the sample distribution of independent variable values is determined by the researcher and is not necessarily representative of distribution of the values in the population.40(p334)

• P value: probability of obtaining the observed data (or data that are more extreme) if the null hypothesis were exactly true.44(p206)

→ While hypothesis testing often results in the P value, P values themselves can only provide information about whether the null hypothesis is rejected. Confidence intervals (CIs) are much more informative since they provide a plausible range of values for an unknown parameter, as well as some indication of the power of the study as indicated by the width of the CI.37(pp186-187) (For example, an odds ratio of 0.5 with a 95% CI of 0.05 to 4.5 indicates to the reader the [im]precision of the estimate, whereas P = .63 does not provide such information.) Confidence intervals are preferred whenever possible. Including both the CI and the P value provides more information than either alone.37(187) This is especially true if the CI is used to provide an interval estimate and the P value to provide the results of hypothesis testing.

→ When any P value is expressed, it should be clear to the reader what parameters and groups were compared, what statistical test was performed, and the degrees of freedom (df) and whether the test was 1-tailed or 2-tailed (if these distinctions are relevant for the statistical test).

→ For expressing P values in manuscripts and articles, the actual value for P should be expressed to 2 digits for P ≥.01, whether or not P is significant. (When rounding a P value expressed to 3 digits would make the P value nonsignificant, such as P ¼ .049 rounded to .05, the P value can be left as 3 digits.) If P < .01, it should be expressed to 3 digits. The actual P value should be expressed (P = .04), rather than expressing a statement of inequality (P < .05), unless P < .001. Expressing P to more than 3 significant digits does not add useful information to P < .001, since precise P values with extreme results are sensitive to biases or departures from the statistical model.37(p198)

P values should not be listed simply as not significant or NS, since for meta-analysis the actual values are important and not providing exact P values is a form of incomplete reporting.37(p195) Because the P value represents the result of a statistical test and not the strength of the association or the clinical importance of the result, P values should be referred to simply as statistically significant or not significant; terms such as highly significant and very highly significant should be avoided.

JAMA and the Archives Journals do not use a zero to the left of the decimal point, since statistically it is not possible to prove or disprove the null hypothesis completely when only a sample of the population is tested (P cannot equal 0 or 1, except by rounding). If P < .00001, P should be expressed as P < .001 as discussed. If P > .999, P should be expressed as P > .99.

• qualitative data: data that fit into discrete categories according to their attributes, such as nominal or ordinal data, as opposed to quantitative data.42(p136)

• qualitative study: form of study based on observation and interview with individuals that uses inductive reasoning and a theoretical sampling model, with emphasis on validity rather than reliability of results. Qualitative research is used traditionally in sociology, psychology, and group theory but also occasionally in clinical medicine to explore beliefs and motivations of patients and physicians.53

• quality-adjusted life-year (QALY): method used in economic analyses to reflect the existence of chronic conditions that cause impairment, disability, and loss of independence. Numerical weights representing severity of residual disability are based on assessments of disability by study participants, parents, physicians, or other researchers made as part of utility analysis.42(p136)

• quantile: method used for grouping and describing dispersion of data. Commonly used quantiles are the tertile (3 equal divisions of data into lower, middle, and upper ranges), quartile (4 equal divisions of data), quintile (5 divisions), and decile (10 divisions). Quantiles are also referred to as percentiles.38(p165)

→ Data may be expressed as median (quantile range), eg, length of stay was 7.5 days (interquartile range, 4.3-9.7 days). See also interquartile range.

• quantitative data: data in numerical quantities such as continuous data or counts42(p137) (as opposed to qualitative data). Nominal and ordinal data may be treated either qualitatively or quantitatively.

• quasi-experiment: experimental design in which variables are specified and participants assigned to groups, but interventions cannot be controlled by the experimenter. One type of quasi-experiment is the natural experiment.42(p137)

• r: correlation coefficient for bivariate analysis.

• R: correlation coefficient for multivariate analysis.

• r2: coefficient of determination for bivariate analysis. See also correlation coefficient.

• R2: coefficient of determination for multivariate analysis. See also correlation coefficient.

• random-effects model: model used in meta-analysis that assumes that there is a universe of conditions and that the effects observed in the studies are only a sample, ideally a random sample, of the possible effects.34(p349) Antonym is fixed-effects model.

• randomization: method of assignment in which all individuals have the same chances of being assigned to the conditions in a study. Individuals may be randomly assigned at a 2:1 or 3:1 frequency, in addition to the usual 1:1 frequency. Participants may or may not be representative of a larger population.37(p334) Simple methods of randomization include coin flip or use of a random numbers table. See also block randomization.

• randomized controlled trial: see 20.2.1, Randomized Controlled Trials, Parallel-Design Double-blind Trials.

• random sample: method of obtaining a sample that ensures that every individual in the population has a known (but not necessarily equal, for example, in weighted sampling techniques) chance of being selected for the sample.40(p335)

• range: the highest and lowest values of a variable measured in a sample.

Example: The mean age of the participants was 45.6 years (range, 20-64 years).

• rank sum test: see Mann-Whitney test or Wilcoxon rank sum test.

• rate: measure of the occurrence of a disease or outcome per unit of time, usually expressed as a decimal if the denominator is 100 (eg, the surgical mortality rate was 0.02). See also 19.7.3, Numbers and Percentages, Forms of Numbers, Reporting Proportions and Percentages.

• ratio: fraction in which the numerator is not necessarily a subset of the denominator, unlike a proportion40(p335) (eg, the assignment ratio was 1:2:1 for each drug dose [twice as many individuals were assigned to the second group as to the first and third groups]).

• recall bias: systematic error resulting from individuals in one group being more likely than individuals in the other group to remember past events.42(p141)

→ Recall bias is especially common in case-control studies that assess risk factors for serious illness in which individuals are asked about past exposures or behaviors, such as environmental exposure in an individual who has cancer.40(p335)

• receiver operating characteristic curve (ROC curve): graphic means of assessing the extent to which a test can be used to discriminate between persons with and without disease,42(p142) and to select an appropriate cut point for defining normal vs abnormal results. The ROC curve is created by plotting sensitivity vs (1 − specificity). The area under the curve provides some measure of how well the test performs; the larger the area, the better the test. See Figure 4. The C statistic is a measure of the area under the ROC curve.

→ The appropriate cut point is a function of the test. A screening test would require high sensitivity, whereas a diagnostic or confirmatory test would require high specificity. See Table 4 and diagnostic discrimination.

• reference group: group of presumably disease-free individuals from which a sample of individuals is drawn and tested to establish a range of normal values for a test.40(p335)

• regression analysis: statistical techniques used to describe a dependent variable as a function of 1 or more independent variables; often used to control for confounding variables.40(p335) See also linear regression, logistic regression.

• regression line: diagrammatic presentation of a linear regression equation, with the independent variable plotted on the x-axis and the dependent variable plotted on the y-axis. As many as 3 variables may be depicted on the same graph.42(p145)

• regression to the mean: the principle that extreme values are unlikely to recur. If a test that produced an extreme value is repeated, it is likely that the second result will be closer to the mean. Thus, after repeated observations results tend to “regress to the mean.” A common example is blood pressure measurement; on repeated measurements, individuals who are initially hypertensive often will have a blood pressure reading closer to the population mean than the initial measurement was.40(p335)

• relative risk (RR): probability of developing an outcome within a specified period if a risk factor is present, divided by the probability of developing the outcome in that same period if the risk factor is absent. The relative risk is applicable to randomized clinical trials and cohort studies40(p335); for case-control studies the odds ratio can be used to approximate the relative risk if the outcome is infrequent.

→ The relative risk should be accompanied by confidence intervals.

Example: The individuals with untreated mild hypertension had a relative risk of 2.4 (95% confidence interval, 1.9-3.0) for stroke or transient ischemic attack. [In this example, individuals with untreated mild hypertension were 2.4 times more likely than were individuals in the comparison group to have a stroke or transient ischemic attack.]

• relative risk reduction (RRR): proportion of the control group experiencing a given outcome minus the proportion of the treatment group experiencing the outcome, divided by the proportion of the control group experiencing the outcome.

• reliability: ability of a test to replicate a result given the same measurement conditions, as distinguished from validity, which is the ability of a test to measure what it is intended to measure.42(p145)

• repeated measures: analysis designed to take into account the lack of independence of events when measures are repeated in each participant over time (eg, blood pressure, weight, or test scores). This type of analysis emphasizes the change measured for a participant over time, rather than the differences between participants over time.

• repeated-measures ANOVA: see analysis of variance.

• reporting bias: a bias in assessment that can occur when individuals in one group are more likely than individuals in another group to report past events. Reporting bias is especially likely to occur when different groups have different reasons to report or not report information.40(pp335-336) For example, when examining behaviors, adolescent girls may be less likely than adolescent boys to report being sexually active. See also recall bias.

• reproducibility: ability of a test to produce consistent results when repeated under the same conditions and interpreted without knowledge of the prior results obtained with the same test40(p336); same as reliability.

• residual: measure of the discrepancy between observed and predicted values. The residual SD is a measure of the goodness of fit of the regression line to the data and gives the uncertainty of estimating a point y from a point x.38(p176)

• residual confounding: in observational studies, the possibility that differences in outcome may be caused by unmeasured or unmeasurable factors.

• response rate: number of complete interviews with reporting units divided by the number of eligible units in the sample.36 See 20.7, Survey Studies.

• retrospective study: study performed after the outcomes of interest have already occurred42(p147); most commonly a case-control study, but also may be a retrospective cohort study or case series. Antonym is prospective study.

• right-censored data: see censored data.

• risk: probability that an event will occur during a specified period. Risk is equal to the number of individuals who develop the disease during the period divided by the number of disease-free persons at the beginning of the period.40(p336)

• risk factor: characteristic or factor that is associated with an increased probability of developing a condition or disease. Also called a risk marker, a risk factor does not necessarily imply a causal relationship. A modifiable risk factor is one that can be modified through an intervention42(p148) (eg, stopping smoking or treating an elevated cholesterol level, as opposed to a genetically linked characteristic for which there is no effective treatment).

• risk ratio: the ratio of 2 risks. See also relative risk.

• robustness: term used to indicate that a statistical procedure’s assumptions (most commonly, normal distribution of data) can be violated without a substantial effect on its conclusions.42(p149)

• root-mean-square: see standard deviation.

• rule of 3: method used to estimate the number of observations required to have a 95% chance of observing at least 1 episode of a serious adverse effect. For example, to observe at least 1 case of penicillin anaphylaxis that occurs in about 1 in 10 000 cases treated, 30 000 treated cases must be observed. If an adverse event occurs 1 in 15 000 times, 45 000 cases need to be treated and observed.40(p114)

• run-in period: a period at the start of a trial when no treatment is administered (although a placebo may be administered). This can help to ensure that patients are stable and will adhere to treatment. This period may also be used to allow patients to discontinue any previous treatments, and so is sometimes also called a washout period.

• sample: subset of a larger population, selected for investigation to draw conclusions or make estimates about the larger population.52(p336)

• sampling error: error introduced by chance differences between the estimate obtained from the sample and the true value in the population from which the sample was drawn. Sampling error is inherent in the use of sampling methods and is measured by the standard error.40(p336)

• Scheffé test: see multiple comparisons procedures.

• SD: see standard deviation.

• SE: see standard error.

• SEE: see standard error of the estimate.

• selection bias: bias in assignment that occurs when the way the study and control groups are chosen causes them to differ from each other by at least 1 factor that affects the outcome of the study.40(p336)

→ A common type of selection bias occurs when individuals from the study group are drawn from one population (eg, patients seen in an emergency department or admitted to a hospital) and the control participants are drawn from another (eg, clinic patients). Regardless of the disease under study, the clinic patients will be healthier overall than the patients seen in the emergency department or hospital and will not be comparable controls. A similar example is the “healthy worker effect”: people who hold jobs are likely to have fewer health problems than those who do not, and thus comparisons between these groups may be biased.

• SEM: see standard error of the mean.

• sensitivity: proportion of individuals with the disease or condition as measured by the criterion standard who have a positive test result (true positives divided by all those with the disease).40(p336) See Table 4 and diagnostic discrimination.

• sensitivity analysis: method to determine the robustness of an assessment by examining the extent to which results are changed by differences in methods, values of variables, or assumptions40(p154); applied in decision analysis to test the robustness of the conclusion to changes in the assumptions.

• signed rank test: see Wilcoxon signed rank test.

• significance: statistically, the testing of the null hypothesis of no difference between groups. A significant result rejects the null hypothesis. Statistical significance is highly dependent on sample size and provides no information about the clinical significance of the result. Clinical significance, on the other hand, involves a judgment as to whether the risk factor or intervention studied would affect a patient’s outcome enough to make a difference for the patient. The level of clinical significance considered important is sometimes defined prospectively (often by consensus of a group of physicians) as the minimal clinically important difference, but the cutoff is arbitrary.

• sign test: a nonparametric test of significance that depends on the signs (positive or negative) of variables and not on their magnitude; used when combining the results of several studies, as in meta-analysis.42(p156) See also Cox-Stuart trend test.

• skewness: the degree to which the data are asymmetric on either side of the central tendency. Data for a variable with a longer tail on the right of the distribution curve are referred to as positively skewed; data with a longer left tail are negatively skewed.44(pp238-239)

• snowball sampling: a sampling method in which survey respondents are asked to recommend other respondents who might be eligible to participate in the survey. This may be used when the researcher is not entirely familiar with demographic or cultural patterns in the population under investigation.

• Spearman rank correlation (ρ): statistical test used to determine the covariance between 2 nominal or ordinal variables.44(p243) The nonparametric equivalent to the Pearson product moment correlation, it can also be used to calculate the coefficient of determination.

• specificity: proportion of those without the disease or condition as measured by the criterion standard who have negative results by the test being studied40(p326) (true negatives divided by all those without the disease). See Table 4 and diagnostic discrimination.

• standard deviation (SD): commonly used descriptive measure of the spread or dispersion of data; the positive square root of the variance.40(p336)The mean ± 2 SDs represents the middle 95% of values obtained.

→ Describing data by means of SD implies that the data are normally distributed; if they are not, then the interquartile range or a similar measure involving quantiles is more appropriate to describe the data, particularly if the mean ± 2 SDs would be nonsensical (eg, mean [SD] length of stay = 9 [15] days, or mean [SD] age at evaluation = 4 [5.3] days). Note that the format mean (SD) should be used, rather than the ± construction.

• standard error (SE): positive square root of the variance of the sampling distribution of the statistic.38(p195)Thus, the SE provides an estimate of the precision with which a parameter can be estimated. There are several types of SE; the type intended should be clear.

In text and tables that provide descriptive statistics, SD rather than SE is usually appropriate; by contrast, parameter estimates (eg, regression coefficients) should be accompanied by SEs. In figures where error bars are used, the 95% confidence interval is preferred54 (see Example F10 in 4.2.1, Visual Presentation of Data, Figures, Statistical Graphs).

• standard error of the difference: measure of the dispersion of the differences between samples of 2 populations, usually the differences between the means of 2 samples; used in the t test.

• standard error of the estimate: SD of the observed values about the regression line.38(p195)

• standard error of the mean (SEM): An inferential statistic, which describes the certainty with which the mean computed from a random sample estimates the true mean of the population from which the sample was drawn.39(p21) If multiple samples of a population were taken, then 95% of the samples would have means would fall within ± 2 SEMs of the mean of all the sample means. Larger sample sizes will be accompanied by smaller SEMs, because larger samples provide a more precise estimate of the population mean than do smaller samples.

→ The SEM is not interchangeable with SD. The SD generally describes the observed dispersion of data around the mean of a sample. By contrast, the SEM provides an estimate of the precision with which the true population mean can be inferred from the sample mean. The mean itself can thus be understood as either a descriptive or an inferential statistic; it is this intended interpretation that governs whether it should be accompanied by the SD or SEM. In the former case the mean simply describes the average value in the sample and should be accompanied by the SD, while in the latter it provides an estimate of the population mean and should be accompanied by the SEM. The interpretation of the mean is often clear from the text, but authors may need to be queried to discern their intent in presenting this statistic.

• standard error of the proportion: SD of the population of all possible values of the proportion computed from samples of a given size.39(p109)

• standardization (of a rate): adjustment of a rate to account for factors such as age or sex.40(pp336-350)

• standardized mortality ratio: ratio in which the numerator contains the observed number of deaths and the denominator contains the number of deaths that would be expected in a comparison population. This ratio implies that confounding factors have been controlled for by means of indirect standardization. It is distinguished from proportionate mortality ratio, which is the mortality rate for a specific disease.40(p337)

• standard normal distribution: a normal distribution in which the raw scores have been recomputed to have a mean of 0 and an SD of 1.44(p245) Such recomputed values are referred to as z scores or standard scores. The mean, median, and mode are all equal to zero.

• standard score: see z score.38(p196)

• statistic: value calculated from sample data that is used to estimate a value or parameter in the larger population from which the sample was obtained,40(p337) as distinguished from data, which refers to the actual values obtained via direct observation (eg, measurement, chart review, patient interview).

• stochastic: type of measure that implies the presence of a random variable.38(p197)

• stopping rule: rule, based on a test statistic or other function, specified as part of the design of the trial and established before patient enrollment, that specifies a limit for the observed treatment difference for the primary outcome measure, which, if exceeded, will lead to the termination of the trial or one of the study groups.7(p258) The stopping rules are designed to ensure that a study does not continue to enroll patients after a significant treatment difference has been demonstrated that would still exist regardless of the treatment results of subsequently enrolled patients.

• stratification: division into groups. Stratification may be used to compare groups separated according to similar confounding characteristics. Stratified sampling may be used to increase the number of individuals sampled in rare categories of independent variables, or to obtain an adequate sample size to examine differences among individuals with certain characteristics of interest.29(p337)

• Student-Newman-Keuls test: see Newman-Keuls test.

• Student t test: see t test. W. S. Gossett, who originated the test, wrote under the name Student because his employment precluded individual publication.42(p166) Simply using the term t test is preferred.

• study group: in a controlled clinical trial, the group of individuals who undergo an intervention; in a cohort study, the group of individuals with the exposure or characteristic of interest; and in a case-control study, the group of cases.40(p337)

• sufficient cause: characteristic that will bring about or cause the disease.40(p337)

• supportive criteria: substantiation of the existence of a contributory cause. Potential supportive criteria include the strength and consistency of the relationship, the presence of a dose-response relationship, and biological plausibility.40(p337)

• surrogate end points: in a clinical trial, outcomes that are not of direct clinical importance but that are believed to be related to those that are. Such variables are often physiological measurements (eg, blood pressure) or biochemical (eg, cholesterol level). Such end points can usually be collected more quickly and economically than clinical end points, such as myocardial infarction or death, but their clinical relevance may be less certain.

• survival analysis: statistical procedures for estimating the survival function and for making inferences about how it is affected by treatment and prognostic factors.42(p163) See life table.

• target population: group of individuals to whom one wishes to apply or extrapolate the results of an investigation, not necessarily the population studied.40(p337) If the target population is different from the population studied, whether the study results can be extrapolated to the target population should be discussed.

• τ(tau): see Kendall τ rank correlation.

• trend, test for: see χ2 test.

• trial: controlled experiment with an uncertain outcome38(p208); used most commonly to refer to a randomized study.

• triangulation: in qualitative research, the simultaneous use of several different techniques to study the same phenomenon, thus revealing and avoiding biases that may occur if only a single method were used.

• true negative: negative test result in an individual who does not have the disease or condition as determined by the criterion standard.40(p338) See also Table 4.

• true-negative rate: number of individuals who have a negative test result and do not have the disease by the criterion standard divided by the total number of individuals who do not have the disease as determined by the criterion standard; usually expressed as a decimal (eg, the true-negative rate was 0.85). See also Table 4.

• true positive: positive test result in an individual who has the disease or condition as determined by the criterion standard.40(p338) See also Table 4.

• true-positive rate: number of individuals who have a positive test result and have the disease as determined by the criterion standard divided by the total number of individuals who have the disease as measured by the criterion standard; usually expressed as a decimal (eg, the true-positive rate was 0.92). See also Table 4.

• t test: statistical test used when the independent variable is binary and the dependent variable is continuous. Use of the t test assumes that the dependent variable has a normal distribution; if not, nonparametric statistics must be used.40(p266)

→ Usually the t test is unpaired, unless the data have been measured in the same individual over time. A paired t test is appropriate to assess the change of the parameter in the individual from baseline to final measurement; in this case, the dependent variable is the change from one measurement to the next. These changes are usually compared against 0, on the null hypothesis that there is no change from time 1 to time 2.

→ Presentation of the t statistic should include the degrees of freedom (df), whether the t test was paired or unpaired, and whether a 1-tailed or 2-tailed test was used. Since a 1-tailed test assumes that the study effect can have only 1 possible direction (ie, only beneficial or only harmful), justification for use of the 1-tailed test must be provided. (The 1-tailed test at α = .05 is similar to testing at α = .10 for a 2-tailed test and therefore is more likely to give a significant result.)

Example: The difference was significant by a 2-tailed test for paired samples (t15 = 2.78, P = .05).

→ The t test can also be used to compare different coefficients of variation.

• Tukey test: a type of multiple comparisons procedure.

• 2-tailed test: test of statistical significance in which deviations from the null hypothesis in either direction are considered.40(p338) For most outcomes, the 2-tailed test is appropriate unless there is a plausible reason why only 1 direction of effect is considered and a 1-tailed test is appropriate. Commonly used for the t test, but can also be used in other statistical tests.

• 2-way analysis of variance: see analysis of variance.

• type I error: a result in which the sample data lead to a rejection of the null hypothesis despite the fact that the null hypothesis is actually true in the population. The α level is the size of a type I error that will be permitted, usually .05.

→ A frequent cause of a type I error is performing multiple comparisons, which increase the likelihood that a significant result will be found by chance. To avoid a type I error, one of several multiple comparisons procedures can be used.

• type II error: the situation where the sample data lead to a failure to reject the null hypothesis despite the fact that the null hypothesis is actually false in the population.

→ A frequent cause of a type II error is insufficient sample size. Therefore, a power calculation should be performed when a study is planned to determine the sample size needed to avoid a type II error.

• uncensored data: continuous data reported as collected, without adjustment, as opposed to censored data.

• uniform prior: assumption that no useful information regarding the outcome of interest is available prior to the study, and thus that all individuals have an equal prior probability of the outcome. See Bayesian analysis.

• unity: synonymous with the number 1; a relative risk of 1 is a relative risk of unity, and a regression line with a slope of 1 is said to have a slope of unity.

• univariable analysis: another name for univariate analysis.

• univariate analysis: statistical tests involving only 1 dependent variable; uses measures of central tendency (mean or median) and location or dispersion. The term may also apply to an analysis in which there are no independent variables. In this case, the purpose of the analysis is to describe the sample, determine how the sample compares with the population, and determine whether chance has resulted in a skewed distribution of 1 or more of the variables in the study. If the characteristics of the sample do not reflect those of the population from which the sample was drawn, the results may not be generalizable to that population.40(pp245-246)

• unpaired analysis: method that compares 2 treatment groups when the 2 treatments are not given to the same individual. Most case-control studies also use unpaired analysis.

• unpaired t test: see t test.

• U test: see Wilcoxon rank sum test.

• utility: in decision theory and clinical decision analysis, a scale used to judge the preference of achieving a particular outcome (used in studies to quantify the value of an outcome vs the discomfort of the intervention to a patient) or the discomfort experienced by the patient with a disease.42(p170) Commonly used methods are the time trade-off and the standard gamble. The result is expressed as a single number along a continuum from death (0) to full health or absence of disease (1.0). This quality number can then be multiplied by the number of years a patient is in the health state produced by a particular treatment to obtain the quality-adjusted life-year. See also 20.5, Cost-effectiveness Analysis, Cost-Benefit Analysis.

• validity (of a measurement): degree to which a measurement is appropriate for the question being addressed or measures what it is intended to measure. For example, a test may be highly consistent and reproducible over time, but unless it is compared with a criterion standard or other validation method, the test cannot be considered valid (see also diagnostic discrimination). Construct validity refers to the extent to which the measurement corresponds to theoretical concepts. Because there are no criterion standards for constructs, construct validity is generally established by comparing the results of one method of measurement with those of other methods. Content validity is the extent to which the measurement samples the entire domain under study (eg, a measurement to assess delirium must evaluate cognition). Criterion validity is the extent to which the measurement is correlated with some quantifiable external criterion (eg, a test that predicts reaction time). Validity can be concurrent (assessed simultaneously) or predictive (eg, ability of a standardized test to predict school performance).42(p171)

→ Validity of a test is sometimes mistakenly used as a synonym of reliability; the two are distinct statistical concepts and should not be used interchangeably. Validity is related to the idea of accuracy, while reliability is related to the idea of precision.

• validity (of a study): internal validity means that the observed differences between the control and comparison groups may, apart from sampling error, be attributed to the effect under study; external validity or generalizability means that a study can produce unbiased inferences regarding the target population, beyond the participants in the study.42(p171)

• Van der Waerden test: nonparametric test that is sensitive to differences in location for 2 samples from otherwise identical populations.38(p216)

• variable: characteristic measured as part of a study. Variables may be dependent (usually the outcome of interest) or independent (characteristics of individuals that may affect the dependent variable).

• variance: variation measured in a set of data for one variable, defined as the sum of the squared deviations of each data point from the mean of the variable, divided by the df (number of observations in the sample 1).44(p266) The SD is the square root of the variance.

• variance components analysis: process of isolating the sources of variability in the outcome variable for the purpose of analysis.

• variance ratio distribution: synonym for F distribution.42(p61)

• visual analog scale: scale used to quantify subjective factors such as pain, satisfaction, or values that individuals attach to possible outcomes. Participants are asked to indicate where their current feelings fall by marking a straight line with 1 extreme, such as “worst pain ever experienced,” at one end of the scale and the other extreme, such as “pain-free,” at the other end. The feeling (eg, degree of pain) is quantified by measuring the distance from the mark on the scale to the end of the scale.42(p268)

• washout period: see 20.2.2, Randomized Controlled Trials, Crossover Trials.

• Wilcoxon rank sum test: a nonparametric test that ranks and sums observations from combined samples and compares the result with the sum of ranks from 1 sample.38(p220) U is the statistic that results from the test. Alternative name for the Mann-Whitney test.

• Wilcoxon signed rank test: nonparametric test in which 2 treatments that have been evaluated by means of matched samples are compared. Each observation is ranked according to size and given the sign of the treatment difference (ie, positive if the treatment effect was positive and vice versa) and the ranks are summed.38(p220)

• Wilks Λ (lambda): a test used in multivariate analysis of variance (MANOVA) that tests the effect size for all the dependent variables considered simultaneously. It thus adjusts significance levels for multiple comparisons.

• x-axis: horizontal axis of a graph. By convention, the independent variable is plotted on the x-axis. Synonym is abscissa.

• Yates correction: continuity correction used to bring a distribution based on discontinuous frequencies closer to the continuous χ2 distribution from which χ2 tables are derived.42(p176)

• y-axis: vertical axis of a graph. By convention, the dependent variable is plotted on the y-axis. Synonym is ordinate.

• z-axis: third axis of a 3-dimensional graph, generally placed so that it appears to project out toward the reader. The z-axis and x-axis are both used to plot independent variables and are often used to demonstrate that the 2 independent variables each contribute independently to the dependent variable. See x-axis and y-axis.

• z score: score used to analyze continuous variables that represents the deviation of a value from the mean value, expressed as the number of SDs from the mean. The z score is frequently used to compare children’s height and weight measurements, as well as behavioral scores.42(p176) It is sometimes referred to as the standard score.

Figure 2. Decision tree showing decision nodes (squares) and chance outcomes (circles). End branches are labeled with outcome states. The subtrees to which the decision tree refers are depicted in a separate figure for simplicity. Adapted from Mason JJ, Owens DK, Harris RA, Cooke JP, Hlatky MA. The role of coronary angiography and coronary revascularization before noncardiac vascular surgery. JAMA. 1995;273(24):1919–1925.

Table 4. Diagnostic Discrimination

Test Result

Disease by Criterion Standard

Disease Free by Criterion Standard

Positive

a (true positives)

b (false positives)

Negative

c (false negatives)

d (true negatives)

a + c = total number of persons with disease

b + d = total number of persons without disease

Sensitivity = $aa+c$

Specificity = $db+d$

Positive predictive value = $aa+b$

Negative predictive value = $dc+d$

Figure 3. Survival curve showing outcomes for 2 treatments groups with number at risk at each time point. While numbers at risk are not essential to include in a survival analysis figure, this presentation conveys more information than the curve alone would. Adapted from Rotman M, Pajak TF, Choi K, et al. Prophylactic extended-field irradiation of para-aortic lymph nodes in stages IIB and bulky IB and IIA cervical carcinomas: ten-year treatment results of RTOG 79–20. JAMA. 1995;274(5):387–393.

Figure 4. Receiver operating characteristic curve. The 45° line represents the point at which the test is no better than chance. The area under the curve measures the performance of the test; the larger the area under the curve, the better the test performance. Adapted from Grover SA, Coupal L, Hu X-P. Identifying adults at increased risk of coronary disease: how well do the current cholesterol guidelines work? JAMA. 1995;274(10):801–806.