|
|
||||||||
EVALUATION METHODS AND PRACTICE |
Correspondence: Requests for reprints should be sent to Kevin L. Delucchi, PhD, Department of Psychiatry, University of California, San Francisco, Box 0984-TRC, 401 Parnassus Ave, San Francisco, CA 94143-0984 (e-mail: kdelucc{at}itsa.ucsf.edu).
| ABSTRACT |
|---|
|
|
|---|
I reviewed sample estimation methods for research designs involving nonindependent data and a dichotomous response variable to examine the importance of proper sample size estimation and the need to align methods of sample size estimation with planned methods of statistical analysis. Examples and references to published literature are provided in this article.
When the method of sample size estimation is not in concert with the method of planned analysis, poor estimates may result. The effects of multiple measures over time also need to be considered.
Proper sample size estimation is often overlooked. Alignment of the sample size estimation method with the planned analysis method, especially in studies involving nonindependent data, will produce appropriate estimates.
| INTRODUCTION |
|---|
|
|
|---|
This discussion is framed primarily in terms of longitudinal study designs, which are more common and probably more familiar to many researchers than cluster-randomized designs. The broader points, however, apply to all research settings in which sample size is important. The more specific issues and methods apply to any design in which the data are nonindependent, such as studies of members of a household, comparisons of entire communities, and multiple measures of the same person.
This topic can be framed from 2 separate perspectives: testing hypotheses and estimating parameters. When testing a hypothesis, one is concerned with estimating the number of study participants required to ensure a minimal probability (power) of detecting an effect if it exists. With many public health applications, the goal is not to test a hypothesis but rather to estimate the size of an effect, such as an odds ratio, a correlation coefficient, or a proportion. The focus is on the variation of the estimate, which is expressed by the size of the confidence interval after one asks the question, "If I have a sample of a given size, how large will the confidence interval around my estimate be?" Proper sample size estimation is equally important in both perspectives.
| The Importance of Good Estimation |
|---|
|
|
|---|
When I reviewed the literature, I found surprisingly little evidence of improvement in applying sample size estimation to design studies despite the publication of numerous articles that have pointed to this problem.1,5 In 1988, Freiman et al. replicated a study they had first published in 1978.6 In this follow-up study (published in 19927), they concluded, as they had in the original work, that inadequate attention was being paid to the issue of statistical power in randomized clinical trials. Reviews within specialties have consistently found many studies to be underpowered.812
Although most of the literature on this topic is written from the experimental or clinical trials perspective, a few publications have addressed the estimation of sample size for confidence intervals.1315 Volatier et al.16 discussed sample size estimation principles for a dietary survey, and Brogger et al.,17 Bennett et al.,18 and Panagiotakos et al.19 have provided recent examples of study design for effect size estimation. Additionally, several articles have addressed sample size estimation in the context of estimating geneenvironment interactions.2022
| Estimating Required Sample Size |
|---|
|
|
|---|
]) and minimal required power (1 - ß, usually 80%.); (4) compute the required number of study participants, or sets of study participants, for each estimated effect size and each tested hypothesis; and (5) if necessary, revise study parameters to accommodate a smaller number of study participants while retaining adequate power.2830 It should be noted that in actual practice, the sample size estimation process is often more interactive and adaptive (a slightly different version of the process outlined here is provided by Castelloe and OBrien,25 Maxwell,26 and Cohen27).
With step 4, it is important to use an estimation method that closely matches the planned analysis method.31 Consider a study designed to compare 2 groups of participants on a dichotomous outcome with a logistic regression model to statistically control for a set of covariates. For instance, when one compares smoking rates, 1 group may have slightly higher levels of depression symptoms and a greater average age. To estimate the required sample size for a logistic regression, one requires an estimate of the expected outcome proportions of the 2 conditions (the effect size) plus the level of correlation
[population correlation coefficient]) between group membership and the set of covariates that will be used in the logistic regression.32
If, however, one is unable to estimate that correlation, it may be tempting to use a simple comparison of the 2 proportions as a test for the basis of estimating the sample size. For the sake of the example, if the proportions of the 2 groups are expected to be 0.20 and 0.35 (
= .05; ß = .20 [80% power]), a sample size of approximately 275 participants is needed (in accordance with PASS 2000 software33). In effect, this example assumes that
is equal to 0.0. If, however,
is greater than 0.0, the study will be underpowered when the data are collected. To reach the targeted power level, the required sample size must increase in conjunction with the value of
2: 306 if
2 is equal to 0.10 and 344 if
2 is equal to 0.20. The specific factor is 1/(1 -
2); this effect is illustrated in Figure 1
.
|
| Methods of Sample Size Estimation for Longitudinal Designs |
|---|
|
|
|---|
To illustrate sample size estimation for a dichotomous longitudinal outcome, consider estimating the sample size for a proposed study of smoking rates in 2 groups measured at 3 time points. The analysis plan is to conduct tests for the 3 main effects: a comparison of the rates between the 2 groups, the change over time, and the interaction of group by time. Set the
at .05 and the power at 80% (i.e., type II rate = .20), and assume the expected smoking rates will be 30%, 40%, and 60% for 1 group and 20%, 25%, and 30% for the other.
Use cross-sectional methods to approximate the sample size.
A simple approximation ignores the time factor and either collapses across time or computes separate estimates for each assessment and adjusts the
level for the multiple tests. In this example, the average proportion is 0.25 for 1 sample and 0.43 for the other. A comparison of these 2 proportions requires approximately 108 study participants per group33 to ensure at least 80% power.
Estimates may vary slightly for even this simple comparison. For example, 107 study participants per group are required if an arcsine transformation is applied to the proportions first, 118 study participants per group are required if a correction for continuity is used, and 117 study participants per group are required if both are used. Rochons SAS macro43 estimates 111 study participants per group and OBriens UnifyPow38 estimates 109 study participants per group when the Pearson
2 was used and 111 when the Wald
2 was used. The PASS manual33 states that use of the continuity correction, but not of the arcsine transformation, yields results close to those obtained with the Fisher exact test, but when it is used with the data analysis, the continuity correction may be overly conservative.44
However, the analysis plan calls for testing for changes across time, and a better approximation may be to compare the proportions of study participants who smoked at each time point. This comparison requires a multiple-testing control, such as a Bonferroni-type correction that sets the testwise
at .05 / 3 = .0167 for the type I error across the 3 tests. The per-group size estimates were 392 participants at the first time point, 203 at the second time point, and 57 at the third time point. Because the comparison at the first time point requires the largest sample size, a total sample of 784 study participants is required, a 263% increase over the estimate of 216 study participants after the proportions are averaged across time. These estimates, however, do not include direct tests for change across time or group-by-time interaction, and they fail to take into account the assessment-to-assessment correlation that results from the repeated measurements.
Incorporate the across-assessment correlation.
To improve the approximation, one can apply methods used to analyze data from related designs: stratified contingency tables, clusterrandomized studies, and survey methods.
The data in the example represent 3 2 x 2 matrices of proportions of smokers by group and by time. The hypothesis of a common odds ratio can be tested with the CochranMantelHaenszel test45 for comparing binary outcomes between 2 groups while controlling for 1 or more stratifying variables, such as site in a multisite clinical trial. Zhang and Boos46 extended the CochranMantelHaenszel test to a case in which the outcomes were correlated, and they derived 2 related tests. They also provided power calculations on the basis of Wittes and Wallensteins research47 by incorporating the population correlation coefficientthe intraclass correlationinto their formula number 3. This incorporation can be applied directly to the example data, which yield estimates (depending on the correlation, assumed to range from
= 0.2 to 0.8) of 47 to 91 participants per group.
Another version of a method incorporating the nonindependence among study participants in a power analysis comes from the research that used the cluster-randomized design, which was discussed by Donner48 and Donner and Klar49 for the continuous case, while methods of power analysis for clustered binary data are discussed by Lee and Durbin50, Jung et al.,51 and Pan.52 One can conceptualize a repeated-measures design as a cluster-randomized design by thinking of the set of assessments for each participant as the cluster that will be randomized to a group. In this case, the cluster size is fixed; hence, one should use the average assessment-toassessment correlation as the estimate of the population correlation coefficient, which is known as the variance inflation factor in this context. In the example, if one examines the same range of intraclass correlations that range from .20 to .80 and if one uses the formula provided by Donner and Klar,49 one obtains the same sample size estimates of 47 to 91 per group. (If one uses Rochons program43 and assumes the same proportions across time, the estimates are 53 and 98.)
Although such methods allow the investigator to take into account correlation across time, I have had to assume that the correlations are equal from time to time (i.e., compound symmetric) and that the test is a simple comparison of 2 proportions. These methods for calculating acrossassessment correlation still do not provide estimates for either the test of change over time or the test of group-by-time interaction. As these estimates and Muller et al.31 demonstrate, such approximations can be risky.
Use a fully aligned method.
To completely align the sample estimation with the analysis plan rather than merely approximating the plan, one can use the methods provided by Rochon,43 Pan,52 and Liu and Liang.53 Pans formulas are limited to 2 conditions, do not allow for dropout, and do not require software implementation; Liu and Liangs method is limited to categorical covariates. Rochons research is applicable to the more general case.
Rochons method43 is based on the Wald
2 test and is implemented in a SAS macro under Proc IML (SAS Institute Inc, Cary, NC); it requires estimates of effect, such as those in the example, and the specification of type I and II error rates. The method also requires an estimate of the correlation of the outcome between the first 2 assessments (the first-order autocorrelation) and an estimate of the shape of the correlation matrix.
With the generalized estimating equation (GEE) approach, the correlation of error terms in a model is assumed to be a nuisance in the sense that error terms must be accounted for if one is to obtain robust estimates of the standard errors in the model, but these error terms are not of direct interest. (Lindsey and Lambert54 have argued that such marginal models are not optimal for this analysis and that a mixed model should be used instead.) While the correct specification of the correlational structure will improve efficiency, the estimates of the mean structure will not be biased if the specification is incorrect.
Table 1
shows 3 correlation matrices, each a different shape, from a 4-assessment design in which the first-order autocorrelation is
= .5. Table 1a
is compound symmetric or exchangeable in shape; the correlation between any 2 time points is the same (i.e., .5). Table 1c
shows a case in which the level of correlation declines as the assessment points become farther apart in time. Specifically, an autoregressive step 1 (AR[1]) shapeeach correlation is defined as the value of the first-order autocorrelation,
, raised to a power equal to the difference between the time points (e.g.,
13 =
|1 - 3| =
2). Between them, Table 1b
shows an example of an autoregressive shape in which the rate of decline in the correlation is slower than the rate in the full AR1. The alteration in the rate of decline in correlation level is accomplished by placing an exponent,
, on the exponent of
. Thus,
2 would be
20.5 if
were set to 0.50. The effect is to slow the rate of decline in the correlation over time if 0
1 and to increase the decline if
> 1. A
value of 0 produces the exchangeable matrix of 1a, a value of .5 produces 1b, and a value of 1.0 produces 1c. This method of raising an exponent to a further power to change the rate of decay is implemented in Rochons approach and is based on the approach of Muñoz et al.55 (It is possible for the correlation between time points to be negative and to increase as the time span increases, but this is not common.)
|
Before considering the effects of these parameters on the sample size estimates, compare the estimates from the fully aligned analysis with the approximations on the basis of the effects provided in the example data, which are summarized in Table 2
. Use of a method aligned with the planned analysis provides estimates not only for the comparison of groups but also for the 2 effects that involve time. When used to test the group-by-time interaction, Rochons approach indicates that a per-group number of study participants of 262 (524 total) is required if one assumes that the first-order correlation equals .20 and the shape of the correlation matrix is compound symmetric. This is the largest of the required sample sizes for the 3 hypotheses we wish to test (under those assumptions) and would be the final estimate for this example. If we used an estimate on the basis of averaged treatment group comparisons only (108 per group), our ability to detect the interaction effect would be greatly underpowered unless we had chosen the estimate of 392 on the basis of 3 nested comparisons at 0.01667, in which case the study would have too many participants.
|
= 0) existed when in fact such a correlation did exist. A study with too many participants is not desirable, because it is unethical and a waste of limited resources to expose more participants to research than necessary.
The relationship of correlational structure to the number of study participants can be seen in greater detail in Figure 2
. Each of the 3 panels displays the required sample size, 1 effect per panel, as a function of the level of the correlation under 3 correlational structures: compound symmetric, step 1 autoregressive, and a structure midway between the other 2 that uses a dampening parameter set to .50, which translates to a slowing of the decay in the correlation (Table 1b
). Note that the y-axis scales vary from panel to panel, and as the population correlation coefficient increases, more study participants are needed to test the difference between conditions, whereas fewer are needed to test the effects that involve time.
|
is equal to .50, 164 study participants per group are required under a compound-symmetric assumption, while 226 study participants are necessary under an autoregressive structure. This approach can be applied to both continuous and categorical data, and it allows for more variations than are discussed in this article, including unequally spaced assessments, differential attrition among samples, and unequal number of subjects per group.43
Use simulations.
There is 1 other option that requires substantially more work but is quite accuratethat is, run a series of computer-based simulations that sample number-of-studyparticipant cases of data from a population with known parameters. In this example, that would mean sampling from 1 theoretical population with X percent "abstinent" and sampling from another with Y percent "abstinent" at each time point with a given variance/covariance structure. For each sample, one would test the primary hypotheses, repeat this set of steps (sample a known population and test the hypothesis), and count how often the resultant
value was greater than 0.05. Do this repeatedly with different sample sizes until the sample size is large enough that you can reject the hypothesis under consideration at least 80% of the time, if the test hypothesis is false.
| Summary |
|---|
|
|
|---|
The 2 most important considerations when estimating the required number of participants are to align the sample size estimation with the data analysis and to verify the sensitivity of the resultant estimates. Although modern methods for data analysis seem to be expanding at a rapid rate, methods of sample estimation are not far behind, and user-friendly software for conducting sample size estimation is increasingly available. The impact of aligning sample estimation methods with data analytic methods is often overlooked; the closer the methods of estimating sample size are to the methods of analysis, the better the chances are that the actual power achieved will match the level of planned power.
Part of the cost of planning a more complex design and analysis derives from the additional information that must be acquired or approximated to accurately estimate how many participants will be required. The effort expended in gathering those pieces of information will necessarily be in proportion to the size of the study and the maturity of the research field in which the study is set.
Once the methods are aligned, efforts should be focused on estimating the required parameters, while at the same time one must realize that it is uncommon to be able to base sample size estimates on a single, well-established effect size. It is equally important to recognize that the effect size and some of the other parameters, such as attrition rates, are themselves estimates. The more the estimates of these parameters vary, the more the sample estimates will vary. Whereas the scientifically conservative decision in the face of such variation would be to select the largest estimated sample size, decision may be impractical and may be far in excess of the true requirement. Even well-established estimates of the parameters should be subjected to a sensitivity analysis to determine the extent to which the estimated sample size varies as the parameters vary.
Following these recommendations means more work for the investigators planning a study and for the reviewers of proposals and manuscripts, but it is work that pays off in the long runboth for the investigators themselves and for the scientific community as a whole.
| Acknowledgments |
|---|
Drs David Wasserman, Alan Bostrom, Roger Vaughan, and 3 anonymous reviewers provided many very helpful comments and suggestions.
Human Participant Protection
No protocol approval was needed for this study.
| Footnotes |
|---|
Accepted for publication July 14, 2003.
| References |
|---|
|
|
|---|
2. Lenth RV. Some practical guidelines for effective sample size determination. Am Statistician. 2001;55:187193.
3. Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Statistician. 2001;55:1924.
4. Halpern SD, Karlawish JHT, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA. 2002;288:358367.
5. Sedlmeier P, Gigerenzer G. Do studies of statistical power have an effect on the power of studies? Psychol Bull. 1989;105:309316.
6. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized controlled trial. Survey of 71 "negative" trials. N Engl J Med. 1978;299:690694.[Abstract]
7. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error, and sample size in the design and interpretation of the randomized controlled trial. In: Bailar JC III, Mosteller F, eds. Medical Uses of Statistics. 2nd ed. Boston, Mass: NEJM Books; 1992: 357373.
8. Sloan NL, Jordan E, Winikoff B. Effects of iron supplementation on maternal hematologic status in pregnancy. Am J Public Health. 2002;92:288293.
9. Thornley B, Adams C. Content and quality of 2000 controlled trials in schizophrenia over 50 years. BMJ. 1998;317:11811184.
10. Bezeau S, Graves R. Statistical power and effect sizes of clinical neuropsychology research. J Clin Exp Neuropsychol. 2001;23:399406.[ISI][Medline]
11. Freedman KB, Bernstein J. Sample size and statistical power in clinical orthopaedic research. J Bone Joint Surg. 1999:81:14541460.
12. Dickison K, Bunn F, Wentz R, Edwards P, Roberts I. Size and quality of randomized controlled trials in head injury: review of published studies. BMJ. 2000:320:13081311.
13. Beal SL. Sample size determination for confidence intervals on the population mean and on the difference between two populations means. Biometrics. 1989;45:969977.[ISI][Medline]
14. Daly LE. Confidence intervals and sample sizes: dont throw out all your old sample size tables. BMJ. 1991;302:333336.
15. Satten GA, Kupper LL. Sample size requirements for interval estimation of the odds ratio. Am J Epidemiol. 1990;131:177184.
16. Volatier JL, Turrini A, Welten D; EFCOSUM Group. Some statistical aspects of food intake assessment. Eur J Clin Nutr. 2002:56(suppl 2):S46S52.
17. Brogger J, Bakke P, Eide GE, Gulsvik A. Comparison of telephone and postal survey modes on respiratory symptoms and risk factors. Am J Epidemiol. 2002;155:572576.
18. Bennett S, Lienhardt C, Bah-Wow O, et al. Investigation of environmental and host-related risk factors for tuberculosis in Africa, II: investigation of host genetic factors. Am J Epidemiol. 2002:155:10741079.
19. Panagiotakos DB, Chrysohoou C, Pitsavos C, et al. The association between secondhand smoke and the risk of developing acute coronary syndromes, among non-smokers, under the presence of several cardiovascular risk factors: the CARDIO2000 casecontrol study. BMC Public Health. 2002;2(1):9.[Medline]
20. Sturmer T, Brenner H. Flexible matching strategies to increase power and efficiency to detect and estimate gene-environment interactions in casecontrol studies. Am J Epidemiol. 2002;155:593602.
21. Yang Q, Khoury MJ, Friedman JM, Flanders DW. On the use of population attributable fraction to determine sample size for casecontrol studies of gene-environment interaction. Epidemiology. 2003;14:161167.[ISI][Medline]
22. Umbach DM. On the determination of sample size. Epidemiology. 2003;14:137138.[ISI][Medline]
23. Streiner DL. Sample size and power in psychiatric research. Can J Psychiatry. 1990;35:616620.[ISI][Medline]
24. Clark V. Sample size determination. Plast Reconstr Surg. 1991;87:569573.[ISI][Medline]
25. Castelloe JM, OBrien RG. Power and Sample Size Determination for Linear Models. Proceedings of the Twenty-Sixth Annual SAS Users Group International Conference, Long Beach, Calif, 2225 April 2001. Cary, NC: SAS Institute Inc; 2001.
26. Maxwell SE. Sample size and multiple regression analysis. Psychol Methods. 2000;5:434458.[ISI][Medline]
27. Cohen J. Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum; 1988.
28. Kraemer HC. To increase power in randomized clinical trials without increasing sample size. Psychopharmacol Bull. 1991;27:217224.[ISI][Medline]
29. McAweeney MJ, Klockars AJ. Maximizing power in skewed distributions: analysis and assignment. Psychol Methods. 1998;3:117122.
30. McClelland GH. Optimal design in psychological research. Psychol Methods. 1997;2:319.
31. Muller KE, LaVange LM, Landesman-Ramey S, Ramey CT. Power calculations for general linear multivariate models including repeated measures applications. J Am Stat Assoc. 1992;87:12091226.[ISI]
32. Hsieh FY, Block DA, Larson MD. A simple method for sample size calculation for linear and logistic regression. Stat Med. 1998;17:16231634.[ISI][Medline]
33. Hintze J. PASS 2000 [computer software]. Kaysville, Utah: Number Cruncher Statistical Software; 2000.
34. Muller KE, Barton CN. Approximate power for repeated measures ANOVA lacking sphericity. J Am Stat Assoc. 1989;84:549555.
35. Overall JE, Doyle SR. Estimating sample sizes for repeated measurement designs. Control Clin Trials. 1994;15:100123.[ISI][Medline]
36. Overall JE, Atlas RS. Power of univariate and multivariate analyses of repeated measurements in controlled clinical trials. J Clin Psychol. 1999;55:465485.[ISI][Medline]
37. Rochon J. Sample size calculations for two-group repeated-measures experiments. Biometrics. 1991;47:13831398.
38. OBrien RG. A Tour of UnifyPow, A SAS Module/Macro for Sample-Size Analysis. Proceedings of the Twenty-Third Annual SAS Users Group International Conference, Nashville, Tenn, 2225 March 1998. Cary, NC: SAS Institute Inc; 1998.
39. Elashoff JD. nQuery Advisor [computer software]. Version 4.0. Sagus, Mass: Statistical Solutions; 2000.
40. Ahn C, Overall JE, Tonidandel S. Sample size and power calculations in repeated measurement analysis. Comput Methods Programs Biomed. 2001;64:121124.[ISI][Medline]
41. EgretSIZ [computer program]. Cytel Software Inc: Cambridge, Mass; 1994.
42. Hedeker D, Gibbons RD, Waternaux C. Sample size estimation for longitudinal designs with attrition: comparing time-related contrasts between two groups. J Educ Behav Stat. 1999;24:7093.
43. Rochon J. Application of GEE procedures for sample size calculations in repeated measures experiments. Stat Med. 1998;17:16431658.[ISI][Medline]
44. Delucchi KL. The use and misuse of chi-square: Lewis and Burke revisited. Psychol Bull. 1983;94:166176.
45. Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst. 1959;22:719748.
46. Zhang J, Boos DD. Mantel-Haenszel test statistics for correlated binary data. Biometrics. 1997;53:11851198.[ISI][Medline]
47. Wittes J, Wallenstein S. The power of the Mantel-Haenszel test. J Am Stat Assoc. 1987;82:11041109.
48. Donner A. Sample size requirements for stratified cluster randomized designs. Stat Med. 1992;11:74350.[ISI][Medline]
49. Donner A, Klar N. Design and Analysis of Cluster Randomization Trials in Health Research. London, England: Arnold; 2000.
50. Lee EW, Durbin N. Estimation and sample size considerations for clustered binary responses. Stat Med. 1994;13:12411252.[ISI][Medline]
51. Jung S-H, Kang S-H, Ahn C. Sample size calculations for clustered binary data. Stat Med. 2001;20:19711782.[ISI][Medline]
52. Pan W. Sample size and power calculations with correlated binary data. Control Clin Trials. 2001;22:211227.[ISI][Medline]
53. Liu G, Liang K-Y. Sample size calculations for studies with correlated observations. Biometrics. 1997;53:937947.[ISI][Medline]
54. Lindsey JK, Lambert P. On the appropriateness of marginal models for repeated measurements in clinical trials. Stat Med. 1998;17:447469.[ISI][Medline]
55. Muñoz A, Carey V, Shouten JP, Segal M, Rosner B. A parametric family of correlation structures for the analysis of longitudinal data. Biometrics. 1992;48:733742.[ISI][Medline]
This article has been cited by other articles:
![]() |
M Reece, D Herbenick, S A Sanders, P Monahan, M Temkit, and W L Yarber Breakage, slippage and acceptability outcomes of a condom fitted to penile dimensions Sex. Transm. Inf., April 1, 2008; 84(2): 143 - 149. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |