Pii: s0895-4356(00)00314-0

Journal of Clinical Epidemiology 54 (2001) 343–349 Adjusting for multiple testing—when and how? aInstitute of Epidemiology and Medical Statistics, School of Public Health, University of Bielefeld, P.O. Box 100131, D-33501 Bielefeld, Germany bDepartment of Medical Informatics, Biometry and Epidemiology, Ruhr-University of Bochum, D-44780 Bochum, Germany Received 9 September 1999; received in revised form 31 July 2000; accepted 2 August 2000 Abstract
Multiplicity of data, hypotheses, and analyses is a common problem in biomedical and epidemiological research. Multiple testing theory provides a framework for defining and controlling appropriate error rates in order to protect against wrong conclusions. However, the cor-responding multiple test procedures are underutilized in biomedical and epidemiological research. In this article, the existing multiple testprocedures are summarized for the most important multiplicity situations. It is emphasized that adjustments for multiple testing are re-quired in confirmatory studies whenever results from multiple tests have to be combined in one final conclusion and decision. In case ofmultiple significance tests a note on the error rate that will be controlled for is desirable.
2001 Elsevier Science Inc. All rights reserved.
Keywords: Multiple hypotheses testing; P value; Error rates; Bonferroni method; Adjustment for multiple testing; UKPDS 1. Introduction
2. Significance tests, multiplicity, and error rates
Many trials in biomedical research generate a multiplic- If one significance test at level ␣ is performed, the proba- ity of data, hypotheses, and analyses, leading to the perfor- bility of the type 1 error (i.e., rejecting the individual null mance of multiple statistical tests. At least in the setting of hypothesis although it is in fact true) is the comparisonwise confirmatory clinical trials the need for multiple test adjust- error rate (CER) ␣, also called individual level or individual ments is generally accepted [1,2] and incorporated in corre- error rate. Hence, the probability of not rejecting the true sponding biostatistical guidelines [3]. However, there seems null hypothesis is 1 Ϫ ␣. If k independent tests are performed, to be a lack of knowledge about statistical procedures for the probability of not rejecting all k null hypotheses when in multiple testing. Recently, some authors tried to establish fact all are true is (1 Ϫ ␣)k. Hence, the probability of reject- that the statistical approach of adjusting for multiple testing ing at least one of the k independent null hypotheses when is unnecessary or even inadequate [4–7]. However, the main in fact all are true is the experimentwise error rate (EER) arguments against multiplicity adjustments are based upon under the complete null hypothesis EER ϭ 1 Ϫ (1 Ϫ ␣)k, fundamental errors in understanding of simultaneous statis- also called global level, or familywise error rate (consider- tical inference [8,9]. For instance, multiple test adjustments ing the family of k tests as one experiment). If the number k have been equated with the Bonferroni procedure [7], which of tests increases, the EER also increases. For ␣ ϭ 0.05 and is the simplest, but frequently also an inefficient method to k ϭ 100 tests EER amounts to 0.994. Hence, in testing 100 independent true null hypotheses one can almost be sure to The purpose of this article is to describe the main con- get at least one false significant result. The expected number cept of multiple testing, several kinds of significance levels, of false significant tests in this case is 100 ϫ 0.05 ϭ 5. Note and the various situations in which multiple test problems in that these calculations only hold if the k tests are independent.
biomedical research may occur. A nontechnical overview is If the k tests are correlated no simple formula for the EER ex- given to summarize in which cases and how adjustments for ists, because EER depends on the correlation structure of the multiple hypotheses tests should be made.
Frequently, the global null hypothesis, that all individual null hypotheses are true simultaneously, is of limited inter-est to the researcher. Therefore, procedures for simulta-neous statistical inference have been developed that control * Corresponding author. Tel.: ϩ49 521 106-3803; fax: ϩ49 521 106- the maximum experimentwise error rate (MEER) under any E-mail address: [email protected] (R. Bender) complete or partial null hypothesis, also called multiple 0895-4356/01/$ – see front matter 2001 Elsevier Science Inc. All rights reserved.
PII: S 0 8 9 5 - 4 3 5 6 ( 0 0 ) 0 0 3 1 4 - 0 R. Bender, S. Lange / Journal of Clinical Epidemiology 54 (2001) 343–349 level, or familywise error rate in a strong sense. The MEER defined statistical analysis plan is required. A clear prespeci- is the probability of rejecting falsely at least one true indi- fication of the multiple hypotheses and their priorities is vidual null hypothesis, irrespective of which and how many quite important. If it is possible to specify one clear primary of the other individual null hypotheses are true. A multiple hypothesis there is not multiplicity problem. If, however, the test procedure that controls the MEER also controls the key hypothesis is proved by means of multiple significance EER but not vice versa [10]. Thus, the control of the MEER tests, the use of multiple test procedures is mandatory.
is the best protection against wrong conclusions and leads to On the other hand, in exploratory studies, in which data the strongest statistical inference.
are collected with an objective but not with a prespecified The application of multiple test procedures enables one key hypothesis, multiple test adjustments are not strictly re- to conclude which tests are significant and which are not, quired. Other investigators hold an opposite position that but with control of the appropriate error rate. For example, multiplicity corrections should be performed in exploratory when three hypotheses A, B, C are tested and the unadjusted studies [7]. We agree that the multiplicity problem in ex- ploratory studies is huge. However, the use of multiple test ferroni correction would lead to the adjusted P values P ϭ procedures does not solve the problem of making valid sta- tistical inference for hypotheses that were generated by the conclude that test A is significant and tests B and C are not data. Exploratory studies frequently require a flexible ap- significant by controlling the MEER of 0.05.
proach for design and analysis. The choice and the numberof tested hypotheses may be data dependent, which meansthat multiple significance tests can be used only for descrip- 3. When are adjustments for multiple tests necessary?
tive purposes but not for decision making, regardless of A simple answer to this question is: If the investigator whether multiplicity corrections are performed or not. As only wants to control the CER, an adjustment for multiple the number of tests in such studies is frequently large and tests is unnecessary; if the investigator wants to control the usually a clear structure in the multiple tests is missing, an EER or MEER, an adjustment for multiple tests is strictly appropriate multiple test adjustment is difficult or even im- required. Unfortunately, there is no simple and unique an- possible. Hence, we prefer that data of exploratory studies swer to when it is appropriate to control which error rate.
be analyzed without multiplicity adjustment. “Significant” Different persons may have different but nevertheless rea- results based upon exploratory analyses should clearly be sonable opinions [11,12]. In addition to the problem of de- labeled as exploratory results. To confirm these results the ciding which error rate should be under control, it has to be corresponding hypotheses have to be tested in further con- defined first which tests of a study belong to one experi- ment. For example, consider a study in which three different Between the two extreme cases of strictly confirmatory new treatments (T1, T2, T3) are compared with a standard and strictly exploratory studies there is a wide range of in- treatment or control (C). All six possible pairwise compari- vestigations representing a mixture of both types. The deci- sons (T1 vs. C, T1 vs. T2, T1 vs. T3, T2 vs. C, T2 vs T3, T3 sion whether an analysis should be made with or without vs. C) can be regarded as one experiment or family of com- multiplicity adjustments is dependent on “the questions parisons. However, by defining the comparisons of the new posed by the investigator and his purpose in undertaking the treatments with the control (T1 vs. C, T2 vs. C, T3 vs. C) as study” [13]. Whatever the decision is, it should clearly be the main goal of the trial and the comparisons of the new stated why and how the chosen analyses are performed, and treatments among each other (T1 vs. T2, T1 vs. T3, T2 vs.
T3) as secondary analysis, this study consists of two experi- In the following, we consider the case of a confirmatory ments of connected comparisons. In this case it may be ap- study with a clear prespecified key question consisting of propriate to perform separate multiplicity adjustments in several hypotheses analyzed by multiple significance tests.
each experiment. In general, we think it is logical that the These tests represent one experiment consisting of a family MEER should be under control when the results of a well- of connected significance tests. For a valid final conclusion defined family of multiple tests should be summarized in an appropriate multiplicity adjustment should be made. We one conclusion for the whole experiment. For example, if present a short nontechnical overview of statistical proce- each new treatment is significantly different from the stan- dures for multiple test adjustment. More technical and com- dard treatment, the conclusion that all three treatments dif- prehensive overviews can be found elsewhere [10,14–16].
fer from the standard treatment should be based upon an ad-equate control of the MEER. Otherwise the type 1 error ofthe final conclusion is not under control, which means that 4. General procedures for multiple test adjustments
the aim of significance testing is not achieved.
4.1. General procedures based upon P values Such a rigorous proceeding is strictly required in confirma- tory studies. A study is considered as confirmatory if the goal The simplest multiple test procedure is the well-known of the trial is the definitive proof of a predefined key hypoth- Bonferroni method [17]. Of k significance tests, those ac- esis for final decision making. For such studies a good pre- cepted as statistically significant have P values smaller than R. Bender, S. Lange / Journal of Clinical Epidemiology 54 (2001) 343–349 ␣/k, where ␣ is the MEER. Adjusted P values are calculated analysis of variance (ANOVA) [24]. For this application a 1, . . . , k are the individual unad- number of procedures exist. The most well-known methods, justed P values. In the same manner Bonferroni adjusted which are frequently implemented in ANOVA procedures confidence intervals can be constructed by dividing the of statistical software packages, are the following. The si- multiple confidence level with the number of confidence in- multaneous test procedures of Scheffé and Tukey can also be tervals. The Bonferroni method is simple and applicable in used to calculate simultaneous confidence intervals for all essentially any multiple test situation. However, the price pairwise differences between means. The method of Dun- for this simplicity and universality is low power. In fact, the nett can be used to compare several groups with a single Bonferroni method is frequently not appropriate, especially control. In contrast to these single-step procedures, multiple if the number of tests is large. Bonferroni corrections should stage tests are in general more powerful but give only ho- only be used in cases where the number of tests is quite mogenous sets of treatment means but no simultaneous con- small (say, less than 5) and the correlations among the test fidence intervals. The most well-known multiple stage tests are the procedures of Duncan, Student–Newman–Keuls Fortunately, there are a number of improvements of the (SNK), and Ryan–Einot–Gabriel–Welsch (REGW). These Bonferroni method [2,16,18], such as the well-known Holm procedures, with the exception of Duncan, preserve the procedure [19,20]. Some of these modified Bonferroni MEER, at least in balanced designs. Which of these tests are methods represent stepwise procedures based upon the appropriate depends on the investigator’s needs and the closed testing procedure introduced by Marcus et al. [21], study design. In short, if the MEER should be under control, which is a general principle leading to multiple tests con- with no confidence intervals needed and a balanced design, trolling the multiple level [10]. A general algorithm for ob- then the REGW procedure can be recommended. If confi- taining adjusted P values for any closed test procedure is dence intervals are desirable or the design is unbalanced, outlined by Wright [16]. While some of these methods are then the Tukey procedure is appropriate. In case of ordered quite complex, the Holm method is just as simple and gen- groups (e.g., dose finding studies), procedures for specific erally applicable as the Bonferroni method, but much more ordered alternatives can be used with a substantial gain in power [10]. More detailed overviews of multiple test proce-dures for the comparison of several groups are given else- where [16,25–27]. Multiple comparison procedures for Despite being more powerful than the simple Bonferroni some nonparametric tests are also available [28].
method, the modified Bonferroni methods still tend to be In the frequent case of three groups the principle of conservative. They make use of the mathematical properties closed testing leads to the following simple procedure that of the hypotheses structure, but they do not take the correla- keeps the multiple level [10]. At first, test the global null tion structure of the test statistics into account. One ap- hypothesis that all three groups are equal by a suitable level␣ proach that uses the information of dependencies and distri- test (e.g., and F test or the Kruskal–Wallis test). If the glo- butional characteristics of the test statistics to obtain bal null hypothesis is rejected proceed with level tests for adjusted P values is given by resampling procedures [22].
the three pairwise comparisons (e.g., t tests or Wilcoxon For highly correlated tests, this approach is considerably more powerful than the procedures discussed above. How-ever, the price for the gain of power is that the resampling- based procedures are computer intensive. PROC MULT-TEST of SAS offers resampling-based adjusted P values for The case of multiple endpoints is one of the most com- some frequently used significance tests [22,23].
mon multiplicity problems in clinical trials [29,30]. Thereare several possible strategies to deal with multiple end-points. The simplest approach, which should always be con- 5. Special procedures for multiple test adjustments
sidered first, is to specify a single primary endpoint. Thisapproach makes adjustments for multiple endpoints unnec- One main advantage of the general multiple test proce- essary. However, all other endpoints are then subsidiary and dures based upon P values is that they are universally appli- results concerning secondary endpoints can only have an cable to different types of data (continuous, categorical, exploratory rather than a confirmatory interpretation. The censored) and different test statistics (e.g., t, ␹2, Fisher, second possibility is to combine the outcomes in one aggre- logrank). Naturally, these procedures are unspecific and gated endpoint (e.g., a summary score for quality of life special adjustment procedures have been developed for cer- data or the time to the first event in the case of survival tain questions in specific multiplicity situations.
data). The approach is adequate only if one is not interestedin the results of the individual endpoints. Thirdly, for signif- icance testing multivariate methods [e.g., multivariate anal- One area in which multiplicity adjustment has a long his- ysis of variance (MANOVA) or Hotelling’s T2 test] and tory is the comparison of the means of several groups in global test statistics developed by O’Brien [31] and ex- R. Bender, S. Lange / Journal of Clinical Epidemiology 54 (2001) 343–349 tended by Pocock et al. [32] can be used. Exact tests suit- able for a large number of endpoints and small sample size The extent to which subgroup analyses should be under- have been developed by Läuter [33]. All these methods pro- taken and reported is highly controversial [43,44]. We will vide an overall assessment of effects in terms of statistical not discuss the full range of problems and issues related to significance but offer no estimate of the magnitude of the subgroup analyses but focus on the multiplicity problem. If effects. Again, information about the effects concerning the one is interested in demonstrating a difference in the magni- individual endpoints is lacking. In addition, Hotelling’s T2 tude of the effect size between subgroups, a statistical test of test lacks power since it tests for unstructured alternative interaction is appropriate, although such tests generally hypotheses, when in fact one is really interested in evidence have low power [45]. If it is the aim to show an effect in all from several outcomes pointing in the same direction [34].
(or in some) of a priori defined subgroups on the basis of Hence, in the case of several equally important endpoints existing hypotheses, an adjustment for multiple testing for which individual results are of interest, multiple test ad- should be performed by using one of the general procedures justments are required, either alone or in combination with based upon P values. If there are few nonoverlapping sub- previously mentioned approaches. Possible methods to ad- groups, a test within one subgroup is independent of a test just for multiple testing in the case of multiple endpoints are within another subgroup. In this case, the use of the simple given by the general adjustment methods based upon P val- Bonferroni method is possible. Frequently, however, sub- ues [35] and the resampling methods [22] introduced above.
group analyses are performed concerning subgroups that are It is also possible to allocate different type 1 error rates to defined a posteriori after data examination. In this case, the several not equally important endpoints [36,37].
results have an exploratory character regardless of whethermultiplicity adjustments are performed or not. For interpre- tation of such analyses one should keep in mind that theoverall trial result is usually a better guide to the effect in Methods to adjust for multiple testing in studies collect- subgroups than the estimated effect in the subgroups [46].
ing repeated measurements are rare. Despite much recentwork on mixed models [38,39] with random subject effects to allow for correlation of data, there are only few multiplecomparison procedures for special situations. It is difficult Interim analyses of accumulating data are used in long- to develop a general adjustment method for multiple com- term clinical trials with the objective to terminate the trial parisons in the case of repeated measurements since these when one treatment is significantly superior to the other(s).
comparisons occur for between-subject factors (e.g., Since repeated analyses of the data increase the type 1 error, groups), within-subject factors (e.g., time), or both. The multiplicity adjustments are required for the development of specific correlation structure has to be taken into account, adequate stopping rules. A simple rule that may be suffi- involving many difficulties. If only comparisons for be- cient in many trials is: if no more than 10 interim analyses tween-subject factors are of interest, one possibility is to are planned and there is one primary endpoint, then P Ͻ .01 consider the repeated measurements as multiple endpoints can be used as criterion for stopping the trial, because the and use one of the methods mentioned in the previous sec- global level will not exceed .05 [47]. The disadvantage of tion. However, if the repeated measurements are ordered, this approach is that the final analysis has to be undertaken this information is lost by using such an approach.
at a significance level considerably smaller than .05 (also If repeated measurements are collected serially over .01). Another simple possibility is to be extremely cautious time, the use of summary measures (e.g., area under the for stopping the trial early by using P Ͻ .001 for the interim curve) to describe the response curves should be consid- analyses [48]. This approach covers any number of interim ered [40,41]. The analysis takes the form of a two-stage analyses and is so conservative that the final analysis can be method where, in the first step, suitable summary measures conducted at the usual .05 level. A compromise between for each response curve are calculated , and in the second these approaches is to use the procedure developed by step, these summary measures are analyzed by using the O’Brien and Fleming [49] with varying nominal signifi- approaches discussed above. The choice of an adequate ap- cance levels for stopping the trial. Early interim analyses proach in the second stage depends on the number of have more stringent significance levels while the final anal- groups to be compared and the number of summary mea- ysis is undertaken as close to the .05 level as possible. Over- sures to be analyzed. Only in the case of two groups and views about recent developments in the field of interim one summary measure as single primary endpoint does no monitoring of clinical trials are given elsewhere [50–56].
multiplicity problem arise. To compare response curves be-tween groups, Zerbe and Murphy have developed an exten- 6. Discussion
sion of the Scheffé method and a stepwise procedure to ad-just for multiple testing [42]. There are also multiple The problem of multiple hypotheses testing in biomedi- comparison procedures for some nonparametric tests suit- cal research is quite complex and involves several difficul- ties. Firstly, it is required to define which significance tests R. Bender, S. Lange / Journal of Clinical Epidemiology 54 (2001) 343–349 belong to one experiment; that means which tests should be PDS one could have defined the intensive versus the con- used to make one final conclusion. Secondly, the particular ventional treatment as the primary comparison, with the error rate to be under control must be chosen. Thirdly, an consequence that confirmatory statements concerning the appropriate method for multiple test adjustment has to be different intensive treatments are impossible. Furthermore, found that is applicable and feasible in the considered situa- one could have defined the aggregated endpoint “any diabe- tion. Many multiple test procedures for standard situations tes-related endpoint” as the primary outcome, with the con- have been developed, but in the practice of clinical and epi- sequence that all other aggregated endpoints are subsidiary.
demiological trials, there are a lot of situations in which an By means of a closed testing procedure it would have been adequate control of type 1 error is quite complex, especially possible to perform tests concerning the single endpoints if there are several levels of multiplicity (e.g., more than forming the primary aggregated outcome (e.g., blindness, two groups and more than one endpoint and repeated mea- death from hypoglycemia, myocardial infarction, etc.) by surements of each endpoint). Unfortunately, the level of preserving the MEER. The number of confirmatory analy- complexity can be so high that it is impossible to make an ses would be drastically reduced, but the results would be adequate adjustment for multiple testing. For example, the UK Prospective Diabetes Study (UKPDS) [57] contains an A further problem we did not mention in detail concerns enormous complexity regarding multiplicity. Considering the type of research in which estimates of association can be only the four main UKPDS publications in the year 1998 obtained for a broad range of possible predictor variables. In (UKPDS 33,34,38,39) [58–61] (i.e., neglecting the interim such studies, authors may focus on the most significant of analyses and multiple significance tests published in earlier several analyses—a selection process that may bias the and future articles) there are 2, 4 or 5 main treatment groups magnitude of observed associations (both point estimates (dependent on the question), additional comparisons be- and confidence intervals). One way to deal with this type of tween specific medications (e.g., captopril vs. atenolol), ap- multiplicity problem is to demand reproduction of the ob- proximately 50 endpoints (7 aggregated, 21 single, 8 surro- served associations and their magnitude in further indepen- gate, and 12 compliance endpoints), and subgroup analyses dent trials. However, this ‘solution’ does not address the ad- (e.g., regarding overweight patients).
justment of significance levels. A data-driven analysis and Of course, for such a specific and complex design no ad- presentation, also called ‘data dredging’ or ‘data fishing,’ equate and powerful multiple test procedure exists. Al- can only produce exploratory results. It can be used to gen- though Bonferroni adjustments would be principally possi- erate hypotheses but not to test and confirm them, regard- ble, they would allow only comparisons with P values less of whether multiplicity corrections are performed or below .00017 (.05/298) to be significant, as we counted 298 not. Hence, the use of multiple test procedures cannot pro- different P values in the four articles. Naturally, with nearly tect against the bias caused by data fishing.
300 tests the Bonferroni procedure has not enough power to In principal, there is an alternative approach to signifi- detect any true effect and cannot be recommended here. The cance testing for analysis of data. Bayes methods differ UKPDS group tried to account for multiplicity by calculat- from all the methods discussed above in minimizing the ing 99% confidence intervals for single endpoints [60]. This Bayes risk under additive loss rather than controlling type 1 approach slightly reduces the risk of type 1 error, but for a error rates. From a Bayesian perspective control of type 1 confirmatory study this procedure is not an adequate solu- error is not necessary to make valid inferences. Thus, the tion since the main goal of a significance test, namely the use of Bayes methods avoids some of the conceptual and control of the type 1 error to a given level, is not achieved.
practical difficulties involved with the control of type 1 er- Moreover, although 99% confidence intervals were calcu- ror, especially in the case of multiplicity. Hence, Bayes lated, unadjusted P values were presented with the effect methods are useful for some of the multiplicity situations that they are interpreted with the usual 5% level of signifi- discussed above. Examples are the monitoring of clinical cance [62]. Hence, in the UKPDS no firm conclusions can trials [63] and the use of empirical Bayes methods for the be drawn from the significance tests as the actual global sig- analysis of a large number of related endpoints [64,65].
nificance level exceeds 5% by a large and unknown amount.
However, in this article we concentrate on classical statisti- To avoid such difficulties a careful planning of the study cal methods based upon significance tests. We started from design is required, taking multiplicity into account. The eas- the assumption that an investigator had decided to use sig- iest and best interpretable approach is to avoid multiplicity nificance tests for data analysis. For this case we tried to as far as possible. A good predefined statistical analysis summarize the available corresponding procedures to adjust plan and a prespecification of the hypotheses and their pri- for multiple testing. Bayes methods—which do not provide orities will in general reduce the multiplicity problem. If adjustments of P values as they do not give P values at all— multiplicity can not be avoided at all (e.g., because there are several equally important endpoints), the investigators In summary, methods to adjust for multiple testing are should clearly define which hypotheses belong to one ex- valuable tools to ensure valid statistical inference. They periment and then adjust for multiple testing to achieve a should be used in all confirmatory studies where on the ba- valid conclusion with control of the type 1 error. In the UK- sis of a clearly defined family of tests one final conclusion R. Bender, S. Lange / Journal of Clinical Epidemiology 54 (2001) 343–349 and decision will be drawn. In such cases the maximum ex- special reference to ordered analysis of variance. Biometrika 1976; perimentwise error rate under any complete or partial null [22] Westfall PH, Young SS. Resampling-based multiple testing. New hypothesis should be under control. While the simple Bon- ferroni method is frequently not appropriate due to low [23] Westfall PH, Young SS. Reader reaction: on adjusting P-values for power, there are a number of more powerful approaches ap- multiplicity. Biometrics 1993;49:941–5.
plicable in various multiplicity situations. These methods [24] Altman DG, Bland JM. Comparing several groups using analysis of deserve wider knowledge and application in biomedical and [25] Godfrey K. Comparing means of several groups. N Engl J Med 1985; [26] Jaccard J, Becker MA, Wood G. Pairwise multiple comparison proce- dures: a review. Psychol Bull 1984;96:589–96.
Acknowledgment
[27] Seaman MA, Levin JR, Serlin RC. New developments in pairwise multiple comparisons: some powerful and practicable procedures.
We thank Dr. Gernot Wassmer (Cologne, Germany) for his careful reading of the manuscript and his valuable com- [28] Conover WJ. Practical nonparametric statistics. New York: Wiley, [29] Pocock SJ. Clinical trials with multiple outcomes: a statistical per- spective on their design, analysis, and interpretation. Contr Clin Tri- References
[30] Zhang J, Quan H, Ng J. Some statistical methods for multiple end- [1] Koch GG, Gansky SA. Statistical considerations for multiplicity in points in clinical trials. Contr Clin Trials 1997;18:204–21.
confirmatory protocols. Drug Inf J 1996;30:523–34.
[31] O’Brien PC. Procedures for comparing samples with multiple end- [2] Sankoh AJ, Huque MF, Dubin N. Some comments on frequently used points. Biometrics 1984;40:1079–87.
multiple endpoint adjustment methods in clinical trials. Stat Med [32] Pocock SJ, Geller NL, Tsiatis AA. The analysis of multiple endpoints in clinical trials. Biometrics 1987;43:487–98.
[3] The CPMP Working Party on Efficacy of Medical Products. Bio- [33] Läuter J. Exact t and F tests for analyzing studies with multiple end- statistical methodology in clinical trials in applications for marketing points. Biometrics 1996;52:964–70.
authorizations for medical products. Stat Med 1995;14:1659–82.
[34] Follmann D. Multivariate tests for multiple endpoints in clinical tri- [4] Rothman KJ. No adjustments are needed for multiple comparisons.
[35] Lehmacher W, Wassmer G, Reitmeir P. Procedures for two-sample [5] Savitz DA, Olshan AF. Multiple comparisons and related issues in comparisons with multiple endpoints controlling the experimentwise the interpretation of epidemiologic data. Am J Epidemiol 1995;142: error rate. Biometrics 1991;47:511–21.
[36] Moyé LA. P-value interpretation and alpha allocation in clinical tri- [6] Savitz DA, Olshan AF. Describing data requires no adjustment for multiple comparisons: a reply from Savitz and Olshan. Am J Epide- [37] Moyé LA. Alpha calculus in clinical trials: Considerations and com- mentary for the new millennium. Stat Med 2000;19:767–79.
[7] Perneger TV. What’s wrong with Bonferroni adjustments. BMJ 1998; [38] Cnaan A, Laird NM, Slasor P. Tutorial in biostatistics: using the gen- eral linear mixed model to analyse unbalanced repeated measures and [8] Aickin M. Other method for adjustment of multiple testing exists longitudinal data. Stat Med 1997;16:2349–80.
[39] Burton P, Gurrin L, Sly P. Tutorial in biostatistics: extending the sim- [9] Bender R, Lange S. Multiple test procedures other than Bonferroni’s ple linear regression model to account for correlated responses: an in- deserve wider use [Letter]. BMJ 1999;318:600–1.
troduction to generalized estimating equations and multi-level mixed [10] Bauer P. Multiple testing in clinical trials. Stat Med 1991;10:871–90.
modelling. Stat Med 1998;17:1261–91.
[11] Thompson JR. Invited commentary: Re: “Multiple comparisons and [40] Matthews JNS, Altman DG, Campbell MJ, Royston P. Analysis of related issues in the interpretation of epidemiologic data.” Am J Epi- serial measurements in medical research. BMJ 1990;300:230–5.
[41] Senn S, Stevens L, Chaturvedi N. Tutorial in biostatistics: repeated [12] Goodman SN. Multiple comparisons, explained. Am J Epidemiol measures in clinical trials: simple strategies for analysis using sum- mary measures. Stat Med 2000;19:861–77.
[13] O’Brien PC. The appropriateness of analysis of variance and multiple [42] Zerbe GO, Murphy JR. On multiple comparisons in the randomiza- comparison procedures. Biometrics 1983;39:787–94.
tion analysis of growth and response curves. Biometrics 1986;42: [14] Miller RG. Simultaneous statistical inference. New York: McGraw- [43] Oxman AD, Guyatt GH. A consumer’s guide to subgroup analysis.
[15] Hochberg Y, Tamhane AC. Multiple comparison procedures. New [44] Feinstein AR. The problem of cogent subgroups: a clinicostatistical [16] Wright SP. Adjusted p-values for simultaneous inference. Biometrics tragedy. J Clin Epidemiol 1998;51:297–9.
[45] Buyse ME. Analysis of clinical trial outcomes: some comments on [17] Bland JM, Altman DG. Multiple significance tests: the Bonferroni subgroup analyses. Contr Clin Trials. 1989;10:187S–94S.
[46] Yusuf S, Wittes J, Probstfield J, Tyroler A. Analysis and interpreta- [18] Levin B. Annotation: on the Holm, Simes, and Hochberg multiple tion of treatment effects in subgroups of patients in randomised clini- test procedures. Am J Public Health 1996;86:628–9.
[19] Holm S. A simple sequentially rejective multiple test procedure.
[47] Pocock SJ. Clinical trials: a practical approach. Chichester: Wiley, [20] Aickin M, Gensler H. Adjusting for multiple testing when reporting [48] Peto R, Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV, research results: the Bonferroni vs Holm methods. Am J Public Mantel N, McPherson K, Peto J, Smith PG. Design and analysis of randomised clinical trials requiring prolonged observation of each pa- [21] Marcus R, Peritz E, Gabriel KR. On closed testing procedures with tient. I. Introduction and design. Br J Cancer 1976;34:585–612.
R. Bender, S. Lange / Journal of Clinical Epidemiology 54 (2001) 343–349 [49] O’Brien PC, Fleming TR. A multiple testing procedure for clinical blood-glucose control with sulphonylureas or insulin compared with trials. Biometrics 1979;35:549–56.
conventional treatment and risk of complications in patients with type [50] DeMets DL, Lan KKG. Overview of sequential methods and their ap- 2 diabetes (UKPDS 33). Lancet 1998;352:837–53.
plications in clinical trials. Commun Stat A 1984;13:2315–38.
[59] The UK Prospective Diabetes Study (UKPDS) Group. Effect of in- [51] Geller NL, Pocock SJ. Interim analyses in randomized clinical trials: tensive blood-glucose control with metformin on complications in ramifications and guidelines for practioners. Biometrics 1987;43: overweight patients with type 2 diabetes (UKPDS 34). Lancet 1998; [52] Jennison C, Turnbull BW. Statistical approaches to interim monitor- [60] The UK Prospective Diabetes Study (UKPDS) Group. Tight blood ing of medical trials: a review and commentary. Stat Sci 1990;5:299– pressure control and risk of macrovascular and microvascular compli- cations in type 2 diabetes: UKPDS 38. BMJ 1998;317:703–13.
[53] Pocock SJ. Statistical and ethical issues in monitoring clinical trials.
[61] The UK Prospective Diabetes Study (UKPDS) Group. Efficacy of atenolol and captopril in reducing risk of macrovascular and microvas- [54] Lee JW. Group sequential testing in clinical trials with multivariate cular complications in type 2 diabetes: UKPDS 39. BMJ 1998;317: observations: a review. Stat Med 1994;13:101–11.
[55] Facey KM, Lewis JA. The management of interim analyses in drug [62] de Fine Olivarius N, Andreasen AH. The UK Prospective Diabetes development. Stat Med 1998;17:1801–9.
Study [Letter]. Lancet 1998;352:1933.
[56] Skovlund E. Repeated significance tests on accumulating survival [63] Fayers PM, Ashby D, Parmar MKB. Tutorial in biostatistics: Baye- data. J Clin Epidemiol 1999;52:1083–8.
sian data monitoring in clinical trials. Stat Med 1997;16:1413–30.
[57] The UK Prospective Diabetes Study (UKPDS) Group. U.K. Prospec- [64] Clayton D, Kaldor J. Empirical Bayes estimates of age-standardized tive Diabetes Study (UKPDS): VIII. Study design, progress and per- relative risks for use in disease mapping. Biometrics 1987;43:671–81.
formance. Diabetologia 1991;34:877–90.
[65] Greenland S, Robins JM. Empirical-Bayes adjustments for multiple [58] The UK Prospective Diabetes Study (UKPDS) Group. Intensive comparisons are sometimes useful. Epidemiology 1991;2:244–51.

Source: http://plog.yejh.tc.edu.tw/gallery/53/%E5%88%A4%E6%96%B7%E5%A4%9A%E5%85%83%E8%A9%95%E9%87%8F.pdf

In this issue:

Mersey ADR Newsletter Issue 22 Autumn 2003 In this issue: Serious reactions generally… ▼ Fluoroquinolones and tendon disorders ▼ Serious reactions generally… Adverse drug reactions (ADRs) are a major problem, both in ▼ …and serious reactions with NSAIDs hospital and in the community. However, it is estimated that only about 10% of serious

Schedule of fees – policy and procedure

Policy and Procedure McMinnville Free Clinic PRESCRIPTION MEDICATIONS McMinnville Free Clinic (MFC) seeks to comply with federal and state regulations regarding prescription of medications. Controlled substances can be dangerous if not carefully monitored and should have more oversight than the intermittent clinics at McMinnville Free Clinic allows. Additionally, because of federa