Journal of Clinical Epidemiology 54 (2001) 343–349
Adjusting for multiple testing—when and how?
aInstitute of Epidemiology and Medical Statistics, School of Public Health, University of Bielefeld, P.O. Box 100131, D-33501 Bielefeld, Germany
bDepartment of Medical Informatics, Biometry and Epidemiology, Ruhr-University of Bochum, D-44780 Bochum, Germany
Received 9 September 1999; received in revised form 31 July 2000; accepted 2 August 2000
Abstract
Multiplicity of data, hypotheses, and analyses is a common problem in biomedical and epidemiological research. Multiple testing theory
provides a framework for defining and controlling appropriate error rates in order to protect against wrong conclusions. However, the cor-responding multiple test procedures are underutilized in biomedical and epidemiological research. In this article, the existing multiple testprocedures are summarized for the most important multiplicity situations. It is emphasized that adjustments for multiple testing are re-quired in confirmatory studies whenever results from multiple tests have to be combined in one final conclusion and decision. In case ofmultiple significance tests a note on the error rate that will be controlled for is desirable.
2001 Elsevier Science Inc. All rights reserved. Keywords: Multiple hypotheses testing; P value; Error rates; Bonferroni method; Adjustment for multiple testing; UKPDS
1. Introduction 2. Significance tests, multiplicity, and error rates
Many trials in biomedical research generate a multiplic-
If one significance test at level ␣ is performed, the proba-
ity of data, hypotheses, and analyses, leading to the perfor-
bility of the type 1 error (i.e., rejecting the individual null
mance of multiple statistical tests. At least in the setting of
hypothesis although it is in fact true) is the comparisonwise
confirmatory clinical trials the need for multiple test adjust-
error rate (CER) ␣, also called individual level or individual
ments is generally accepted [1,2] and incorporated in corre-
error rate. Hence, the probability of not rejecting the true
sponding biostatistical guidelines [3]. However, there seems
null hypothesis is 1 Ϫ ␣. If k independent tests are performed,
to be a lack of knowledge about statistical procedures for
the probability of not rejecting all k null hypotheses when in
multiple testing. Recently, some authors tried to establish
fact all are true is (1 Ϫ ␣)k. Hence, the probability of reject-
that the statistical approach of adjusting for multiple testing
ing at least one of the k independent null hypotheses when
is unnecessary or even inadequate [4–7]. However, the main
in fact all are true is the experimentwise error rate (EER)
arguments against multiplicity adjustments are based upon
under the complete null hypothesis EER ϭ 1 Ϫ (1 Ϫ ␣)k,
fundamental errors in understanding of simultaneous statis-
also called global level, or familywise error rate (consider-
tical inference [8,9]. For instance, multiple test adjustments
ing the family of k tests as one experiment). If the number k
have been equated with the Bonferroni procedure [7], which
of tests increases, the EER also increases. For ␣ ϭ 0.05 and
is the simplest, but frequently also an inefficient method to
k ϭ 100 tests EER amounts to 0.994. Hence, in testing 100
independent true null hypotheses one can almost be sure to
The purpose of this article is to describe the main con-
get at least one false significant result. The expected number
cept of multiple testing, several kinds of significance levels,
of false significant tests in this case is 100 ϫ 0.05 ϭ 5. Note
and the various situations in which multiple test problems in
that these calculations only hold if the k tests are independent.
biomedical research may occur. A nontechnical overview is
If the k tests are correlated no simple formula for the EER ex-
given to summarize in which cases and how adjustments for
ists, because EER depends on the correlation structure of the
multiple hypotheses tests should be made.
Frequently, the global null hypothesis, that all individual
null hypotheses are true simultaneously, is of limited inter-est to the researcher. Therefore, procedures for simulta-neous statistical inference have been developed that control
* Corresponding author. Tel.: ϩ49 521 106-3803; fax: ϩ49 521 106-
the maximum experimentwise error rate (MEER) under any
E-mail address:[email protected] (R. Bender)
complete or partial null hypothesis, also called multiple
0895-4356/01/$ – see front matter 2001 Elsevier Science Inc. All rights reserved. PII: S 0 8 9 5 - 4 3 5 6 ( 0 0 ) 0 0 3 1 4 - 0
R. Bender, S. Lange / Journal of Clinical Epidemiology 54 (2001) 343–349level, or familywise error rate in a strong sense. The MEER
defined statistical analysis plan is required. A clear prespeci-
is the probability of rejecting falsely at least one true indi-
fication of the multiple hypotheses and their priorities is
vidual null hypothesis, irrespective of which and how many
quite important. If it is possible to specify one clear primary
of the other individual null hypotheses are true. A multiple
hypothesis there is not multiplicity problem. If, however, the
test procedure that controls the MEER also controls the
key hypothesis is proved by means of multiple significance
EER but not vice versa [10]. Thus, the control of the MEER
tests, the use of multiple test procedures is mandatory.
is the best protection against wrong conclusions and leads to
On the other hand, in exploratory studies, in which data
the strongest statistical inference.
are collected with an objective but not with a prespecified
The application of multiple test procedures enables one
key hypothesis, multiple test adjustments are not strictly re-
to conclude which tests are significant and which are not,
quired. Other investigators hold an opposite position that
but with control of the appropriate error rate. For example,
multiplicity corrections should be performed in exploratory
when three hypotheses A, B, C are tested and the unadjusted
studies [7]. We agree that the multiplicity problem in ex-
ploratory studies is huge. However, the use of multiple test
ferroni correction would lead to the adjusted P values P ϭ
procedures does not solve the problem of making valid sta-
tistical inference for hypotheses that were generated by the
conclude that test A is significant and tests B and C are not
data. Exploratory studies frequently require a flexible ap-
significant by controlling the MEER of 0.05.
proach for design and analysis. The choice and the numberof tested hypotheses may be data dependent, which meansthat multiple significance tests can be used only for descrip-
3. When are adjustments for multiple tests necessary?
tive purposes but not for decision making, regardless of
A simple answer to this question is: If the investigator
whether multiplicity corrections are performed or not. As
only wants to control the CER, an adjustment for multiple
the number of tests in such studies is frequently large and
tests is unnecessary; if the investigator wants to control the
usually a clear structure in the multiple tests is missing, an
EER or MEER, an adjustment for multiple tests is strictly
appropriate multiple test adjustment is difficult or even im-
required. Unfortunately, there is no simple and unique an-
possible. Hence, we prefer that data of exploratory studies
swer to when it is appropriate to control which error rate.
be analyzed without multiplicity adjustment. “Significant”
Different persons may have different but nevertheless rea-
results based upon exploratory analyses should clearly be
sonable opinions [11,12]. In addition to the problem of de-
labeled as exploratory results. To confirm these results the
ciding which error rate should be under control, it has to be
corresponding hypotheses have to be tested in further con-
defined first which tests of a study belong to one experi-
ment. For example, consider a study in which three different
Between the two extreme cases of strictly confirmatory
new treatments (T1, T2, T3) are compared with a standard
and strictly exploratory studies there is a wide range of in-
treatment or control (C). All six possible pairwise compari-
vestigations representing a mixture of both types. The deci-
sons (T1 vs. C, T1 vs. T2, T1 vs. T3, T2 vs. C, T2 vs T3, T3
sion whether an analysis should be made with or without
vs. C) can be regarded as one experiment or family of com-
multiplicity adjustments is dependent on “the questions
parisons. However, by defining the comparisons of the new
posed by the investigator and his purpose in undertaking the
treatments with the control (T1 vs. C, T2 vs. C, T3 vs. C) as
study” [13]. Whatever the decision is, it should clearly be
the main goal of the trial and the comparisons of the new
stated why and how the chosen analyses are performed, and
treatments among each other (T1 vs. T2, T1 vs. T3, T2 vs.
T3) as secondary analysis, this study consists of two experi-
In the following, we consider the case of a confirmatory
ments of connected comparisons. In this case it may be ap-
study with a clear prespecified key question consisting of
propriate to perform separate multiplicity adjustments in
several hypotheses analyzed by multiple significance tests.
each experiment. In general, we think it is logical that the
These tests represent one experiment consisting of a family
MEER should be under control when the results of a well-
of connected significance tests. For a valid final conclusion
defined family of multiple tests should be summarized in
an appropriate multiplicity adjustment should be made. We
one conclusion for the whole experiment. For example, if
present a short nontechnical overview of statistical proce-
each new treatment is significantly different from the stan-
dures for multiple test adjustment. More technical and com-
dard treatment, the conclusion that all three treatments dif-
prehensive overviews can be found elsewhere [10,14–16].
fer from the standard treatment should be based upon an ad-equate control of the MEER. Otherwise the type 1 error ofthe final conclusion is not under control, which means that
4. General procedures for multiple test adjustments
the aim of significance testing is not achieved. 4.1. General procedures based upon P values
Such a rigorous proceeding is strictly required in confirma-
tory studies. A study is considered as confirmatory if the goal
The simplest multiple test procedure is the well-known
of the trial is the definitive proof of a predefined key hypoth-
Bonferroni method [17]. Of k significance tests, those ac-
esis for final decision making. For such studies a good pre-
cepted as statistically significant have P values smaller than
R. Bender, S. Lange / Journal of Clinical Epidemiology 54 (2001) 343–349
␣/k, where ␣ is the MEER. Adjusted P values are calculated
analysis of variance (ANOVA) [24]. For this application a
1, . . . , k are the individual unad-
number of procedures exist. The most well-known methods,
justed P values. In the same manner Bonferroni adjusted
which are frequently implemented in ANOVA procedures
confidence intervals can be constructed by dividing the
of statistical software packages, are the following. The si-
multiple confidence level with the number of confidence in-
multaneous test procedures of Scheffé and Tukey can also be
tervals. The Bonferroni method is simple and applicable in
used to calculate simultaneous confidence intervals for all
essentially any multiple test situation. However, the price
pairwise differences between means. The method of Dun-
for this simplicity and universality is low power. In fact, the
nett can be used to compare several groups with a single
Bonferroni method is frequently not appropriate, especially
control. In contrast to these single-step procedures, multiple
if the number of tests is large. Bonferroni corrections should
stage tests are in general more powerful but give only ho-
only be used in cases where the number of tests is quite
mogenous sets of treatment means but no simultaneous con-
small (say, less than 5) and the correlations among the test
fidence intervals. The most well-known multiple stage tests
are the procedures of Duncan, Student–Newman–Keuls
Fortunately, there are a number of improvements of the
(SNK), and Ryan–Einot–Gabriel–Welsch (REGW). These
Bonferroni method [2,16,18], such as the well-known Holm
procedures, with the exception of Duncan, preserve the
procedure [19,20]. Some of these modified Bonferroni
MEER, at least in balanced designs. Which of these tests are
methods represent stepwise procedures based upon the
appropriate depends on the investigator’s needs and the
closed testing procedure introduced by Marcus et al. [21],
study design. In short, if the MEER should be under control,
which is a general principle leading to multiple tests con-
with no confidence intervals needed and a balanced design,
trolling the multiple level [10]. A general algorithm for ob-
then the REGW procedure can be recommended. If confi-
taining adjusted P values for any closed test procedure is
dence intervals are desirable or the design is unbalanced,
outlined by Wright [16]. While some of these methods are
then the Tukey procedure is appropriate. In case of ordered
quite complex, the Holm method is just as simple and gen-
groups (e.g., dose finding studies), procedures for specific
erally applicable as the Bonferroni method, but much more
ordered alternatives can be used with a substantial gain in
power [10]. More detailed overviews of multiple test proce-dures for the comparison of several groups are given else-
where [16,25–27]. Multiple comparison procedures for
Despite being more powerful than the simple Bonferroni
some nonparametric tests are also available [28].
method, the modified Bonferroni methods still tend to be
In the frequent case of three groups the principle of
conservative. They make use of the mathematical properties
closed testing leads to the following simple procedure that
of the hypotheses structure, but they do not take the correla-
keeps the multiple level [10]. At first, test the global null
tion structure of the test statistics into account. One ap-
hypothesis that all three groups are equal by a suitable level␣
proach that uses the information of dependencies and distri-
test (e.g., and F test or the Kruskal–Wallis test). If the glo-
butional characteristics of the test statistics to obtain
bal null hypothesis is rejected proceed with level tests for
adjusted P values is given by resampling procedures [22].
the three pairwise comparisons (e.g., t tests or Wilcoxon
For highly correlated tests, this approach is considerably
more powerful than the procedures discussed above. How-ever, the price for the gain of power is that the resampling-
based procedures are computer intensive. PROC MULT-TEST of SAS offers resampling-based adjusted P values for
The case of multiple endpoints is one of the most com-
some frequently used significance tests [22,23].
mon multiplicity problems in clinical trials [29,30]. Thereare several possible strategies to deal with multiple end-points. The simplest approach, which should always be con-
5. Special procedures for multiple test adjustments
sidered first, is to specify a single primary endpoint. Thisapproach makes adjustments for multiple endpoints unnec-
One main advantage of the general multiple test proce-
essary. However, all other endpoints are then subsidiary and
dures based upon P values is that they are universally appli-
results concerning secondary endpoints can only have an
cable to different types of data (continuous, categorical,
exploratory rather than a confirmatory interpretation. The
censored) and different test statistics (e.g., t, 2, Fisher,
second possibility is to combine the outcomes in one aggre-
logrank). Naturally, these procedures are unspecific and
gated endpoint (e.g., a summary score for quality of life
special adjustment procedures have been developed for cer-
data or the time to the first event in the case of survival
tain questions in specific multiplicity situations.
data). The approach is adequate only if one is not interestedin the results of the individual endpoints. Thirdly, for signif-
icance testing multivariate methods [e.g., multivariate anal-
One area in which multiplicity adjustment has a long his-
ysis of variance (MANOVA) or Hotelling’s T2 test] and
tory is the comparison of the means of several groups in
global test statistics developed by O’Brien [31] and ex-
R. Bender, S. Lange / Journal of Clinical Epidemiology 54 (2001) 343–349
tended by Pocock et al. [32] can be used. Exact tests suit-
able for a large number of endpoints and small sample size
The extent to which subgroup analyses should be under-
have been developed by Läuter [33]. All these methods pro-
taken and reported is highly controversial [43,44]. We will
vide an overall assessment of effects in terms of statistical
not discuss the full range of problems and issues related to
significance but offer no estimate of the magnitude of the
subgroup analyses but focus on the multiplicity problem. If
effects. Again, information about the effects concerning the
one is interested in demonstrating a difference in the magni-
individual endpoints is lacking. In addition, Hotelling’s T2
tude of the effect size between subgroups, a statistical test of
test lacks power since it tests for unstructured alternative
interaction is appropriate, although such tests generally
hypotheses, when in fact one is really interested in evidence
have low power [45]. If it is the aim to show an effect in all
from several outcomes pointing in the same direction [34].
(or in some) of a priori defined subgroups on the basis of
Hence, in the case of several equally important endpoints
existing hypotheses, an adjustment for multiple testing
for which individual results are of interest, multiple test ad-
should be performed by using one of the general procedures
justments are required, either alone or in combination with
based upon P values. If there are few nonoverlapping sub-
previously mentioned approaches. Possible methods to ad-
groups, a test within one subgroup is independent of a test
just for multiple testing in the case of multiple endpoints are
within another subgroup. In this case, the use of the simple
given by the general adjustment methods based upon P val-
Bonferroni method is possible. Frequently, however, sub-
ues [35] and the resampling methods [22] introduced above.
group analyses are performed concerning subgroups that are
It is also possible to allocate different type 1 error rates to
defined a posteriori after data examination. In this case, the
several not equally important endpoints [36,37].
results have an exploratory character regardless of whethermultiplicity adjustments are performed or not. For interpre-
tation of such analyses one should keep in mind that theoverall trial result is usually a better guide to the effect in
Methods to adjust for multiple testing in studies collect-
subgroups than the estimated effect in the subgroups [46].
ing repeated measurements are rare. Despite much recentwork on mixed models [38,39] with random subject effects
to allow for correlation of data, there are only few multiplecomparison procedures for special situations. It is difficult
Interim analyses of accumulating data are used in long-
to develop a general adjustment method for multiple com-
term clinical trials with the objective to terminate the trial
parisons in the case of repeated measurements since these
when one treatment is significantly superior to the other(s).
comparisons occur for between-subject factors (e.g.,
Since repeated analyses of the data increase the type 1 error,
groups), within-subject factors (e.g., time), or both. The
multiplicity adjustments are required for the development of
specific correlation structure has to be taken into account,
adequate stopping rules. A simple rule that may be suffi-
involving many difficulties. If only comparisons for be-
cient in many trials is: if no more than 10 interim analyses
tween-subject factors are of interest, one possibility is to
are planned and there is one primary endpoint, then P Ͻ .01
consider the repeated measurements as multiple endpoints
can be used as criterion for stopping the trial, because the
and use one of the methods mentioned in the previous sec-
global level will not exceed .05 [47]. The disadvantage of
tion. However, if the repeated measurements are ordered,
this approach is that the final analysis has to be undertaken
this information is lost by using such an approach.
at a significance level considerably smaller than .05 (also
If repeated measurements are collected serially over
.01). Another simple possibility is to be extremely cautious
time, the use of summary measures (e.g., area under the
for stopping the trial early by using P Ͻ .001 for the interim
curve) to describe the response curves should be consid-
analyses [48]. This approach covers any number of interim
ered [40,41]. The analysis takes the form of a two-stage
analyses and is so conservative that the final analysis can be
method where, in the first step, suitable summary measures
conducted at the usual .05 level. A compromise between
for each response curve are calculated , and in the second
these approaches is to use the procedure developed by
step, these summary measures are analyzed by using the
O’Brien and Fleming [49] with varying nominal signifi-
approaches discussed above. The choice of an adequate ap-
cance levels for stopping the trial. Early interim analyses
proach in the second stage depends on the number of
have more stringent significance levels while the final anal-
groups to be compared and the number of summary mea-
ysis is undertaken as close to the .05 level as possible. Over-
sures to be analyzed. Only in the case of two groups and
views about recent developments in the field of interim
one summary measure as single primary endpoint does no
monitoring of clinical trials are given elsewhere [50–56].
multiplicity problem arise. To compare response curves be-tween groups, Zerbe and Murphy have developed an exten-
6. Discussion
sion of the Scheffé method and a stepwise procedure to ad-just for multiple testing [42]. There are also multiple
The problem of multiple hypotheses testing in biomedi-
comparison procedures for some nonparametric tests suit-
cal research is quite complex and involves several difficul-
ties. Firstly, it is required to define which significance tests
R. Bender, S. Lange / Journal of Clinical Epidemiology 54 (2001) 343–349
belong to one experiment; that means which tests should be
PDS one could have defined the intensive versus the con-
used to make one final conclusion. Secondly, the particular
ventional treatment as the primary comparison, with the
error rate to be under control must be chosen. Thirdly, an
consequence that confirmatory statements concerning the
appropriate method for multiple test adjustment has to be
different intensive treatments are impossible. Furthermore,
found that is applicable and feasible in the considered situa-
one could have defined the aggregated endpoint “any diabe-
tion. Many multiple test procedures for standard situations
tes-related endpoint” as the primary outcome, with the con-
have been developed, but in the practice of clinical and epi-
sequence that all other aggregated endpoints are subsidiary.
demiological trials, there are a lot of situations in which an
By means of a closed testing procedure it would have been
adequate control of type 1 error is quite complex, especially
possible to perform tests concerning the single endpoints
if there are several levels of multiplicity (e.g., more than
forming the primary aggregated outcome (e.g., blindness,
two groups and more than one endpoint and repeated mea-
death from hypoglycemia, myocardial infarction, etc.) by
surements of each endpoint). Unfortunately, the level of
preserving the MEER. The number of confirmatory analy-
complexity can be so high that it is impossible to make an
ses would be drastically reduced, but the results would be
adequate adjustment for multiple testing. For example, the
UK Prospective Diabetes Study (UKPDS) [57] contains an
A further problem we did not mention in detail concerns
enormous complexity regarding multiplicity. Considering
the type of research in which estimates of association can be
only the four main UKPDS publications in the year 1998
obtained for a broad range of possible predictor variables. In
(UKPDS 33,34,38,39) [58–61] (i.e., neglecting the interim
such studies, authors may focus on the most significant of
analyses and multiple significance tests published in earlier
several analyses—a selection process that may bias the
and future articles) there are 2, 4 or 5 main treatment groups
magnitude of observed associations (both point estimates
(dependent on the question), additional comparisons be-
and confidence intervals). One way to deal with this type of
tween specific medications (e.g., captopril vs. atenolol), ap-
multiplicity problem is to demand reproduction of the ob-
proximately 50 endpoints (7 aggregated, 21 single, 8 surro-
served associations and their magnitude in further indepen-
gate, and 12 compliance endpoints), and subgroup analyses
dent trials. However, this ‘solution’ does not address the ad-
(e.g., regarding overweight patients).
justment of significance levels. A data-driven analysis and
Of course, for such a specific and complex design no ad-
presentation, also called ‘data dredging’ or ‘data fishing,’
equate and powerful multiple test procedure exists. Al-
can only produce exploratory results. It can be used to gen-
though Bonferroni adjustments would be principally possi-
erate hypotheses but not to test and confirm them, regard-
ble, they would allow only comparisons with P values
less of whether multiplicity corrections are performed or
below .00017 (.05/298) to be significant, as we counted 298
not. Hence, the use of multiple test procedures cannot pro-
different P values in the four articles. Naturally, with nearly
tect against the bias caused by data fishing.
300 tests the Bonferroni procedure has not enough power to
In principal, there is an alternative approach to signifi-
detect any true effect and cannot be recommended here. The
cance testing for analysis of data. Bayes methods differ
UKPDS group tried to account for multiplicity by calculat-
from all the methods discussed above in minimizing the
ing 99% confidence intervals for single endpoints [60]. This
Bayes risk under additive loss rather than controlling type 1
approach slightly reduces the risk of type 1 error, but for a
error rates. From a Bayesian perspective control of type 1
confirmatory study this procedure is not an adequate solu-
error is not necessary to make valid inferences. Thus, the
tion since the main goal of a significance test, namely the
use of Bayes methods avoids some of the conceptual and
control of the type 1 error to a given level, is not achieved.
practical difficulties involved with the control of type 1 er-
Moreover, although 99% confidence intervals were calcu-
ror, especially in the case of multiplicity. Hence, Bayes
lated, unadjusted P values were presented with the effect
methods are useful for some of the multiplicity situations
that they are interpreted with the usual 5% level of signifi-
discussed above. Examples are the monitoring of clinical
cance [62]. Hence, in the UKPDS no firm conclusions can
trials [63] and the use of empirical Bayes methods for the
be drawn from the significance tests as the actual global sig-
analysis of a large number of related endpoints [64,65].
nificance level exceeds 5% by a large and unknown amount.
However, in this article we concentrate on classical statisti-
To avoid such difficulties a careful planning of the study
cal methods based upon significance tests. We started from
design is required, taking multiplicity into account. The eas-
the assumption that an investigator had decided to use sig-
iest and best interpretable approach is to avoid multiplicity
nificance tests for data analysis. For this case we tried to
as far as possible. A good predefined statistical analysis
summarize the available corresponding procedures to adjust
plan and a prespecification of the hypotheses and their pri-
for multiple testing. Bayes methods—which do not provide
orities will in general reduce the multiplicity problem. If
adjustments of P values as they do not give P values at all—
multiplicity can not be avoided at all (e.g., because there are
several equally important endpoints), the investigators
In summary, methods to adjust for multiple testing are
should clearly define which hypotheses belong to one ex-
valuable tools to ensure valid statistical inference. They
periment and then adjust for multiple testing to achieve a
should be used in all confirmatory studies where on the ba-
valid conclusion with control of the type 1 error. In the UK-
sis of a clearly defined family of tests one final conclusion
R. Bender, S. Lange / Journal of Clinical Epidemiology 54 (2001) 343–349
and decision will be drawn. In such cases the maximum ex-
special reference to ordered analysis of variance. Biometrika 1976;
perimentwise error rate under any complete or partial null
[22] Westfall PH, Young SS. Resampling-based multiple testing. New
hypothesis should be under control. While the simple Bon-
ferroni method is frequently not appropriate due to low
[23] Westfall PH, Young SS. Reader reaction: on adjusting P-values for
power, there are a number of more powerful approaches ap-
multiplicity. Biometrics 1993;49:941–5.
plicable in various multiplicity situations. These methods
[24] Altman DG, Bland JM. Comparing several groups using analysis of
deserve wider knowledge and application in biomedical and
[25] Godfrey K. Comparing means of several groups. N Engl J Med 1985;
[26] Jaccard J, Becker MA, Wood G. Pairwise multiple comparison proce-
dures: a review. Psychol Bull 1984;96:589–96. Acknowledgment
[27] Seaman MA, Levin JR, Serlin RC. New developments in pairwise
multiple comparisons: some powerful and practicable procedures.
We thank Dr. Gernot Wassmer (Cologne, Germany) for
his careful reading of the manuscript and his valuable com-
[28] Conover WJ. Practical nonparametric statistics. New York: Wiley,
[29] Pocock SJ. Clinical trials with multiple outcomes: a statistical per-
spective on their design, analysis, and interpretation. Contr Clin Tri-
References
[30] Zhang J, Quan H, Ng J. Some statistical methods for multiple end-
[1] Koch GG, Gansky SA. Statistical considerations for multiplicity in
points in clinical trials. Contr Clin Trials 1997;18:204–21.
confirmatory protocols. Drug Inf J 1996;30:523–34.
[31] O’Brien PC. Procedures for comparing samples with multiple end-
[2] Sankoh AJ, Huque MF, Dubin N. Some comments on frequently used
points. Biometrics 1984;40:1079–87.
multiple endpoint adjustment methods in clinical trials. Stat Med
[32] Pocock SJ, Geller NL, Tsiatis AA. The analysis of multiple endpoints
in clinical trials. Biometrics 1987;43:487–98.
[3] The CPMP Working Party on Efficacy of Medical Products. Bio-
[33] Läuter J. Exact t and F tests for analyzing studies with multiple end-
statistical methodology in clinical trials in applications for marketing
points. Biometrics 1996;52:964–70.
authorizations for medical products. Stat Med 1995;14:1659–82.
[34] Follmann D. Multivariate tests for multiple endpoints in clinical tri-
[4] Rothman KJ. No adjustments are needed for multiple comparisons.
[35] Lehmacher W, Wassmer G, Reitmeir P. Procedures for two-sample
[5] Savitz DA, Olshan AF. Multiple comparisons and related issues in
comparisons with multiple endpoints controlling the experimentwise
the interpretation of epidemiologic data. Am J Epidemiol 1995;142:
error rate. Biometrics 1991;47:511–21.
[36] Moyé LA. P-value interpretation and alpha allocation in clinical tri-
[6] Savitz DA, Olshan AF. Describing data requires no adjustment for
multiple comparisons: a reply from Savitz and Olshan. Am J Epide-
[37] Moyé LA. Alpha calculus in clinical trials: Considerations and com-
mentary for the new millennium. Stat Med 2000;19:767–79.
[7] Perneger TV. What’s wrong with Bonferroni adjustments. BMJ 1998;
[38] Cnaan A, Laird NM, Slasor P. Tutorial in biostatistics: using the gen-
eral linear mixed model to analyse unbalanced repeated measures and
[8] Aickin M. Other method for adjustment of multiple testing exists
longitudinal data. Stat Med 1997;16:2349–80.
[39] Burton P, Gurrin L, Sly P. Tutorial in biostatistics: extending the sim-
[9] Bender R, Lange S. Multiple test procedures other than Bonferroni’s
ple linear regression model to account for correlated responses: an in-
deserve wider use [Letter]. BMJ 1999;318:600–1.
troduction to generalized estimating equations and multi-level mixed
[10] Bauer P. Multiple testing in clinical trials. Stat Med 1991;10:871–90.
modelling. Stat Med 1998;17:1261–91.
[11] Thompson JR. Invited commentary: Re: “Multiple comparisons and
[40] Matthews JNS, Altman DG, Campbell MJ, Royston P. Analysis of
related issues in the interpretation of epidemiologic data.” Am J Epi-
serial measurements in medical research. BMJ 1990;300:230–5.
[41] Senn S, Stevens L, Chaturvedi N. Tutorial in biostatistics: repeated
[12] Goodman SN. Multiple comparisons, explained. Am J Epidemiol
measures in clinical trials: simple strategies for analysis using sum-
mary measures. Stat Med 2000;19:861–77.
[13] O’Brien PC. The appropriateness of analysis of variance and multiple
[42] Zerbe GO, Murphy JR. On multiple comparisons in the randomiza-
comparison procedures. Biometrics 1983;39:787–94.
tion analysis of growth and response curves. Biometrics 1986;42:
[14] Miller RG. Simultaneous statistical inference. New York: McGraw-
[43] Oxman AD, Guyatt GH. A consumer’s guide to subgroup analysis.
[15] Hochberg Y, Tamhane AC. Multiple comparison procedures. New
[44] Feinstein AR. The problem of cogent subgroups: a clinicostatistical
[16] Wright SP. Adjusted p-values for simultaneous inference. Biometrics
tragedy. J Clin Epidemiol 1998;51:297–9.
[45] Buyse ME. Analysis of clinical trial outcomes: some comments on
[17] Bland JM, Altman DG. Multiple significance tests: the Bonferroni
subgroup analyses. Contr Clin Trials. 1989;10:187S–94S.
[46] Yusuf S, Wittes J, Probstfield J, Tyroler A. Analysis and interpreta-
[18] Levin B. Annotation: on the Holm, Simes, and Hochberg multiple
tion of treatment effects in subgroups of patients in randomised clini-
test procedures. Am J Public Health 1996;86:628–9.
[19] Holm S. A simple sequentially rejective multiple test procedure.
[47] Pocock SJ. Clinical trials: a practical approach. Chichester: Wiley,
[20] Aickin M, Gensler H. Adjusting for multiple testing when reporting
[48] Peto R, Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV,
research results: the Bonferroni vs Holm methods. Am J Public
Mantel N, McPherson K, Peto J, Smith PG. Design and analysis of
randomised clinical trials requiring prolonged observation of each pa-
[21] Marcus R, Peritz E, Gabriel KR. On closed testing procedures with
tient. I. Introduction and design. Br J Cancer 1976;34:585–612. R. Bender, S. Lange / Journal of Clinical Epidemiology 54 (2001) 343–349
[49] O’Brien PC, Fleming TR. A multiple testing procedure for clinical
blood-glucose control with sulphonylureas or insulin compared with
trials. Biometrics 1979;35:549–56.
conventional treatment and risk of complications in patients with type
[50] DeMets DL, Lan KKG. Overview of sequential methods and their ap-
2 diabetes (UKPDS 33). Lancet 1998;352:837–53.
plications in clinical trials. Commun Stat A 1984;13:2315–38.
[59] The UK Prospective Diabetes Study (UKPDS) Group. Effect of in-
[51] Geller NL, Pocock SJ. Interim analyses in randomized clinical trials:
tensive blood-glucose control with metformin on complications in
ramifications and guidelines for practioners. Biometrics 1987;43:
overweight patients with type 2 diabetes (UKPDS 34). Lancet 1998;
[52] Jennison C, Turnbull BW. Statistical approaches to interim monitor-
[60] The UK Prospective Diabetes Study (UKPDS) Group. Tight blood
ing of medical trials: a review and commentary. Stat Sci 1990;5:299–
pressure control and risk of macrovascular and microvascular compli-
cations in type 2 diabetes: UKPDS 38. BMJ 1998;317:703–13.
[53] Pocock SJ. Statistical and ethical issues in monitoring clinical trials.
[61] The UK Prospective Diabetes Study (UKPDS) Group. Efficacy of
atenolol and captopril in reducing risk of macrovascular and microvas-
[54] Lee JW. Group sequential testing in clinical trials with multivariate
cular complications in type 2 diabetes: UKPDS 39. BMJ 1998;317:
observations: a review. Stat Med 1994;13:101–11.
[55] Facey KM, Lewis JA. The management of interim analyses in drug
[62] de Fine Olivarius N, Andreasen AH. The UK Prospective Diabetes
development. Stat Med 1998;17:1801–9.
Study [Letter]. Lancet 1998;352:1933.
[56] Skovlund E. Repeated significance tests on accumulating survival
[63] Fayers PM, Ashby D, Parmar MKB. Tutorial in biostatistics: Baye-
data. J Clin Epidemiol 1999;52:1083–8.
sian data monitoring in clinical trials. Stat Med 1997;16:1413–30.
[57] The UK Prospective Diabetes Study (UKPDS) Group. U.K. Prospec-
[64] Clayton D, Kaldor J. Empirical Bayes estimates of age-standardized
tive Diabetes Study (UKPDS): VIII. Study design, progress and per-
relative risks for use in disease mapping. Biometrics 1987;43:671–81.
formance. Diabetologia 1991;34:877–90.
[65] Greenland S, Robins JM. Empirical-Bayes adjustments for multiple
[58] The UK Prospective Diabetes Study (UKPDS) Group. Intensive
comparisons are sometimes useful. Epidemiology 1991;2:244–51.
Mersey ADR Newsletter Issue 22 Autumn 2003 In this issue: Serious reactions generally… ▼ Fluoroquinolones and tendon disorders ▼ Serious reactions generally… Adverse drug reactions (ADRs) are a major problem, both in ▼ …and serious reactions with NSAIDs hospital and in the community. However, it is estimated that only about 10% of serious
Policy and Procedure McMinnville Free Clinic PRESCRIPTION MEDICATIONS McMinnville Free Clinic (MFC) seeks to comply with federal and state regulations regarding prescription of medications. Controlled substances can be dangerous if not carefully monitored and should have more oversight than the intermittent clinics at McMinnville Free Clinic allows. Additionally, because of federa