# Model Uncertainty and Model Selection

In the previous sections we have described a variety of commonly used dose-response models and given some possibilities of deriving initial parameter estimates for constructing model contrasts. We have not yet discussed the problem of selecting a model for the final analysis. This is a particularly important issue in the regulated environment of pharmaceutical drug development, since it is required to prespecify completely the statistical analysis (and thus also the dose-response model used for the final analysis) prior to the start of a study.

The application of a model-based approach to analyze a given dose-response study may raise questions on the validity of the statistical results, since the true underlying dose-response relationship under investigation is typically unknown. This model uncertainty remains during the entire drug development until the late Phase III clinical trials. Thus, justification of a certain dose designated to be released on the market is based on the assumption of a particular dose-response model. Supporting statistical analyses are mainly conducted to obtain the best parameter estimates for a suitable description of the collected data. Even if the model specification is based on supporting information from prior trials, there is always a more or less remote possibility of model misspecification. Current statistical practice, however, mostly does not take this additional statistical uncertainty into account. Instead, current approaches believe on the truth of the underlying model and thus do not reflect the complete statistical decision process. This is of particular concern in standard confirmatory clinical trials, since in late drug development stages the statistical methods need to be defined up front, inherently including the definition of the unknown dose-response model.

In fact, model uncertainty is one of the major pitfalls when conducting statistical modeling. The intrinsic problem is the introduction of a new source of variability by selecting a particular model M (say) at any stage prior to the final analysis. Standard statistical analysis neglects this fact and reports the final outcomes without accounting for this extra-variability. It is common practice, for example, to compute the variance of a parameter estimate as var(/l) where in fact var(/l | M) is the correct term. In addition, substantial bias in estimating the parameters of interest can be introduced due to the model selection process. Whereas, it is admittedly a more difficult task to compute unbiased estimates conditional on the selected model, ignoring completely the model uncertainty can lead to very undesirable effects (Chatfield, 1995; Draper, 1995; Hjorth, 1994).

A common approach in situations when one has to select the model that will ultimately be used to fit the data is to use information criteria based on a reasonable discrepancy measure to assess the lack of fit. Many model selection criteria are available and the discussion on the best method is still ongoing (Zucchini, 2000; Kadane and Lazar, 2004). A common approach is, for example, to consider the ratio R2 of the sum of squares for regression to the total sum of squares. This number is reported by most software packages when performing a standard regression analysis. The problem with the R2 is that the sum of squares for regression, and hence by construction R2 itself, increases with the number of parameters and thus leads to over-fitting. If solely the (unadjusted) R2 would be used, the selected model at the end would typically be among the most complex ones within the given set of candidate models. A variety of alternative measures have thus been proposed, which include a penalty term: as the model gets more complex, i.e., for larger number of parameters p, the penalty increases. The inclusion of a new variable is therefore only supported, if the amount of information gain (in form of a better prediction) is substantial as measured by the penalty correction. A well known information criterion is the penalized log-likelihood AIC = —2log(L) + 2p (Akaike, 1973), where L denotes the likelihood under the fitted model and p the number of corresponding parameters. A second information criterion of the same general form is the Bayesian information criterion BIC = —2log(L) + p log(n) (Schwarz, 1978), which also depends on the sample size n. Although both methods are derived in completely different ways, the BIC differs from the AIC only in the second term, favoring simpler models than AIC as n increases.

Note that these and other information criteria are generally not suitable in dose-response analyses as they do not incorporate potential parameter constraints, such as the simple order restriction /x1 < ... < /xk. This observation parallels the results from the theory of order restricted inference that the maximum likelihood estimates for the mean level responses subject to a given order restriction are different from the unrestricted maximum likelihood estimates (Robertson et al., 1988). Anraku (1999) thus proposed to use an order restricted information criterion (ORIC) based on the isotonic regression theory. Simulation results suggest that the ORIC indeed behaves better than the AIC, in that it selects more often the correct target dose for varying model specifications and parameter configurations.

However, any measure of fit—either the AIC, ORIC or any other criterion— bears the inherent drawback of a missing error control. If we simply select the model corresponding to the best ORIC, say, we have no conclusion on the validity of our decision. In fact, once we have a candidate set of models, the application of the ORIC will always lead to the selection of one single model, irrespective of the goodness of fit given the observed data. Instead, model selection uncertainty should be incorporated into statistical inference in those cases, where the estimation process is sensitive to the ultimate model choice (Shen et al., 2004). In order to circumvent to a certain extent the problem of conditional inference on a selected model, weighting methods and computer-intensive simulation based inferences have been proposed.

We first consider the weighting methods, which incorporate rather than ignore model uncertainty by computing parameter estimates using a weighted average across the models. A straightforward solution in the line of the information criterion approaches above was introduced by Buckland, Burnham, and Augustin (1997). Let M = {M1,Ml} denote a set of L candidate models index by i. Buckland et al. (1997) proposed to use the weighted estimate m = J2 wvt i where fi is the estimate of f under model i for given weights wi. The idea is, thus, to use estimates for the final data analysis which rely on the averaged estimates across all L models. Buckland et al. (1997) proposed the use of the weights

ICi 2

which are defined in dependence of a common information criterion IC applied to each of the L models. Alternatively, Buckland et al. (1997) have proposed to set the weights wi as the proportions of bootstrap samples for which model Mi was selected. Augustin et al., (2002) extended this resampling approach and proposed bootstrap aggregation methods instead. Note that although these approaches provide a simple and intuitive way to overcome some of the model uncertainty problems, one is still left with the open problem on how to ultimately choose the final model for further inferential problems. Similar in spirit, though based on a completely different theoretical reasoning, are Bayesian model averaging techniques. Having observed the data X, the posterior distributions P(mIX, Me) under each of the investigated models are weighted according to their posterior model probabilities P (Me| X), so that

We refer to Hoeting et al., (1999) and Clyde and George (2004) for discussions of the related methods and several examples. Note that choosing the BIC for the calculation of the weights in Eq. (10.15) leads to results that are closely related to Bayesian model averaging (Augustin et al., 2002).

Cross-validation techniques are an example of computer-intensive simulation based approaches. A data set consisting of n observed data is split into two subsets. The first set of size n1 is used for fitting the model (learning sample). The second set of size of size n2 = n — n 1 is used for validating the fitted model, i.e., assessing its predictive ability (training sample). Clearly, using n 1 observations instead of the complete sample of size n may substantially reduce the performance of the fit. One possibility is to repeat the above procedure for all (or some) learning samples of size n 1 and to assess the average predicted ability. A common approach is thus to choose n2 = 1 (leave-one-out method) and to repeat a large number of times the two steps (1) model fit based on the learning sample of size n1 = n — 1 and (2) validation of the fitted model using the single remaining observation. Cross-validation then selects the model with the best predictive ability across the replications. We refer to Hjorth (1994) and Hastie et al., (2001) for further details on cross-validation and related techniques.

A different philosophy is to consider model selection as a multiple hypotheses testing problem, where the selection of a specific model is done while controlling the familywise error rate at a prespecified level a (Shimodaira, 1998; Junquera, et al., 2002). In this context, the FWER may be interpreted as 1 - Probability of Correct Model Selection. In addition, a reference set of good models is constructed rather than choosing a single model. Shimodaira (1998), for example, considered testing the set of hypotheses

where E(AICMl) is the expected AIC value for model Me. The proposed multiple test procedure uses the standardized differences of any two AIC values within a variant of Gupta's subset selection procedure using bootstrap techniques to assess the joint distribution of the test statistics. A final reference set T of good models is obtained as

T = {Me e M : PMt > a}, where PMi is the p-value associated with the Ith model. By construction, if PMi < a, it has been shown that the AIC value for the ith model Mi is significantly larger than the minimum AIC of the remaining set M \ Mi. Thus, the present approach includes all models at the beginning and only removes those models shown to behave inferior to the other models. This approach never leads T = 0 and may contain more than one model at the end. 