Juliet Popper Shaffer
University of California
Yung-Pin Chen
Smith College
Journal of Statistics Education v.7, n.3 (1999)
Copyright (c) 1999 by Juliet Popper Shaffer and Yung-Pin Chen, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words:
Best linear unbiased estimation; Least squares estimation; Unbiased estimation; Variance estimation.A useful way of approaching a statistical problem is to consider whether the addition of some missing information would transform the problem into a standard form with a known solution. The EM algorithm (Dempster, Laird, and Rubin 1977), for example, makes use of this approach to simplify computation. Occasionally it turns out that knowledge of the missing values is not necessary to apply the standard approach. In such cases the following simple logical argument shows that any optimality properties of the standard approach in the full-information situation generalize immediately to the approach in the original limited-information situation: If any better estimate were available in the limited-information situation, it would also be available in the full-information situation, which would contradict the optimality of the original estimator. This approach then provides a simple proof of optimality, and often leads directly to a simple derivation of other properties of the solution. The approach can be taught to graduate students and theoretically-inclined undergraduates. Its application to the elementary proof of a result in linear regression, and some extensions, are described in this paper. The resulting derivations provide more insight into some equivalences among models as well as proofs simpler than the standard ones.
1 Assume the linear regression model
where
is an
random vector of observations on the dependent variable in
the n experimental units,
is a fixed, known
matrix with all elements of the first column equal to one
and of rank
is a fixed, unknown (
is a random
vector with expected value zero and covariance matrix
.
Under model (1), the best linear unbiased estimator
(BLUE) of the vector
is
2 Let
be the vector of coefficients excluding the constant
additive term. A well-known fact, stated in most regression
textbooks, is that the expression for estimating
can be obtained from the inversion of a
rather than a
matrix by expressing
and
in deviation form (i.e., subtracting column means), and
using an expression of the same form as (2).
3 Let
be the mean of the n components of
,
and let
be the
row vector with elements
,
where
is the mean of the n components of
column i of the matrix
.
Let
be an
vector of ones. Then
,
with elements 2 to (
,
can be expressed as
where
is the
matrix consisting of columns 2 to (
(the first column of
is identically zero and is dropped) and
.
The proof of this result is not immediately obvious because
the covariance matrix of
,
given
,
is no longer
but is
where
is an
matrix with all elements equal to 1. Draper and Smith (1998, pp. 27-28) and
Daniel and Wood (1980, pp. 13-14)
state the result without proof. Sen and
Srivastava (1990, pp. 42 and 146) ask for the proof in
a problem, and suggest using a generalized inverse (since
the covariance matrix (4) is singular). Arnold (1981), Graybill (1976), Searle (1971), and Seber (1977) have relatively long proofs
involving partitioning of matrices.
4 This paper presents a proof that involves only elementary facts that are usually presented in a first introduction to regression. The method of proof involves consideration of missing information that, if known, would put the equation to be solved into a standard regression form.
2. Proof of the Equivalence of Raw Form and Deviation
Form Solutions
5 Under Model (1), noting that
where
is the mean of the n components of
,
and subtracting (5) from (1) yields
where
is the
matrix defined below Equation (3). Suppose
were known. Adding
to both sides of (6) yields the equation
where
.
Note that
and
,
so (7) is in the form (1), but with
and
replaced by
and
,
the latter in deviation form. Thus, the BLUE of
is
The last equality follows because, although
,
which appears in (8), is unknown, each element i of
the vector
is in the form
,
,
since
is in deviation form. Then
is the BLUE of
,
as is
omitting the first element. Therefore,
,
without the intercept element, and
must be equivalent.
6 Note that, instead of adding
to both sides of (6) in Section 2, the realization of any
random variable independent of
with mean zero and variance
could be added to each observation, and the proof would
proceed in the same way. By combining the method of
Section 2 with the addition of an external variable
independent of
,
a further equivalence can be demonstrated.
Theorem:
Consider a model of the form (1)
but in which the covariance matrix of
is
The BLUE of
in the model (1) but with the covariance matrix of
given by Equation (9) is the same as the BLUE of
in the model (1) with the covariance matrix of
(note that
is
without the intercept term). Furthermore, the covariance
matrix of the BLUE of
with error covariance
matrix (9) is
times as great as the covariance matrix of the BLUE of
with error covariance matrix
.
Proof: Three cases will be considered separately.
7 The proof for
requires a further comment. Up to now, there has been no
reference to the distribution of the errors, except for
expected values and variances. As is well known, the
results above are distribution-free, requiring only that
be finite. However, the proof above for
,
in which a random vector is expressed as the sum of two
others, requires some distributional assumption, since not
all random vectors can be expressed in that form.
8 Assume two models A and B with the same regression
coefficients but different error covariance matrices.
Because the BLUE, given any model, has a fixed form,
depending only on the covariance matrix of the errors and
not on other features of the error distribution (of course
with means of errors equal to zero), it follows that if A
and B (with the specified error covariance matrices above)
have the same BLUEs under any error distribution,
they have the same BLUEs for all error
distributions. Therefore, it can be assumed, without loss
of generality, that the error distributions are normal. In
that case, the proof of the theorem for
can be carried out, since a normal random vector with the
given covariance matrix can be decomposed as required in the proof.
9 Note that another proof, in the same spirit, that
combines negative and positive values of
can be obtained by transforming the model into the
deviation form (6), and then
adding the realization of a single independent random
variable with variance
to each observation. The proof using this alternative
approach appears to be more compact but involves somewhat
more matrix manipulation than the previous one. Both proofs
start with the deviation form (6). In order to determine
the variance of the single independent variable value to be
added to each observation using this alternative proof, it
is necessary to derive the covariance matrix of (
),
which requires a fair amount of calculation because of the
nonzero covariances in (9). In
the original proof given, by starting with (6) and adding
(
)
to both sides, it is immediately clear that the equation is
in the form (7) with the
covariance matrix unchanged from the original form (9).
The single further step, required in both proofs, involves
adding (or subtracting) the realization of a random
variable, independent of
,
to each observation. The only knowledge needed for this
latter step is that the variance of that random variable is
added to every term in the covariance matrix; this is easy
to show. However, the alternative proof, although somewhat
more complex than the first one given, is nonetheless
simpler than proofs presently available in the literature.
10 McElroy (1967) proved that
the error covariance matrix (9) is a necessary and
sufficient condition for (2) to be the BLUE of
in Model (1), for
.
McElroy's result is more general than the result in this
theorem in that it includes the estimate of
and includes necessity as well as sufficiency of (9). On
the other hand, the result in this theorem is more general
than McElroy's result in that the singular matrices
resulting when
and when
are included in the range of
.
11 An interesting insight follows from considering the case
.
In this case, there is (with probability one) a single
realization
of a random variable added to the expected value for each
observation. Then, in the form (6), each element of
is exactly equal to its expected value,
so the covariance matrix of
is the null matrix, and
can be calculated exactly.
12 The development in Section 2
provides a simple proof that estimators of the coefficient
vector in a standard linear model, excluding the intercept
term, can be obtained by expressing the sample values of
the predictors and predicted variable in deviation form and
using the standard equation for the estimator of
in a model without an intercept. From a geometric point of
view, note that the regression plane goes through the
point (
,
and centering shifts the plane to go through the origin but
doesn't change the slopes. (Of course the solution in this
paper is for a model with an intercept, and should not be
confused with the solution for a model that assumes an
intercept of zero, for which the estimators and their
properties are different.) The proof proceeds by noting
that the addition of an unknown value
would put the relevant expressions into standard form. The
same approach, with the addition of an external random
vector, leads to a simple proof that the BLUE of
when the error covariance matrix is of the form
is the same as when it is of the standard form
,
and yields a simple expression for the comparative
variances of the estimators. It follows also from the proof
that the results hold even if the covariance matrix of
is singular. Neither extensive matrix manipulation nor
knowledge of generalized inverses is necessary. The
results generalize under appropriate conditions to the
correlation model (Arnold 1981, Shaffer 1991), in which the elements
of
are realizations of random variables.
We would like to thank Katherine Halvorsen for her valuable comments which helped to improve the presentation of this paper.
Arnold, S. F. (1981), The Theory of Linear Models and Multivariate Analysis, New York: Wiley-Interscience.
Daniel, C. and Wood, F. S. (1980), Fitting Equations to Data: Computer Analysis of Multifactor Data for Scientists and Engineers (2nd ed.), New York: Wiley-Interscience.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), "Maximum Likelihood From Incomplete Data Via the EM Algorithm," Journal of the Royal Statistical Society, Ser. B, 39, 1-22.
Draper, N. R., and Smith, H. (1998), Applied Regression Analysis (3rd ed.), New York: Wiley.
Graybill, F. A. (1976), Theory and Application of the Linear Model, North Scituate, MA: Duxbury Press.
McElroy, F. W. (1967), "A Necessary and Sufficient Condition That Ordinary Least Squares Estimators Be Best Linear Unbiased," Journal of the American Statistical Association, 62, 1302-1304.
Searle, S. R. (1971). Linear Models, New York: Wiley.
Seber, G. A. F. (1977), Linear Regression Analysis, New York: Wiley.
Sen, A., and Srivastava, M. (1990), Regression Analysis: Theory, Methods, and Applications, New York: Springer-Verlag.
Shaffer, J. P. (1981), "The Analysis of Variance Mixed Model With Allocated Observations: Application to Repeated Measurement Designs," Journal of the American Statistical Association, 76, 607-611.
----- (1991), "The Gauss-Markov Theorem and Random Regressors," The American Statistician, 45, 269-273.
Juliet Popper Shaffer
University of California
Department of Statistics
367 Evans Hall #3860
Berkeley, CA 94720-3860
Yung-Pin Chen
Department of Mathematics
Smith College
Northampton, MA 01063
JSE Homepage | Subscription Information | Current Issue | JSE Archive (1993-1998) | Data Archive | Index | Search JSE | JSE Information Service | Editorial Board | Information for Authors | Contact JSE | ASA Publications