Terence C. Mills

Loughborough University

Journal of Statistics Education Volume 13, Number 2 (2005), www.amstat.org/publications/jse/v13n2/datasets.mills.html

Copyright © 2005 by Terence C. Mills, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.

**Key Words:**Body Mass Index; Functional form; Prediction; Regression.

A scatterplot of all 252 pairs of observations is shown in Figure 1. A traditional starting point for students analysing the relationship between percent body fat and the BMI is to fit a linear regression. Two such regressions are shown superimposed on Figure 1: a fit to all 252 cases and a fit with case 39 omitted from the calculations. This case has a BMI of 48.9 associated with a body fat percentage of 33.8 and is seen to be an outlier, both pulling the fitted line towards it and giving an impression of a distinct curvilinear relationship between the two variables. It will therefore be omitted from further modelling, but students could be encouraged to discuss whether this is the best course of action for dealing with an aberrant observation and whether alternative solutions could be considered.

Figure 1

Figure 1. Scatterplot of body fat percentage against BMI with linear regressions fitted to all observations (bodyfat% =
–20.4 + 1.55BMI)

and with the outlier removed (bodyfat% = –24.9 + 1.73BMI) superimposed.

may be more appropriate on theoretical grounds. Here *b* represents the theoretical BMI associated with 0% body fat,
and *a* represents the percentage of excess body weight which is fat (see
Gray and Fujioka, 1991, pages 548-9, for more detailed interpretation of
this relationship). With *a* and *b* both positive, the first and second derivatives of this function with
respect to BMI are positive and negative respectively. Thus the model implies that percent body fat increases, but at a
decreasing rate, with increasing BMI and is bounded by the value *a*. An alternative model which would also capture
this type of nonlinearity is the semi-logarithmic relationship bodyfat% = *c* + *d*ln(BMI) with *d* positive.
Students can fit these two nonlinear functions to the data and assess their fits relative to that of the linear model.
Figure 2 presents the fits of the three functions graphically. Students
can note that the estimates of the inverse function parameters are
*a* = 64.3 2.6% and
*b* = 17.6 0.3%, where we use the notation
, where and
are the ordinary least squares (OLS) estimate and associated
standard error of a parameter , i.e., that a BMI of 17.6 is associated
with 0% body fat and that 64.3% of excess body weight is fat. It should be noted that, while estimates of the parameters
of this model can be obtained by regressing percent body fat on the inverse of the BMI and algebraically calculating the
estimates from the regression coefficients, parameter standard errors can only be obtained using a dedicated nonlinear
regression routine. The *R*^{2} statistics from the three models are: linear 0.560, inverse 0.557, and
semi-log 0.563, with residual error variances 5.123^{2}, 5.141^{2} and 5.105^{2}, respectively.
Thus, on goodness of fit grounds, the semi-logarithmic model is to be preferred. The fits, however, are very similar
over the central range of BMI values, only differing substantially for the very highest BMI values, so that a more formal
method of model selection could be considered.

Figure 2

Figure 2. Fitted linear, semi-logarithmic (bodyfat% = –126.2 + 45.0ln(BMI)), and inverse (bodyfat% = 64.3(BMI – 17.6/BMI) functions.

One approach is to note that all three functional forms may be “nested” within a general functional form by using the
Box and Cox (1964) family of power transformations, which are defined
for the generic variable *Z* as

If we denote the *i*^{th} observations, *i* = 1, 2, ..., *n*, on percent body fat and the BMI as
*y _{i}* and

(1) |

where *u _{i}* is an error term, assumed to be normally and independently distributed with zero mean
and constant variance . The linear model is obtained when
= , setting
=1 and = 0 defines
the semi-logarithmic model, while setting = 1 and
= –1 defines the inverse model (in each case with a redefined intercept
of ). However, arbitrary values of the power transformation
parameters need not be imposed upon (1): rather, they may be estimated along
with the other parameters of the model, and tests of the hypotheses implied by the alternative functional forms may then
be performed in order to discriminate between them.

The procedure for doing this is to recognise that, for any given values of and , estimates of and conditional upon these values may be obtained from the regression of equation (1). Maximum likelihood (ML) estimates of the power transformation parameters, denoted and , and hence of the ’s, are found by maximizing the concentrated log-likelihood, defined as

(2) |

where

is an estimate of the “conditional” error variance, the being the residuals from the conditional regression (1). The term “concentrated” is used because the maximization is, in fact, a step-wise procedure in that is first obtained from a linear regression with fixed values of and , with the second step being to maximize over all values of . For an extended discussion of concentrated likelihood methods, see Seber and Wild (2003, pages 37-42), and for a review of the Box-Cox transformation, see Sakia (1992).

This maximization may conveniently be computed by searching over a grid of , values: advanced students can be encouraged to develop routines for carrying out this two-dimensional grid search, which involves calculating the power transformations, saving regression output to compute the log-likelihood, and writing looping procedures. Students may also experiment with plotting the contours of the likelihood function so constructed, which will provide a graphical perspective on the likely precision with which the transformation parameters, and , are estimated. This precision may also be examined by calculating the confidence region obtained by using the result that

(3) |

is approximately distributed as chi-squared with two degrees of freedom (Box
and Cox, 1964). Thus, for example, 95% and 75% confidence regions are defined by
< 5.99 and respectively.
The ML estimates are obtained as = 0.92 and
= 0.01, with *L*(0.92, 0.01) = –407.83.
Since *L*(1, 0) = –407.17, it is clear that the semi-logarithmic model is contained within any conventional confidence
region. The other functional forms are quite “close' in terms of fit, however. The linear model
( = = 1 in equation
(1)) has *L*(1, 1) = –409.04, and so is contained within the 75%
confidence region, while the inverse model ( = 1,
= –1) has *L*(1, –1) = –409.95 and is thus contained within a 95%
confidence region. The double-logarithmic model ( =
= 0 in equation (1) and a
functional form that is often used in regression analysis), however, has *L*(0, 0) = –501.59 and so is excluded from
all conventional confidence regions.

The fitted regressions for the linear, semi-logarithmic and inverse functional forms are

(4) |

(5) |

and

(6) |

where parameter standard errors are shown in parentheses. Diagnostic checks on the residuals of each of these regressions found no evidence of heteroskedasticity or non-normality in any of the regressions, as was also true for the residuals from the regression using the ML estimates.

At this point students could be asked to consider alternative non-linear specifications. A plausible competitor would be a
polynomial in *x _{i}*, so that students may fit the quadratic regression:

(7) |

Figure 3 shows the implied semi-logarithmic, quadratic and linear functions
for 10 BMI 50, bearing in
mind that the range of BMI values in the sample used for estimation is 18.1 to 39.1. The three functions are almost
identical over the central region of the BMI range (20 to 30), but the linear model is a poor approximation to the
semi-logarithmic outside of this interval. The quadratic, on the other hand, provides a good approximation to the
semi-logarithmic over the entire observed range of BMI values, even though, as students can check, the quadratic term in
(7) is insignificant (its t-ratio is just –1.36). Since the semi-logarithmic
function also produces a superior fit to the quadratic (the error variance of the latter is 5.114^{2}) and contains
one less parameter, we prefer the former as a better representation of the relationship between percent body fat and BMI.

Figure 3

Figure 3. Semi-logarithmic (bodyfat% = –126.2 + 45.0ln(BMI)), quadratic (bodyfat% = –42.5 + 3.08BMI – 0.025BMI^{2}),
and linear (bodyfat% = –24.9 + 1.73BMI) functions.

(8) |

where

The one-standard error bounds for *y _{f}* are then calculated as
. Students may be encouraged to provide interpretations of these
bounds. For example, the current U.S. Dietary guidelines define the range 18.5 < BMI < 25 to be “healthy,”
25 < BMI < 30 to be “overweight,” and higher values of BMI to be “obese” (see
Kuczmarski and Flegal, 2000, table 2). Using the linear model, these
cut-offs predict body fat percentages of

Figure 4

Figure 4. Scatterplot of body fat and BMI with predicted values and one standard error bounds from the fitted linear equation.

A related question to ask students is: if we are given a value *y _{f}*, what is the value

Although this is the ML estimate of *x _{f}*, it is biased because, in general,

(see, for example, Seber and Lee, 2003, section 6.1.5). An
alternative estimate is that obtained from estimating the reverse, or inverse, regression of *x* on *y*:

where and are
the OLS estimates of the inverse regression. Predicting from *x _{f}* is often referred to as

Error bounds (prediction intervals) for *x _{f}* using the estimate
are straightforwardly calculated by using the formulae in
(8) with

the last equality being obtained using . For large *n*, and
defining , the ratio

will be distributed as standard normal. Thus

where is the percentage point of the chi-square distribution with one degree of freedom. The set of all values of satisfying the inequality

will then provide a (1 – ) confidence interval for the unknown
*x _{f}*, with lower and upper bounds defined as
and
. and
are the solutions (roots) of the quadratic equation

i.e.,

(9) |

It is possible for these roots to be complex if is not significantly
different from zero, in which case the regression line is close to being horizontal and any value of the regressor is
acceptable. As can be seen from the fitted regression (4), however,
is highly significant as the 95% confidence interval for
is 1.73 0.20. The resultant interval is
often called a *discrimination interval* rather than a prediction interval
(Seber and Wild, 2003, page 146).

Johnson (1996) reports a suggestion that 15% body fat is a maximum for
good health for men, so it is interesting to calculate BMI prediction and discrimination intervals for this value of
*y _{f}*. The estimated coefficients of the inverse regression are
= 19.22 and
= 0.32, so that

while the value of *x _{f}* is

For the calculation of the intervals we require and , so that = 2.225. A 95% prediction interval using = 24.1 is thus (19.7, 28.5). Using = 23.1, for a 95% discrimination interval, with = 3.84, the quadratic (9) simplifies to

The two roots of this equation are = –8.1 and
= 3.6. The 95% discrimination limits for
*x _{f}* are thus 17.2 and 28.9, which show that the interval is asymmetric about
. Thus, if is used,
then at the 95% level of confidence, a 15% body fat is consistent with a BMI ranging from 17.2 to 28.9, i.e., from a BMI
below the current lower healthy BMI cut-off to a value close to the upper-end of the overweight range. If
is used, this range is a little narrower, running from 19.7 to 28.5.
A similar calculation with the semi-logarithmic function obtains 95% limits of 20.2 and 28.3 (using
) and 18.3 and 28.9 (using
). Students may be encouraged to discuss the implications of the width
of these intervals for the efficiency of the BMI as an indicator of percent body fat.

where *w _{i}* and

(10) |

is considered, equation (5) is obtained if the restriction = 0 is imposed. Students may then be asked to consider how they might test such a linear restriction. One approach would be construct a confidence interval for : a 100(1 - ) interval is given by

where is the /2 percentage point of the standard normal distribution and

(see, for example, Maddala, 1977, chapter 10.3). The Johnson (1996) data set contains data on weight and height, although measured in pounds and inches. After rescaling to kilograms and metres (multiplying by the factors 0.4536 and 0.0254 respectively), the following multiple regression was obtained

The variances and covariances of the estimated coefficients are estimated as
= 6.24, =
103.63 and = –13.23, respectively, so that
= 75.67. Thus an approximate 95% confidence interval (using
*z _{0.025}* = 1.96) for is calculated to be
–11.7 17.0, which includes 0 and thus provides evidence in favour of using
the BMI as a “composite' weight-height index. Note, however, that an approximate 68% interval (using

Earlier research on “weight for height' indices by Benn (1971) considered
power indices of the form weight/height^{p} (see also
Flegal, 1990). The testing approach outlined above may be utilised to
construct a range of values for the exponent *p* that are consistent with the data. The power index implies the
linear combination between the slope coefficients of equation
(10), which can then be written as

The exponent can be estimated as . The variance of is given by

Since

Using the estimates = 44.8 and
= –101.3, we can calculate
= 2.26 and 0.038.
Thus an approximate 95% confidence interval for *p* is 2.26
1.96(0.038)^{1/2} = 2.26 0.38. This interval contains *p* = 2,
thus again confirming the choice of the BMI as an appropriate “Benn power-index', but excludes *p* = 1.5, a value of
the exponent suggested by early studies of the weight-height relationship but not currently recommended (see
Kuczmarski and Flegal, 2000).

Benn, R.J. (1971), “Some mathematical properties of weight for height indices as measures of adiposity,”
*British Journal of Preventive and Social Medicine*, 25, 42-50.

Box, G.E.P. and Cox, D.R. (1964), “An analysis of transformations,” *Journal of the Royal Statistical Society, Series B*,
26, 211-243.

Brown, P.J. (1993), *Measurement, Regression, and Calibration*, Oxford, U.K.: Oxford University Press.

Flegal, K.M., (1990), “Ratio of actual to predicted weight as an alternative to a power-type weight-height index
(Benn index),” *American Journal of Clinical Nutrition*, 51, 540-547.

Gray, D.S. and Fujioka, K. (1991), “Use of relative weight and body mass index for the determination of adiposity,”
*Journal of Clinical Epidemiology*, 44, 545-550.

Johnson, R.W. (1996), “Fitting percentage of body fat to simple body measurements,”
*Journal of Statistics Education* [Online], 4(1).
www.amstat.org/publications/jse/v4n1/datasets.johnson.html

Kuczmarski, R.J. and Flegal, K.M. (2000), “Criteria for definition of overweight in transition: background and
recommendations for the United States,” *American Journal of Clinical Nutrition*, 72, 1074-1081.

Maddala, G.S. (1977), *Econometrics*, McGraw-Hill.

Osborne, C. (1991), “Statistical calibration: a review,” *International Statistical Review*, 59, 309-336.

Sakia, R.M. (1992), “The Box-Cox transformation technique,” *The Statistician*, 41, 169-178.

Seber, G.A.F. and Lee, A.J. (2003), *Linear Regression Analysis*, 2nd Edition, New York: John Wiley & Sons.

Seber, G.A.F. and Wild, C.J. (2003), *Nonlinear Regression*, New York: John Wiley & Sons.

Webster, J.D., Hesp, R. and Garrow, J.S. (1984), “The composition of excess weight in obese women estimated by body
density, total body water and total body potassium,” *Human Nutrition: Clinical Nutrition*, 38C, 299-306.

Terence C. Mills

Department of Economics

Loughborough University

Leicestershire

United Kingdom
*T.C.Mills@lboro.ac.uk*

Volume 13 (2005) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications