Non-constant variance in residuals of your DoE might not be a problem but a chance. While the standard approach for the analysis of DoE-data is OLS-regression and assumes constant variance for your residuals, other approaches are more flexible. In this blog-post we will discuss loglinear variance models. These allow to model dependencies between residual variance and factors. This way we are not only able to get valid models in presence of non-constant variance of residuals, but even better: We can learn about the underlying mechanisms impacting the variance of the process.

Introduction and Example

Let’s have a look at the following data set:

[A] Rods [B] Drying [C] Material [D] Thickness [E] Angle [F] Opening [G] Current [H] Method [I] Preheating [Y] Strength
-1 -1 -1 -1 -1 -1 -1 -1 -1 43.7
-1 -1 1 1 1 1 -1 -1 1 40.2
-1 1 1 -1 -1 -1 -1 1 -1 42.4
-1 1 -1 1 1 1 -1 1 1 44.7
-1 1 1 -1 -1 1 1 -1 1 42.4
-1 1 -1 1 1 -1 1 -1 -1 45.9
-1 -1 -1 -1 -1 1 1 1 1 42.2
-1 -1 1 1 1 -1 1 1 -1 40.6
1 1 1 -1 1 -1 -1 -1 1 42.4
1 1 -1 1 -1 1 -1 -1 -1 45.5
1 -1 -1 -1 1 -1 -1 1 1 43.6
1 -1 1 1 -1 1 -1 1 -1 40.6
1 -1 -1 -1 1 1 1 -1 -1 44
1 -1 1 1 -1 -1 1 -1 1 40.2
1 1 1 -1 1 1 1 1 -1 42.5
1 1 -1 1 -1 -1 1 1 1 46.5

 

The data comes from an experiment performed by the National Railway Corporation of  Japan [see: G.Smyth 2002, Original: Taguchi & Wu 1980]. It is a fractional factorial design. We are screening for the important factors impacting the tensile strength of weld.

Analysing the data using JMP’s screening platform gives a strong indication that [C]Material and [B]Drying are important factors. [F]Opening and [A]Rods may be role-players as well.

Running a regression model for those factors confirms the relationship of [C]Material and [B]Drying with tensile strength. With p-values above 0.1 I wouldn’t assume [F]Opening and [A]Rods to be relevant (and of course not as significant) but model-building is of course a highly subject-matter-knowledge based animal and I might be wrong there.

Are these results reliable? Our OLS-regression model assumes:

  1. Independent residuals
  2. Normally distributed residuals
  3. Constant variance for the residuals

While normality of residuals might be given for this example (see the Normal-Quantile-Plot below), the assumption of constant variance is a bit problematic. Especially for [B], [C], [G] and [I] the variance of the residuals seems to be non-constant for different settings of the factor.

First of all this violates assumption (3) but even more important: This behaviour tells us that the process is more reproducible for one of the factor settings compared to the other one. As it seems the results of running the process with low level of [I] are closer to each other compared to the results of running the process with [I] on high level. This might be very relevant information for the optimization of a process.

Possible Solutions

There are a couple of standard tools that are often recommended when we see non-constant variance in a regression model.

  • The most popular solution might be to transform the response variable. Tools like box-cox-power transformation might help finding a good transformation for the given problem. Overall log- and sqrt-transformations are often helpful to stabilize residuals.
  • Depending on what software you are using you might want to try different models. Multiple versions of GLMs (Generalized Linear Models) allow non-constant variance as part of their model specification.
  • Another alternative might be to use White-Standard-Errors (sometimes: heteroscedasticity-consistent (HC) standard errors). 

All those methods have one thing in common: They approach the problem from a mathematical perspective. Either we try to get rid of the heteroscedasticity (transformation) or we use models that are able to calculate valid p-values in presence of heteroscedasticity.

These approaches do not reveal any information where the heteroscedasticity comes from. To get an better understanding about the heteroscedasticity and to ideally learn where your process is more reproducible you need to use other techniques.

The loglinear Variance Model

One way might be the use of so called Loglinear Variance Models (This is how the model is called in JMP. There is no standard name in literature for this kind of models by now). The basic idea of this class of models is to explain both mean and variance of the response variable in one model. While the classical OLS model equation looks like this

the loglinear variance model works with the following equations:

For the moment the reader might want to ignore the exp in the variance equation (it’s just making sure the estimated variances are positive). As you can see the mean model is the same like in standard ols. We are modelling the relationships between factors x and response y. Now the loglinear variance model adds an additional equation modelling the relationship between variance and some factors z. These factors z might be the same factors like in x. 

You might already see where the name loglinear variance model comes from. Just change the variance model equation to: 

and you are modelling the log of the variance using a linear predictor.

This more complex model comes at a price. Instead of using ordinary least squares we will estimate the model using a REML-approach. For this blog-post we will not go any further into the estimation process but it is much more complex than the OLS-regression. 

Example in JMP

One thing to notice is that estimating variances requires more data than estimating means only. Therefore we cannot estimate a whole model with all factors ([A] - [I]) in both the mean and variance model.

For demonstration purposes we will stick to the model proposed in [Smyth 2002]:

  • Mean model uses factors (this is the x-matrix): [B], [C] and [I]
  • Variance model uses factors (this is the z-matrix): [C] and [I]

In JMP 12 you will find the loglinear variance model as part of the Fit-Model-plattform.  

The resulting model gives us the information that [B] and [C] have significant effect on the mean strength (with [I] being very close to statistical significance). Additionally we learn that [C] and [I] are impacting the expected variance of the process (i.e. reproducibility) as well.

Especially the fact that [I] has some potential relevance for the mean and variance of the process is really interesting as we did not see that in the original OLS-model. Further interpretation might be done using the prediction profiler:

We can achieve maximum strength for high settings of [B] and low settings of [C]. At the same time we learn that chosing low levels of [C] increases the variance of the process. Therefore we might want to chose a higher setting of [C] to grant process stability. As [I]'s impact on the mean is very low we might want to set [I] on low levels to minimize the variance.

Conclusion

  1. Loglinear variance models are a neat tool to model mean and variance of a response variable at a time.
  2. Of course most DoEs are not made to model the variance thus you might not have sufficient data to estimate the whole model. This definitvely needs some further research to find DoEs granting the ability to estimate a complete loglinear variance model.
  3. I could not find any literature on the model building process for loglinear models. Especially the previous point - not having enough data for the variance estimation - is problematic there. Subject matter knowledge will help here of course.

Literature and Usefull Resources

  1. M. Aitkin: Modelling Variance HGeterogeneity in Normal Regression using GLIM (Journal of Royal Statistical Society. Series C (Applied Statistics), 1987)
  2. H. Goldstein: Heteroscedasticity and Complex Variation (Encyclopedia of Statistics in Behavioral Science)
  3. A.C. Harvey: Estimating Regression Models with Multiplicative Heteroiscedasticity (Econometrica, 1976)
  4. G. Smyth: An Efficient Algorithm for REML in Heteroscedastic Regression (Journal of Computational and Graphical Statistics, 2002)
  5. JMP 12 Online Documentation

Special thanks to my collegue Jonas who did all the HTML-stuff in here (including the interactive HTML from JMP is so nice!).


 

 

 

 

 

Author: Sebastian Hoffmeister