G02 Class
関数リスト一覧   NagLibrary Namespaceへ  ライブラリイントロダクション  本ヘルプドキュメントのchm形式版

This chapter is concerned with two techniques – correlation analysis and regression modelling – both of which are concerned with determining the inter-relationships among two or more variables.
Other chapters of the nAG Library which cover similar problems are E02 class and E04 class. E02 class methods may be used to fit linear models by criteria other than least-squares, and also for polynomial regression; E04 class methods may be used to fit nonlinear models and linearly constrained linear models.

Syntax

C#
public static class G02
Visual Basic (Declaration)
Public NotInheritable Class G02
Visual C++
public ref class G02 abstract sealed
F#
[<AbstractClassAttribute>]
[<SealedAttribute>]
type G02 =  class end

Background to the Problems

Correlation

Aims of correlation analysis

Correlation coefficients

The (Pearson) product-moment correlation coefficients measure a linear relationship, while Kendall's tau and Spearman's rank order correlation coefficients measure monotonicity only. All three coefficients range from -1.0 to +1.0. A coefficient of zero always indicates that no linear relationship exists; a +1.0 coefficient implies a ‘perfect’ positive relationship (i.e., an increase in one variable is always associated with a corresponding increase in the other variable); and a coefficient of -1.0 indicates a ‘perfect’ negative relationship (i.e., an increase in one variable is always associated with a corresponding decrease in the other variable).
Consider the bivariate scattergrams in Figure 1: (a) and (b) show strictly linear functions for which the values of the product-moment correlation coefficient, and (since a linear function is also monotonic) both Kendall's tau and Spearman's rank order coefficients, would be +1.0 and -1.0 respectively. However, though the relationships in figures (c) and (d) are respectively monotonically increasing and monotonically decreasing, for which both Kendall's and Spearman's nonparametric coefficients would be +1.0 (in (c)) and -1.0 (in (d)), the functions are nonlinear so that the product-moment coefficients would not take such ‘perfect’ extreme values. There is no obvious relationship between the variables in figure (e), so all three coefficients would assume values close to zero, while in figure (f) though there is an obvious parabolic relationship between the two variables, it would not be detected by any of the correlation coefficients which would again take values near to zero; it is important therefore to examine scattergrams as well as the correlation coefficients.
In order to decide which type of correlation is the most appropriate, it is necessary to appreciate the different groups into which variables may be classified. Variables are generally divided into four types of scales: the nominal scale, the ordinal scale, the interval scale, and the ratio scale. The nominal scale is used only to categorise data; for each category a name, perhaps numeric, is assigned so that two different categories will be identified by distinct names. The ordinal scale, as well as categorising the observations, orders the categories. Each category is assigned a distinct identifying symbol, in such a way that the order of the symbols corresponds to the order of the categories. (The most common system for ordinal variables is to assign numerical identifiers to the categories, though if they have previously been assigned alphabetic characters, these may be transformed to a numerical system by any convenient method which preserves the ordering of the categories.) The interval scale not only categorises and orders the observations, but also quantifies the comparison between categories; this necessitates a common unit of measurement and an arbitrary zero-point. Finally, the ratio scale is similar to the interval scale, except that it has an absolute (as opposed to arbitrary) zero-point.
For a more complete discussion of these four types of scales, and some examples, you are referred to Churchman and Ratoosh (1959) and Hays (1970).
Figure 1

Partial correlation

Robust estimation of correlation coefficients

Missing values

When there are missing values in the data these may be handled in one of two ways. Firstly, if a case contains a missing observation for any variable, then that case is omitted in its entirety from all calculations; this may be termed casewise treatment of missing data. Secondly, if a case contains a missing observation for any variable, then the case is omitted from only those calculations involving the variable for which the value is missing; this may be called pairwise treatment of missing data. Pairwise deletion of missing data has the advantage of using as much of the data as possible in the computation of each coefficient. In extreme circumstances, however, it can have the disadvantage of producing coefficients which are based on a different number of cases, and even on different selections of cases or samples; furthermore, the ‘correlation’ matrices formed in this way need not necessarily be positive-semidefinite, a requirement for a correlation matrix. Casewise deletion of missing data generally causes fewer cases to be used in the calculation of the coefficients than does pairwise deletion. How great this difference is will obviously depend on the distribution of the missing data, both among cases and among variables.
Pairwise treatment does therefore use more information from the sample, but should not be used without careful consideration of the location of the missing observations in the data matrix, and the consequent effect of processing the missing data in that fashion.
Consider a matrix with elements given by the product-moment correlation of pairs of variables, with any missing values treated in the pairwise sense. Such a matrix may not be positive semi-definite, and therefore not a valid correlation matrix. However, a valid correlation matrix can be calculated that is in some sense ‘close’ to the original. One measure of closeness is the Frobenius norm. This valid correlation matrix is the solution to the nearest correlation matrix problem.

Regression

Aims of regression modelling

Linear regression models

When the regression model is linear in the parameters (but not necessarily in the independent variables), then the regression model is said to be linear; otherwise the model is classified as nonlinear.

Fitting the regression model – least-squares estimation

Regression models and designed experiments

One application of regression models is in the analysis of experiments. In this case the model relates the dependent variable to qualitative independent variables known as factors. Factors may take a number of different values known as levels. For example, in an experiment in which one of four different treatments is applied, the model will have one factor with four levels. Each level of the factor can be represented by a dummy variable taking the values 0 or 1. So in the example there are four dummy variables xj, for j=1,2,3,4 such that:
xij =1​ if the ​ith observation received the ​jth treatment =0​ otherwise,
along with a variable for the mean x0:
xi0 =1​ for all ​i.
If there were 7 observations the data would be:
Treatment Y x0 x1 x2 x3 x4 1 y1 1 1 0 0 0 2 y2 1 0 1 0 0 2 y3 1 0 1 0 0 3 y4 1 0 0 1 0 3 y5 1 0 0 1 0 4 y6 1 0 0 0 1 4 y7 1 0 0 0 1
Models which include factors are sometimes known as General Linear (Regression) Models. When dummy variables are used it is common for the model not to be of full rank. In the case above, the model would not be of full rank because
xi4=xi0-xi1-xi2-xi3,  i=1,2,,7.
This means that the effect of x4 cannot be distinguished from the combined effect of x0,x1,x2 and x3. This is known as aliasing. In this situation, the aliasing can be deduced from the experimental design and as a result the model to be fitted; in such situations it is known as intrinsic aliasing. In the example above no matter how many times each treatment is replicated (other than 0) the aliasing will still be present. If the aliasing is due to a particular dataset to which the model is to be fitted then it is known as extrinsic aliasing. If in the example above observation 1 was missing then the x1 term would also be aliased. In general intrinsic aliasing may be overcome by changing the model, e.g., remove x0 or x1 from the model, or by introducing constraints on the parameters, e.g., β1+β2+β3+β4=0.
If aliasing is present then there will no longer be a unique set of least-squares estimates for the parameters of the model but the fitted values will still have a unique estimate. Some linear functions of the parameters will also have unique estimates; these are known as estimable functions. In the example given above the functions (β0+β1) and (β2-β3) are both estimable.

Selecting the regression model

In many situations there are several possible independent variables, not all of which may be needed in the model. In order to select a suitable set of independent variables, two basic approaches can be used.
(a) All possible regressions
In this case all the possible combinations of independent variables are fitted and the one considered the best selected. To choose the best, two conflicting criteria have to be balanced. One is the fit of the model as measured by the residual sum of squares. This will decrease as more variables are added to the model. The second criterion is the desire to have a model with a small number of significant terms. To aid in the choice of model, statistics such as R2, which gives the proportion of variation explained by the model, and Cp, which tries to balance the size of the residual sum of squares against the number of terms in the model, can be used.
(b) Stepwise model building
In stepwise model building the regression model is constructed recursively, adding or deleting the independent variables one at a time. When the model is built up the procedure is known as forward selection. The first step is to choose the single variable which is the best predictor. The second independent variable to be added to the regression equation is that which provides the best fit in conjunction with the first variable. Further variables are then added in this recursive fashion, adding at each step the optimum variable, given the other variables already in the equation. Alternatively, backward elimination can be used. This is when all variables are added and then the variables dropped one at a time, the variable dropped being the one which has the least effect on the fit of the model at that stage. There are also hybrid techniques which combine forward selection with backward elimination.

Examining the fit of the model

Having fitted a model two questions need to be asked: first, ‘are all the terms in the model needed?’ and second, ‘is there some systematic lack of fit?’. To answer the first question either confidence intervals can be computed for the parameters or t-tests can be calculated to test hypotheses about the regression parameters – for example, whether the value of the parameter, βk, is significantly different from a specified value, bk (often zero). If the estimate of βk is β^k and its standard error is seβ^k then the t-statistic is
β^k-bk seβ^k .
It should be noted that both the tests and the confidence intervals may not be independent. Alternatively F-tests based on the residual sums of squares for different models can also be used to test the significance of terms in the model. If model 1, giving residual sum of squares RSS1 with degrees of freedom ν1, is a sub-model of model 2, giving residual sum of squares RSS2 with degrees of freedom ν2, i.e., all terms in model 1 are also in model 2, then to test if the extra terms in model 2 are needed the F-statistic
F=RSS1-RSS2/ν1-ν2 RSS2/ν2
may be used. These tests and confidence intervals require the additional assumption that the errors, ei, are Normally distributed.
To check for systematic lack of fit the residuals, ri=yi-y^i, where y^i is the fitted value, should be examined. If the model is correct then they should be random with no discernible pattern. Due to the way they are calculated the residuals do not have constant variance. Now the vector of fitted values can be written as a linear combination of the vector of observations of the dependent variable, y, y^=Hy. The variance-covariance matrix of the residuals is then I-Hσ2, I being the identity matrix. The diagonal elements of H, hii, can therefore be used to standardize the residuals. The hii are a measure of the effect of the ith observation on the fitted model and are sometimes known as leverages.
If the observations were taken serially the residuals may also be used to test the assumption of the independence of the ei and hence the independence of the observations.

Computational methods

Robust estimation

There are two ways in which an observation for a regression model can be considered atypical. The values of the independent variables for the observation may be atypical or the residual from the model may be large.
The first problem of atypical values of the independent variables can be tackled by calculating weights for each observation which reflect how atypical it is, i.e., a strongly atypical observation would have a low weight. There are several ways of finding suitable weights; some are discussed in Hampel et al. (1986).
The second problem is tackled by bounding the contribution of the individual ei to the criterion to be minimized. When minimizing (7) a set of linear equations is formed, the solution of which gives the least-squares estimates. The equations are
i=1neixij=0j=0,1,,k.
These equations are replaced by
Robust regressions using least absolute deviations can be computed using methods in E02 class.

Generalized linear models

Generalized linear models are an extension of the general linear regression model discussed above. They allow a wide range of models to be fitted. These included certain nonlinear regression models, logistic and probit regression models for binary data, and log-linear models for contingency tables. A generalized linear model consists of three basic components:
(a) A suitable distribution for the dependent variable Y. The following distributions are common:
(i) Normal
(ii) binomial
(iii) Poisson
(iv) gamma
In addition to the obvious uses of models with these distributions it should be noted that the Poisson distribution can be used in the analysis of contingency tables while the gamma distribution can be used to model variance components. The effect of the choice of the distribution is to define the relationship between the expected value of Y, EY=μ, and its variance and so a generalized linear model with one of the above distributions may be used in a wider context when that relationship holds.
(b) A linear model η=βjxj, η is known as a linear predictor.
(c) A link function g· between the expected value of Y and the linear predictor, gμ=η. The following link functions are available:
For the binomial distribution ε, observing y out of t:
(i) logistic link: η=logμt-μ ;
(ii) probit link: η=Φ-1 μt ;
(iii) complementary log-log: η=log-log1-μt .
For the Normal, Poisson, and gamma distributions:
(i) exponent link: η=μa, for a constant a;
(ii) identity link: η=μ;
(iii) log link: η=logμ;
(iv) square root link: η=μ;
(v) reciprocal link: η=1μ .
For each distribution there is a canonical link. For the canonical link there exist sufficient statistics for the parameters. The canonical links are:
(i) Normal – identity;
(ii) binomial – logistic;
(iii) Poisson – logarithmic;
(iv) gamma – reciprocal.
For the general linear regression model described above the three components are:
(i) Distribution – Normal;
(ii) Linear model – βjxj;
(iii) Link – identity.
The model is fitted by maximum likelihood; this is equivalent to least-squares in the case of the Normal distribution. The residual sums of squares used in regression models is generalized to the concept of deviance. The deviance is the logarithm of the ratio of the likelihood of the model to the full model in which μ^i=yi, where μ^i is the estimated value of μi. For the Normal distribution the deviance is the residual sum of squares. Except for the case of the Normal distribution with the identity link, the χ2 and F-tests based on the deviance are only approximate; also the estimates of the parameters will only be approximately Normally distributed. Thus only approximate z- or t-tests may be performed on the parameter values and approximate confidence intervals computed.
The estimates are found by using an iterative weighted least-squares procedure. This is equivalent to the Fisher scoring method in which the Hessian matrix used in the Newton–Raphson method is replaced by its expected value. In the case of canonical links the Fisher scoring method and the Newton–Raphson method are identical. Starting values for the iterative procedure are obtained by replacing the μi by yi in the appropriate equations.

Linear mixed effects regression

In a standard linear model the independent (or explanatory) variables are assumed to take the same set of values for all units in the population of interest. This type of variable is called fixed. In contrast, an independent variable that fluctuates over the different units is said to be random. Modelling a variable as fixed allows conclusions to be drawn only about the particular set of values observed. Modelling a variable as random allows the results to be generalised to the different levels that may have been observed. In general, if the effects of the levels of a variable are thought of as being drawn from a probability distribution of such effects then the variable is random. If the levels are not a sample of possible levels then the variable is fixed. In practice many qualitative variables can be considered as having fixed effects and most blocking, sampling design, control and repeated measures as having random effects.
In a general linear regression model, defined by
y=Xβ+ε
where y is a vector of n observations on the dependent variable,
X is an n by p design matrix of independent variables,
β is a vector of p unknown parameters,
and ε is a vector of n, independent and identically distributed, unknown errors, with ε ~ N 0 , σ 2 .
there are p fixed effects (the β) and a single random effect (the error term ε).
An extension to the general linear regression model that allows for additional random effects is the linear mixed effects regression model, (sometimes called the variance components model). One parameterisation of a linear mixed effects model is
y=Xβ+Zν+ε
where y is a vector of n observations on the dependent variable,
X is an n by p design matrix of fixed independent variables,
β is a vector of p unknown fixed effects,
Z is an n by q design matrix of random independent variables,
ν is a vector of length q of unknown random effects,
ε is a vector of length n of unknown random errors,
and ν and ε are normally distributed with expectation zero and variance / covariance matrix defined by
Var ν ε = G 0 0 R .
The methods currently available in this chapter are restricted to cases where R= σ R 2 I , I is the n×n identity matrix and G is a diagonal matrix. Given this restriction the random variables, Z, can be subdivided into gq groups containing one or variables. The variables in the ith group are identically distributed with expectation zero and variance σi2. The model therefore contains three sets of unknowns, the fixed effects, β, the random effects, ν, and a vector of g+1 variance components, γ, with γ = σ12,σ22,,,, σ g-1 2 ,σg2,σR2 . Rather than work directly with γ and the full likelihood function, γ is replaced by γ* = σ12 / σR2 , σ22 / σR2 ,, σg-12 / σR2 , σg2 / σR2 ,1  and the profiled likelihood function is used instead.
The model parameters are estimated using an iterative method based on maximizing either the restricted (profiled) likelihood function or the (profiled) likelihood functions. Fitting the model via restricted maximum likelihood involves maximizing the function
-2 l R = log V + n-p log r V-1 r + log X V-1 X + n-p 1+ log 2 π / n-p + n-p .
Whereas fitting the model via maximum likelihood involves maximizing
-2 l R = log V + n log r V-1 r + n log 2 π / n +n .
In both cases
V = ZG Z + R,   r=y-Xb   and   b = X V-1 X -1 X V-1 y .
Once the final estimates for γ *  have been obtained, the value of σR2  is given by
σR2 = r V-1 r / n - p .
Case weights, Wc , can be incorporated into the model by replacing XX  and ZZ  with XWcX  and ZWcZ  respectively, for a diagonal weight matrix Wc .

Ridge regression

Latent variable methods

Recommendations on Choice and Use of Available Methods

Correlation

Product-moment correlation

Let SSx be the sum of squares of deviations from the mean, x-, for the variable x for a sample of size n, i.e.,
SSx=i=1n xi-x- 2
and let SCxy be the cross-products of deviations from the means, x- and y-, for the variables x and y for a sample of size n, i.e.,
SCxy=i=1nxi-x-yi-y-.
Then the sample covariance of x and y is
covx,y=SCxy n-1
and the product-moment correlation coefficient is
r=covx,y varx vary =SCxySSxSSy .
g02bu computes the sample sums of squares and cross-products deviations from the means (optionally weighted).
g02bt updates the sample sums of squares and cross-products and deviations from the means by the addition/deletion of a (weighted) observation.
g02bw computes the product-moment correlation coefficients from the sample sums of squares and cross-products of deviations from the means.
The three methods compute only the upper triangle of the correlation matrix which is stored in a one-dimensional array in packed form.
g02bx computes both the (optionally weighted) covariance matrix and the (optionally weighted) correlation matrix. These are returned in two-dimensional arrays. (Note that g02bt and g02bu can be used to compute the sums of squares from zero.)
g02bg can be used to calculate the correlation coefficients for a subset of variables in the data matrix.

Product-moment correlation with missing values

If there are missing values then g02bu and g02bx, as described above, will allow casewise deletion by you giving the observation zero weight (compared with unit weight for an otherwise unweighted computation).
Other methods also handle missing values in the calculation of unweighted product-moment correlation coefficients. Casewise exclusion of missing values is provided by g02bb while pairwise omission of missing values is carried out by g02bc. These two methods calculate a correlation matrix for all the variables in the data matrix; similar output but for only a selected subset of variables is provided by methods g02bh and g02bj respectively. As well as providing the Pearson product-moment correlation coefficients, these methods also calculate the means and standard deviations of the variables, and the matrix of sums of squares and cross-products of deviations from the means. For all four methods you are free to select appropriate values for consideration as missing values, bearing in mind the nature of the data and the possible range of valid values. The missing values for each variable may be either different or alike and it is not necessary to specify missing values for all the variables.

Nonparametric correlation

There are five methods which perform nonparametric correlations, each of which is capable of producing both Spearman's rank-order and Kendall's tau correlation coefficients. The basic underlying concept of both these methods is to replace each observation by its corresponding rank or order within the observations on that variable, and the correlations are then calculated using these ranks.
It is obviously more convenient to order the observations and calculate the ranks for a particular variable just once, and to store these ranks for subsequent use in calculating all coefficients involving that variable; this does however require an amount of store of the same size as the original data matrix, which in some cases might be excessive. Accordingly, some methods calculate the ranks only once, and replace the input data matrix by the matrix of ranks, which are then also made available to you on exit from the method, while others preserve the data matrix and calculate the ranks a number of times within the method; the ranks of the observations are not provided as output by methods which work in the latter way. If it is possible to arrange the program in such a way that the first technique can be used, then efficiency of timing is achieved with no additional storage, whereas in the second case, it is necessary to have a second matrix of the same size as the data matrix, which may not be acceptable in certain circumstances; in this case it is necessary to reach a compromise between efficiency of time and of storage, and this may well be dependent upon local conditions.
Routines g02bn and g02bq both calculate Kendall's tau and/or Spearman's rank-order correlation coefficients taking no account of missing values; g02bn does so by calculating the ranks of each variable only once, and replacing the data matrix by the matrix of ranks, whereas g02bq calculates the ranks of each variable several times. Routines g02bp and g02br provide the same output, but treat missing values in a ‘casewise’ manner (see above); g02bp calculates the ranks of each variable only once, and overwrites the data matrix of ranks, while g02br determines the ranks of each variable several times. For ‘pairwise’ omission of missing data (see above), the method g02bs provides Kendall and/or Spearman coefficients.
Since g02bn and g02bp order the observations and calculate the ranks of each variable only once, then if there are M variables involved, there are only M separate ‘ranking’ operations; this should be contrasted with the method used by methods g02bq and g02br which perform MM-1/2+1 similar ranking operations. These ranking operations are by far the most time-consuming parts of these nonparametric methods, so for a matrix of as few as five variables, the time taken by one of the slower methods can be expected to be at least a factor of two slower than the corresponding efficient method; as the number of variables increases, so this relative efficiency factor increases. Only one method, g02bs, is provided for pairwise missing values, and this method carries out MM-1 separate rankings; since by the very nature of the pairwise method it is necessary to treat each pair of variables separately and rank them individually, it is impossible to reduce this number of operations, and so no alternative method is provided.

Partial correlation

g02by computes a matrix of partial correlation coefficients from the correlation coefficients or variance-covariance matrix returned by g02bx.

Robust correlation

g02hl and g02hm compute robust estimates of the variance-covariance matrix by solving the equations
1ni=1nwzi2zi=0
and
1n i= 1n u zi2 zi ziT -v zi2 I=0,
as described in [Robust estimation of correlation coefficients] for user-supplied functions w and u. Two options are available for v, either vt=1 for all t or vt=ut.
g02hm requires only the function w and u to be supplied while g02hl also requires their derivatives.
In general g02hl will be considerably faster than g02hm and should be used if derivatives are available.
g02hk computes a robust variance-covariance matrix for the following functions:
ut =au/t2​ if ​t<au2 ut =1​ if ​au2tbu2 ut =bu/t2​ if ​t>bu2
and
wt = 1​ if ​tcw wt =cw/t​ if ​t>cw
for constants au, bu and cw.
These functions solve a minimax space problem considered by Huber (1981). The values of au, bu and cw are calculated from the fraction of gross errors; see Hampel et al. (1986) and Huber (1981).
To compute a correlation matrix from the variance-covariance matrix g02bw may be used.

Nearest correlation matrix

g02aa calculates the nearest correlation matrix to a real, square matrix.

Regression

Simple linear regression

Four methods are provided for simple linear regressions: g02ca and g02cc perform the simple linear regression with a constant term (equation (1) above), while g02cb and g02cd fit the simple linear regression with no constant term (equation (2) above). Two of these methods, g02cc and g02cd, take account of missing values, which the others do not. In these two methods, an observation is omitted if it contains a missing value for either the dependent or the independent variable; this is equivalent to both the casewise and pairwise methods, since both are identical when there are only two variables involved. Input to these methods consists of the raw data, and output includes the coefficients, their standard errors and t values for testing the significance of the coefficients; the F value for testing the overall significance of the regression is also given.

Multiple linear regression – general linear model

g02da fits a general linear regression model using the QR method and an SVD if the model is not of full rank. The results returned include: residual sum of squares, parameter estimates, their standard errors and variance-covariance matrix, residuals and leverages. There are also several methods to modify the model fitted by g02da and to aid in the interpretation of the model.
g02dc adds or deletes an observation from the model.
g02dd computes the parameter estimates, and their standard errors and variance-covariance matrix for a model that is modified by g02dc, g02de or g02df.
g02de adds a new variable to a model.
g02df drops a variable from a model.
g02dg fits the regression to a new dependent variable, i.e., keeping the same independent variables.
g02dk calculates the estimates of the parameters for a given set of constraints, (e.g., parameters for the levels of a factor sum to zero) for a model which is not of full rank and the SVD has been used.
g02dn calculates the estimate of an estimable function and its standard error.
Note:  g02de also allows you to initialize a model building process and then to build up the model by adding variables one at a time.
If you wish to use methods based on forming the cross-products/correlation matrix (i.e., (XTX) matrix) rather than the recommended use of g02da then the following methods should be used.
For regression through the origin (i.e., no constant) g02ch preceded by:
  • g02bd (no missing values, all variables)
  • g02bk (no missing values, subset of variables)
  • g02be (casewise missing values, all variables)
  • g02bl(casewise missing values, subset of variables)
  • g02bf* (pairwise missing values, all variables)
  • g02bm* (pairwise missing values, subset of variables)
For regression with intercept (i.e., with constant) g02cg preceded by:
  • g02ba (no missing values, all variables)
  • g02bg (no missing values, subset of variables)
  • g02bb (casewise missing values, all variables)
  • g02bh (casewise missing values, subset of variables)
  • g02bc* (pairwise missing values, all variables)
  • g02bj* (pairwise missing values, subset of variables)
Note that the four methods using pairwise deletion of missing value (marked with *) should be used with great caution as the use of this method can lead to misleading results, particularly if a significant proportion of values are missing.
Both g02cg and g02ch require that the correlations/sums of squares involving the dependent variable must appear as the last row/column. Because the layout of the variables in your data array may not be arranged in this way, two methods, g02ce and g02cf, are provided for re-arranging the rows and columns of vectors and matrices. g02cf simply reorders the rows and columns while g02ce forms smaller vectors and matrices from larger ones.
Output from g02cg and g02ch consists of the coefficients, their standard errors, R2-values, t and F statistics.

Selecting regression models

Residuals

g02fa computes the following standardized residuals and measures of influence for the residuals and leverages produced by g02da:
(i) Internally studentized residual;
(ii) Externally studentized residual;
(iii) Cook's D statistic;
(iv) Atkinson's T statistic.
g02fc computes the Durbin–Watson test statistic and bounds for its significance to test for serial correlation in the errors, ei.

Robust regression

For robust regression using M-estimates instead of least-squares the method g02ha will generally be suitable. g02ha provides a choice of four ψ-functions (Huber's, Hampel's, Andrew's and Tukey's) plus two different weighting methods and the option not to use weights. If other weights or different ψ-functions are needed the method g02hd may be used. g02hd requires you to supply weights, if required, and also methods to calculate the ψ-function and, optionally, the χ-function. g02hb can be used in calculating suitable weights. The method g02hf can be used after a call to g02hd in order to calculate the variance-covariance estimate of the estimated regression coefficients.
For robust regression, using least absolute deviation, (e02ga not in this release) can be used.

Generalized linear models

While g02ga can be used to fit linear regression models (i.e., by using an identity link) this is not recommended as g02da will fit these models more efficiently. g02gc can be used to fit log-linear models to contingency tables.
In addition to the methods to fit the models there is one method to predict from the fitted model and two methods to aid interpretation when the fitted model is not of full rank, i.e., aliasing is present.
  • g02gp computes a predicted value and its associated standard error based on a previously fitted generalized linear model.
  • g02gk computes parameter estimates for a set of constraints, (e.g., sum of effects for a factor is zero), from the SVD solution provided by the fitting method.
  • g02gn calculates an estimate of an estimable function along with its standard error.

Linear mixed effects regression

There are two methods for fitting linear mixed effects regression. The methods are:
  • g02ja uses restricted maximum likelihood (REML) to fit the model.
  • g02jb uses maximum likelihood to fit the model.
for both methods the output includes: either the maximum likelihood or restricted maximum likelihood and the fixed and random parameter estimates, along with their standard errors.
As the estimates of the variance components are found using an iterative procedure initial values must be supplied for each σ. In both methods you can either specify these initial values, or allow the method to calculate them from the data using minimum variance quadratic unbiased estimation (MIVQUE0). Setting the maximum number of iterations to zero in either method will return the corresponding likelihood, parameter estimates and standard errors based on these initial values.

Ridge regression

g02ka calculates a ridge regression, optimizing the ridge parameter according to one of four prediction error criteria.
g02kb calculates ridge regressions for a given set of ridge parameters.

Partial Least-squares (PLS)

g02la calculates a nonlinear, iterative PLS by using singular value decomposition.
g02lb calculates a nonlinear, iterative PLS by using Wold's method.
g02lc calculates parameter estimates for a given number of PLS factors.
g02ld calculates predictions given a PLS model.

Polynomial regression and nonlinear regression

No methods are currently provided in this chapter for polynomial regression. If you wish to perform polynomial regressions you have three alternatives: you can use the multiple linear regression methods, g02da, with a set of independent variables which are in fact simply the same single variable raised to different powers, or you can use the method (g04ea not in this release) to compute orthogonal polynomials which can then be used with g02da, or you can use the methods in E02 class (Curve and Surface Fitting) which fit polynomials to sets of data points using the techniques of orthogonal polynomials. This latter course is to be preferred, since it is more efficient and liable to be more accurate, but in some cases more statistical information may be required than is provided by those methods, and it may be necessary to use the methods of this chapter.
More general nonlinear regression models may be fitted using the optimization methods in E04 class, which contains methods to minimize the function
i=1nei2
where the regression parameters are the variables of the minimization problem.

References

Inheritance Hierarchy

See Also