SCOPE 34 - Practitioner's Handbook on the Modelling of Dynamic Change in Ecosystems, Chapter 7, Multivariate Models

SCOPE 34 - Practitioner's Handbook on the Modelling of Dynamic Change in Ecosystems

7.	Multivariate Model

7.1 ORIGINS AND DEVELOPMENT OF MULTIVARIATE MODELS
7.2 ATTRIBUTES, VARIABLES AND VARIATES
7.3 NOTATION
7.4 SOME ESSENTIAL STAGES IN THE ANALYSIS OF MULTIVATE DATA
	7 .4.1 Choice of the attributes to be included
	7.4.2 A priori weighting of attributes
	7.4.3 Choice of the individuals to be included
	7.4.4 Construction of the basic data matrix
	7.4.5 Missing values
	7.4.6 Scaling, standardization and transformation of the data
		7.4.6.1 Transformation to make the variables approximately normally distributed
		7.4.6.2 Transformation to make the sample variances and covariances more homogeneous
		7.4.6.4 Transformation to simplify the interpretation of the analysis
	7.4.7 Data exploration
7.5 CHOICE OF METHOD OF ANALYSIS
	7.5.1 Ordination
	7.5.2 Discrimination
	7.5.3 Classification
	7.5.4 Relationships between variables
7.6 METHODS OF MULTIVARIATE ANALYSIS: INTRODUCTION
7.7 ORDINATION
	7.7.1 Principal component analysis
		7.7.1.1 Covariances or correlations?
		7.7.1.2 Eigenvalues and eigenvectors
		7.7.1.3 Calculation of eigenvalues and eigenvectors
		7.7.1.4 Dimensionality of the data matrix
		7.7.1.5 Sphericity
		7.7.1.6 Calculation of the transformed values
		7.7.1.7 Analysis of transformed values
		7.7.1.8 Plotting of transformed values
		7.7. 1. 9 Reification
		7.7.1.10 Rotation
		7.7.1.11 Interpretation of results
		7.7.1.12 Advantages and disadvantages of principal component analysis
		7.7.1.13 Case studies in the application of principal component analysis
	7.7.2 Factor analysis
		7.7.2.1 Mathematical basis
		7.7.2.2 Practical considerations
		7.7.2.3 Differences between factor analysis and principal component analysis
		7.7.2.4 Case studies
	7.7.3 Principal co-ordinate analysis
		7.7.3.1 Background
		7.7.3.2 Mathematical basis
		7.7.3.3 Computational procedures
		7.7.3.4 Interpretation
		7.7.3.5 Alternative distance measures
		7.7.3.6 Duality with principal component analysis
		7.7.3.7 Case studies
	7.7.4 Reciprocal averaging
7.8 DISCRIMINATION
	7.8.1 Mathematical basis
	7.8.2 Computational aspects
	7.8.3 A distribution-free method
	7.8.4 Generalized distance
	7.8.5 Case studies
7.9 CLUSTER ANALYSIS
	7.9.1 Minimum spanning tree
	7.9.2 Single linkage cluster analysis
	7.9.3 k means clustering
	7.9.4 The use of components in cluster analysis
	7.9.5 Other uses of components
	7.9.6 Association analysis
	7.9.7 Indicator species analysis
	7.9.8 Case studies
7.10 FITTING RELATIONSHIPS BETWEEN GROUPS OF VARIABLES
	7.10.1 Mathematical basis
	7.10.2 Multiple regression
	7.10.3 The use of components in regression analysis
	7.10.4 Canonical correlations
	7.10.5 Case studies
Acknowledgement
References

7. Multivariate Models

Strictly speaking, multivariate models are not uniquely dynamic. However , they enable the ecologist to explore several, and often many, dimensions simultaneously, and one of these dimensions may well be time. Alternatively, the several dimensions modelled in a multivariate analysis may be related to time as an explanatory variable, and so provide an explanation, and in some cases prediction, of the changes that take place with time. In this chapter, a review of multivariate models will be given, drawing on the very extensive literature that now exists on this topic. Some hints are given about the computational aspects of multivariate models, but the reader will often be referred to texts where very extensive advice is found. Similarly, the theory of multivariate models is not discussed in detail in this chapter, because there are a great many texts where this theory can be found to any desired level of detail and mathematical complexity.

7.1 ORIGINS AND DEVELOPMENT OF MULTIVARIATE MODELS

Much of the mathematics involved in the construction of multivariate models is not new. For example, the fundamental probability distributions related to the multivariate normal distribution were derived in the 1930s, and the methods developed then are the basis for most of the multivariate methods used today. However, the calculations involved in multivariate analysis, and in the construction of multivariate models, becomes very tedious when the number of variates is large, and 2 more of these computations were virtually impossible to carry out for more than four or five variates, even on electrically driven calculating machines. Until the electronic computer became generally available, therefore, only a very limited number of extensive multivariate analysis has been attempted, and the same examples were quoted in amost all of the earlier textbooks.

The development of computers has completely changed the situation, with the result that multivariate models have now become an important addition to the range of models which may be used by the ecologist. Many of the computations can be done on quite simple microprocessors, so that the application of multivariate models is not restricted to ecologists with access to large mainframe computers. Indeed, there can often be distinct advantages in modelling multivariate data in several stages, with careful examination of the results at each stage, before proceeding to the next stage of the analysis. In consequence, a rapidly increasing collection of examples has appeared in the scientific literature, almost too many to review in one chapter of this Handbook. Nevertheless, these multivariate methods represent a much neglected class of models, especially in dynamic or time related applications.

Many of the multivariate techniques available today have been developed in association with taxonomy and other branches of the more descriptive sciences. This association has had several interesting and unexpected effects. For example, what used to be largely intuitive has become increasingly more formalized and more quantitative. The sudden explosion of computers and computer science has enabled scientists to explore a wide range of techniques, and this new field of activity has attracted the attention of statisticians and mathematicians, with the consequent development of an even wider variety of methods and of applications to problems in many different scientific fields. This rapid development has not been an unmixed blessing, as there now exists a bewildering variety of numerical techniques, the properties of many of which are not adequately known. Furthermore, while numerical methods can often greatly help the ecologist to investigate the structure of his data, the results of the analyses provided by these methods still need to be interpreted, and there are still very few ecologists with experience in such interpretation.

It is relatively easy to perceive the structure of multivariate data when discontinuities are obvious, but such a situation is not typical of many ecological applications. Much of what we observe in nature changes more or less continuously in one or more properties, but not necessarily by the same degree in each of the properties. It is these more gradual changes which provide the greatest problems for the scientist to decide where or how to draw boundaries, or, indeed, whether to draw any boundaries at all. The aim of this chapter, therefore, is to clarify some of the issues involved in the use of numerical methods to search for structure and dynamic change in ecological data. Ecological applications are less well developed than those in plant and animal taxonomy, partly because of the diversity of interests of ecologists, and partly because of the nature of the ecological data themselves. The methods appropriate to the taxonomy of organisms are not, therefore, necessarily the most appropriate to dynamic ecological problems (Clifford and Stephenson, 1975).

By virtue of the speed with which they can process large quantities of data and perform complex calculations, computers have revolutionized the handling of scientific data. However, there is a danger that the power of computers sometimes leads those unfamiliar with the mathematical theory of the methods to assume that what is produced by the computer necessarily contains some objective 'truth' which should not be questioned. Nowhere is this more the case than in multivariate analysis. Anderberg (1973) has commented that

cluster analysis methods involve a mixture of imposing a structure on the data and revealing the structure which actually exists in the data. The notation of finding natural groups tends to imply that the algorithm should passively conform like a wet tee-shirt. Unfortunately, practical procedures involve fixed sequences of operations which systematically ignore some aspects of the structure while intensively dwelling on others.

The use of a computer does not release the ecologist from thinking; quite the reverse. Multivariate models, in particular, lay traps for the unwary, and it is the purpose of this chapter to spring many of these traps.

7.2 ATTRIBUTES, VARIABLES AND VARIATES

Multivariate data in ecology, as in other applied sciences, consist of sets of attributes or scores for each of a number of variables, this number being greater than two and sometimes large. Conventionally, the term attribute is used by statisticians to represent the qualities possessed or not possessed by an individual. The term variable is used for qualities which are measured quantitatively along some continuous scale. The strict definition is of a quantity that may take any one of a specified set of values, where these values may be continuous, as in measurements of height or weight, or discontinuous, as in counts of individuals. By extension, the term is sometimes also used to denote non-measurable characteristics. For example, sex may be regarded as a variable in a sense, as any individual may take one of two values - male or female.

A variate is a quantity which may take any one of the values of a specified relative frequency or probability. Such variates are also known as random variables and they are regarded as being defined not merely by a set of permissible values like any ordinary mathematical function, but by an associated frequency or probability function expressing how often these values appear in the application under discussion. There are many situations in ecology where models have to capture the behaviour of more than one variate, and such models are therefore known collectively as multivariate, an expression which is used rather loosely to denote the analysis of data which are multivariate in the sense that each individual bears the values of several variates. By extension, some of these values may be variables rather than variates.

This rather general description of attributes in ecology, as elsewhere, is convenient, as it avoids the need to differentiate between continuous and discrete data in general discussion where the nature of the data is not in question. Nevertheless, formal discussion of multivariate methods often focuses on the different kinds of attributes (see, for example, Sneath and Sokal, 1973; Clifford and Stephenson, 1975; Williams, 1976). The most common distinctions are made between: (i) binary attributes, e.g. presence or absence; (ii) disordered multi-state or nominal attributes, such as colour or soil type; (iii) order multi-state or ordinal attributes, e.g. rare, common, abundant, etc.; (iv) meristic or discontinuous, e.g. number of petals; and (v) continuous or numeric attributes. These categories are not, however, distinct, because they depend to some extent on the sampling procedures used, and data in one form can often be transformed into another.

There are marked differences of opinion about the value of binary data in ecology, but the consensus of opinion seems to be that other forms of data are preferable (e.g. Clifford and Stephenson, 1975). In most branches of ecology, some measure of dominance is often regarded as important in describing vegetation, so that the results of using analyses of numerical data regarded as being more informative than those of binary data. For example, Williams et al. (1973) found that while the presence or absence of plant species was adequate in a simple study involving only 8 sites, there was 'some advantage' in using the numbers of each species in a study of 10 sites, and quantitative data of the relative dominance of species were distinctly preferable for a study of 80 sites. Similarly, Barkham (1968) found quantitative data to be more informative than presence-absence in a study of a Cotswold beechwood.

While binary or presence-absence data appear to be adequate when there are major differences in species distributions between sites, they have not proved to be very useful in detailed studies of vegetation dynamics where there are relatively few species and less clear-cut differences between sites. Indeed, the use of binary data for the study of dynamic change in ecology can usually only be justified if it is difficult to obtain anything else, or if there is a declared lack of interest in the information which will be lost by the use of binary data instead of, say, continuous data. Similarly, the conversion of either meristic or continuous data to binary data is usually unsatisfactory. Division of a variable which is approximately normally distributed, for example into two parts, leads to an attribute with all the values on either side of the dividing line having , identical binary scores. However, there may, in some circumstances, be instances where continuous data have properties which make conversion from one form to another logical. One such example is an attribute which can take a wide range of values, but which has a concentration at one value (usually zero), as in the counts of the number of parasites in a host species. It may then be possible to regard such an attribute as discrete, and to score it as if it were composed of a few groups, e.g. zero, low, medium, high. Nevertheless, if the aim of the analysis is to find a useful or meaningful grouping of the data, a coarsely grouped attribute may exert a disproportionate influence on the result (Marriott, 1974).

Figure 21. Basic data matrix for multivariate analysis

7.3 NOTATION

Whether the basic attributes to be used in the study are binary, multi-state, discrete or continuous variables, the same notation can be employed to represent the data which will be used in any multivariate analysis. Figure 21 displays this notation as a matrix, X, of the values of each of p attributes for each of n individuals. Thus, x₁₁ represents the value of the first attribute for the first individual, x₃₂ the value of the second attribute for the third individual, and, more generally, x_rs the value of the sth attribute for the rth individual.

There may, of course, be a priori groupings or classifications of the p variables or the n individuals. Indeed, it is the presence of such additional information which may influence the choice of an appropriate multivariate technique. The existence of such information does not, however, necessitate any change in the notation.

7.4 SOME ESSENTIAL STAGES IN THE ANALYSIS OF MULTIVATE DATA

7 .4.1 Choice of the attributes to be included

The first stage in any analysis is the choice of attributes to be included. It perhaps needs to be stressed that this is the most important of all the stages described. Any choice of attributes represents an hypothesis about the relevance of the variables to the solution of the problem under consideration. Much time and effort have been wasted by the application of multivariate analysis to sets of data containing a 'rag-bag' of variables included because they were easy to measure, or because they happened to be available, without any apparent consideration of the logical design of the investigation. Indeed, some scientists appear to regard multivariate analysis as a substitute for thought, expecting the methods to compensate for failures of logic which they would not permit in the analysis of univariate data (Jardine and Sibson, 1971 ; Sneath and Sokal, 1973; Williams, 1976).

As the selection of attributes to be included will be heavily dependent on the objectives of the research, some guidance can frequently be obtained by considering the following questions.

What broad dimensions of the variability of the individuals are relevant to the problem? In seeking to describe the growth of a particular species of plant, for example, we may only be concerned with the activity of the primary meristems, and our variable will be confined to measures of shoot length, height, leaf extension, etc. We may, alternatively, be concerned with growth of both primary and secondary meristems, so that measures of diameter, sectional area or volume of various part of the plant may need to be included.
Do the variables cover the total range of the variability of the individuals in a fairly uniform manner? Inclusion of a single variable measuring vigour or health in a larger group of variables measuring size may lead to a relatively inefficient analysis.
Are some of the variables logically dependent on the others? It is not usually particularly helpful to include all of the percentages which must add up to 100, for example. Similarly, inclusion of a variable which is derived arithmetically from one or more of the others seldom adds much information to the analysis.
Is it possible to identify any structure among the variables? For example, the total set of variables may consist of two groups, one of which can be regarded as a response to the other groups in a complex investigation.

Nevertheless, having stressed the importance of the choice of the variables to be included in the analysis, there is a good deal of flexibility resulting from the use of the computers essential to the application of multivariate analysis to practical problems. Further variables may be added, if they become available or can be obtained, and variables may be added, if they become available or can be obtained, and variables may be deleted after a trial analysis, if it is found that they contribute little to the interpretation of the analysis. Indeed, one of the main advantages of computer-based multivariate analysis is that it enables the research worker to keep many variables under constant review, modifying the selection of variables as understanding of the problem increases, and eliminating those variables which contribute little to the analysis.

The number of variables that can be included in one analysis is limited only by practical difficulties of computation at critical stages of the analysis, and by the size of the computer that is available. Generally speaking, there is seldom any difficulty in dealing with up to 60 variables. For more than 60 variables, large computer systems are usually necessary, but, except for certain kinds of analysis of data consisting entirely of binary attributes, there is relatively little to be gained from including very large numbers of variables in the same analysis. It will frequently be preferable to divide large sets of variables into groups and to analyse the groups separately, investigating the relationships between the groups subsequently. Alternatively, it is sometimes possible to analyse the data by adding and discarding variables in sequential steps; such methods are not only usually more economical of computing resources, but are often more easily interpreted than the analysis of all variables simultaneously.

The number of variables is not likely to be a serious limitation in practical work. There are very few problems for which large numbers of independent dimensions of variability can be established, quite apart from the difficulty of collecting the data in the first place.

7.4.2 A priori weighting of attributes

The question of whether, or how, to weight data a priori is an important problem in taxonomy, and specialists in different groups of organisms have their own ideas about the importance of different attributes in this specific application. One solution to the problem is to state the basis of weighting, so that the reader may judge the merits of each case (Clifford and Stephenson, 1975). Sneath and Sokal (1973) consider that equal weighting is desirable; their approach can be defended on several independent grounds, and is probably the only practical solution. Jardine and Sibson (1971) conclude that certain kinds of weighting which taxonomists use intuitively are, in fact, incorporated in the calculation of many multivariate techniques as part of the analysis.

It is certainly likely that some of the attributes in any given set of data might reflect some single underlying feature, so that to use all the attributes is an implicit weighting of that feature. One way of avoiding this effect is to use principal component analysis to identify the number of independent dimensions contained in the data set, as described in Section 7.7.1.

7.4.3 Choice of the individuals to be included

Having decided on the attributes to be included in the analysis, it is necessary to consider carefully the individuals. Ideally, of course, the logic of this is selection should have been decided long before any data were collected at all, as part of the design of the experiment or survey which preceded the analysis. However, as with the choice of the attributes, it frequently happens that data are presented for multivariate analysis without much thought about the individuals that have been measured or assessed.

It will commonly be assumed that the individuals are samples from some population. Only rarely will it occur that the individuals are the whole population, i.e. represent all the possible individuals for which a description is required. If the individuals are a sample from a population, has that population been defined, and can the individuals be regarded as a representative sample of the population? This may seem a somewhat pedantic question, but it is vitally important for the interpretation of the analysis. At some stage of the argument, it is almost certain that, explicitly or implicitly, someone, if not the analyst, will want to make the inference that, because the sample behaves in a particular way, the population behaves in the same way. Unless the individuals are taken from a defined population by some objective method of sampling, there may be very little value in undertaking multivariate analysis of the data.

Unfortunately, the apparent complexity of multivariate analysis seems to bemuse scientists into logical traps that they would readily avoid if they were analysing only one variable at a time. Most scientists are not well aware of the dangers of attempting to draw inferences from subjective samples, and the use of random or systematic sampling as a protection against subjective selection is well understood. It seems odd, therefore, that is should be necessary to assert that multivariate analysis is a statistical tool, and is subject to exactly the same logical constraints as any other statistical technique.

If, as is usually essential, the individuals to be included in the analysis are a representative sample from some defined population, there will be a 'structure', i.e. a sampling design or experimental design associated with the individuals which will represent independent samples from the population. In more complex cases, the population may have been stratified before sampling, so that the individuals come from several strata, and it will be necessary to identify the stratum from which each individual came. More rarely, the 'individuals' may represent 'plots' or a designed experiment, and, in such cases, it will be desirable to distinguish between 'blocks', or some other device to control experimental variability, and 'treatments' which may have a factorial structure.

The limit on the number of individuals that can be included in an analysis will depend on the forms of data storage available on the computer used for the analysis, but it is generally possible to analyse large numbers of individuals once the data have been summarised in some convenient way. However, the number of individuals will limit the use of particular methods of analysis in some instances; for example, cluster analysis is feasible only for relatively small sets of individuals, unless the analyst is prepared to spend a considerable amount of time (and money) on computing. Where data are available from very large numbers of individuals, some method of deriving a sample of individuals will often be desirable.

7.4.4 Construction of the basic data matrix

When the attributes and the individuals to be included in the study have been defined, the next stage in the analysis is to construct the basic data matrix. This matrix will have as many rows as there are individuals, and as many columns as there are attributes, with the values for each attribute given in a standard order for each individual.

Many analysts are tempted, in constructing the basic data matrix, to add to the measured variables additional sets of simple ratios of the measurements, or even to replace the variables by ratios constructed from them. This temptation should, however, be resisted. As Barraclough and Blackith (1962) have pointed out, 'the use of ratios implies certain prior knowledge about the nature of the systematic variation which, if set out explicitly, would almost certainly be denied by any experienced worker'. If a ratio hypothesis is preferred, there are simple ways of introducing the hypothesis at a later stage in the analysis.

Once the basic data matrix has been constructed, it will then be necessary to transfer the data to some machine-readable format. This format will depend on the input medium available on the computer to be used for the calculations. Increasingly, however, now that interactive computer systems are becoming more readily available, data are keyed direct into disk, magnetic tape or random access memory. This form of direct entry of data, under program control, is more efficient, as input errors can be detected rapidly and the use of programmed data editing and sorting enables error-free data to be obtained more quickly and more economically. The construction of the basic data matrix is therefore the stage at which the transfer should be made to a computer-readable medium of data storage. The point is emphasized because much valuable time can be wasted by attempting to do some of the early stages of the computations by hand or on programmable calculators. Not only can serious errors be introduced into the subsequent stages of the calculations, but it may be difficult or impossible to check the early calculations in any satisfactory way, so that a feeling of uncertainty is introduced into the later stages of the interpretation of the analysis.

7.4.5 Missing values

When the basic data have been assembled, it will frequently happen that there are missing elements in the two-way matrix of values, and the analyst will need to decide how to overcome the absence of these elements. There are no simple solutions to the problem. Most methods of multivariate analysis require complete sets of data for the calculations to be feasible, and some way will need to be found to eliminate the missing values or replace them in the basic data matrix.

The simplest solution is often to eliminate either rows or columns (or both) of the matrix so as to remove the missing values. This elimination needs to be done with care, however, as it is very easy to restructure the analysis so that it becomes irrelevant to the objectives of the research. The feasibility rows and columns will also depend on the number of missing values and the way in which they are distributed in the matrix.

If, for example, all of the missing values are concentrated on one column, representing a variable which was difficult to measure or which was not present on some individuals, consideration might be given to eliminating that variable from the subsequent calculations. Before doing so, however, it is important to ask 'Why was this variable included in the first place?' If it was included as one of a set of related variables thought to measure some particular dimension of variability, there may be little value in retaining it in the analysis: indeed, the fact that it has not been possible to obtain values of that variable for every individual has already indicated that it is not a very convenient measure of the variability that is to be investigated. On the other hand, the variable may be the only representative of an important dimension that must be included in the analysis if it is to meet the research objectives, and, in this case, one , would be extremely reluctant to eliminate the variable if some other way of avoiding the missing value can be found. Where two or more variables have substantial numbers of missing values, any attempt to avoid the missing values by eliminating the variables may substantially restructure the whole analysis.

Similarly, if all, or most, of the missing values are concentrated in a few individuals, these individuals could possibly be eliminated from the analysis. Again, eliminating these individuals may not have any serious implications for the interpretation of the results, but might have the effect of changing the population from which the individuals are drawn. It is relatively easy to think of situations in which the difficulty of obtaining particular measurements is characteristic of certain parts of the total population, and elimination of the individual from that part of the population effectively changes the definition of the population from which the sample individuals are taken. The effects of eliminating individuals may be even more serious when there is an a priori structure imposed upon the individuals, as, for example, when the individuals represent 'plots' of a designed experiment. When variables and individuals have to be eliminated in order to provide a basic data matrix which is free of missing values, difficulties arising from the change in the research objectives may be particularly serious.

Difficult through the decision to eliminate variables, individuals, or both, from the analysis may be, reducing the basic matrix to a matrix without missing values may be the easiest solution that is available. Various attempts have been made to find a technique for replacing missing values in multivariate data, but none of the solutions so far devised are completely satisfactory. Where there is a definite structure that can be imposed upon the individual, any of the missing-value techniques appropriate to the experimental design may be used to fit small numbers of missing values for each variable separately. Alternatively, multiple regression analysis may be used to fit the missing variables, the prediction equations for the missing values being derived from those individuals for which complete data are available. Any missing values fitted in these ways must be examined very critically: it is better to reject an individual or variable than to include a 'wild-looking' missing value. In many situations, there may be little to be gained from anything but the replacement of the missing value by the mean of the variable for the other individuals.

Perhaps the most important question to ask, however, is the following: 'With these missing values, is any kind of multivariate analysis worth attempting?' The right answer may well be 'No'.

7.4.6 Scaling, standardization and transformation of the data

There are three ways in which multivariate data may be modified: (i) scaling; (ii) standardization; (iii) transformation. Scaling may be done in a variety of ways, the simplest being to add or subtract a constant from all values of a given attribute. Another method is to multiply or divide by a constant. Standardization implies that the value of each attribute for each individual is expressed as a derivative from the mean of that attribute, and then divided by the standard deviation. This procedure has the effect of reducing all attributes to unit standard deviation, and of reducing the magnitude of each attribute. Other methods include ranging (Gower, 1971) and rankits (Sokal and Rohlf, 1962). Standardization of attributes make them dimensionless, and so renders them additive. It also reduces the range, so that single attributes, or a small number of attributes, do not automatically dominate the analysis. The choice of the method of standardization is effectively a form of weighting.

The term transformation is used of methods which seek to change the shape of the frequency distribution of the data, usually in the hope of obtaining an approximately normal distribution. In univariate statistics, for example, transformations may be used to satisfy the theoretical requirement of the analysis of variance. Classical multivariate theory has been based largely on the multivariate normal distribution, and, although multivariate normality is not essential in some multivariate methods except for the sampling theory (e.g. in canonical correlation), not much is known about the robustness of the methods and the effects of large departures from normality. Some methods are sensitive to non-normality, e.g. cluster methods based on the assumption of a mixture of multivariate normal distributions. In numerical classification, dissimilarity measures may be sensitive to certain types of data. Thus, measures of Euclidean distance, including variance measures, are particularly sensitive to data in which there are occasional very large values (Clifford and Stephenson, 1975).

Transformation of the basic data may be introduced for the following purposes:

to make the variables approximately normally distributed;
to make the sample variances and covariances of separate groups more homogeneous;
to make the relationships between the variables more nearly linear;
to simplify the interpretation of the analysis, e.g. to provide ratios of variables rather than linear combinations.

Although in practice it is difficult to separate these purposes, we may consider them in turn.

7.4.6.1 Transformation to make the variables approximately normally distributed

It may seem desirable to transform each of the variables included in the analysis so that they are all approximately normally distributed. So much of statistical theory is dependent on an appeal to the normal distribution that the requirement will sometimes seem to be almost a standard response to the analysis of data. Almost certainly, variables will occur in data matrices which cannot, by any shade of imagination, be regarded as even approximately normally distributed. The emphasis on the multivariate normal distribution in most texts dealing with multivariate analysis has the effect of misleading many users of multivariate techniques into thinking that the assumption of a normal distribution is essential for any practical application.

In fact, very little appeal to the multivariate normal distribution is necessary in the application of most forms of multivariate analysis. The essential mathematics of multivariate analysis does not depend of the assumption of the multivariate normal distribution, unless tests of significance are invoked. Such tests of significance are not usually required for the interpretation of the data.

Certainly, however, there may be little harm in introducing various transformations in order to make the distributions of the variables more nearly normal. The most frequent transformation for this purpose will usually be the logarithmic transformation, but the square root transformation may sometimes be suggested for variables which follow an approximate Poisson distribution, especially where all the values are less than 25, and the �(X+ ³/₈) transformation may be used when all the values of one variable are less than 10. The arc sin transformation (y = sin^-1�^p)may be useful when a variable consists of percentages of a fixed number.

However, the value of transformations for the purpose of making the variables approximately normally distributed may be severely limited, quite apart from the fact that the assumption of a multivariate normal distribution is seldom of practical value in the interpretation of the results. The multivariate normal distribution places constraints on the correlation between the variables, and these will usually be affected in unknown ways by the transformations.

7.4.6.2 Transformation to make the sample variances and covariances more homogeneous

It is frequently desirable to reduce the heterogeneity of the sample variances and covariancs, especially in canonical variate analysis, where it is necessary to pool the 'within-group' variances and covariances. Again, the logarithmic transformation will be the one most frequently used for this purpose, although other transformations are also used regularly, including the square root, arc sin, and inverse transformations. It is worth checking that the proposed transformation will have the desired effect, as it is relatively easy to introduce some distortion into the results by the unwise use of transformations for one purpose.

7.4.6.3 Transformation to make the relationships between the variables more nearly linear

Most methods of multivariate analysis assume linear combinations of variables, and are hence more appropriate for use in situations in which the relationships between the variables are linear rather than curvilinear.

There are occasions on which the use of a transformation for one or more variables will greatly improve the linearity of the relationships between the whole set of variables. Again, logarithmic transformations are the most frequently invoked, and are often valuable when the variables show exponential or hyperbolic relationships. Note, however, that the use of transformations to make the variables more nearly linearly related also has effects on the sample variances and covariances, and these effects may be undesirable. Furthermore, while it may be easy to find a transformation which will linearize the relationship between one pair of variables, it may not be easy to find one that will linearize the relationships between all pairs (Kendall, 1975).

7.4.6.4 Transformation to simplify the interpretation of the analysis

For some applications, it may be desirable to introduce a transformation of the variables of the basic data matrix to simplify the interpretation of the results. One commonly occurring example of this kind of transformation has already been touched upon, i.e. when it is felt desirable to express the components of variation as ratios of the basic variables rather than as linear combinations of the variables. By transforming each of the variables of the basic matrix into their logarithm, the resulting linear combinations of the transformed variables which are provided by eigenvectors are complex ratios of the original variables. Such ratios may be easier to interpret than the linear combinations of the original variables, although, again, the logarithmic transformation has profound effects, which may or may not be beneficial on the variability of the variables and the linearity of the relationships between them.

The choice of whether or not to transform some or all of the variables of the basic data matrix is essentially subjective. In a sense, an analysis based on transformed variables represents an alternative hypothesis to the analysis based on untransformed variables, just as the choice of variables to be included in the analysis represents an hypothesis of the relevance of the variables.

Various types of transformation have been used in practical applications (Sneath and Sokal, 1973; Clifford and Stephenson, 1975). Andrews et al. (1971) have discussed the problem of transformations in which the transformed variables are functions of the original variables, collectively rather than separately, and have suggested some techniques which are occasionally useful. Clifford and Stephenson (1975) suggest that it remains uncertain whether the transformation required to produce normality of data is also the transformation which will produce optimal ecological 'sense' , and that optimal ecological classificatory 'sense' is generally obtained by using a weaker transformation than that required to transform data to normality. In any multivariate analysis, careful thought needs to be given to the nature of the data and how they relate to the methods to be used. Despite any constraints, attributes will not, in general, contribute equally to the final classification. Any attribute which varies little over the population has little or no discriminating power.

7.4.7 Data exploration

Once the data have been assembled and stored on some medium accessible to the computer, they should be examined carefully as a preliminary step to any formal multivariate analysis. Methods for such data exploration have been presented by Tukey (1977), Mosteller and Tukey (1977), and McNeil (1977), among others. They include the calculation of medians, quartiles, means and standard errors, the identification of outliers and extreme values, plotting of histograms and boxplot diagrams to reveal skewness of distributions, and trial of transformations to correct various deficiencies. Plotting of two variables at a time on scatter diagrams helps to reveal non-linearity of the relationship between variables, and, again, transformations can be tried to correct for the non-linearity and heteroscedacity of the relationships.

All of these methods are now readily available on modern interactive computer systems, and are invaluable in revealing previously undetected errors in the basic data, as well as in increasing the general understanding and 'feeling' for the data. A present tendency in the use of computers to reduce the degree of contact between scientists and their data is regrettable and totally unnecessary. The modern, time-sharing, interactive computer or micro- processor offers, as never before, opportunities to explore data before embarking on any major analysis. In this way, it is possible to check that the assumptions in the use of particular methods of analysis are justified.

7.5 CHOICE OF METHOD OF ANALYSIS

There are now so many different methods of multivariate analysis described in the literature that it would be impossible to compile a complete catalogue of methods. Fortunately, all of the methods fall into a few major categories, so that it is sufficient to describe these categories and some of the more important techniques within each category. Anyone embarking on the use of multivariate techniques of model dynamic change in ecosystems would be well advised to begin with one of the more well-established techniques before moving to the many less well-tried methods.

The multivariate models presented in this chapter will be summarized in four main categories, namely ordination, discrimination, classification and the fitting of relationships.

7.5.1 Ordination

The basic n x p data matrix defines the position of the n individuals in a p-dimensional attribute space. Ordination procedures in multivariate analysis have as their aim the arrangement of the individuals in as small a number of dimensions as possible, while retaining as much as possible of the information contained by the data. The reduction in dimensionality from the total number of attributes to (hopefully) some smaller number makes the data easier to handle mathematically by: (i) replacing attributes which are linearly related, or nearly so, by a single composite attribute; (ii) making graphical representation easier by referring-the data points to axes which are orthogonal, i.e. at right angles to each other; and (iii) increasing the possibility of reification, i.e. the interpretation of the mathematical expression of the data in terms of the original problem by providing a useful insight into their structure.

Ordination procedures are necessarily related to the amounts of information which are available about either the attributes or the individuals. Any cluster of attributes or individuals will need to be related to this information. But the analysis begins by assuming that the basic data matrix is unpartitioned, that there are no a priori divisions of this matrix which must be incorporated into the assumptions underlying the multivariate model, and that the external information will be determined after these parameters have been estimated. Figure 22 emphasizes this lack of a priori divisions of the matrix, by contrast with the figures which follow.

Figure 22. Undivided matrix for ordination models

7.5.2 Discrimination

The purpose of discriminant analysis is to examine how far it is possible to distinguish between members of various groups on the basis of observations made about them (Marriott, 1974). Thus, the basic data matrix is divided, a priori, into two or more groups. The multivariate model for discrimination provides for:

tests of significance for differences in the values of the attributes between the groups;
allocation rules for identifying further individuals as belonging to one of the groups by some kind of discriminant function based on the measured attributes;
estimates of the probability of correct allocation by the rules that are derived from the model.

As shown in Figure 23, the basic data matrix now has divisions of the individuals into two or more groups, based on information which is external to the analysis itself. It is not necessary that there should be the same number of individuals in each group.

Figure 23. Division of basic matrix into r a priori groups for discriminant analysis

7.5.3 Classification

Multivariate models for classification or cluster analysis fall into four sub-categories, which overlap to a considerable extent.

What is the best way of dividing the individuals into a given number of groups, a procedure called 'dissection' by Kendall and Stuart (1979)? The resulting dissection need only be a convenient way of dividing the given set of individuals into groups, as opposed to some 'natural' division of the data.
What is the best way of dividing the given set of individuals into more or less homogeneous and distinct categories? In ecology, we often expect a set of individuals to reflect biological characteristics which enable organisms or communities to be classifiable into groups which are in some way 'natural'.
Are there discontinuities in the multivariate attribute space in which the individuals are described which will help us to classify those individuals? Here, we are content to be guided by the analysis in the identification of distinct groups in the space defined by the attributes.
What is the best way of constructing a strictly hierarchical classification and plotting a dendrogram for these individuals? Expressing the relationship between the individuals in the form of a tree implies a special kind of structure, and is frequently confused with a search for natural groupings.

In practice, these four sub-categories are mixed up in any single application of multivariate models. A very large number of possible methods have been reviewed by Cormack (1971). The choice of an appropriate method therefore becomes quite difficult, and only a few basic approaches will be described in this Handbook.

The division of the basic data matrix is the same as in Figure 23, but the individuals must now be re-ordered. The divisions of the individuals into groups are not made on the basis of external information, but solely on the basis of information contained in the data matrix. The ordering of the individuals within the matrix will, however, usually need to be changed to reflect the groupings or dissections which are identified by the analysis.

7.5.4 Relationships between variables

The fitting of defined relationships between a single variate, y, and a set of variables, x₁, x₂, ..., x_p^, which mayor may not be random variables, is a well-known problem usually solved by the technique of multiple regression. Strictly speaking, multiple regression is not a multivariate model, but is a technique so widely used, and often misused, that a brief account of it will be given later in this chapter. The multivariate extension of the concept to the relationship of a set of variates with another set of variables is of particular importance in the modelling of dynamic change in ecosystems, but it is an extension which has been somewhat neglected in ecological research. The technique defines the relationship between any two sets of attributes by pairs of newly defined variables. It is then possible to test whether there is evidence for any relationship, whether the relationship is accounted for entirely by the first pair, or the first few pairs, of variables, and whether some of the original variables can be left out of the consideration, without significantly affecting the conclusions (Marriott, 1974).

As shown in Figure 24, the divisions of the basic data matrix now occur between sets of variables. In a complex case, there may be several groups of variables, but the appropriate model for this case is an extension of the simple model case for only two groups of variables. Ideally, both sets of attributes should consist of sets of continuous variables, but modifications of the technique exist for other types of variables. Again, it is not necessary that the two groups of attributes should be of the same size.

The choice between these four main categories necessarily depends on the purpose of the investigation. In part, the choice also depends on how much is known about either the attributes or the individuals or both. In other words, the multivariate model is embedded in a whole series of hypotheses, assumptions and objectives. Unfortunately, this fact is seldom made clear in any of the texts on multivariate analysis, with the result that the basic data matrix is usually regarded as being self-sufficient in representing all that is known about the problem the multivariate model is being used to solve. This section of the Handbook on the choice between methods is, therefore, intended to emphasize the importance of the information which is external to the data matrix itself. Descriptions of some of the more important techniques within each of the main categories are given in the following sections.

Figure 24. Division of basic matrix into two groups of variables for fitting of relationships between the two groups

7.6 METHODS OF MULTIVARIATE ANALYSIS: INTRODUCTION

The description of the underlying mathematics of multivariate techniques in this Handbook is restricted to the information which the more mathematically inclined reader may find useful in relating the methods applied to the practical analysis of data to algebraic theory. There are plenty of alternative presentations of the mathematical theory of multivariate analysis, and anyone who wants to drink deeply from this particular source of inspiration would do better to read these other texts, and notably Anderson (1958), Hope (1968), Morrison (1967), Seal (1964), and Kendall and Stuart (1979). For the more limited purposes attempted here, the excellent paper on the algebraic basis of the classical multivariate methods by Krzanowski (1971) has been used as both a source of more or less direct reference and a consistent notation. An understanding of matrix algebra is essential for any compact presentation of the mathematics of multivariate analysis, and readers requiring comfort and instruction on this subject should consult Searle (1966) or Bellman (1960).

The essential nature of multivariate data has already been illustrated in Figure 21. We suppose that p attributes are observed on each individual in a sample of n such individuals. Generally, the observed attributes will be variates, i.e. they will be values of a specified set with a specified relative frequency or probability. In matrix terms, the values given to the variates for the ith individual are the scalars Xi) where j can take all the values from 1 to p. The whole series of values for the ith individual is given by the vector:

xi' = (x_i_l, x_i₂, x_i₃, ..., x_ip)

The complete set of vectors represents the data matrix X with n rows and p columns:

X = (x_l, x₂, x₃, ..., x_n)

X ' =(x'_l, x'₂, x'₃, ..., x'_p)

and is a compact notation for the whole sample. The ith row of the matrix gives the values of the p variables for the ith sample: the jth column of the matrix gives the values of the jth variable for each of the individuals in the sample. Note that, at this point, no assumptions are made about the extent to which the sample can be regarded as representative of some defined population.

This compact notation for the data matrix has some convenient computational properties. Without loss of generality, we may assume that the variates are measured about their means for the sample, so that all the column means are zero. Then, the sample variance-covariance matrix may be calculated as:

[1/(n- l)] X' X

where X' is the transpose of the original data matrix. The sample sums of squares and products, X' X, and hence the variance-covariance matrix, has the mathematical property of being real, symmetric and positive semi-definite.

Geometrically, the sample data of Figure 21 can be represented as n points in p dimensions, with the values of the jth variate (j = 1,2,3 , ..., p) for each unit referred to as the jth of p rectangular co-ordinate axes. When the number of variates is large, the resulting geometric representation is in many dimensions and cannot easily be visualized. For this reason, many of the techniques of multivariate analysis seek to simplify this representation by reducing its dimensionality, but also to keep the essential features of the data as measured by their total variance.

7.7 ORDINATION

In this section, four methods of ordination will be described, depending on the main focus of interest in the undivided basic data matrix. Where the main focus of interest, initially at least, is on the correlations between the variables, principal component analysis may be the most appropriate method of deriving an ordination and reification of the data. Factor analysis is regarded by some analysts as an alternative. Where, however, the basic data matrix is used to define the distances between individuals in the attribute space, principal co-ordinate analysis finds the n points relative to principal axes which will give rise to these distances. Finally, an iterative approach may be used to derive an ordination simultaneously from the attributes and the individuals by reciprocal averaging, a technique which is related to correspondence analysis.

7.7.1 Principal component analysis

Where no a priori structure is imposed on the data matrix, so that it represents a scatter of n points in p dimensions, we seek a rotation of the axes of the multivariate space such that the total variance of the projections of the points on the first axis is a maximum. We then seek a second axis orthogonal to the first, which accounts for as much as possible of the remaining variance, and so on.

If x' = (x₁, x₂, ..., x_p) represents a point in the p-dimensional space, the linear combination l'x of its co-ordinates represents the length of an orthogonal projection on to a line with direction cosines l, where l = (l₁, l₂, ..., lp) and l' l = 1.

The sample variance of all n elements is given by:

V = l' X' Xl

and to maximize V subject to the constraint of orthogonality, the following criterion is maximized:

V' = lX' Xl - l(l' l - 1)
= l' Wl - l(l' l - 1)

where W = X' X.

It can be shown that the p equations in p unknowns 1, 2, ..., p have consistent solutions if and only if | W - ll | = 0. This in turn leads to an equation of degree p in with p solutions l_l, l₂, ..., l_p. These solutions are variously designated as the latent roots, eigenvalues or characteristic roots of W.

Substitution of each solution l_l, l₂, ..., l_p in

(W - ll )l = 0

gives corresponding solutions of l which are uniquely defined if the l's are all distinct, and these are designated as the latent vectors, eigenvectors, or characteristic vectors of W.

The extraction of the eigenvalues and eigenvectors of the variance covariance matrix of our original data matrix representing n points in p dimensions neatly defines the linear combination of the original variables which account for the maximum variance while remaining mutually orthogonal. The elements of the eigenvectors provide the appropriate linear weighting for the components, and the eigenvalue, expressed as a proportion of the number of dimensions (p), gives the proportion of the total variance accounted for by the component.

Note that, in the argument used so far, no assumptions about the normality of the multivariate distribution from which the samples have been drawn have been involved, and such assumptions would only be required if tests of the significance of components were required.

7.7.1.1 Covariances or correlations?

The above theory has been developed for the covariance for the covariance matrix and is most suitable if all the original variates have been measured in the same units. Even if the variates are all on comparable scales, it is necessary to be careful in working with unstandardized data. For example, if the range of values obtained for one variate is greatly different from that of another, it clearly makes a good deal of difference whether the data are standardized or not. On the other hand, if the original variates are not in the same units, their linear compound would have little meaning, and the rationale of maximizing a'Sa relative to a'a is questionable; in fact, the analysis will depend on the different units of measurement (Anderson, 1958).

If the variates are not all expressed in the same units, it is customary to standardize the data so that:

z_ij = x_ij/ S_j

where x_ij is an element of matrix X and is a deviation from the mean, and s_i is the standard deviation of the jth variate. In this case, the transformation is:

U = ZB

and the covariance matrix U, say L, is:

L = B'RB

where R is the covariance matrix of Z, and the correlation matrix of X.

The theory for principal component analysis based on the correlation matrix is essentially the same as that for the covariance matrix (see Cooley and Lohnes, 1971). The eigenvalues l_j and eigenvectors b_j of the correlation matrix are computed. The sum of the eigenvalues will be the trace of R, and will be equal to p. In analysing the correlation matrix, the data are regarded as a set of correlated variates, and the aim of the linear transformation is to obtain an equivalent set of uncorrelated components with maximized variances. Hence, the pattern of the eigenvector elements depends upon the correlation structure of the observations. In an analysis of the covariance matrix, the pattern of the eigenvector elements will depend on the variance-covariance structure.

7.7.1.2 Eigenvalues and eigenvectors

The eigenvalues and eigenvectors of a covariance or correlation matrix determine the lengths and directions of the component axes. The orthogonal eigenvectors used in computing the component values are the direction cosines of the component axes. The columns of an orthonormal eigenvector matrix are the eigenvectors, and the rows represent variables. As the sum of the squared elements of any column is unity, the square of any element gives the proportion of the variance of the component which is accounted for by the corresponding attribute. The contribution of a component to an attribute is not so obvious, and, by concentrating only on the orthonormal matrix, the possibility may be overlooked that a particular attribute is largely accounted for by one of the later components. In order to get more information, other normalizations of the eigenvectors are necessary.

The eigenvectors of a covariance matrix may be normalized so that the sum of the squared elements of a column eigenvector is equal to the corresponding eigenvalue. Then, providing that all the eigenvectors are included in the matrix, the sum of the squared elements in any row will equal the variance of the corresponding attribute. The elements of this matrix are a_ij �d_j. For convenience, we can call this a type A eigenvector matrix. The best fitting q-dimensional subspace to a scatter of points in a higher (p) dimensional space passes through the centre of gravity of the points, and the sum of squares of the perpendiculars from the points to the subspace is a minimum. Such a space is defined by the centre of gravity (i.e. the mean vector) and the orthonormal eigenvectors. Any point in the subspace can be referred to the p original axes by the relation:

x_i = x' + k'_iC'

where X_i is a p-dimensional row vector of the raw data matrix X, x' is the mean vector, k'_i is a row vector of a matrix (N X q) of coefficients, and C' is the (q X p) transpose of:

C = AG

where G is a diagonal matrix of square roots of the eigenvalues, i.e. matrix C is a type A eigenvector matrix. The elements of the eigenvectors:

c_j = a_jd_j

are 'typical points', summarizing the deviation of the original points from the centre of gravity (Rao, 1964). In some types of study, the typical points prove to be useful (e.g. Simonds, 1963). In other types of study, they may not admit any useful interpretation. Taylor (1977) has plotted these values as 'variance profiles'. Each observed profile, easier to interpret if the data are standardized, is a composite of several such profiles.

If each element of a row of a type A eigenvector matrix is divided by the square root of the variance of the corresponding variable, the rows will have unit sum of squares. Squaring the appropriate element gives the proportion of the total values of an attribute which is accounted for by a particular component. The elements will be a_ij�d / s_j, i.e. the product-moment correlation of the ith attribute and the jth component. Each element is a cosine between an attribute and a component, i.e. a direction cosine of the attribute. It is also a projection of the attribute on to a component (Hope, 1968).

An eigenvector matrix obtained from a correlation matrix bears no simple relation to that derived from the corresponding covariance matrix. Here, too, the orthonormal eigenvectors b_j are used to compute the component values, if they are applied to the standardization data. The eigenvector can also be normalized so that the sum of the squared elements of a vector equals the associated eigenvalue, the new vector elements being b_ij. The sum of squares of each row is unity, and squaring the appropriate element gives the proportion of the total variance of an attribute which is accounted for by a particular component. Each of these elements is the correlation of the ith attribute with the jth component, and they are direction cosines of the attributes. For convenience, we can call this a type D eigenvector matrix.

Finally, for both a covariance and a correlation matrix, the elements of an orthonormal eigenvector matrix can be divided by the square root of the corresponding eigenvalue (Hope, 1968; Massey, 1965). If such a matrix is applied to the data, the resulting components will have unit variance. We can call this a type C matrix (covariance) or a type E matrix (correlation).

7.7.1.3 Calculation of eigenvalues and eigenvectors

The main computational problem of principal component analysis is to solve the eigenstructure of a symmetric matrix. Since the use of digital computers became widespread, significant advances have been made in the development of fast and accurate algorithms. Modern methods first reduce the matrix to tridiagonal form by Householder's method. The eigenvalues of the tridiagonal matrix are the same as those of the original matrix. The eigenvectors are not the same, but are readily found by an appropriate transformation. In the algorithm given by Ortega (1967), the eigenvalues are calculated by the method of Sturm sequences. As many eigenvalues as required may be calculated, but it is usual to compute all the eigenvalues, if not all the eigenvectors, because the sum of the eigenvalues gives the rank of the data matrix. Finally, the eigenvectors of the tridiagonal matrix are calculated and back-transformed to eigenvectors of the original matrix by a procedure due to Wilkinson (1965).

In the above method, if the matrix was pathologically close to being singular, or if multiple eigenvalues exist, the eigenvalues will be accurate but the eigenvectors will not be orthogonal. If accurately orthogonal eigenvectors of close or multiple roots are required, it is better to use the TRED2 routine of Martin et al. (1971) for the tridiagonal matrix and the QL algorithm with standard shift (TGL2 routine of Bowdler et al., 1971) to calculate the eigenvalues and eigenvectors. This combination produces eigenvectors which are always accurately orthogonal, even for multiple roots. The order in which the eigenvalues are found is, to some extent, arbitrary. The accuracy of individual eigenvectors is, of course, dependent on their inherent sensitivity to changes in the original data.

When the complete set of eigenvalues and eigenvectors is required, Ortega's program is more efficient than TRED2 and TGL2 for matrices of orders exceeding about 48, but TRED2 and TGL2 are more efficient for orders less than about 35 (Sparks and Todd, 1973). TRED2 and TGL2 also have the advantage of requiring less storage space.

7.7.1.4 Dimensionality of the data matrix

Principal component analysis normally provides r � p non-zero eigenvalues from p attributes. If n < p, a maximum of n -1 components are obtained. For many purposes, it is convenient to disregard components which have small variances, treating them as constants. Hence, the main components of variation may be studied in a subspace of dimension q < p. Although some information is lost in this transformation, the q components may account for enough of the original variance to make the reduced dimensionality useful. Indeed, much of the variability in the lower components may be regarded as random noise. However, because components are rejected the possibility should be considered that one of the lower components may be important to a particular attribute or to a particular subgroup of individuals. In choosing to ignore a component, we are not doubting its reality. Any dimension which exists in a sample must exist in the population, and, however small the latent root, it should not have arisen from a population in which the corresponding value is zero (Kendall, 1975). We are merely suggesting that that part of the variance is not relevant to the particular study in hand (Jolloffe, 1982).

For some, data, the first few eigenvalues are large, but successive eigenvalues decrease sharply. It is then clear how many components are likely to be of practical importance. In many cases, however, the result is less clear. The first component may account for half, or less, of the total variance, and successive eigenvalues decrease gradully. This result may be due to one of two causes. First, the variances of the original measurements are approximately equal (or have been made to be). Second, the correlation between the original measurements may be strongly curvilinear (Williams, 1976).

Morrison (1976) suggests that, from his experience, there is usually little point in extracting further vectors if the first four or five components do not account for about 75% of the variance. Even if the later eigenvalues were sufficiently distinct to allow easy computation of the components, the interpretation of those components may be difficult or impossible. He suggests that it is frequently better to summarize the data in terms of the first components with large markedly distinct variances, and to include as highly specific and unique variates those responses represented by high loadings in the later components, although the latter are likely to be associated with considerable noise from the other, unrelated variates. Marriott (1974) suggests that, if the attributes have been standardized, it may be reasonable to discard all components with a smaller variance than that of a single attribute, but, on the whole, it is better to retain unimportant components, than to discard information of value.

Jeffers (1965) proposed a 'rule of thumb' test based on the fact that, if the correlation matrix is a unit matrix, i.e. no attribute correlates with any other attribute, the eigenvalues will all be unity. If there are correlations, some of the eigenvalues will be greater than unity and some will be less. Hence, a component with an eigenvalue less than unity represents a component which accounts for a smaller proportion of the variance than would be represented by each of the basic attributes separately. In practice, therefore, he has found it useful to consider only those components having eigenvalues equal to or greater than unity, although he might also look at the next one or two components with eigenvalues less than unity, provided that they are greater than about 0.75. Kaiser (1960) has also expressed a variety of arguments for a 'rule of thumb' test based on eigenvalues greater than unity, and this rule seems to work well with small or moderate-sized samples (Cooley and Lohnes, 1971).

7.7.1.5 Sphericity

A different question is that of sphericity. The well-known chi-square test of Bartlett, with its various modifications (Kendall, 1975), tests whether some or all of the eigenvalues are equivalent. Applied to all the eigenvalues, the test has different interpretations according to whether it is applied to the correlation matrix or the covariance matrix. Applied to the correlation matrix, the test asks whether the eigenvalues show any tendency to deviate from a uniform value of unity. If they do not, the correlations among the tests cannot be assumed to differ from zero, and there is no point in transforming the data to orthogonal components. If the test is applied to the covariance matrix, it is a test of the independence of the variates and of the equality of their variances (Hope, 1968).

Such statistical tests require a large sample drawn from a p-dimensional multivariate normal population with a covariance matrix having p distinct, non-zero eigenvalues. The asymptotic distribution theory of the eigenvalues and eigenvectors of correlation matrices is considerably more complicated than that of covariance matrices (Anderson, 1963). Furthermore, it does not follow that all the components which reach statistical significance in a large sample necessarily remove a large proportion of the variance, and so some of them may be comparatively unimportant in practice (Bartlett, 1950). Krishnaiah and Lee (1977) have discussed the problem of the testing of roots of covariance matrices. James (1977) described tests of the hypothesis that a prescribed subspace is spanned by principal components. The value of such tests in the analysis of real data is doubtful, and many workers prefer to use the 'rule of thumb' approach given above. Non-parametric tests might prove useful, but do not seem to have received much attention.

7.7.1.6 Calculation of the transformed values

As a further aid to the interpretation of the analysis, and as an input for further analysis and interpretation, it is usually desirable to compute the value of each of the components regarded as meaningful for each of the individuals in the basic data matrix. Where, as will most usually be the case, the components will be computed from the standardized variables of the basic matrix, i.e. from the difference of the individual from the average for the matrix, divided by the standard deviation of the values, the product of the weighted eigenvectors and the standardized variables for the individuals then gives the transformed values of the original data.

If the analysis has succeeded in giving any effective reduction of the essential dimensions of the problem, there will be fewer transformed variables than there were variables in the basic matrix, and these transformed variables have the desirable property of being orthogonal or statistically independent.

7.7.1.7 Analysis of transformed values

Many types of analysis can be carried out on the transformed values represented by the calculated components for each individual, and are limited only by the ingenuity of the analyst, and the purpose of the analysis. Much will depend on the structure that has been imposed on the variables and individuals of the basic data matrix. Where no such structure exists, some kind of cluster analysis (see below) of the individuals will usually be the only kind of further analysis which will seem appropriate, but, even here, the choice of methods is considerable. However, it will be evident that a preliminary component analysis of the data will frequently greatly reduce the dimensions over which the clustering needs to be performed. It will also remove much of the 'noise' which makes some forms of cluster analysis difficult to interpret.

Structure imposed on the variables will frequently lead to comparisons of the correlations between sets of components or to the calculation of the regression of a component of one set on all or some of the components of another set. Clearly, there is no point in looking at the correlations between components of the same set, as these are, by definition, orthogonal. Regression calculations can be greatly facilitated by the technique of orthogonalized regression described by Kendell (1975). Structure imposed on individuals will usually lead to analysis of variance of the individual components, the analysis being related to the experiment or survey design implied by the structure. The orthogonality of the components greatly increases the ease of interpretation of this method of analysis.

7.7.1.8 Plotting of transformed values

Plotting of the transformed values will frequently be helpful in making clear the relationships and distinctions between the variables, components and individuals. The most usual form of plotting will be as a projection of the n-dimensional space on to the plane formed by the axes of two of the components. The orthogonality of the components ensures that the axes of any pair of components are at right angles to one another, and greatly simplifies the plotting. However, it must always be remembered that projection of more than two dimensions on to a two-dimensional plane almost inevitably results in some distortion of the relationships between individuals. The techniques of minimum spanning tree and nearest neighbour analysis may be used to alert the analyst to the degree of distortion present.

Where plotting of three dimensions is required, several techniques are available, including the use of various kinds of symbols to indicate deviations from the two-dimensional surface in some third dimension, and the use of stereo pairs of plots giving the illusion of three dimensions when viewed through a simple stereoscope. For larger numbers of dimensions, Andrews' transformation (Andrews et al., 1971) will usually be preferable. Alternatively, biplot graphical displays (Gabriel, 1971) may be used to obtain a visual appraisal of the structure of large data matrices, and to show the inter-unit distances and the clustering of the individuals in multivariate space. This is one of the areas of application of computer graphics which promises to greatly improve interpretation and understanding of complex data.

7.7.1.9 Reification

Reification is the interpretation of the results of a mathematical analysis in terms of the original problem and observations. Such of the components can sometimes be interpreted in terms of features of the original observations. Components may correspond to features which have already been appreciated in an intuitive way. Ii is important to realize the limitations of this type of reification. The order in which the components appear, the proportion of the variance associated with each, and, hence, the conclusions drawn, depend on the objects chosen and the observations included in the analysis, as well as on sampling variations. The stability of the components of a sample can be examined by making similar sets of measurements on several samples, and doing a separate analysis on each set. Kendall (1975) emphasizes that the direction cosines are sensitive to fluctuations such as one would get from one sample to another. It is therefore unwise to place too much emphasis on the numerical value of any particular coefficient in an eigenvector .

In some types of analysis, the first principal component has the character of a size vector, while others are shape vectors. Jolicoeur and Mosimann (1960), in studies on the painted turtle, emphasized that, for the first component to be a size vector, all coefficients must be of the same sign. Rao (1964) gave a mathematical justification for this conclusion. Examples of other studies of this type are given by Blackith and Reyment (1971). However, the first component should not be regarded as a size component unless the structure of the observations clearly suggests such an interpretation. For example, in studying heath vegetation (species presence-absence), Ivimey-Cook and Proctor (1967) found that the first component reflected species abundance.

In some applications, the calculated components do not suggest a simple reification, i.e. the pattern of eigenvector elements on a particular component is not susceptible to a simple biological or ecological interpretation. Holland ( 1969) has pointed out that, while the principal components themselves may be of little biological significance, the space defined by their vectors is significant, and it is only a matter of geometrical manipulation to determine the extent to which the vectors of other components corresponding to biological hypotheses, or derived from other bodies of data, lie within such a space. Holland further suggests that, if certain hypothetical or consistent components can be found to fit the observations, it may be preferable to define the space occupied by the observations in terms of these new components. If this is done, two problems arise. First, the variation associated with each new component will be unknown. Second, the number of such components may not be sufficient to define the space. He suggests ways of overcoming these problems.

7.7.1.10 Rotation

Various methods have been proposed for rotating the component axes to give a new set of axes spanning the same space. The proportion of the total variance accounted for by the first q components will be unchanged, but the variances of the individual new components will be different from the originals, and they will not be subject to maximum-variance constraints, i.e. they will not be the principal axes of the hyper-ellipsoid. Furthermore, the rotated axes may not be orthogonal, so that the variances of the new components do not constitute a partition of the total variance attributable to the component space. Hence, great care is needed in interpreting such components. If axes are to be rotated, the aims of the rotation must be clearly stated and formulated in mathematical terms, so that the angle of rotation can be determined. In many studies, ecologists apply one or more of the standard rotation techniques used in factor analysis (see Section 7.7.2 below).

For example, Ivimey-Cook and Proctor (1967), studying heath vegetation, found that, in principal component analysis of the correlation matrix, the first three components reflected species abundance, soil moisture and base status respectively. The fourth and fifth components were not readily interpretable. The first five components accounted for nearly 88% of the total variance. Examination of the ordination graphs suggested that a rotation of the axes might bring them into line with configurations of points, and the components were then rotated according to Kaiser's varimax criterion to give five new axes which were readily interpretable as corresponding to five recognizable vegetation types.

Another method of rotation involves the fact that, in a vegetation association which is virtually a pure stand, the entropy of mixing of the constituent elements is low relative to that of an association where many elements are approximately equally represented (Pelto, 1954). Zones of inter-gradation between two associations or communities have high entropy (Howard and Murray, 1969) and, if components are rotated to positions such that the entropy of the system is minimized, the resulting new axes might be expected to link the centres of associations (McCammon, 1966,1968).

7.7.1.11 Interpretation of results

Having reviewed the main stages of principal component analysis, it will be evident that the interpretation of the results of the analysis depends on careful oversight of all these stages. It is perhaps for this reason that the interpretation of multivariate data appears to present so many problems to the inexperienced analyst. Used to less rigorous applications of statistical methods, the analyst may be in the habit of delegating some stages of the analysis to others, with a subsequent loss of control of the analysis and of the understanding of the results. Worst still, the analyst may be tempted to surrender the decisions that he should make for himself to the 'control' of a computer program or package ^__ he may not even know what decisions have been programmed into the package, with disastrous results. It is the author's view that the correct interpretation of multivariate analysis can only be achieved if the data, and all the stages of the analysis, are properly embedded in the logic and background of the investigation, of which the data form part.

7.7.1.12 Advantages and disadvantages of principal component analysis

The main advantage of principal component analysis lies in the robustness of the least squares approach to approximating the covariance or correlation matrix. It is, therefore, not important for the data to be multivariately normal, unless significance testing is required. Other advantages lie in the relative simplicity of the technique. For example, it is easy to see the contributions made by the attributes to each component.

The main disadvantage lies in the assumption that any relationships among the original attributes are essentially linear, or at least that any non-linear contribution is small. Problems occur if the data are markedly non-linear, for example along an environmental gradient. The orthogonality of the components implies functional independence only if the objects are normally distributed. Norris (1971) used a simple example with artificial data to show that, if one attribute has a linear response to an environmental gradient while a second has a sinusoidal response, the resulting plot of the components shows the environmental gradient to be curved. Situations in which some of the attributes do not increase or decrease linearly along an environmental gradient proposed as a reification should be examined for functional relationships between components (e.g. Norris and Barkham, 1970).

Problems can occur with plant species, which may be thought of as being distributed along environmental gradients with frequency curves which are S-shaped if the peak value is at one end of the gradient, or bell-shaped if the gradient encompasses the whole range of a species along a gradient, as described by Whittaker (1967). If only a small part of the gradient is examined, the relationships may be approximately linear for a number of species, but, as more of the gradient is included, the number of zero occurrences increases and the bell-shaped curves overlap. Noy-Meir and Austin (1970) subjected a set of simulated data of this type to analysis. The resulting ordination diagram showed that the linear gradient became a complex three-dimensional curve. However, they concluded that, provided no assumption of a one-to-one relationship between axes and gradients is made, principal component ordination could still give a useful representation of the data.

Austin and Noy-Meir (1971) examined the problem of non-linearity using artificial data for two-gradient models. They distinguished two types of distortion which can arise: (i) involution, where extreme stands occur closer to the centre of the environmental plane than less extreme stands, and (ii) spurious axes which appear in the ordination although they do not represent independent environmental gradients.

A related problem concerns the nature of the matrix from which the eigenvalues and eigenvectors are calculated. Principal component analysis is a method for partitioning variances, and can be applied to sums of squares and products matrices, although it is more commonly applied to covariance and correlation matrices. Care is necessary when applying the method to correlation matrices which are not calculated by ordinary product-moment methods (Kendall, 1975). Binary data present special problems; the frequency of occurrence of an attribute constrains the range of correlation possible. The type of species distribution described above means that a species may contribute to the calculation of ecological distance in only a limited part of the data set, because it is represented elsewhere by zeros.

Where the linearity constraint of principal component analysis has received much attention, there has been less concern with the problems raised by the use of mixed data, and in the additive derivation of component values. Some workers think that multiplicative models might be more appropriate. Various problems and approaches to the application of principal component analysis were discussed by Dale (1975).

A problem which often occurs in multivariate studies is how to reduce the number of original attributes which need to be measured. Jolliffe (1972,1973) discussed eight methods, four of which use principal components. He applied five out of eight techniques to real data, including a multiple linear regression analysis. Mansfield et al. (1977) presented a method for reducing the number of interdependent attributes. The procedure first deletes components associated with small latent roots of X^'X and then incorporates an analogue of the backward elimination procedure to eliminate the independent attributes, this deletion being based on minimal increases in residual sums of squares.

7.7.1.13 Case studies in the application of principal component analysis

Principal component analysis is probably the best known and most widely used ordination technique. The positions of the individuals can be plotted on pairs of rectangular Cartesian component axes. Such plots will show discontinuities if they exist in the data (e.g. see Blackith and Reyment, 1971), but it must be remembered that any such two-dimensional representation is distorted in that the other dimensions are not included. Gower and Ross (1969) showed that such distortions can be illustrated by superimposing the minimum spanning tree (see Section 7.9.1) of the points in the total dimensionality on to their representation in the reduced space. There has been much discussion about the use of principal component analysis in plant ecology, because of problems caused by the nature of the plot species distributions along environ- mental gradients. Too many papers have been written for detailed discussion here, but see, for example, Bray and Curtis (1957), Austin and Orlocci (1966), Beals (1973), Noy-Meir (1973), Whittaker (1973), Orlocci (1975).

Apart from Blackith and Reyment (1971), few collected examples of case studies in the application of principal component analysis exist, but examples of principal component analysis are given in more general books on multi- variate analysis as a whole. One of the earliest of these books is that by Seal (1964) which is still worth reading as an introduction to multivariate techniques. Orlocci (1975) presents a wide range of multivariate techniques, including principal components, in vegetation research, and Mather (1976) does the same for physical geography. More recently, Gauch (1982) gives a range of applications in community ecology.

7.7.2 Factor analysis

7.7.2.1 Mathematical basis

In the principal component model, components were represented as linear additive functions of the original variables. In the model for factor analysis, however, it is assumed that each variate is represented as a linear additive function of K 'factors' common to all variates, together with a residual specific to each variate, i.e. :

or X = LF + E

where X and E are p-vectors, F is a k-vector, and L a p + k matrix. The elements of F are called the common factors, while those of E are called specific factors. The elements of L are designated as the loadings of the factors.

Without loss of generality, it may be assumed that the common factors are uncorrelated with each other and with the specific factor, so that, with further assumptions of zero means and unit variances,

D = LL' + V

where D is the covariance matrix of the population of the p variate and V is the covariance matrix of the specific factors. Methods of estimating the loadings L are given by Lawley and Maxwell (1971), by the maximum likelihood solution is of particular interest.

The model

D = LL' + V

is assumed for the population of which the data matrix is a random sample, providing an estimated covariance matrix C of the population covariance matrix D. If it can be assumed that the population has a multivariate normal distribution, then the sample covariance matrix C has a Wishart distribution with a known likelihood function whose maximum can be shown to be:

L' - (VL' ^-1(L' V ^-1L + I))^-1C = 0

As any orthogonal transformation LH of L will also satisfy the original model and

(LH)(LH)' = LHH' L' = LL'

where H is orthogonal, a unique solution is obtained by solving for the values of L which make

L'V^-1L = J (say)

diagonal, and so

L(I + J) - CV ^-1L = 0

is the basic equation for estimating L.

Consequently,

LJ = CV^-1L - L

so that the diagonal elements of J are the latent roots of:

V^-1/2CV^-^1/2 - I

with eigenvectors:

M = V ^-^1/2L

and scaling given by:

M' M = M' V^-1L = J

The estimates of the loading L are given by V^-^1/2M.

In contrast to principal component analysis, factor analysis requires an assumption that the observations are taken from a population following a multivariate normal distribution -a feature which limits the usefulness of the model in data analysis.

It is also worth noting that the matrix V^-^1/2CV ^1/2 is invariant to changes in scale of measurement, and is, in this respect, closely related to the correlation matrix S^-^1/2CS ^1/2, where S = diag (C), but with the specific variances playing the same standardizing role that is played by the variances. In effect, there is a close relationship between factor analysis and principal component analysis of a correlation matrix, and it can be shown that the projected value of scaled co-ordinates V^1/2X on to k dimensions after a principal component analysis of V^-^1/2CV^-1/2 differs from the estimates of factor scores only by a scale factor for each of the k axes. As a result, if the factor scores of an individual observation are regarded as its co-ordinates, similar configurations are obtained by both methods. If V is proportional to S, a principal component analysis can be expected to give similar results to a factor analysis.

7.7.2.2 Practical considerations

The basic principle of factor analysis is that examination of the correlation matrix may show that a few of the attributes are highly correlated, while the remainder are not significantly correlated with them or with each other. If the purpose is to seek a basic pattern, we might consider retaining the attributes which show high inter-correlations and discarding the remainder. If the values of unity in the principal diagonal of the correlation matrix are replaced with the commonalities (i.e. the common factor variances), some of the eigenvalues, those associated chiefly with the unique variances, would become zero or nearly so. The search for commonalities is a major problem. It is necessary to decide in advance how many eigenvalues and eigenvectors we are interested in, and there is no way to be sure that the correct number has been chosen; it seems to be a matter of luck (Williams, 1976). The next step is to calculate the chosen number of eigenvalues and eigenvectors of the modified correlation matrix, and to normalize the eigenvectors so that the sum of squares of each vector equals the corresponding eigenvalue. The eigenvectors can then be used to provide an improved estimate of the commonalities, and the process is iterated until the commonality estimates are reasonably constant. In practice, this method is no longer used, more efficient methods being available.

7.7.2.3 Differences between factor analysis and principal component analysis

It is useful, at this stage, to emphasize the important differences between principal component analysis and factor analysis. Both methods have in common the aim to find a small number of hypothetical attributes (components or factors) which contain the essential information expressed in a larger number of observed attributes. Hence, the dimensionality of the data is reduced by utilizing interdependence. Factor analysis was originally developed for the analysis of scores obtained by individuals on batteries of psychological tests, and it still has many applications in that field, although its use has spread to other fields, notably geography (e.g. see J�reskog et al., 1976). Originally, the term factor analysis included principal component analysis, but it is important to distinguish between two methods which have similar mathematical models, but are not exact as representations of the data based on certain assumptions. The differences between the techniques are largely connected with differences in these assumptions.

The fundamental differences between principal component analysis and factor analysis lie in the ways in which the factors are defined, and in the assumptions about the nature of the residuals. In principal component analysis, the factors (components) are determined with a maximum variance constraint. In factor analysis, the factors are defined to account maximally for the inter-correlation of the attributes. In principal component analysis, the residual terms are assumed to be small, and a large part of the total variance of an attribute is assumed to be important. In factor analysis, there is no such assumption, and only that part of an attribute is used that participates in the correlation with other attributes. In both methods, the residuals are assumed to be uncorrelated with the factors. In principal component analysis, there is no assumption about correlations between the residuals, whereas the residuals in factor analysis are assumed to be uncorrelated.

7.7.2.4 Case studies

Blackith and Reyment (1971) suggest that it is very hard to discuss factor analysis without generating more heat than light; it is the most controversial of the multivariate methods. Factor analysis was proposed originally as a model for a well-defined problem in educational psychology, but it acquired a bad reputation among mathematicians and was largely ignored outside the field of psychology, where it still finds most of its applications. The method and criticisms were discussed by Cattell (1965), Blackith and Reyment (1971), and Marriott (1974) and several books have been written on this topic (e.g. Lawley and Maxwell, 1971).

Factor axes (and component axes) may be rotated to determinable positions in which they are not necessarily, or generally, orthogonal. Sneath and Sokal (1973) considered that this rotation makes scientific sense in that the factors underlying the covariation pattern of the characters in nature are themselves undoubtedly correlated, but they pointed out that there are practical problems. Clifford and Stephenson (1975) went so far as to state that 'It is likely that in the future factor analysis will play an increasingly important role in ecological studies.' On the other hand, Gower (1967b) considers it doubtful if factor analysis really is a helpful way of viewing biological data, and Blackith and Reyment (1971) ask: 'Could it not be that factor analysis has persisted precisely because, to a considerable extent, it allows the experimenter to impose his preconceived ideas on the raw data?'

7.7.3 Principal co-ordinate analysis

7.7.3.1 Background

Principal co-ordinate analysis is a computationally simple, but powerful, procedure which has wide applications. It arose from a dissatisfaction with many reported applications of factor analysis and principal component analysis in classification studies, particularly in the biological literature. If many, or all, of the attributes are qualitative, product-moment correlations between attributes may be inappropriate. In such a case, the analysis requires the use of a matrix of association coefficients, or some function of an association coefficient which is regarded as the distance between two objects. Some analysts applied the techniques of principal component analysis and fact that the standard underlying assumptions are not even approximately satisfied in such an analysis, obtained results which were successful, in that the expected relative magnitude of inter-object distances were recovered.

The problem was solved in a classic paper by Gower (1966). A simplified account is given by Gower (1967b). Suppose we have a data matrix X (n x p). Any row vector x_i' gives the co-ordinates of a point p_i which represents the position of the ith object in p-dimensional space. From X we can derive a Q matrix, of order n, of coefficients of association between objects. The eigenvectors of the Q matrix give the co-ordinates of the points q_i which are the projections of points p_i on the new axes. The problem is to find the correct scaling for each eigenvector and the inter-object distance (p_i, p_j) when this scaling is used. Gower (1966) showed that, if the eigenvectors are normalized so that the sum of squares of the elements of a vector equals the corresponding eigenvalues, then these vectors define a Euclidean space (a 'Gower space'). The eigenvectors can be used as orthogonal axes, the vector elements being the co-ordinates of the objects in the new space. Given certain conditions, the Gower system is Euclidean even if the original measures were not.

7.7.3.2 Mathematical basis

Suppose we have a matrix A, which is symmetric and of order n, the elements of which are some form of coefficient of association or distance between individuals. Matrix A has n eigenvalues and associated eigenvectors c_j, which are the columns of a matrix C. The elements of the ith row of C are taken as the co-ordinates of a point q_i in Euclidean n-space. The Pythagorean distance between two points q_i and q_j in this space is:

Gower (1966) showed that, if the eigenvectors are normalized so that the sum of squares of a vector is equal to the corresponding eigenvalue, then:

d²_ij = a_i + a_ij- 2a_ij

That is, the inter-object distances r(p_i, p_j) map into the Euclidean Gower space, regardless of whether or not the original distance properties are Euclidean, as long as those properties are suitable for representing the interrelationships between the individuals as expressed in the equation above. For example, if A is a similarity matrix or a formal product-moment correlation matrix between the individuals (r(p_i, p_j)) will be zero for complete identification and will attain its maximum value for complete opposites. In both these cases, the diagonal elements of A are unity, so that:

[r(P_i, P_j)] ² = 2(1 - ai)

It also follows from the equation d_ij = a_ii + a_jj- 2 a_ij that, if we put a_ij = � d²_ij and a_ii = 0, then r(p_i, p_j) = d_ij, and this gives a direct method of finding the co-ordinates of a set of points given their inter-point distances d_ij.

The validity of the above theory depends upon matrix A fulfilling two conditions. First, it must be symmetric; otherwise, the relationship between the equations does not hold. Fortunately, most of the commonly used association or similarity measures (described in detail by Sneath and Sokal, 1973) are symmetrical, although Williams et al. (1971) gave an example which failed in this respect. The second requirement is for A to be positive semi-definite; otherwise, one or more of the eigenvalues will be negative and part of the space will be imaginary, and, therefore, non-Euclidean. Fortunately, most commonly used measures appear to define positive semi-definite matrices, although missing values in the original data matrix can destroy this property.

Having set out the above theory for mapping the original measures into a Euclidean space, Gower (1966) then considered the use of principal component analysis on the matrix of co-ordinates of the points q_i, to find the best fit in fewer dimensions. He showed that this result can be achieved by transforming matrix A to matrix B such that:

b_ij = a_ij- r_i - c_j + g

where r_i is the mean of the ith row, c_j is the mean of the kth row, and g is the grand mean. The rows and columns of B sum to zero, and consequently B has a zero root. Matrix B must be positive semi-definite as before for A.

7.7.3.3 Computational procedures

The steps in the computation of the principal co-ordinate analysis are as follows.

Form the association matrix A. Perhaps the most important aspect of this type of analysis is the choice of a suitable measure. This choice is discussed in more detail.
Transform A to B by the equation b_ij = a_ij- r_i- C_j + g.
Calculate the eigenvalues and eigenvectors of B and normalize each eigenvector so that its sum of squares is equal to its corresponding eigenvalue.

It is clear that it is very easy to run into practical computing problems in calculating the eigenvalues and eigenvectors of a matrix of order n. When n is large, both time and storage space can become a problem even on modern computers. One way of overcoming this problem is to tridiagonalize the matrix using an algorithm which needs only the lower half matrix to be stored (e.g. the TRED3 algorithm of Martin et al., 1971). As only a few of the largest eigenvalues will be required, time can be saved by using an algorithm written especially for this purpose (e.g. the RA TQR algorithm of Reinsch and Bauer , 1971). The corresponding eigenvectors of the tridiagonal matrix can then be calculated (e.g. by using the inverse iteration procedure in algorithm TRISTURM of Peters and Wilkinson, 1971). These eigenvectors will also need to be back-transformed to those of the original matrix (e.g. by algorithm TRBAK3 or Martin et al., 1971). Another approach has been suggested by Lefkovitch ( 1976), and is based on the fact that, if Y is a transformation of the data matrix X such that YY' is a similarity or distance matrix, the principal co-ordinates of the n objects can be obtained from the eigenvalues and eigenvectors of Y' Y' , .This solution is also possible with mixed data types.

7.7.3.4 Interpretation

Principal co-ordinate analysis has more limited uses than principal component and factor analysis methods. Its main aim is the graphical treatment of the data. As n points can be fitted into n -1 dimensions, principal co-ordinate analysis must yield at least one zero eigenvalue. If the requirements mentioned above are satisfied, the r non-zero roots will be positive. The eigenvectors corresponding to the q largest positive eigenvalues give the co-ordinates of the points q_i, which are the projections of the points p_i on to the best-fitting subspace of dimensionality q. If an eigenvalue is small, then the contribution (c_ik- c_jk)² to the distance between q_i and q_j will also be small. Thus, the only co-ordinates which contribute much to the distance are those with large eigenvalues which have wide variation in their vector elements. In many applications, it is found that the distances can be expressed adequately in terms of two or three such vectors. As in principal component analysis, the sum of squares of the residuals will be the difference between the trace of the matrix B and the sum of its q largest roots.

The first principal co-ordinate maximizes the total squared distance between the objects, the second maximizes the total squared distance in the space orthogonal to the first, and so on. The positions of the points can be plotted on pairs of principal co-ordinate axes, and the resulting ordination charge can be examined for pattern. The points are centred at the origin, because the columns of C sum to zero. In order to interpret each axis, we could calculate the correlation coefficient between each column of C and each of the original attributes. Because any diagonal element b_ii is the squared distance of point q_i from the origin O. Furthermore, an off-diagonal element b_ij is the cosine of the angle between the vectors from the centroid O to q_i and q_j times the product of distances Oq_i and Oq_j. The better the approximation given by the first q vectors, the closer q_i will be to p_i.

B = CC'

If matrix A consists of measures which are not real Euclidean distances, B may not be positive semi-definite, and therefore may have negative eigen values. Provided that these negative values are sufficiently close to zero, they can, in practice, be ignored.

Williams (1976) drew attention to an interesting application of principal co-ordinates in the analysis of factorial experiments. If the original treatments are orthogonal, the main effects tend to dispose themselves on separate eigenvectors. Furthermore, if the system has a very large error contribution, this may itself be patterned and may separate out as a distinct vector . Consequently, the true main effects can be tested against an error term from which an unwanted pattern has been partitioned out. Although this procedure appears to be both effective and powerful, it arouses misgivings in statisticians because the underlying rationale is not fully understood.

Another application also described by Williams (1976) is canonical co- ordinate analysis. This analysis can be applied to data in which two batteries of attributes have been measured on the same individuals. The two batteries of data are submitted to principal co-ordinate analysis separately, and a canonical correlation analysis is carried out on the two sets of principal co-ordinate vectors. Because of the orthogonality of the latter, the canonical correlation characteristic equation is considerably simplified; B₁₁ becomes an identity matrix, as does R₁₁^-l and R₂₂ ^-1. The equation thus reduces to:

| R₁₂R₂₁ -l²I | = 0

If only the principal co-ordinate vectors corresponding to a few of the larger eigenvalues have been used, the 'noise' itself will have been discarded.

7.7.3.5 Alternative distance measures

Any transformation of A will result in a transformation in the distance between qi and qj, with a consequent distortion in the configuration of the objects. This distortion may result in curvature effects on the ordination graphs, particularly if A has been formed from presence-absence data (J�reskog et al., 1976).

Pythagorean distance calculated from values of variates has nonsensical physical dimensions if the variates are measured on different scales. To overcome this difficulty, the variates can be normalized, usually by the sample standard deviation, although other normalizations could be used, for example the variate mean (when zero is not arbitrarily located), the range, or even the cube root of the sample third moment (Gower, 1966). The formula:

makes no attempt to allow for correlations.

Boratynski and Davies (1971) investigated the taxonomic value of male coccids using 12 different numerical methods and 3 principal co-ordinate analyses, with 101 characters. They concluded, tentatively, that principal co-ordinate methods using non-parametric measures of association are, perhaps, best suited to the analysis of coded multistate data. They admitted, however, that principal co-ordinate analysis is unlikely to yield seriously misleading results.

7.7.3.6 Duality with principal component analysis

Two techniques are deftned as being dual to one another when they both lead to a set of n points with the same inter-point distances. Gower (1966) showed that principal co-ordinate analysis, if applied to a matrix whose elements are -�d²_ij, where d_ij is the Pythagorean distance between objects i and j calculated from the original variates, is dual to a principal component analysis of the sums of squares and products matrix of the variates. He gave two important special cases of this duality: (i) Sokal's measure of taxonomic distance, and (ii) the simple matching co-efftcient deviation.

If the principal components are derived from the correlation matrix, the inter-object distances are given by:

where S_k is the standard deviation of the kth variate. This distance is the measure of taxonomic distance proposed by Sokal and others (see Sneath and Sokal, 1973). A principal co-ordinate analysis of the matrix

will lead to a reduced dimensional configuration identical to that obtained from a principal component analysis of the correlation matrix.

If S is the matrix of simple matching coefficients, then principal co-ordinate analysis of such a matrix will give points [2(1 -S_ij)]^� apart. The same points would be obtained by a principal component analysis of a matrix of corrected sums of squares and products. In fact, principal component analysis of (0, 1) data is exactly equivalent to assuming that the individuals are represented by points whose distance apart are proportional to (1 -S_ij) �.

7.7.3.7 Case studies

Factor or component scores from factor or principal component analysis can be plotted on pairs of rectangular Cartesian axes, and these scores provide a means of describing inter-object relationships. However, covariances or correlations are not the only bases, and may not necessarily be the best, on which to examine such relationships. The basis of principal co-ordinates is in a definition of inter-object similarity, or association, and the calculation of a similarity matrix, of order n x n. Given such a matrix, the steps in the procedures are analogous to those of component analysis, but the interpretation of the results is different.

Analysis of inter-object similarity was first applied in geology by Imbrie and Purdy (1962). In this study, similarities were defined with respect to properties of constituents, for example mineral species, which make up a rock or sediment. Gower's method of principal co-ordinate analysis is based on the use of a distance or dissimilarity matrix. The ability to ordinate a set of objects given only their dissimilarities can be useful in ecological studies, and there are some circumstances in which a particular dissimilarity measure might be preferred. For example, one might wish to emphasize dominance and thus use the Bray-Curtis measure, or, perhaps, be more concerned with relative properties and use the Canberra metric (Clifford and Stephenson, 1975). Principal co-ordinate analysis is particularly useful when there are missing values or missing variates. In such a case, a correlation type of similarity measure is reasonably robust and reliable, whereas replacing the missing values by estimates or guesses is not usually satisfactory (Marriott, 1974).

Again, there are numerous descriptions of the use of the technique in ecology and the environmental sciences, but no texts devoted solely to case studies of its application.

7.7.4 Reciprocal averaging

Where some of the attributes to be used in a multivariate ordination model are quantitative, the use of principal component analysis, while apparently robust, has little theoretical justification. Principal co-ordinates analysis, on the other hand, can still often be used, even on wholly qualitative data, by devising indices of similarity which are then translated into distances by formulae such as:

d²_ii=1 - S_ii

where the similarity S_ij between individuals in the ratio of 'matches' to the number of attributes compared, a 'match' occurring when the attribute is either present or absent in both individuals.

However, one of the most valuable methods devised for the ordination of multivariate data which are essentially qualitative is that of reciprocal averaging described by Hill (1973). The model is especially appropriate for the analysis of presence-absence data or absence of species on sample quadrats. Geometrically, these data may be regarded as a set of points at the vertices of a hypercube, for which the ordination does not depend on the explicit use of the distances between the vertices.

Reciprocal averaging depends on a series of successive approximations during which the individuals are given an arbitrarily chosen set of starting scores, ideally chosen to represent some gradient suspect a priori as being reflected by the data. Average scores are then computed for attributes from which new, re scaled averages are calculated for the individuals. After a sufficient number of iterations, the final attribute scores converge to the same row vector, and the eigenvalue of the first axis is a measure of the extent to which the range of the scores contracts in one iteration.

When the first axis has been obtained, the second axis is considered, and a good starting point for the scores of this second axis may be obtained by using a set of scores which are close to the final scores for the first axis. Before iteration, however, these scores have to be adjusted by subtracting a multiple of the final first axis. A simple example, together with the appropriate computer algorithm, is given by Hill (1973).

The process is essentially a repeated cross-calibration which derives a unique, one-dimensional ordination of both the attributes and the individuals. It is called reciprocal qveraging precisely because the attribute scores are averages of the individual scores, and, reciprocally, the individual scores are averages of the attribute scores. The final scores do not depend on the initial trial scores, but a good choice of inital trial scores considerably reduces the number of iterations required. The whole procedure is otherwise mathematically very similar to both principal component and principal co-ordinate analysis, and can be extended to cover quantitative as well as qualitative data.

Benz�cri's (1973) method of correspondence analysis, based on the use of a contingency table for assessing similarities, shares many of the properties of reciprocal averaging, in that the similarity must be defined jointly and symmetrically. The method permits the simultaneous presentation of objects and attributes as points on the same pair of co-ordinate axes, so that mutual dependence can be interpreted. An extensive school of application of this method exists in the French ecological literature. An excellent introduction to the theory and application of correspondence analysis is given by Greenacre (1984).

7.8 DISCRIMINATION

The discrimination between two groups of individuals on the basis of measures of several attributes is a classical problem of which Fisher (1936) provided the earliest solution. He postulated a linear function of the measurements on each variable such that an individual can be assigned to one or other of the two groups with the least chance of being misclassified. Such a discriminant may be written as:

Z = a₁x₁ + a₂x₂ + ...+ a_nx_n

where a is the vector of discriminant coefficients and x the vector of observations or measurements made on an individual which is to be assigned to one or other of the two groups. From this model, however, we are only considering the possibility of two groups, and, while we may decide that some individual cannot be assigned with any confidence to one of the two groups, we are not considering the formation of other groups. Thus, the des crimination assumes the imposition of an a priori grouping of the individuals, information which is external to the analysis itself. The group may be of very different sizes, i.e. with markedly different numbers of individuals. Again, we will seek to represent the P variables in as few dimensions as possible, but we will wish to emphasize the between-group variability at the expense of the within-group variability.

7.8.1 Mathematical basis

For any effective analysis of the problem, we must assume that the within- group variances and covariances are homogeneous, with W representing the pooled within-groups covariance matrix, and B the between-groups covariance matrix. In order to find linear combinations of the variables that are orthogonal and that successively maximize the between-group variance by comparison with the within-group variance, we maximize the criterion:

and this may be shown to be equivalent to solving:

(B - lW ) I = O

where l is the maximum of V.

For consistency, we require that:

|B - lW| = O

leading once again to an equation of degree p having p solutions, l₁, l₂, ..., l_p, with a vector I_j corresponding to each lj (j = 1, 2, ..., p) and giving the required linear combinations. Because the equation can be written as:

(W^-1B- lI') I = O

the l_j are effectively the eigenvalues of W^-1B with corresponding eigenvectors l_j, and, because the l_j are the stationary values of:

the vector corresponding to the largest eigenvalue gives the direction along which the between-group variance is maximum relative to the within-group variance. The vector corresponding to the next largest eigenvalue gives the direction maximizing the remaining between-group variance relative to within- group variance, and so on.

The transformation of any vector x is thus given by L' X, and the space described by all such vectors is called the canonical variate space. Further, as L' WL is a diagonal matrix, the disadvantage of the arbitrary scaling of the l_j may be overcome by normalizing as follows:

L ' WL = I

The analysis represents the simultaneous reduction of B and W to diagonal form, in which the mean of the kth population in the original space is x_k, and this becomes L' x_k in the canonical variate space. The distance between the kth and jth mean in the canonical variate space is:

(x_k - x_j)' LL'(x_k - x_j)

Canonical variate analysis is an extension of the idea of the discriminant function between two populations.

7.8.2 Computational aspects

Discrimination between two groups is characteristically simpler than discrimination between several groups, in that we need only seek a single discriminant function. Effective algorithms for the calculation are given by Blackith and Reyment (1971), and Davies (1971).

The extension of discriminant theory to more than two groups presents no great difficulties. All of the procedures for the two-group case generalize in a fairly obvious way. There are two possible ways of preceeding (Marriott, 1974). One possibility is to consider the groups in pairs, and estimate the discriminant function between each pair. An alternative is to consider the functions that maximize the variance ratio, or the variability between the groups considered together. These are the canonical variates although usually they are standardized on the matrix W, so that they have unit variance within groups. Both methods can be considered as generalizations of the discriminant function of two groups. The two approaches lead, in general, to slightly different answers, but only because the elimination of non-significant factors is carried out at different stages. The choice between the two is largely a matter of convenience.

In the first approach, the discriminant function between any two groups is calculated from the means and the pooled dispersion matrix W. If the distances between all pairs are significant, there are � g(g -1) discriminant functions. The alternative method is to calculate the latent roots of WS^-1 by canonical analysis. If the first k are significant, and the value of L * = (l -l_{k +} ₁ ). ..(l - l_{g - l} ) is not significant, the first k canonical variates only are used. Original variates which are irrelevant may be discarded from these canonical variates which are then used to derive allocation rules as in the first method.

The two methods are not inconsistent. It is possible that a distance judged 'just significant' by the first will be judged 'not significant' by the second, and the discriminant functions will have slightly different coefficients unless all the canonical variates are used. Which approach is chosen depends essentially on whether the canonical variates give a worthwhile simplification. If k = 1, a single discriminant function is sufficient for all the groups, and the allocation rule is simply based on a dissection of the range of this function. If k = 2, a scatter diagram, with the two canonical variates as axes, can be drawn to show all the groups and the lines dividing them, based on discriminant functions which are linear functions of the two canonical variates. If k = 3, a three- dimensional model is needed to present the data. When k > 3, the advantage of using canonical variates is slight. If k is large, there may be some reduction of dimensionality, but it may still be better to consider doing several separate anaylses.

Algorithms for canonical variate analysis are, again, given in Blackith and Reyment (1971) and Davies (1971).

7.8.3 A distribution-free method

When the basic assumptions of the discriminant analysis are not justified for a particular set of data, the following technique has been suggested by Kendall and Stuart (1979).

The range of each attribute should be divided into three non-overlapping parts:

one end containing members of group A only;
a middle, containing members of A and B;
the opposite end, containing members of group B only.
If the extreme individuals belong to the same group, only (c) exists.

Select as x₂ the attribute with the largest number of individuals in ( a_l ) and (b₁).
Assign all observations in ( a_l ) to A, all in ( b_l ) to B. Record the innermost values of (a₁) and (b₁), and the first instruction in the allocation rule is to allot to A or B all individuals having values outside these limits.
Select as x₂ the attractive with the largest number of individuals of (c₁) in (a₂) and (b₂).
The second instruction in the allocation rule then allots individuals unallocated by x₁ in accordance with the value of x₂.
This process can be continued until either all individuals are disposed of, or until all the attributes have an individual of the same group at both ends of their range. Richards (1972) has pointed out that there is no reason why an attribute should not be used more than once in different steps of the rule.

This method depends only on the ranking of the separate attributes. It is, however, very inefficient when there are many attributes and none of them discriminates efficiently. The method could be modified by considering combinations of attributes, but such a modification would destroy some of the generality of the method. The real difficulty, however, is to know when to stop. It is necessary to decide whether the discrimination afforded by a particular attribute is real, or merely a chance effect. Some attributes will show differences that are obviously significant, but a decision on whether the ranking associated with the best of several measures ^__not, of course, independent ^__ is or is not a chance effect requires the use of bootstrap or jackknife techniques (Efron, 1982).

7.8.4 Generalized distance

The concept of the generalized distance between two populations was suggested by Mahalanobis (1936) and its properties have been investigated chiefly by the Indian statisticians. The following account again follows closely that of Marriott (1974). Suppose d is the vector of differences between the x means in groups A and B (d_i = x_iA- x_iB; i = 1, ..., p). The statistic D²= d' V ^-1d is the estimate of a corresponding parameter, dependent of the means of the two groups and the dispersion matrix within groups. This parameter is known as the squared generalized distance between the groups.

The generalized distance and its estimate have the following properties.

They are scale-independent. The value of D² is unaltered if the x's are multiplied by arbitrary constants, or replaced by a set of linear combinations which are not linearly dependent (e.g. and set of principal components).
They take account of correlations between the variates.
The value of D is the difference between the mean values of the discriminant function, regarded as a linear function of the x's for the two groups, divided by its standard deviation.
The distribution of D² is known, .and a significance test for difference between the two groups is given by:

where n is the total number of observations.

The test is generalized to the p-variate case of the ordinary t-test. It is sometimes expressed in terms of Hotelling's T. If

a significance test is given by

When p = 1 and g = 2, this expression reduces to the ordinary t-test for the differences between the means of two groups.

The distribution of D² when the generalized distance in the population is not zero is based in the same way on the non-central F distribution (see Rao, 1965). The distribution has been used to compare distances based on different numbers of variates (Rao, 1950) or between corresponding groups in different populations.

7.8.5 Case studies

As before, while discriminant function analysis figures largely in most texts on multivariate analysis, there are relatively few case studies of the application of the technique in ecology, as opposed to taxonomy. Jeffers (1978a) gives one case study of the discrimination between sites on Signy Island, in the South Orkney Islands, with and without vascular plants. Norris and Barkham (1970) made a comparison of some Cotswold beechwoods by multiple discriminant analysis.

Hill (1977) has reported on the use of simple discriminant functions to classify quantitative phytosociological data, and Nielsen et al. ( 1973) have made a statistical evaluation of geobotanical and biogeochemical data by discriminant analysis. Discriminant analysis in tree nutrition research has also been reported by White and Mead (1971), and Valentine and Houston (1979) have used a discriminant function to identify mixed oak stand susceptibility to gypsy moth defoliation.

7.9 CLUSTER ANALYSIS

Classification involves the recognition of similarities between, and the grouping of, the individuals of the basic data matrix. A classification may have more than one purpose, but the paramount purpose is to describe the relationships of objects to each other, and to simplify the relationships so that general statements can be made about classes of individuals. An important distinction is between monothetic and polythetic classifications. Monothetic classifications are those in which the classes established differ by at least one property which is uniform within the members of each class. In polythetic classifications, the classes are groups of individuals or objects that share a large proportion of their attributes, but do not necessarily agree in anyone attribute. A corollary of polythetic classification is the requirement that many attributes can be used to classify the individuals. However, once a classification has been established, only a few attributes are generally necessary to allocate individuals to the proper group. Classifications based on many attributes will be general. They are unlikely to be optimal for any single purpose, but might be useful for a great variety of purposes. By contrast, a classification based on few attributes might be optimal with respect to those attributes, but would be unlikely to be of general use (Sokal, 1974).

Hence, classification of a data set results in a reduction of the amount of information that is necessary to describe the data, but, if the classification is efficient, there is little or no reduction in the amount of information contained in the data. Furthermore, classifications that describe relationships between individuals from a defined population should generate hypotheses, possibly the main scientific justification for the exercise.

If dissection is to be carried out, the basis of the dissection must be clearly defined. For example, an ecologist may regard the vegetation as essentially continuously changing, but changing more rapidly in some regions than others. He will therefore wish to treat these zones of maximum gradient as if they were discontinuities, and so sharpen them by an appropriate technique (Williams, 1971). Some cluster methods have properties which make them useful for different types of dissection, for example the various minimum variance methods and methods of the association analysis type. The flexible clustering strategy of Lance and Williams (1967) may also be useful for this purpose.

Cluster analysis of data representing the distribution of points in multidimensional space, where the distances between pairs of points are defined as some function of the observed sample values, has become a popular method of data analysis. The usual purpose of the analysis is to group the points in multi-dimensional space into (usually) disjoint sets which it is hoped will correspond to marked features of the sample. The grouped sets of points may themselves be grouped into larger sets, so that all the points are eventually classified hierarchically. This hierarchical classification can be represented diagrammatically in the form of a dendrogram showing the degree of relationship between individuals, and, ideally, a scale indicating the level of similarity between the suggested groupings. There are many forms of cluster analysis and classification, and the critical review of Cormack ( 1971) should be read by anyone intending to embark on the use of multivariate models for this purpose. In this Handbook, only a few of these forms will be considered.

7.9.1 Minimum spanning tree

Suppose that n points are given in two or more dimensions. A tree is defined as any set of straight-line segments joining pairs of points such that:

no closed loops occur;
each point is visited by at least one line;
the tree is connected.

The length of the tree is the sum of the length of its segments, and, for any set of n points for which the lengths of all possible segments are known, it is possible to define a tree of minimum length which spans the points. Efficient algorithms for computing this minimum spanning tree are given by Gower and Ross (1969). The concept of the minimum spanning tree is of particular value in helping with the interpretation of diagrammatic representations of multivariate data, and also as a first stage in the single linkage cluster analysis considered below.

7.9.2 Single linkage cluster analysis

This method of cluster analysis was proposed by Sneath (1957) as a helpful way of summarizing taxonomic relationships in the form of dendrograms, where the relationships are expressed in terms of taxonomic distances between every pair of samples, measured on some convenient scale. The method entails clustering the individual samples by comparing their distances with a series of increasing threshold distances, these threshold distances usually being increased by small constant steps rather than continuously. It can be shown that the method of clustering is closely related to the minimum spanning tree and the clusters at any level can be derived from the minimum spanning tree by deleting all segments of length greater than the defined level. Because some detail on the exact distances between samples may be lost when several links join two threshold levels, the dendrogram derived from single linkage cluster analysis may not be exactly the same as the dendrogram derived from the minimum spanning tree.

Single linkage cluster analysis has the disadvantage of producing long clusters of 'chained' samples under certain conditions, and these elongated clusters are generally regarded as being undesirable. On the other hand, evidence of a continuous sequence of intermediate samples can be informative. In addition, unlike most other methods of cluster analysis, single linkage gives exactly the same results by aggregating small clusters into larger clusters as by dividing larger clusters into smaller ones. This property enables the method to be used for much larger numbers of samples than many other techniques of cluster analysis, and makes it especially convenient for preliminary analysis of very large sets of multivariate data.

7.9.3 k means clustering

Wishart (1968) proposed an entirely different approach to numerical classification, and the following account of the method is given by Marriott (1974).

If the criterion for the separation of groups is that there are several nodes in multivariate space, an obvious line of attack is to look for the nodes directly. Wishart did this by defining 'dense points' as the centres of hyperspheres of minimum radius containing a given number of points, and then expanding the hyper spheres to associate the other points with these centres. During this expansion, there is a continuous revision, the dense points moving to give minimum radius to the sphere associated with the number of points contained, and new dense points being defined as the radius increases. The process continues until all points have been classified, and, at this stage, they are divided into several groups giving the final classification. During the expansion, new group centres "may emerge, and the groups already formed may be combined. If the distribution has only a single node, there may be only a single group with no discontinuity in the multivariate space.

The process'depends on one parameter, i.e. the number of points defining the original dense points or nodes. It is an agglomerative hierarchical procedure, and if this parameter k = 1, it reduces to ordinary single-link clustering. Nevertheless, when k > 1, it differs from other hierarchical techniques in its aims and conclusions. The purpose of the analysis is to find a natural grouping, and the intertnediate steps are of no importance. It is theoretically possible to construct a dendrogram to represent the steps leading to the final grouping, but the intermediate stages consist of one or more groups and additional isolated points.

The method is suitable for clustering both continuous variables and binary attributes. It is not satisfactory for a mixture of the two types of attributes, and discrete or coarsely grouped attributes are apt to be troublesome. The definition of dense points avoids any assumption of an underlying distribution, and there is no sampling theory or significance test associated with the method. The aim is to detect clustering of the observations. If, in fact, the data are a sample from a distribution of some sort, it is not clear how effective the method is at rejecting spurious clustering due to sampling fluctuations, which will depend on the value chosen for k. Provided that k is not very small, it will not suggest a grouping when the data themselves give no indication of heterogeneity.

One defect of the method is that the definition of dense points in terms of spheres make it less effective when variates are highly correlated and the contours surrounding the modes are elongated ellipses. The difficulty could be overcome in the case of continuous variables by working with principal components or principal co-ordinates. Though it is technically a variation of agglomerative hierarchical clustering, the possibility of varying k gives Wishart's method far greater flexibility that other methods in this class. If the aim is to find a natural grouping, rather than to construct a dendrogram, it is effective and unlikely to give misleading results. On the whole, it is probably the best practical classiflcation technique at present available.

In summary, Wishart's method has the following advantages:

it is a direct approach designed to identify the nodes of the underlying distribution, or the clustering of the results;

it is unlikely to suggest a completely spurious grouping;

no sampling theory is invoked -though when the observations really are a sample, the properties of the method are not known.

Its weaknesses are:

it is suitable only for continuous variables or binary attributes, but not both together

for continuous variates, rather large samples may be needed;

it is sensitive in detecting elongated modes;

the choice of the value of k may affect the conclusions.

7.9.4 The use of components in cluster analysis

In some problems, we may wish to examine whether the n objects can be classified into groups, or clusters, so that the points within a cluster are close together, but the clusters themselves are, ideally, far apart. The problems involved in cluster analysis are discussed above, but here we can discuss briefly how principal component analysis can be of use. The discriminatory power of principal components can serve as a clustering technique of great generality (see examples in Blackith and Reyment, 1971). Plotting points on pairs of orthogonal component axes can help in several ways. First, it may suggest the suitability (or otherwise) of a particular form of analysis, for example if there are clearly defined and separate groups and whether these are spherical or elongated. Second, it may show why a particular technique has not given satisfactory results, and it may suggest alternatives. Finally, it may confirm that a suggested clustering looks reasonable and fits the observations realistically.

The orthogonality of the principal component transformation means that it is distance- and angle-preserving in an r-dimensional space. How well plots of points on pairs of orthogonal component axes represent the real configuration of the points depends on how well this confirmation is preserved in the reduced space. One way of looking for distortions in such plots is to superimpose on them the minimum spanning tree described above. It can also be useful to plot histograms of the frequency distributions of points along selected components. The procedure will show whether or not there are several nodes along a component (e.g. Webster and Burrough, 1972).

The investigator may have decided to accept the first q dimensions as preserving sufficient of the total variance for practical purposes. The configuration of the points in this q-dimensional space can be studied by calculating distances between pairs of points. It also might be worth computing the distances in r-dimensional space, and examining the distances between the differences in the two spaces to see if they are uniformly small (Rao, 1964).

Because the components are orthogonal, the distances can be simple Euclidean distances. As Euclidean distance depends on the scale of the attributes, it is unlikely to have much meaning if some attributes have a much greater range of values than others. Hence, it is generally used only when all the measurements have been standardized in some way (Marriott, 1974). Principal components are dimensionless, but, as calculated using the orthogonal eigenvectors, they have different variances, equal to the eigenvalues. Hence, if component values are to be used for the calculation of Pythagorean distances, it is preferable to normalize the components to unit sums of squares (Hope 1968). In effect, this is a question of the weighting given to each component in its contribution to the distance. By normalizing each component, it is effectively being weighted according to its variance.

There will be occasions, however, where this simple procedure of plotting multivariate data as a series of two-dimensional projections, with the minimum spanning tree to indicate the closeness of the point in multivariate space, will not be sufficient. When these occur, a simple transformation described by Andrews et al. (1971) may be used to obtain easily interpreted plots of high-dimensional data. This transformation embeds the data in an ever higher dimensioned, but easily visualized, space of functions, and then plots the functions.

For each point, expressed either as principal components or canonical variates, the function:

f_x(t) = x₁�2 + x₂ sin(t) + x₃ cos(t) + x₄ sin 2(t) + x₅ cos(2t) + ...

(where x₁, x₂, x₃, ...are the component values and t is variable -p< t < p) is

plotted over the range -p < t < p .This function transforms the set of points into a set of lines drawn across the page in such a way that the mean of the functions of n observations corresponds to the mean of the observations themselves. The function representation also preserves. distances, so that distance between two functions is proportional to Euclidean distance between the corresponding points. Even more important, the representation preserves variances, so that, if the components of the data are uncorrelated (as will usually be the case in our applications of the function) with a common variance a2, then the function value of t, f_x(t), has a variance which is given by:

var | f_x(t) | = s²(� + sin²(t) + cos²(t) + sin²(2t) + cos²(2t) + ...).

The variability of the plotted function is almost constant across the graph, a fact which considerably simplifies this interpretation.

7.9.5 Other uses of components

If there is any reason to hypothesize that an object may belong to one of two or more groups which can be discriminated by a particular component, then the fit between such groups and the component values can be tested by analysis of variance (Hope, 1968). If the component values have been calculated using the orthonormal eigenvectors of a covariance or correlation matrix as described above, the total sum of squares of a component is n -1 times its eigenvalue. The between-groups sum of squares can be calculated by squaring the mean component value of each group, multiplying the result by the number in the group, and then summing over all the groups. The within- groups sum of squares can be obtained by subtraction. For exploratory purposes, the more complicated designs in the analysis of variance may prove useful. If the hypothesis is that the groups may be discriminated by reference to two or more components, the multivariate analysis of variance may be useful (Rao, 1964), bearing in mind the need to satisfy certain requirements. It is worth bearing in mind the fact that, in practice, it may turn out that some of the components with smaller eigenvalues are better discriminators between groups with respect to the differences between means than are the first few components with the larger eigenvalues. Other uses are possible if the data can be referred to some sort of geographical grid, for example in stratified sampling (Patterson et al., 1978).

Lefkovitch (1976) described a computation ally simple divisive method for hierarchical clustering. He showed that the vectors of principal co-ordinates, in decreasing order of their eigenvalues, indicate the succession levels of the hierarchy of a dendrogram, and the signs of the vector elements indicate the group membership. Consider the following matrix C:

Object A	1.49	0.09	1	0
B	1.49	0.9	-1	0
C	-0.97	-1.13	0	0
D	-1.26	0.61	0	-0.5
E	-1.26	0.61	0	0.5
Eigenvalue	8.56	2.48	2	0.5

Starting from the left, the first division is into (A, B) and (C, D, E). At the next level, (C, D, E) divides into (C) and (D, E). After the following level, (A, B) divides, and at the final level (D, E) divides. Small differences in the numerical values of the co-ordinates do not alter the hierarchical relationships, which depend only on the signs.

There are three assumptions on this method: (i) XX' is an appropriate description of the inter-object relationships; (ii) the objects do, in fact, belong 'to disjoint, rather than overlapping, groups; (iii) a group formed by co-ordinates i and j is either unchanged by including co-ordinate j + l or it is divided into two. Then the between-group squared distance is maximized by this method. Lefkovitch noted that computation can be simplified by trans- forming the data matrix X into Y and finding the eigenvalues and eigenvectors of Y ' Y; the eigenvectors can then be converted to principal co-ordinates. This method can be used with a variety of attribute types.

7.9.6 Association analysis

Association analysis is a technique for obtaining a monothetic, divisive hierarchy for a set of data representing the recorded presences or absences of a number of attributes. The advantages of the technique (Williams, 1971) are as follows.

Group definitions are simple and unambiguous. The groups are defined in terms of the presence or absence of the chosen attributes.
The majority of groups remain stable as additional entities are added to the data matrix. Alterations will occur, however, if a sufficient number of new entities affects the priorities in the choice of attributes.
Computation is relatively fast, because there is usually more interest in the upper thin the lower level of the hierarchy, and the process can be halted at the required level.
At least in theory, divisive strategies begin classification when the total information available is a maximum.

Divisive methods are particularly suitable for handling large data matrices. They can also be employed in data elimination to reduce large matrics to a size where agglomerative methods can be employed. Two methods have been employed in the determination of the division attributes -these depending on information theory and those depending on x². Although the x² methods have now largely been superseded by information theory measures, some interest still remains in their use where it is not necessary to handle mixed data, i.e. presence or absence data mixed with quantitative variables (Lance and Williams, 1968a, b).

Given a set of binary attributes for a set of entities, the x² are calculated for all attributes taken in pairs. These are then summed over all attributes and that with the largest � ₂ⁿX² is used as the basis for dividing the set of entities into two subsets ^__ those possessing the attribute and those lacking it. Each subset is further subdivided in the same manner as the original array of data until the required number of subsets is obtained, or until' none of the x² values exceed a pre-set probability level.

Algorithms for divisive X² strategies based on binary data were produced by Williams and Lambert (1960) for vegetation analysis. Programs capable of handling mixed data and allowing for missing entries have been written by Lance and Williams (1968b). The technique is now regarded as having been superseded by indicator species analysis (see below), although it continues to have some useful properties.

7.9.7 Indicator species analysis

In ecological applications, the individuals of the basic data matrix are ordered by the first axis of a reciprocal averaging ordination, and the individuals are then divided into two groups at the centre of gravity of the ordination. Five 'indicator species' are chosen by the function:

I_j= | m₂/ M₁ -m₂/M₂|

where I_j is the indicator value of species j (assuming the value 1 if the species is a perfect indicator and a value 0 if it has no indicator value);
m1 is the number of individuals in which species j occurs on the negative side of the dichotomy;
m2 is the number of individuals in which species j occurs on the positive side of the dichotomy;
M1 is the total number of individuals on the negative side of the dichotomy;
M2 is the total number of individuals on the positive side of the dichotomy.

The five species with the highest indicator value are then used to construct an 'indicator score' for the whole set of individuals and to define an 'indicator threshold' which corresponds with the dichotomy.

The whole process may be repeated for the second and subsequent reciprocal averaging axes, so that the individuals are again divided by the same method, continuing as far as possible. No satisfactory rule for stopping the subdivision has so far been devised, so that there is a degree of arbitrariness both in the selection of thresholds and the number of subdivisions. The indicator scores can be regarded as providing an ordination of the individuals on a six-point scale, an admittedly crude ordination, but one which can be done quickly. The aim is to mirror the original reciprocal averaging ordination sufficiently closely for the analysis to the used as a criterion for classifying the individuals.

An algorithm for the calculations of the indicator species analysis is given by Hill (1973).

7.9.8 Case studies

The literature on classification and cluster analysis in ecology is vast. For an introduction, it is probably best to start from the general texts of Orlocci (1975), Greig-Smith (1983), Gauch (1982), Williams (1976), and Blackith and Reyment (1971).

7.10 FITTING RELATIONSHIPS BETWEEN GROUPS OF VARIABLES

7.10.1 Mathematical basis

As an alternative to the assumption of an a priori structure imposed upon the individuals of the basic matrix, it may be assumed that the attributes of the matrix are divided into two sets, with r and q attributes in each set, so that p = r + q. This is equivalent to writing the data matrix:

X = [X₁ X₂]

Where X₁ has n rows and p columns and X₂ has n rows and q columns.

The covariance matrix computed from the basic matrix may be partitioned as:

From this matrix, it will frequently be of interest to find the linear combinations:

u_i = l'_i X₁
v_i = m'_iX₂

where i = 1,2,3,...,S, with the property that the correlation of u₁ and v₁ is greatest, the correlation of U₂ and V₂ is greatest among all linear combinations uncorrelated with u₁ and v₁, and so on for all possible pairs.

The correlation between any two linear combinations:

u = l'X₁
v = m' X₂

is given by:

and this is the criterion which we seek to maximize. It may be shown that this maximum is equivalent to the solution of the equations:

(C'A ^-1C - p²B)m = 0
(CB ^-1C' - p²A)l = 0

where p² is the stationary value.

The two matrices:

C'A ^-1C - B
CB ^-1C' - A

have identical roots, and the eigenvectors corresponding to these roots give the
coefficients for the linear combinations. Just as with canonical variates:

L' AL = D₁
M' BM = D₂
L'CM = D₃

where D_i is a diagonal matrix. If q < p and all vectors of type 1 are in the
matrix L(q x q), then

M = A ^-1CL

and M is (p x q), where the remaining p -q columns of M correspond to zero canonical correlations.

7.10.2 Multiple regression

The theory behind multiple regression is so well known that is is hardly necessary to give an account of the technique in this Handbook. Excellent texts already exist, notably Snedecor and Cochran (1967), Williams (1959) and Sprent (1969). Just about every computer installation has a multiple regression package, although many of them are computationally suspect.

7.10.3 The use of components in regression analysis

A serious problem in regression analysis is collinearity. The estimators of the coefficients depend on the inverse of the covariance matrix of the regressors. If one variable is a linear function of another, the coefficients in a regression equation which includes them are indeterminate, for the determinant of the covariance matrix vanishes. If some of the eigenvalues of the covariance matrix are small, its determinant is also small, and the coefficients will be ill-determined (Kendell, 1975).

Because the principal components of the matrix of regressor variables are orthogonal, it is natural to consider using them as regressors. Massey (1965) discussed principal component regression analysis in some detail. In an empirical study, he found that the principal component regression technique gave larger values for R² while employing fewer regressors than did their classical counterparts in three out of four cases examined. He concluded that the methods involved are useful because: (i) they permit rapid calculation of the correlations between a dependent variable and each of the components, and (ii) they refer the regression results back to the projections of the original independent variables in the space spanned by the components included in a given regression. This partially overcomes the problem of identifying the components in order to give meaning to the regression coefficients. Jolliffe's ( 1982) warning about the need to include components with small eigenvalues in such calculations should, however, be heeded carefully.

Daling and Tamura (1970) used components, not as new variables, but as the reference frame to identify a near-orthogonal subset of explanatory variables. Selection of such variables minimizes overlapping of information supplied by explanatory variables in the regression. They suggested applying varimax rotation to a type D eigenvector matrix. The varimax criterion produces a matrix of vectors in each of which a few variables tend to have high loadings, while the rest have small or zero loadings. The regression of the dependent variable on each varimax factor is then calculated to identify the explanatory variables which, having minimum interdependence among themselves, appear to make a significant contribution to the variance of the dependent variable. Hawkins (1973) described an interactive method which has the advantage of enabling good alternative subsets of predictors to be found easily. This method is based on a varimax rotation of a type D eigenvector matrix, the rotated matrix being used to suggest possible variables.

More recently, Page and Fabian (1978) have used principal component regression to examine the relationships between disease and air pollution. Jolliffe (1972,1973) discussed eight methods, four of which use principal components. He applied five of the eight techniques to real data including a multiple linear regression analysis. Mansfield et al. (1977) presented a method for reducing the number of inter-dependent attributes. The procedure first deletes components associated with small latent roots of X' X and then incorporates an analogue of the backward elimination procedure to eliminate the independent attributes, this deletion being based on minimal increases in residual sums of squares.

7.10.4 Canonical correlations

The generalization of which multiple regression is a special case assumes that there are two sets of attributes x₁, ..., x_p and y₁, ..., y_q, one set at least being random variables. If significance tests are to be used, it will also be assumed that this is jointly normally distributed about a mean dependent on the other set.

Given the dispersion matrix of the complete set of p + q attributes, the correlation of any given linear combination of the x's with a given linear combination of the Y's can readily be calculated. One of the possible pairs of linear combinations must have the maximum correlation, and this is called the first canonical correlation. The corresponding pair of linear combinations of the x's and y's are called the first canonical variables. Because they have an arbitrary scale factor and an arbitrary mean, these constants may be chosen so that each canonical variable has zero mean and unit variance.

The second canonical correlation and variable may then be defined by the second pair of variables, uncorrelated with the first pair, that have the maximum correlation. Similarly, the process may be continued until there are p pairs of canonical variables and p canonical correlations, assuming that p < q.

The resulting p pairs of variables have the following properties:

all the correlations between them are zero, except those between the corresponding pairs;
the correlations between the corresponding pairs form a decreasing sequence;
each variable is standardized, i.e. has zero mean unit variance.

Any interpretation of the relationships between the two sets of attributes is now based on these canonical variables. Under certain rather strict assumptions, it is possible to test whether there is evidence of any relationship, whether the relationship is accounted for by the first pair, or first few pairs, of variables, and whether some of the original attributes can be omitted without significantly altering the conclusions. If reification of the canonical variables is possible, the nature of the relationship between the two sets of attributes may be clarified.

An algorithm for canonical correlation analysis is given by Anderson (1958). The only difficulty with the computation is the need for an efficient procedure for finding the eigenvalues and eigenvectors of a non-symmetric matrix Martin et al., 1971; Bowdler et al., 1971).

7.10.5 Case studies

There are very few examples of canonical correlation analysis being applied in ecology. Blackith and Reyment (1971) describe one or two such applications. Gillins (1979) also describes ecological applications of canonical analysis.

Acknowledgement

The help of Peter Howard in the writing of this chapter is gratefully acknowledged.

References

Anderberg, M. R. (1973). Cluster analysis for applications. New York, London: Academic Press.

Anderson, T. W. (1958). An introduction to multivariate statistical analysis. New York, Chichester: John Wiley.

Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 34, 122-128.

Anderson, T. W., and Goodman, L. A. (1957). Statistical inference about Markov chains. Annals of Mathematical Statistics, 28, 89-110.

Andrews, D. F. (1972). Plots of high dimensional data. Biometrics, 28, 125-136.

Anthony, T. F., Taylor, B. W. (1977). Analyzing the predictive capabilities of Markovian analysis for air pollution level variations. J. Environ. Manage., 2, 139-149.

Austin, M. P., and Noy-Meir, I. (1971). The problem of non-linearity in ordination: experiments with two-gradient models. J. Ecol., 59, 217-227.

Austin, M. P ., and Orlocci, L. (1966). Geometric models in ecology II. An evaluation of some ordination techniques. J. Ecol., 54, 217-227.

Barkham, J. P. (1968). The ecology of the ground flora of some Cotswold beechwoods. University of Birmingham, PhD thesis.

Barraclough, R. M., and Blackith, R. E. (1962). Morphometric relationships in the genus Ditylenchus. Nematologica, 8, 51-58.

Bartholomew, D. J., and Forbes, A. F. (1979). Statistical techniques for manpower planning. New York, Chichester: John Wiley.

Bartlett, M. S. (1950). Tests of significance in factor analysis. British J. Psych. (Stat. Sect.), 3, 77-85.

Beals, E. W. (1973). Ordination: mathematical elegance and ecological naivete. J. Ecol., 61, 23-35.

Bellman, R. (1960). Dynamic programming. Princeton: Princeton University Press.

Benz�cri, J .-P .(1973). L 'Analyse des donnees. Paris: Dunod.

Blackith, R. E., and Reyment, R. A. (1971). Multivariate morphometrics. London, New York: Academic Press.

Bhattacharya, R. N., et al. (1976). A Markovian stochastic basis for the transport of water through unsaturated soil. J. Amer. Soc. Soil Science, 40, 465-467.

Binkley, C. S. (1980). Is succession in hardwood forests a stationary Markov process? Forest Science, 26, 566-570.

Boratynski, K., and Davies, R. G., (1971). The taxonomic value of male Coccoidea (Homoptera) with an evaluation of some numerical techniques. Bot. J. Linn. Soc. , 3, 57-102.

Botkin, D. B. (1977). Life and death in a forest: the computer as an aid to understanding. In: Ecosystem modelling in theory and practice (edited by C. A. S. Hall and J. W. Day), pp. 213-233. New York, Chichester: John Wiley.

Botkin, D. B., Janak, J. F., and Wallis, J. R. (1972a). Rationale, limitations and assumptions of a northeastern forest growth simulation. IBM J. Res. Dev., 16, 101-116.

Botkin, D. B., Janak, J. F., and Wallis, J. R. (1972b). Some ecological consequences of a computer model of forest growth. J. Ecol., 60, 849-872.

Bowdler, H., Martin, R. S., Reinsch, C., and Wilkinson, J. H. (1971). The QR and QL algorithms for symmetric matrices. In: Handbook for automatic computation; Vol. 11, Linear algebra (edited by J. H. Wilkinson and C. Reinsch), pp. 227-250. Berlin, Heidelberg, New York: Springer-Verlag.

Bray, J. R., and Curtis, J. T. (1957). An ordination of the upland forest communities of southern Wisconsin. Ecol. Monogr., 27, 325-349.

Brockington, N. R. (1979). Computer modelling in agriculture. Oxford: Oxford University Press.

Buongiorno, J. and Michie, B. R., (1980). A matrix model of uneven-aged forest management. Forest Science, 26, 609-625.

Cassell, R. F. and Moser, J. W., (1974). A programmed Markov model for predicting diameter distribution and species composition in uneven-aged forests. Purdue Univ. Agric. Exp. Stn. Res. Bull. 915.

Cat tell, R. B. (1965). Factor analysis: an introduction to essentials I. the purpose and underlying models. Biometrics, 21, 190-215.

Clifford, H. T., and Stephenson, W. (1975). An introduction to numerical classification. London: Academic Press.

Colinvaux, P. A. (1973): Introduction to ecology. New York, Chichester: John Wiley.

Connell, J., and Slatyer, R. O. (1977). Mechanisms of succession in natural communities and their role in community stability and organization. Am. Nat. , 111, 1119-1144.

Conway, G. R., and Murdie, G. (1972). Population models as a basis for pest control. In: Mathematical models in ecology (edited by J. N. R. Jeffers), pp. 195-214. Oxford: Blackwell Scientific.

Cooke, D. (In press). The description of plant succession data using a Markov chain model of plant-by-plant replacement.

Cooley, W. W., and Lohnes, P. R. (1971). Multivariate data analysis. New York, Chichester: John Wiley.

Cormack, R. M. (1971). A review of classification. J. R. Statist. Soc., A, 134, 321-367.

Cunia, T., and Chevrou, R. B. (1969). Sampling with partial replacement on three or more occasions. Forest Science, 15, 205-224.

Dale, M. B. (1975). On objectives of methods of ordination. Vegetation, 30, 15-32.

Daling, J. R., and Tamura, H. (1970). Use of orthogonal factors for selection of variables in a regression equation -an illustration. Appl. Statist., 19, 260-268.

Dempster, J. P. (1975). Animal population ecology. London, New York, San Francisco: Academic Press.

Davies, R. G. (1971). Computer programming in quantitative biology. London, New York: Academic Press.

Debussche, M., Godron, M., Lepart, J. and Romane, F. (1977). An account of the use of a transition matrix. Agro-Ecosystems, 3, 81-92.

Dent, J. B., and Blackie, M. J. (1979). Systems simulation in agriculture. London: Applied Science Publishers.

De Witt, C. T ., and Goudriaan, J. (1974). Simulation of ecological processes. Wageningen, The Netherlands: Center for Agricultural Publishing and Documentation.

Drury, W. H., and Nisbet, I. C. T. (1973). Succession. J. Arnold Abor., 54, 331-368.

Dubois, D. M. (1979). State-of-the-art on predator - prey systems modelling. In: State-of-the-art in ecological modelling (edited by S. E. Jorgensen), pp. 163-217. Oxford: Pergamon.

Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. Philadelphia, Pa.: Society for Industrial and Applied Mathematics.

Elger, F. E. (1954). Vegetation science concepts. I. Initial floristic composition -a factor in old field vegetation development. Vegetation, 4, 412-417.

Fisher, R. A. (1936). The use of multiple measures in taxonomic problems. Ann. Eugen., 8a, 376-386.

Gabriel, K. R. (1971). The bipolot graphic display of matrices with application to principal component analysis. Biometrics, 58, 453-467.

Gauch, H. G. (1982). Multivariate analysis in community ecology. Cambridge: Cambridge University Press.

Glaz, J. (1979). Probabilities and moments for absorption in finite homogeneous birth-death processes. Biometrics, 35, 813-816.

Goodall, D. W. (1972). Building and testing ecosystem models. In: Mathematical models in ecology, (edited by J. N. R. Jeffers), pp 173-194. Oxford: Blackwell Scientific.

Gower, J. C. (1967). Multivariate analysis and multidimensional geometry. Statistician, 17, 13-28.

Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 857-871.

Gower, J. C. and Ross, G. J. S. (1969), Minimum spanning trees and single linkage cluster analysis. Applied Statistics, 18, 54-64.

Greenacre, M. J. (1984). Theory and applications of correspondence analysis. London, New York: Academic Press.

Greig-Smith, P. (1983). Quantitative plant ecology. Oxford: Blackwell Scientific.

Hahn, H. H. and Eppler, B. (1979). Models of rivers. In: State-of-the-art in ecological modelling, (edited by S. E. Jorgensen), pp. 13-58. Oxford: Pergamon Press.

Hall, C. A. S., and Day, J. W. (1977). Systems and models: terms and basic principles. In: Ecosystem modelling in theory and practice (edited by C. A. S. Hall and J. W. Day), pp. 6-36. New York, Chichester: Wiley.

Hall, C. A. S., Day, J. W., and Odum, H. T. (1977). A circuit language for energy and matter. In: Ecosystem modelling in theory and practice (edited by C. A. S. Hall and J. W. Day), pp. 37-48. New York, Chichester: Wiley.

Harbaugh, J. W. and Bonham-Carter, G. (1970). Computer simulation in geology. London, Chichester. John Wiley & Sons.

Hawkins, D. M. (1973). On the investigation of alternative regressions by principal component analysis. Appl. Statist., 22, 275-286.

Hill, M. O. (1973). Reciprocal averaging: an eigenvector method of ordination. J. Ecol., 61, 237-251.

Hill, M. O. (1977). Use of simple discrimination functions to classify quantitative phytosociological data. Proc. Ist Int. Symp. Data Analysis and Informatics, Versailles, September, Vol. 1, pp. 181-196.

Holland, D. A. (1969). Component analysis: an aid to the interpretation of data. Exp. Agric., 4, 151-164.

Hope, K. (1968). Methods of multivariate analysis. London: University of London Press.

Horn, H. S. (1974). The ecology of secondary succession. Ann, Rev. Ecol. Syst., 5, 25-37.

Horn, H. S. (1975). Markovian properties of forest succession. In: Ecology and evolution of communities (edited by M. L. Cody and J. M. Diamond), pp. 126-211. Cambridge, Mass.: Harvard University Press.

Howarth, R. J., and Murray, J. W. (1969). The foraminiferida of Christchurch Harbour, England: A reappraisal using multivariate techniques. J. Paleont. , 43, 660-675.

Imbrie, J., and Purdy, E. G. (1962). Classification of modern Bahamian carbonate sediments. Amer. Ass. Petrol. Geol. Mem., 7, pp. 253-272.

Innis, G. S., and O'Neill, R. V. (eds) (1979). Systems analysis of ecosystems. Maryland: International Co-operative Publishing House.

Ivimey-Cook, R. B., and Proctor, M. C. F. (1967). Factor analysis of data from an east Devon heath: a comparison of principal component and rotated solutions. J. Ecol., 55, 405-415.

James, A. T. (1977). Tests for a prescribed subspace of principal components. In: Multivariate analysis IV (edited by P. R. Krishnaiah), pp. 73- 77. Amsterdam: North-Holland Publishing Co.

Jardine, N., and Sibson, R. (1971). Mathematical taxonomy. New York, Chichester: John Wiley.

Jeffers, J. N. R. (1965). Principal component analysis in taxonomic research. For. Comm. Statist. Section paper no.83, 21 pp.

Jeffers, J. N. R. (Ed.) (1972). Mathematical models in ecology. Oxford: Blackwell Scientific.

Jeffers, J. N. R. (1978a). An introduction to systems analysis; with ecological applications. London:.Edward Arnold.

Jeffers, J. N. R. (1978b). Design of experiments. Stat. Checkl., Inst. Terr. Ecol., no. 1.

Jeffers, J. N. R. (1979). Sampling. Stat. Checkl., Inst. Terr. Ecol., no.2.

Jolicoeur, P., and Mosimann, J. E. (1960). Size and shape variation in the painted turtle, a principle component analysis. Growth, 24, 339-354.

Jolliffe, I. T. (1972). Discarding variables in a principal component analysis. I. Artificial data. Appl. Stat., 21, 160-173.

Jolliffe, I. T. (1973). Discarding variables in a principal component analysis. II. Real data. Appl. Stat., 22, 21-31.

Jolliffe, I. T. (1982). A note on the use of principal components in regression. Appl. Stat., 31, 300-303.

J�reskog, K. G., Klovan, J. E., and Reyment, R. A. (1976). Geological factor analysis. Amsterdam: Elsevier .

Jorgensen, S. E. (Ed.) (1979a). State-of-the-art in ecological modelling. Oxford: Pergamon Press.

Jorgensen, S. E. (1979b). State-of-the-art in eutrophication models.. In: State-of-the- art in ecological modelling (edited by S. E. Jorgensen). Oxford. Pergamon Press.

Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educ. Psychol. Meas., 20, 141-151.

Kendall, M. G. (1975). Multivariate analysis. London: Griffin.

Kendall, M. G., and stuart, A. (1979). The advanced theory of statistics, Vol. 2 (4th edn). London: Griffin.

Kleijnen, J. P. C. (1975). Statistical techniques in simulation. New York: Dekker.

Krishnaiah, P. R. , and Lee, J. C. (1977). Inference on the eigenvalues of the covariance matrices of real and complex multivariate normal populations. In: Multivariate analysis IV (edited by P. R. Krishnaiah), pp. 95-103. Amsterdam: North-Holland Publishing Co.

Krumbein, W. C. (1967). FORTRAN IV computer programs for Markov chain experiments in geology. Computer Contribution, 13, Kansas Geological Survey.

Krzanowski, W. J. (1971). The algebraic basis of classical multivariate methods. The Statistician, 20, 51-61.

Lance, G. N., and Williams, W. T. (1967). A general theory of classification sorting strategies. I. Hierarchical systems. Comp. J., 9, 373-380.

Lance, G. N., and Williams, W. T. (1968a). Mixed-data classificatory programs. II. Devisive systems. Aust. Comput. J., 1, 82-85.

Lance, G. N., and Williams, W. T. (1968b). Note on a new information-statistic classificatory program. Comp. J. , 11, 195.

Lassiter, R. R. (1979). Microcosms as ecosystems for testing ecological models. In: State-of-the-art in ecological modelling (edited by S. E. Jorgensen), pp, 127-161. Oxford: Pergamon.

Lassiter, R. R., Baughman, G. L. and Burns, L. A. (1979). Fate of toxic substances in the aquatic environment. In: State-of-the-art in ecological modelling, (edited by S. E. Jorgensen), pp. 219-246. Oxford, Pergamon Press.

Lawley, D. N. and Maxwell, A. E. (1971). Factor analysis as a statistical method. London: Butterworth.

Lefkovitch, L. P. (1976). Hierarchical clustering from principal co-ordinates. Math. Biosci., 31, 157-174.

Lembersky, M. R. and Johnson, K. N. (1975). Optimal policies for managed stands: an infinite horizon Markov decision process approach. Forest Science, 21, 109-122.

Leslie, P. H. (1945). On the use of matrices in certain population mathematics. Biometrika, 35, 183-212.

Leslie, P. H. (1948). Some further notes on the use of matrices in population mathematics. Biometrika, 35, 213-245.

Lloyd, E. H. (1977). Reservoirs with seasonally varying Markovian inflows and their first passage times. International Institute of Applied Systems Analysis, RR- 77-4.

McCammon, R. B. (1966). Principal component analysis and its application in large-scale correlation studies. J. Geol., 74, 721-733.

McCammon, R. B. (1968). Multiple component analysis and its application in classification of environments. Bull. Amer. Ass. Petrol. Geol., 52, 2178-2196.

McNeil, D. R. (1977). Interactive data analysis. Chichester, New York: John Wiley.

Maguire, B. J. (1979). Modelling of ecological process and ecosystems with particular response structures: a review and a new paradigm for diagnosis of emergent ecosystem dynamics and patterns. In: State-of-the-art in ecological modelling (edited by S. E. Jorgensen), pp. 59-126. Oxford: Pergamon.

Mahalanobis, P. C. (1936). On the generalised distances in statistics. Proc natn. Inst. Sci. India, 2, 49-55.

Mansfield, E. R., Webster, J. T., and Gunst, R. T. (1977). An analytic variable selection technique for principal component regression. Appl. Stat., 26, 34-40.

Marriott, F. H. C. (1974). The interpretation of multiple observations. London, New York. San Francisco: Academic Press.

Martin, R. S., Reinsch, C., and Wilkinson, J. H. (1971). Householder's tridiagonalization of a symmetric matrix. In: handbook for automatic computation , Vol. II Linear algebra, (edited by J. H. Wilkinson and C. Reinsch), pp. 212-226. Berlin, Heidelberg, New York: Springer Verlag.

Massey, W. F. ( 1965). Principal components regression in exploratory statistical research. J. Amer. Stat. Assn., 60, 234-256.

Mather, P. M. (1976). Computational methods of multivariate analysis in physical geography. London, Chichester: John Wiley.

Maynard-Smith, J. (1974). Models in ecology. Cambridge: Cambridge University Press.

Milner, C., and Hughes, R. E. (1968). Methods for the measurement of primary productivity of grassland. IBP Handbook No.6. Oxford: Blackwell Scientific.

Morrison, D. F. (1967). Multivariate statistical methods. New York, London, Sydney: McGraw-Hill.

Mosteller, F., and Tukey, J. W. (1977). Data analysis and regression. London: Addison-Wesley Publishing Co.

Munn, R. E. (Ed.) (1975). Environmental impact assessment: principles and procedures. SCOPE Report No.5.
Newbould, P. J. (1967). Methods for estimating the primary production of forests. IBP Handbook No.2. Oxford: Blackwell Scientific.

Nielsen, J. S., Brooks, R. R., Boswell, C. R., and Marshall, N. J. (1973). Statistical evaluation of geobotanical and biogeochemical data by discriminant analysis. J. Appl. Ecol., 10, 251-258.

Norris, J. M. (1971). Functional relationships in the interpretation of principal component analysis. Area, 3, 217-220.

Norris, J. M., and Barkham, J. P. (1979). A comparison of some Cotswold beechwoods using multiple discriminant analysis. J. Ecol., 58, 603-619.

Noy-Meir, I. (1973). Data transformations in ecological ordinations I. Some advantages on non-centering. J. Ecol., 61, 329-341.

Noy-Meir, I. , and Austin, M. P. (1970). Principal component ordination and simulated vegetational data. Ecology, 51, 551-552.

O'Neill, R. V. (1971). Systems approaches to the study of forest floor arthropods. In: Systems analysis and simulation in ecology, Vol. I (edited by B. C. Patten), pp. 441-477. New York: Academic Press.

O'Neill, R. V., Ferguson, N., and Watts, J. A. (Eds) (1977). A bibliography of mathematical modelling.in ecology. EDFB/IBP-75/5. Washington, D. C.: US Govt Printing Office.

Orlocci, L. (1975). Multivariate analysis in vegetation research. The Hague: Dr W. Junk.

Ortega, J. (1967). The Givens-Householder method for symmetric matrices. In: Mathematical methods for digital computers, Vol. II (edited by A. Ralston and H. Wilf), pp. 84-115. Chichester, New York, Sydney: John Wiley.

Ott, W. R. (Ed.) (1976). Environmental modelling and simulation. Virginia: National Technical Information Services.

Overton, W. S. (1977). A strategy of model construction. In: Ecosystem modelling in theory and practice (edited by C. A. S. Hall and J. W. Day), pp. 49- 73. New York, Chichester: Wiley.

Page, W. P., and Fabian, R. G. (1978). Factor analysis: an exploratory methodology and management technique for the economies of air pollution. J. environ. Management, 6, 185-192.

Patten, B. C. (Ed.) (1971). Systems analysis and simulation in ecology, Vol. I. New York: Academic Press.

Patten, B. C. (Ed.) (1972). Systems analysis and simulation in ecology, Vol. II. New York: Academic Press.

Patten, B. C. (Ed.) (1975). Systems analysis and simulation in ecology, Vol. III. New York: Academic Press.

Patten, B. C. (Ed.) (1976). Systems analysis and simulation in ecology, Vol. IV. New York: Academic Press.

Patterson, J. G., Goodchild, N. A., and Boyd, W. J. R. (1978). Classifying environments for sampling purposes using a principal component analysis of climatic data. Agric. Meteorol., 19, 349-362.

Peden, L. M., Williams, J. S. and Frayer, W. E. (1973). A Markov model for stand projection. Forest Science, 19, 303-314.

Pelto, C. R. (1954). Mapping of multicomponent systems. J. Geol. , 62, 501 -511.

Peters, G. and Wilkinson, J. H. (1971). The calculation of specified eigenvectors by inverse iteration. In: Handbookfor automatic computation. Vol. 11, Linear Algebra (edited by J. H. Wilkinson and C. Reinsch), pp. 418-439. Heidelberg: Springer- Verlag.

Phillipson, J. B. (1966). Ecological energetics. London: Edward Arnold.

Pielou, E. C. (1969). An introduction to mathematical ecology. New York, Chichester: John Wiley.

Plinston, D. T. (1972). Parameter sensitivity and interdependence in hydrological models. In: Mathematical models in ecology (edited by J. N. R. Jeffers), pp. 237-248. Oxford: Blackwell Scientific.

Radford, P. J. (1972). The simulation language as an aid to ecological modelling. In: Mathematical models in ecology (edited by J. N. R. Jeffers), pp. 277-296. Oxford: Blackwell Scientific.

Raiffi, H. (1968). Decision analysis -introductory lectures on choices under uncertainty. Reading, Mass.: Adddison-Wesley.

Rao, C. R. (1950). Statistical inference applied to classificatory problems. Sankhya, 10, 229-256.

Rao, C. R. (1964). The use and interpretation of principal component analysis in applied research. Sankhya, Ser. A, 26, 329-358.

Rao, C. R. (1965). Linear statistical inference and its applications. Chichester, New York: John Wiley.

Rao, C. R. and Kshirsagar, A. M. (1978). A semi-Markovian model for predator-prey interactions. Biometrics, 34, 611-620.

Reinsch, C. and Bauer, P. L. (1971). Rational QL transformation with Newton shift for symmetric tridiagonal matrices. In: Handbook for automatic computation. Vol. II, Linear algebra (edited by J. H. Wilkinson and C. Reinsch), pp.257-265. Heidelberg: Springer- Verlag.

Richards, L. E. (1972). Refinement and extension of distribution-free discriminant analysis. Applied Statistics, 21, 174-176.

Roberts, N. et al. (1983). Computer simulation. London: Addison-Wesley Publishing Company.

Ross, C. J. S. (1972). Stochastic model fittings by evolutionary operation. In: Mathematical models in ecology (edited by J. N. R. Jeffers), pp. 297-308. Oxford: Blackwell Scientific.

Rowen, H. S. (1976). Policy analysis as heuristic aid: the design of means, ends and institutions. In: When values conflict (edited by L. H. Tribe, C. S. Schelling and J. Voss). Cambridge, Mass.: Ballinger.

Schatzoff, M., and Tillman, C. C. (1975). Design of experiments in simulator validation. IBM J. Res. Dev., 19,252-262.

Seal, H. (1964). Multivariate statistical analysis for biologists. London: Methuen and Co.

Searle, S. R. (1966). Matrix algebra for the biological sciences. New York, Chichester: John Wiley.

Shoemaker, C. A. (1977a). Pest management models of crop ecosystems. In: Ecosystem modelling in theory and practice (edited by C. A. S. Hall and J. W. Day), pp. 545-574. New York, Chichester: John Wiley.

Shoemaker, C. A. (1977b). Mathematical construction of ecological models. In:. Ecosystem modelling in theory and practice (edited by C. A. S. Hall and J. W. Day), pp. 75-114. New York, Chichester: Wiley.

Shugart, H. H., and O'Neill, R. V. (1979). Systems ecology. Stroudsburg, Pa.: Dowden, Hutchinson Ross.

Shugart, H. H., Crow, T. R., and Hett, J. M. (1973). Forest succession models: a rationale and methodology for modelling forest succession over large regions. Forest Sci., 19, 203-212.

Simmonds, J. L. (1963). Application of characteristic vector analysis to photographic and optical response data. J. Optical Soc. Amer. , 53, 986-974.

Skellam, J. G. (1972). Some philosophical aspects of mathematical modelling in empirical science with special reference to ecology. In: Mathematical models in ecology (edited by J. N. R. Jeffers), pp. 13-28. Oxford: Blackwell Scientific.

Skogerboe, G. V., Walker, W. R. and Evans, R. G. (1979). Modeling process for assessing water quality problems and developing appropriate solutions in irrigated agriculture. In: State-of-the-art in ecological modeling, (edited by S. E. Jorgensen), pp. 269-292. Oxford: Pergamon Press.

Slatyer, R. O. (1977). Dynamic changes in terrestrial ecosystems: paterns of change, techniques for study and application to management. UNESCO MAB Technical Notes 4.

Smeach, S. C. and Jernigan, R. W. (1977). Further aspects of a Markovian sampling policy for water quality monitoring. Biometrics, 33, 41-46.

Smith, F. E. (1970). Analysis of ecosystems. In: Analysis of temperate forest ecosystems (edited by D. Reichle), pp. 7-18. New York: Springer-Verlag.

Sneath, P. H. A. (1957). Computers in taxonomy. J. gen Microbiology, 17, 201-226.

Sneath, P. H. A., and Sokal, R. R. (1973). Numerical taxonomy, the principles and practice of numerical classification. San Francisco: W. H. Freeman.

Snedecor, G. W., and Cochran, W. G. (1967). Statistical methods (6th edn). Ames: Iowa State Univ. Press. .
Sokal, R. R. (1974). Classification: purposes, principles, progress, prospects. Science, 185, 1115-1123.

Sokal, R. R. and Rohlf, F. J. (1962). The comparison of dendrograms by objective methods. Taxon, 11, 33-40.

Sparks, D. N., and Todd, A. D. (1973). A comparison of FORTRAN subroutines for calculating latent roots and vectors. Appl. Stat., 22, 220-225.

Sprent, P. (1969). Models in regression and related topics, London: Methuen.

Stephens, G. R., and Waggoner, P. E. (1970). The forests anticipated from 40 years on natural transitions in mixed hardwoods. Bull. Conn. Agric. Exp. Stn, 77, 1.58.

Tansley, A. G. (1935). The use and abuse of vegetational concepts and terms. Ecology,16, 284-307.

Tavare, S. (1979). A note on finite homogeneous continuous-time Markov chains. Biometrics, 35, 831-834.

Taylor, C. C. (1977). Principal component and factor analysis. In: The analysis of survey data, Vol. 1, Exploring data structures (edited by C. A. O'Muircheartaigh and C. Payne), pp. 89-124. New York, Chichester: John Wiley.

Tukey, J. W. (1977). Exploratory data analysis. London: Addison-Wesley Publishing Co.

Udvardy, M. D. F. (1975). A classification of the biographical provinces of the world. IUCN Occas. Pap. No. 18.

UNESCO (1972). Expert panel in the role of systems analysis and modelling approaches in the programme on Man and the Biosphere (MAB). UNESCO (MAB) Rep. Ser. No.2.

Usher, M. B. (1972). Developments in the Leslie matrix model. In: Mathematical models in ecology (edited by J. N. R. Jeffers), pp.29-60. Oxford: Blackwell Scientific.

Usher, M. B. (1979). Markovian approaches to ecological succession. Journal of Animal ecology, 48, 413-426.

Usher, M. B. and Parr, T. W. (1977). Are there successional changes in arthropod decomposer communities? Journal of Environmental Management, 5, 151-160.

Valentine, H. T ., and Houston, D. R. (1979). A discriminant function for identifying mixed-oak stand susceptibility to gypsy moth defoliation. Forest Sci., 25, 468-474.

Vandeveer, L. R. and Drummond, H. E. (1978). The use of Markov processes in estimating land use change. Tech. Bull. No.148. Oklahoma: Agricultural Experimental Station.

Volterra, V. (1926). Variazione e fluttuazini del numero d'individui in specie animali conviventi. Mem. Accad. Nazionale Lincei (6) 2, 31-113.

Waggoner, P. E. and Stevens, G. R. (1970). Transitional probabilities for a forest. Nature, 225, 1160-1161.

Ware, K. D., and Cunia, T. (1962). Continuous forest inventory with partial replacement of samples. Forest Sci. Monogr. No.3.

Webster, R. , and Burrough, P. A. (1972). Computer-based soil mapping of small areas from sample data. I. Multivariate classification and ordination. J. Soil Sci., 23, 210-221.

White, E. H., and Mead, D. J. (1971). Discriminant analysis in tree nutrition research. Forest Sci., 17, 425-427.

Whittaker, J. H. (Ed.) (1973). Ordination and classification of communities. The Hague: Dr W. Junk.

Whittaker, R. H. (1967). Gradient analysis of vegetation. Bioi. Rev. , 49, 207-264.

Wilkinson, J. H. (1965). The algebraic eigenvalue problem. London: Oxford Univ. Press.

Williams, E. J. (1959). Regression analysis. London, New York, Chichester: John Wiley.

Williams, W. T. (1971). Principles of clustering. Ann. Rev. Ecol. Syst., 2, 303-326.

Williams, W. T. (Ed.) (1976). Pattern analysis in agricultural science. Amsterdam, Oxford, New York: Elsevier .

Williams, W. T ., and Lambert, J. M. (1960). Multivariate methods in plant ecology. II. The use of an electronic digital computer for association analysis. J. Ecol., 48, 689-710.

Williams, W. T., Lance, G. N., Tracey, J. G. and Connell, J. H. (1969). Studies in the numerical analysis of complex rain-forest communities. IV. A method for the elucidation of small scale forest pattern. J. Ecol. , 57, 635-654.

Williams, W. T., Clifford, H. T., and Lance, G. N. (1971). Group-size dependence: rationale for choice between numerical classifications. Comp. J., 14, 157-162.

Williams, W. T., Lance, G. N., Webb, L. J., and Tracey, J. G. (1973). Studies in the numerical analysis of complex rain-forest communities. VI. Models for the classification of quantitative data. J. Ecol., 61, 47-70.

Williamson, M. H. (1972). The analysis of biological populations. London: Arnold and Sons.

Wishart, D. (1968). A FORTRAN II program for numerical classification. St. Andrews University.

Back to Table of Contents

The electronic version of this publication has been prepared at
the M S Swaminathan Research Foundation, Chennai, India.