work in progress

Basic one-way analysis of variance (ANOVA) is a statistical technique used to decide the probability that the means of a group of samples are in fact all samples from the same population given the variation in each sample.

ANOVA assumptions

  • Equal variance for each treatment group
  • Normal distribution around the mean for each treatment group.

To perform an ANOVA you have to calculate a number of different figures.

There is also a simple worked example

Residual Sum of Squares (RSS)

Here x is a value from the set. n is the number of data points in group i . and k is the total number of treatment groups.

Which is the sum of the sum of the residuals for each treatment group

Here is the basic technique for finding RSS

  1. Find the residuals in each group and square them
  2. Add all the residuals in each treatment together (so you get a sum of squares of the residuals –SSR- figure for each group)
  3. Add all the group SSR values together and this gives you your RSS

So it's a bit like a measure for the total amount of variation you are getting across the set of samples.

Explained Sum of Squares (ESS)

ESS gives the 'ideal' situation where all k (our case 3) treatments have no variability in the samples. i.e. each treatment has n values and all n values in the treatment are the same. This would mean that if the treatment means where indeed different then there would be no doubt that the treatments were different as they do not vary (normally or otherwise) around the treatments average.

So ESS is calculated by finding the means for each treatment and then subtracting the group mean from the total mean (for the entire set) and squaring this residual. Once we have the squared residual we then multiply each treatments squared residual by the the number of samples in each treatment group.

Put simply, this gives an indication of how much of the variation is explained by the fact that the data is grouped into treatments.

Total Sum of Squares

Now we need to generate our last value. For this we need the overall mean for all the values regardless of the treatment.

If the groups had NO EFFECT WHATEVER on the position of the mean for the samples (i.e. all the samples were really from the same overall population under the same conditions) then RSS and TSS should be roughly equal.
It is the difference between these two figures that will constitute our test statistic for whether or not there is an effect from the treatments.

However we are not quite there yet.

Interestingly if you take ESS and add RSS then you get TSS

TSS = ESS + RSS

or alternatively:

Anyway, let's leave that for now...

Degrees of Freedom (df)

The df shows the number of values in the calculation of your statistic that are free to vary. In the calculation of the mean this value is 'n' (the number of samples) because each value is free to change.

However in more complex statistics, such as the standard deviation, the calculation of the statistic uses the average value in the calculation. This means that there are fewer degrees of freedom as once you know the mean and n-1 values the 'n'th value is no longer free as it can only be the remaining value. As a result the system does not have n degrees of freedom. It as n-1 degrees.

Now to work out the values actually used in the ANOVA statistical test we need to generate a couple of extra statistics.

They are the 'Standard Error' values for RSS and ESS and to do this we need to know the degrees of freedom for RSS and ESS.

For ESS the number of degrees of freedom is k-1. Please bear in mind the definition of ESS above. This is because we know the overall mean ( ) and we can use the k-1 values of the treatment means together with the overall mean to find the kth value therefore we only have k-1 degrees of freedom for ESS.

For the RSS the number of degrees of freedom is n-k. This is because there are k means applied to the statistic so there is a reduction by k degrees of freedom to the statistic

The Wikipaedia definition for degree of freedom can be found here. They can only do a better job than I have done here.

Anyway let's forge ahead...

Mean Square Values

We now calculate the Mean Square for Residuals and the Mean Square for Estimates.

F-Statistic

Which can be used to generate the random variable F.

Where...

The distribution of F under H0 is the F distribution with k-1 and n-k degrees of freedom. This is often called the F statistic.

The F statistic is the one used to identify the likelihood for or against the Null Hypothesis. With all likelihood you will need a computer program to calculate the value of F. You can find some table for the F distribution here

 


Worked Example

Let's work through a simple example

X,Y and Z are our treatments. They could be anything; fish breeds, training shoe manufacturer, etc. Similarly the data points could be anything. They are just selected to make the average value convenient and the variances comparable.

X

Y

Z

10

17

15.5

11

17

14

11

19

18

13

16

16.5

12

19.5

 

15

21

 

The mean values are:

Other key figures are


For RSS 'doing the math' would look something like this:

and...

So our residual sum of squares (RSS) value is 42.375


Next in line is ESS. Using the average for the whole set:

With the average for the whole set using the definition of ESS

Here is what it looks like for our example:


There are a couple of ways of calculating TSS

Now we need to go to each of X, Y and Z values and find the difference between the value and the overall mean and then square that and add up all the individual numbers.

which comes to:

Now the value of TSS is the total variation across the whole dataset.
RSS is a measure of the variation in the groups. TSS is a measure of the total variation.