Just what is regressions analysis? In a nutshell it uses data to attempt to define a 'model' or equation from which you can either predict the likely outcome of more of the same sort of data (and to tell when the underlying situataion changes) or to try and determine the underlying dynamics of a situation. Undergoing a OU course called M346 Linear Statistical Analaysis I have found that the course text book appears to obscure rather than illuminate. It takes a broadly simple concept and makes it unintelligable. Having said that I have a profound weakness when it comes to reading summation notation so it might just be me at fault. I hope to give a few illustrations that explain some of the simple concepts behind regression analysis. The Simple Linear Model This is a concept which I like to think of by attacking it from the other way around. Take a linear relationship and add to each point on the line a random amount of error dictated by a normal distribution. And this is what it looks like. Here's the linear bit...
Adding a bit of normal distributed error to the line gives:
This is exactly the same data but with the extra added ingredient of error. So the simple linear model is taking something like the above plot and assuming that really it is a straight line with the extra error. Each point of the data can be thought of as a normal distribution centred on the actual linear value of the underlying linear description of reality.
One key assumption would be that the variance remains constant throughout the model for all values of the predictive variable. Least Squares Estimate M346 seems to run on the Currency of Least Squares and knowing what all these numbers means is key to simply understanding the concepts involved. Just what is Least Squares, Least Squares come about because when you draw a line through a set of points... Fig. 1
unless it is a special case where all the points are exactly on the line there will be a difference between the 'fitted' line and the points. It just so happens that the average for the Dependent Variable (the Y) is 292.7. Now if we take this to be our Model then plotting the model on the above dot plot would look like Figure 2. Fig 2.
Here the Black line represents the average value (292.7) and the red dots represent the actual data points. The thin red lines represent what are called the residuals. So as we can see above taking the average value as the model is less efficient than trying to incorporate the Explanatory Variable. The way we do this is to find a line that we can plot on the graph that minimises the lengths of the residual lines. By calculating the residual sum of squares (RSS) for all the possible lines through the model we can find the line (model) that minimises the RSS value Fig 3.
You might think that there is a lot of trying out different lines to get this!! But no. There is an equation, or rather two equations. One for the gradient and one for the intercept. Gradient:
n is the number of values i is an index over n. ...and using the above function we can find the intercept:
Which we then simply apply to the line and out pops our least squares line approximation to the data as you can see in figure 1. If we want to calculate the variance of the data we use:
Okay, we have fitted a line to our data but how do we know that the line we have fitted is an accurate representation of the underlying reality. The least squares approximation is the best line for the data, BUT the data is a sample of an underlying population and we don't necessarily know whether it's representative. So we calculate some further figures and perform a statitical test to see whether the regression performed is likely to be real. From the point of view of reality we can take many samples and add a regression line to them. This means that the slope and interpect of the regression lines will have a distribution.
and more complex:
In the above case the distribution is NORMAL therefor we can use a test to check wether, given the sample, what are the chances that the underlying population line slope is zero.
If for the purpose of the test we assume that
here the total number of points is 18 (n = 18) Total:
Total sum of squares is the sum of the squares of all the residual lines from the average of Y. There are 17 degrees of freedom for this value as only 1 is 'taken up' by the average. Total m.s is s.s divided by d.f. and represents an approximation of the variance of the data. Residual:
This is the sum of the residuals from the fitted line. The better the model is, the smaller this value will be (relative to TSS). Degrees of freedom is n - 2 which is for the chi squared distribution under normality. The m.s. here is the approximation for the variance on the fitted line and is s.s./d.f. Regression s.s. Is the amount of sum of squares soaked up form the data by the fitted line. The d.f. for this is 1 and this reflects the fact that we have used the data to fit an extra parameter which is the slope. Variance Ratio This is the ratio:
|
|||||||||||||||||||||||||