What is the study of statistics? That's a very good question and like most good questions it has lots of answers. To me (at the moment) statistics is the science of pulling signal from noise. Where signal is something (usually a number) that you hope has meaning in the real world, the noise is the rest of the world getting in the way. The power of the science of statistics is that it can identify patterns and also warns you if you might be making a mistake in your assumptions. It is a huge subject and potentially very complex. A statement couched in statistical terms has a very precise meaning. If the context is left out of a statistical result then the person quoting it is probably trying to lie to you. I put it that boldly. They do it to hide the 'truth' rather than spread it. News media are very very good at doing it. Regression analysis can be seen as the plotting of relationships between response variable (or outcomes) and predictors. i.e. The Suns output and global temperatures; calorific intake, height and the number of hours of excercise per day on waistline measurements. These might all be considered continuous measurements that can be used to plot a graph and to try and fit a line to it. This is regression. ANOVA is similar to regression in that it takes predictive values and sees if they have an effect on the response variable, with the slight exception that ANOVA can be said to deal more with categorical groups rather than continuous predictors. Another improtant difference between the two methods is that ANOVA does not give you a line with which you can then run off and make a prediction based on another category. ANOVA deals more with statistical tests of whether the categorical value has a statistically significant effect on the reponse variable. This sub page gives some simple ideas behind ANOVA.
The above chart depicts the standard normal distribution sitting over the top of a contrived data set (contrived to be a normal distribution). The probability mass function: i.e. the probability of obtaining a value of x (say 3) given the parameters of If we set mean=5 and standard deviation is 0.5 then we get the following plot for the probability mass function
We can use the above plot to estimate the probability of obtaining a 3. We just need to read off the value of f(3). Unfotunately the value is too low to get a good estimate It is almost certainly less than 0.01. We can see from the plot that most values will be between 4 and 6 with a modal (highest) frequency of 5. THis should not come as a surprise as we set the mean value to be 5 when we plotted the function. If we have the same average but use a larger standard deviation (say 1.5) then we get:
Here the odds of getting a 3 are much better and are approximately 0.109.
Here the grey area corresponds to 95% of the area under the Normal PDF curve. 95% is just shy of 2 standard deviations either side of the mean (in the standard curve the mean is zero and the standard deviation value is 1 as was mentioned above) My biggest problem with statistics is the notation. I struggle to remember what it means exactly. So I have compiled some notes here for reference.
This means that where
Because the overall distribution is Normal then we can treat sub-sets (ie randomly selected sample sub-sets) of samples as little approximate versions of X that will (on average) share the same parameters as the master set. This (I think I've got this right) is the basic concept behind the Central Limit Theorem. More generally: any Set of sample sub-sets of a set will average to give a normal distribution that has a mean that is the 'same' as the mean for the initial set. Regardless of the distribution of the initial set of data. The chi-squared distribution is quite useful and can be used to estimate the statistical significance of figures in tables or to test the hypothesis that the data you have is from a given distribution and countless other really great uses... This is a common concept which is used frequently in the field of statistics. It's used to identify the most likely parameters to slot into a probability distribution. A valuable tool for examining the distribution of data.
While being forced to undergo "six sigma" training, studying with the OU or working for DenLHB, I have had cause to use two different statistical packages. Number one (my current favourite) is Minitab. I like Minitab because it allows for easy 'copy and pasting' into and out of MS office apps. This makes it ideal for quick and 'dirty' analysis. The second application is Genstat. This is the package that M346 uses. It seems nice and complicated but you have to remember an awful lot of commands to use.. Both of the above packages are expensive++. A free statistical package can be found here R.
|
|