What is the study of statistics?

That's a very good question and like most good questions it has lots of answers.

To me (at the moment) statistics is the science of pulling signal from noise. Where signal is something (usually a number) that you hope has meaning in the real world, the noise is the rest of the world getting in the way.
It is a bit like doing the opposite of what we normally do to find an answer. Rather than stepping up to a problem and looking very intently at a single instance we step away and hope that if we look at enough events the underlying pattern emerges. Statistics helps us to gauge when the inferences we are making about a situation are likely to be correct.

The power of the science of statistics is that it can identify patterns and also warns you if you might be making a mistake in your assumptions. It is a huge subject and potentially very complex. A statement couched in statistical terms has a very precise meaning. If the context is left out of a statistical result then the person quoting it is probably trying to lie to you. I put it that boldly. They do it to hide the 'truth' rather than spread it. News media are very very good at doing it.

Regression analysis can be seen as the plotting of relationships between response variable (or outcomes) and predictors. i.e. The Suns output and global temperatures; calorific intake, height and the number of hours of excercise per day on waistline measurements. These might all be considered continuous measurements that can be used to plot a graph and to try and fit a line to it. This is regression.

ANOVA is similar to regression in that it takes predictive values and sees if they have an effect on the response variable, with the slight exception that ANOVA can be said to deal more with categorical groups rather than continuous predictors. Another improtant difference between the two methods is that ANOVA does not give you a line with which you can then run off and make a prediction based on another category. ANOVA deals more with statistical tests of whether the categorical value has a statistically significant effect on the reponse variable. This sub page gives some simple ideas behind ANOVA.

The above chart depicts the standard normal distribution sitting over the top of a contrived data set (contrived to be a normal distribution).
The standard Normal distribution has the following parameters. Parameters are used in the notation and any functions the distribution might have to define the distribution.
The mean value is 0 the standard deviation and the variance are both equal to 1. Note:The variance is the square of the standard deviation.

The probability mass function: i.e. the probability of obtaining a value of x (say 3) given the parameters of and is:

If we set mean=5 and standard deviation is 0.5 then we get the following plot for the probability mass function

We can use the above plot to estimate the probability of obtaining a 3. We just need to read off the value of f(3). Unfotunately the value is too low to get a good estimate It is almost certainly less than 0.01. We can see from the plot that most values will be between 4 and 6 with a modal (highest) frequency of 5. THis should not come as a surprise as we set the mean value to be 5 when we plotted the function.

If we have the same average but use a larger standard deviation (say 1.5) then we get:

Here the odds of getting a 3 are much better and are approximately 0.109.
The grey shaded area corresponds to the chances of having a number that is 3 or less. This is the Cumulative Distribution Function and represents the integral form of the function f(x). In this case the integral would be evaluated between -infinity and 3.
For the normal distribution the CDF is extremely complicated and you can find it on the Normal Distribution page of the Wikipaedia. The CDF values are usually read off from a table of given values that are printed in most stats books. They are also available online.

Here the grey area corresponds to 95% of the area under the Normal PDF curve. 95% is just shy of 2 standard deviations either side of the mean (in the standard curve the mean is zero and the standard deviation value is 1 as was mentioned above)

My biggest problem with statistics is the notation. I struggle to remember what it means exactly. So I have compiled some notes here for reference.


I recently experimented with Excel and Minitab exploring Student's t-distribution you can find it on this other page.

This means that where (ie where X is a set of values and x belongs to X) then we expect the whole set to look like a normal distribution with a mean of and a variance of .
To test if this is true you would have to take lots of samples from X and plot the frequency that a value appears in your sample against the value (this type of plot is called a histogram or sometimes a frequency plot) and see if they truly are normal approximate to the Normal PDF plot of the same parameters.

Because the overall distribution is Normal then we can treat sub-sets (ie randomly selected sample sub-sets) of samples as little approximate versions of X that will (on average) share the same parameters as the master set.

This (I think I've got this right) is the basic concept behind the Central Limit Theorem.

More generally: any Set of sample sub-sets of a set will average to give a normal distribution that has a mean that is the 'same' as the mean for the initial set. Regardless of the distribution of the initial set of data.

The chi-squared distribution is quite useful and can be used to estimate the statistical significance of figures in tables or to test the hypothesis that the data you have is from a given distribution and countless other really great uses...

This is a common concept which is used frequently in the field of statistics. It's used to identify the most likely parameters to slot into a probability distribution.

A valuable tool for examining the distribution of data.

While being forced to undergo "six sigma" training, studying with the OU or working for DenLHB, I have had cause to use two different statistical packages. Number one (my current favourite) is Minitab. I like Minitab because it allows for easy 'copy and pasting' into and out of MS office apps. This makes it ideal for quick and 'dirty' analysis. The second application is Genstat. This is the package that M346 uses. It seems nice and complicated but you have to remember an awful lot of commands to use..

Both of the above packages are expensive++.

A free statistical package can be found here R.