Standard Deviation and Degrees of Freedom
In order to brush up some elementary statistics, I decided to read more about standard deviation as a measure of variability. What I couldn’t understand initially was the difference between
as ‘standard deviation of a sample’, and …
which is ‘sample standard deviation’.The difference here lies in the denominator – n versus n-1.
The wikipedia article on SD calls it Bessel’s correction and the wiki entry on that is equally impenetrable, only for qualified mathematicians. Here’s a more plausible (in my own words) explanation of the reason behind reducing n to n-1 that I read in the book ‘How to think about statistics’:
Enter Degrees of Freedom.
They are essentially values of data, or scores, that are free to vary. Consider a set of 5 numbers. If you were asked to guess those numbers, you could theoretically be free to think of any number. Let’s say that the first number is 3. What’s the second number? Again, you could think of anything. Let’s say you come up with 4 (let’s keep it simple); and so on (3, 4, 5 and 7). Now, assuming you know the mean (5 for our example) of the scores, what could be the missing value? There’s only one possible number – 6.
So, if you know the mean of a set of scores with a single value missing, you are no longer free to select its value and the only way to determine value for the 5th score is by using the remaining 4 scores. That’s the n-1 right there.
When you calculate the mean of a set of scores, all scores were free to have whichever values they were. So you divide them by n and get a mean for that set of scores. Remember, when you calculate the mean for score(s), your degrees of freedom remains n.
When you calculate standard deviation, you are not using those independent values, but the mean as well! If you had a score missing, using the mean would have been fine. However, in order to ensure our standard deviation isn’t biased and is based on ‘truly independent’ scores, you have to make corrections for the loss of a single degree of freedom.
Which is why an ‘accurate’ formula for standard deviation will have n-1 and not n. The reason this is usually ignored is that the effect on standard deviation for a large set of scores doesn’t really change that much. A sample of 100 scores will only have 1% difference on the number of samples to be used.
I know I sound like a complete novice at this – I am. Knowing mathematics and actually writing about it are two completely different things. Let’s see if I can learn both (wink).