Let's assume that we are given a "Data Set" - for us that will simply be a list (a column) of numbers. The tools/ideas developed below help one to understand and interpret this data set. Let's use the data set {2, 4, 4, 4, 6} to illustrate the first four of the concepts below. We'll use a different data set for the last two concepts.
AVERAGE (also called the Mean)
A Theorem about Standard Deviation
AVERAGE (or Mean):
BY HAND:
The AVERAGE should be very familiar to you. The AVERAGE of a collection of numbers is their sum, divided by how many numbers there are. So, in our case, the average is:
(2+4+4+4+6)/5 = 20/5 = 4.0
USING EXCEL:
We need to enter the numbers in Excel as a column, one number in each cell. Below, the data has been entered in Column B, Rows 3 through 7. This range of data (called, sometimes, the Data_Range or Data_Array or the Number_Range) in Excel is denoted by B3:B7. [Note, carefully, the "colon."] To compute the AVERAGE of these numbers, enter =AVERAGE(B3:B7) at the bottom of the column of numbers. When you press return, instead of seeing =AVERAGE(B3:B7), you'll see 4.0.
A
B
1
2
3
2
4
4
5
4
6
4
7 6 8 =AVERAGE(B3:B7) 9
BY HAND:
Another descriptor is called the MEDIAN. This is the middle value. That is, you arrange all the data in increasing (or decreasing) order and find the middle number. In our data set {2, 4, 4, 4, 6}, the middle value is 4
Note, that this concept is actually ambiguous when there is an even number of values in the data set. For example, what number is the middle of the list 2, 4, 6, 9? Often one says, by convention, that the median in this case is just the average of the TWO middle scores, so that in this case, the median would be 5. Note that when the data set is HUGE, say several 1000 numbers, it is very time consuming to arrange the data set in order. This is not an issue when the median is computed with Excel
USING EXCEL:
The EXCEL function that computes the MEDIAN is = MEDIAN(Data_Range), so that in our case, we'd write =MEDIAN(B3:B7) in the cell where we want the median to appear.
MODE:
BY HAND:
A third descriptor is called the MODE. This is, by definition, the most common score. In our data set {2, 4, 4, 4, 6} the mode is clearly 4.
Again, there is some ambiguity here. What is the most common score in the list 1, 2, 2, 3, 4, 4, 5, 8? One would say, in this case, that the data set has two modes: 2 and 4.
USING EXCEL:
The MODE formula is =MODE(Data_Range), so that to find the mode of our data set, we'd write =MODE(B3:B7). When there is ambiguity, EXCEL computes the smallest of the most common values.
BY HAND:
The STANDARD DEVIATION of a data set is a number which measures the how spread out the data are. So, for example, the standard deviation of the data set {3, 3, 3, 3, 3} will turn out to be zero, because the numbers are clearly not spread out at all.
The standard deviation of a bunch of numbers is defined to be [it's a mouthful] the square root of the average of the squares of the differences between each number and the mean of the bunch of numbers. A formula says it much better: Given a set of n numbers, x1, ....xn, the standard deviation of these n numbers
where
is the AVERAGE of all the numbers in the data set.
Let's compute the standard deviation for our data set {2, 4, 4, 4, 6}
We already know
; that's 4. Now, for each of the 5 numbers, 2, 4, 4, 4, 6, we need to compute the deviation from the mean and also the squares of these deviations
4 - 2 = 2, whose square is 4
4 - 4 = 0, whose square is 0
4 - 4 = 0, whose square is 0
4 - 4 = 0, whose square is 0
6 - 4 = -2, whose square is 4
And now we need to find the average of the squares of the deviations: (4+0+0+0+4)/5=1.6.
And then we need to find the square root of this last number: (you'll need a calculator for this). The square root of 1.6 = approx 1.26. That is, the STANDARD DEVIATION of the numbers 2, 4, 4, 4, 6 equals 1.26.
USING EXCEL:
Well, this is CONSIDERABLY easier in Excel, for one just enters a simple formula. The formula is =STDEVP(Data_Range) [What the "P" stands for is the word "Population." There are actually two different standard deviations. How they are different and why we are using this particular one is a bit of a long story. You'll need to take a full blown stat course to get to that story.]
So, in our case, we'd write: =STDEVP(B3:B7).
A Theorem on Standard Deviation:
There are lots of ways one could measure the spread of some data. We've described here the most common, the Standard Deviation. One reason that this measure is so valuable is the following theorem. A careful statement of this theorem takes more time than we have and its proof takes a rather advanced course in probability and statistics. So what follows is only a hint. Before the theorem can be stated, we need to define what it means for data to be NORMALLY DISTRIBUTED. The problem is, this, too, takes a good deal of work. Let me just say that data is normally distributed if when you graph the data (as we'll do when we discuss, below, the notion of a HISTOGRAM) that the graph is a special "bell-shaped" curve, called a NORMAL CURVE (which you may have heard of). This is a TERRIBLE definition, for I really have said hardly anything. Again, one needs a more advanced course to do this properly. With this said:
THEOREM: Assume we have data set that is normally distributed.
Then:
(a) Approximately 68.3% of all the data lie within 1 standard deviation of the mean.
(b) Approximately 95.4% of all the data lie within 2 standard deviations of the mean.
(c) Approximately 99.7% of all the data lie within 3 standard deviations of the mean.
Just to be clear: If the scores on an exam taken by a large class were normally distributed with mean 85 and standard deviation 5, then approximately 95.4% of all the scores would lie in the range from 85- 2(5) to 85 + 2(5), that is, from 75 to 95.
BY HAND:
Often one finds that the simple measures of central tendency (average, median, mode) and even the measure of dispersion (the standard deviation) don't describe the data well enough. One very useful tool in this case is the FREQUENCY TABLE.
A FREQUENCY TABLE is just that: a table that tells the frequency (how often) values occur in a data set. Usually data are grouped together in so-called BINS.
This concept is best illustrated with a large data set, so let's assume that the data set has 300 data points. They are the scores on an exam in a large lecture. I've put them in the following spreadsheet:
In this example, the scores range from a high of 99 to a low of 18. Here is the FREQUENCY TABLE for this data with BINS of size 5. [We could have chosen BINS of any reasonable size.] Just to be clear: the first "7" in right hand column means that there were 7 scores BIGGER than 15 but LESS THAN OR EQUAL TO 20. That is, the bin INCLUDES the 20 but NOT the 15. The common notation for such a range is (15,20]. The "(" indicates that the 15 is NOT included and the "]" indicates that the 20 IS included.
I would guess it would take one a good 20 minutes to create this table BY HAND.
Bin Boundaries
Number of Items in Bin
[Frequency]Bigger Than
Less than or Equal to
- infinity 0 0 0 5 0 5 10 0 10 15 0 15 20 7 20 25 8 25 30 5 30 35 11 35 40 76 40 45 20 45 50 11 50 55 6 55 60 5 60 65 0 65 70 0 70 75 0 75 80 4 80 85 9 85 90 101 90 95 28 95 100 9 100 infinity 0
USING EXCEL:
The FREQUENCY TABLE that Excel produces is a bit different from the one done by hand. It is the following, in this case:
Bins Frequency 0 0 5 0 10 0 15 0 20 7 25 8 30 5 35 11 40 76 45 20 50 11 55 6 60 5 65 0 70 0 75 0 80 4 85 9 90 101 95 28 100 9 0
Can you see the differences and the relationships? For example, there is only one column to represent the BINS. And, our number "7" is next to the 20 - so that the 20 stands for the top value of the bin. It's still the case that the bin INCLUDES the 20 but NOT the 15.
Said again:
The frequency for the bin labeled "0" equals the number of scores of 0 or less.
The frequency for the bin labeled "5" equals the number of scores 5 or less but greater than 0.
The frequency for the bin labeled "10" equals the number of scores 10 or less but greater than 5.
.
.
The frequency for the bin labeled "95" equals the number of scores 95 or less but greater than 90.
The frequency for the bin labeled "100" equals the number of scores 100 or less but greater than 95.
AND, one extra row is created, at the bottom, whose frequency value represents the number of scores greater than 100.
Now, here is how to use EXCEL to make a frequency table. [It's sort of complicated.]
It's a bit of a trick to enter the FREQUENCY function, as you have to
Put the data that your are analyzing in a column. This column is called the Data_array.
Create column of numbers that has the bin boundaries in it.
Highlight (with the mouse) the column NEXT to that column but of length ONE LONGER, so that the highlight starts level with the top of the bin list but extends one more row down.
Go to the menus on top of the Excel page and choose: Insert, Function and pick Statistical from the left section of the dialog box. Then pick FREQUENCY from the right dialog box and choose ok.
A new dialog box opens up: type the range for your data in the "Data_array" spot and type the range for your bins in the "Bins_array" spot. Remember, the format for a range is A4:A23, if, for example, the data starts in cell A4 and extends through cell A23.
Here is the tricky part: You DON'T choose OK next.
Instead you use the keys "Control-Shift-Enter." [Press, first, the Control and Shift keys and with them held down, press Enter. When you've done that, you'll get the frequencies entered right next to their appropriate bins, just where you highlighted.]
BY HAND:
A HISTOGRAM is simply a bar graph of the frequency table. Along the base ("x" axis) go the bin boundaries and in the vertical direction go the frequencies. By hand it's easy to label the bin boundaries; when Excel makes this graph, it's harder to do this. Once you see what Excel does, the BY HAND version will be clear. So, please look at the Excel version, below.
USING EXCEL
Highlight the data you want to graph (this is the frequency column) with the mouse.
Choose Insert Chart from the menus, make sure the chart type "column" is chosen, and then ALMOST choose "finish."
Actually, to get the labels correct along the bottom of the graph, there are some other steps. Instead of choosing "finish" right away, choose "next" and then the "series" tab at the top and then click the white box to the right of the phrase "Category (X) axis labels." Then use your mouse to highlight the bins_array in your spreadsheet. [You may have to move the dialog box out of the way, first.] NOW, press "finish."
Here is a trick to get the bars all next to each other: Click on one bar; then right click, then choose Format Data Series, then choose Options, and then set the gap width to zero.
This is the HISTOGRAM that I got: