It is nice to have a number specifying where data lies (e.g., mean, median), but it is also nice to know how representative of the data that number is (i.e., how far from that number the data lies). The data sets {10, 30, 50, 70, 90} and {40, 45, 50, 55, 60} both have the mean=median=midrange=50, but they differ in how much the data is spread out. To distinguish these data sets, it is useful to complement a measure of location with a measure of spread or variation. There are several statistics which characterize the amount of spread, of which the range and the standard deviation are most frequently used.

## Range

The range is the extent of the data set; it is defined as the maximum minus the minimum. For data sets A and B the range is 7 and 8, respectively. For the weights of students the range is 235-105=130. Note that the range is a single number, the difference between the maximum and minimum, *not* an ordered pair specifying the minimum and maximum. Note also that the range is defined from the extreme individuals, hence does not measur how far a typical data value is from the middle.

If you know the midrange and range, you can calculate the maximum and minimum.

## Inter-quartile range (IQR)

(quartiles are defined in the next section)

The inter-quartile range is defined as Q3-Q1. For data sets A and B the interquartile range is 8-3 = 5 and 7-2 = 5, respectively. Note that the inter-quartile range is a single number and not the ordered pair consisting of the quartiles. For the weights of students the inter-quartile range is 175-130 = 45. [Different definitions for the quartiles will produce different inter-quartile ranges.] Since Q3 is the middle of the data above the median, and Q1 is the middle of the data below the median; Q3-Q1=(Q3-Q2)+(Q2-Q1) is twice the average distance of a datum from the median.

Note that knowing the median and the inter-quartile range does not let you calculate the minimum and maximum.

Rather than the maximum extent, one might want a measure of the average distance of data from the center

## (Variance and) standard deviation

Rather than the maximum extent, one might want a measure of the average distance of data from the center. There are several approaches to measuring the average distance from the mean. A first notion might be (1/n)*sum*(x(i) - x-bar), where there are n data points. However, it is readily verified that this quantity is always 0, hence it is not of much use. Negative distances cancelling out positive distances can be avoided by employing the absolute value: (1/n)*sum*|(x(i) - x-bar)|; this quantity is called the mean deviation (or the mean absolute deviation). It is a nice concept, but is not suitable for many mathematical manipulations, hence is not widely employed.

Another way to avoid negative summands is to square them. (1/(n-1))*sum*((x(i) - x-bar)^2) is called the variance, which is denoted by s^2. [The reason for dividing by n-1 rather than n, is that this is the estimate for the variance of a population based on a sample; if we had divided by n we would have still called it the variance, but denoted it with *sigma*^2 where *sigma* is lower case sigma.] Evaluating this expression for data set A yields ((2-5.4)^2 + (3-5.4)^2 + (5-5.4)^2 + (8-5.4)^2 + (9-5.4)^2)/4 = 9.3. This is not a good measure of the average distance from the mean, but its square root 3.05 is (taking the square root essentially undoes the previous squaring). The square root of the variance is called the standard deviation, and denoted by s. For the weights of students the variance is 881.77, and the standard deviation is 29.69.

Exercise: In what sense is the standard deviation an average distance? How many (what percentage) of the weights of students lie within one standard deviation unit of the mean? Within two standard deviation units of the mean? Within three? How many standard deviation units is the range?

## Rule of thumb

To what extent do the inter-quartile range and standard deviation specify how far data is from the center (meadian or mean)? Half the data lies between the first and third quartile. Because data distributions are in general not symmetric, Q1 is not equal to Q2-.5(Q3-Q1) (and Q3 is not equal to Q2+.5(Q3-Q1)), hence we cannot say precisely what fraction of the data will lie within one-half interqartile range of the median. But approximately half the data will lie within one-half interquartile range of the median. [It is easy to show that at least one-quarter of the data and at most three-quarters of the data lies within one-half inter-quartile range of the median.]

For the standard deviation, there is no exact result either, but many data distributions are approximately normal (for our purposes, approximately normal means symmetric and unimodal, mound shaped). Hence we can use the exact result for normal distributions as a rule of thumb for most data distributions:
Approximately .68 of the data (approximately 2/3) lies within one standard deviation unit of the mean.
Approximately .95 of the data (approximately 19/20) lies within two standard deviation units of the mean.
Approximately .9974 of the data (approximately 399/400) lies within three standard deviation units of the mean.
[Bounds on on how much data lies within one, two, or three standrd deviation units from the mean are available from Tchebychev's theorem, but we shall not discuss that in this course.]

## Skewness

The range and standard deviation measure how spread out data is, but do not give any information as to how that spread is distributed among the data. We would really like to know where all the data lies, be able to visualize a picture like a histogram. In particular, we might want to know whether the data is symmetric about the mean, or there is a longer tail on one side. A distribution is said to be skewed in the direction of the longer tail. We have seen previously that a long tail on the right side (i.e., extreme individuals on the right not balanced by extreme individuals on the left) causes the mean to be greater than the median. We shall employ the mean being greater than the median as the definition of skewed to the right (although there is a more technical defintion of skewness which does not always agree with this definition). Similarly, if the mean is less than the median a distribution is skewed to the left.. (Note that if a data set is symmetric the mean is equal to the median, but the mean being equal to the median does not require that a data set be symmetric.)

## Bimodality

Most data distributions have a single peak, which occurs near the mean and median; such a distribution is called unimodal. However, there are also distributions where there are two peaks, and the mean and median lie between them, being near neither. An example would be the weights of birds where juveniles would be much lighter than adults resulting in some data points near the average weight for juveniles, and other data points near the average weight for adults. Although the weights of juveniles would be unimodal and the weights of adults would be unimodal, the weights for the entire species would be bimodal.

Competencies: For the data set {2 5 9 4 6 7 6 8 8}, calculate the variance, standard deviation, and range.

Reflection: For the above data set, which of the above statistics best describes the spread of the data?

Challenge: Is the variance always greater than the standard deviation? When will the variance, standard deviation, and range be equal?

May 2003