Interrelations of summary statistics

The median is the second quartile, hence it is natural to use the median and the interquartile range together. We have also seen the three quartiles used with the maximum and minimum in the five number summary. The mean is used in the definition of the standard deviation, hence the mean and standard deviation are often used together. The rule of thumb (empirical rule) that 2/3 (68%) of data lies within one standard deviation unit of the mean, 95% lies within two standard deviation units of the mean, and 99.7% lies within three standard deviation units of the mean employs the mean and standard deviation. The midrange may be used with the range, in which case the maximum and minimum can be calculated. There are some basic properties of these statistics one should know.

Obvious relations
Skewness
Shape
Relative position
Subtle relations
Location versus spread

Obvious relations

minimum <= Q1 <= Q2 <= Q3 <= maximum
Note that all the inequalities are weak, there would be equality if thall the data had the same value.
minimum <= midrange <= maximum
minimum <= mean <= maximum

Skewness

A data set is symmetric if it is a mirror image about its middle. An example of a symmetric data set is {1, 2, 4, 5, 6, 8, 9}. If a data set is symmetric, its mean equals its median equals its midrange. If there are more extreme individuals on one side of the middle than the other, a data set is called skewed in that direction. For example, the data set {1, 3, 4, 5, 9} is skewed to the right. Since the midrange depends only on the maximum and minimum, the mean is calculated using all the data, and the median is not affected by the values of the maximum and minimum; if a data set is skewed to the right, the midrange will generally be larger than the mean, which will be larger than the median. (In fact, many introductory texts use the mean being greater than the median as the definition of skewed to the right; although this is not always consistent with the standard definition of skewness, you may use the mean being greater than the median as the definition of skewed to the right.) For the data set {1, 3, 4, 5, 9} the midrange is 5, the mean is 4.4, and the median is 4. If a data set is skewed to the left, the inequalities will be reversed.

Skewness is manifested in stem-and-leaf plots, histograms, and box-and-whisker plots. In histograms, data will slowly trail off in the direction of skewness as opposed to more abruptly ending in the other direction, this produces a tail in the direction of skewness. Stem-and-leaf plots are essentially histograms, hence a similar tail can be seen. In box-and-whisker plots, the whisker is longer in the direction of skewness. Most non-negative data is skewed to the right, because it cannot have a tail extend into negative values, weights of students is an example of data that is skewed to the right. Members of a crew provide an example of a distribution skewed to the left.

Shape

It is often of interest whether the data distribution trails off in both directions from a single high point, in which case it is called unimodal, or there are a couple of high points with a valley in between, in which case it is called bimodal. One text required that both high points be the same height for bimodality, but that is not generally required. Multimodality with multiple high points can also occur. Bimodality often occurs when two distinct populations are combined, such as the heights of men and women.

Relative position

When using percentiles or z-scores, one should remember that "average" is the median (50th percentile) when percentiles are used, but "average" is the mean when z-scores are used.

Subtle relations

The maximum is equal to the midrange plus half the range.
The minimum is equal to the midrange minus half the range.

Location versus spread

It is important to recognize the different roles of measures of location (mean, median, minimum, etc.) and measures of spread (range, standard deviation, etc.). If a constant is added to all the data values, it changes the location, but not the spread. If all the data values are multipled by a consant, both the location and spread are multiplied by that constant. Caveat: the variance is multiplied by the square of that constant because the variance measures the square of the distance from the mean.

Applets: An interactive histogram illustrates how the summary statistics are related to a histogram. Histogram explorer provides another way to shape a histogram and look at the summary statistics.

Competencies: Is the data set {2 5 9 4 6 7 6 8 8} symmetric, skewed to the right, or skewed to the left?

return to index

Questions?