Types of Data; Stem-and-Leaf Plots and Histograms

John Gaunt lists more than 30 causes of death in his Bills of Mortality. That is too many categories for anyone to comprehend in either tables, bar charts, or pie charts. Therefore he groups them into categories in order to discuss their import. For example: " The fourth Observation is; That of the said 229250. not 4000. died of outward Griefs, as of Cancers, Fistulaes, Sores, Ulcers, broken and bruised Limbs, Impostumes, Itch, King's-evil, Leprosie, Scald-head, Swine-Pox, Wens, &c. viz. not one in 60." Other categories include epdemical and chronical diseases. From this we can get a sense of why (how) people died in the 1600's.

Similarly, if we had milage or safety information on all the models of cars, we would be overwhelmed by the information, but if we grouped the cars into a few categories such as manufacturers (Ford, Chrysler, General Motors, etc.) or type of vehicle (subcompact, sedan, station wagon, van, etc.) we might be able to comprehend the essence of the information. Although the usefulness of grouping into categories is clear, it is often difficult to determine the appropriate way to group categories. If you grouped cars by manufacturer, you would miss safety information which depended on the type of vehicle. However, there are sometimes natural groupings dependent on the nature of the data.

Types of data

Categorical (also called qualitatitative, and sometimes further specified as nominal or ordinal) data is data where what is being recorded cannot be readily identified with the real numbers. Examples include colors of cars (red, green, black), size of eggs (small, medium, large), sex (male, female). One can count the number of cars of various colors, and display that information in a bar chart or pie chart, but one cannot combine the various cars and conclude that the average car is brown. We have illustrated some circumstances in which categorica data can be grouped, but there is no general rule as to how to group it..

Quantitative (also further specified as interval and ratio, the distinction between which is not of interest for our purposes) data is data where what is being recorded can be identified with the real numbers. Examples include age, I.Q., weight, height. Identification with the real numbers facilitates organizing, comprehending, and communicating this data. We can always group quantitative data as one groups numbers which are close together. We will later combine it using algebraic operations to describe where the data lies.

N.B.: We can count all data, whether categorical or quantitative, the terms categorical and quantitative refer to the essence of the individual items which we are counting.

Exercise: What characteristics of people are qualitative? quantitative? What characteristics of cars are qualitative? quantitative?

Stem-and-leaf plots

A natural way to organize (group) quantitative data is with the order property of the real numbers, i.e., arrange the data from least to greatest. For example, the 30 weights: 185, 160, 235, 165, 125, 175, 185, 132, 168, 112, 170, 155, 105, 158, 120, 190, 140, 185, 125, 180, 145, 110, 155 135, 170, 113, 155, 175, 145, 130 are more easily comprehended in order: 105, 110, 112, 113, 120, 125, 125, 130, 132, 135, 140, 145, 145, 155, 155, 155, 158, 160, 165, 168, 170, 170, 175, 175, 180, 185, 185, 185, 190, 235. Note that each weight has been listed as many times as it occurs. This information can be visually presented with a stem and leaf plot. A position has been chosen to break the numbers into a stem and a leaf. The leaf will always be one digit. The stems are listed on the left, and the corresponding leaves (if any) on the right. Visually a stem-and leaf plot looks like a bar chart; the categories are defined by the decimal structure of the numbers. A stem-and leaf plot for the above data is presented below:
10 | 5
11 | 023
12 | 055
13 | 025
14 | 055
15 | 5558
16 | 058
17 | 0055
18 | 0555
19 | 0
20 |
21 |
22 |
23 | 5
In practice, stem-and leaf plots are formed before the data has been ordered (or inorder to order the data). Thus it is a three step process: 1) choose the stems based on scanning the data, 2) add the leaves in the order encountered, 3) reorder the leaves on each stem from smallest to largest.

Sometimes to enhance visual presentation of data, stems will be split (e.g., repeat each stem on the left, once for the digits 0-4, once for the digits 5-9). Sometimes data are truncated (rightmost digits dropped) in order to have an informative plot with single digit leafs.

Histograms

The use of the decimal structure of the numbers sometimes constrains the ability of a stem-and-leaf plot to visually display where the data lies. Even the techniques of spliting stems or truncating (dropping) digits may not be satifactory. Histograms allows arbitrary sizes for the categories, but the categories (classes) must be contiguous, and all be the same size. N.B.: stem-and-leaf plots are a good preliminary way to organize data prior to representing it with a histogram.

The essence of a histogram is best illustrated by the method of its construction.

  1. Choose the number of classes; this will be an aesthetic judgement based on the data. Generally you will want between 5 and 20 classes: your goal is to communicate where the data lies. The number of classes is important, although subtle.
  2. Choose the class size. Divide the range by the target number of classes above, then round off aesthetically. If you do not end up with the number of classes above, it will not matter since that number was a rough aesthetic guess.
  3. Choose the class marks (or class boundaries). Again, do this aesthetically. Since classes are all the same size, if you know the class marks (which are at the center of the classes), you know the class boundaries, and vice-versa. It is important that each datum lies in exactly one class.
  4. Draw the histogram. This requires that you count the number of data which lie in each class, and make the heights (hence areas) of the bars proportional to the number of data in each class. Do not forget to label the histogram, since it will convey no information if it is not labelled.
For the above data set, I would want about 5 classes, since with only 30 data points there would be too few data per class to accurately portray where the data lies with more classes. My range is 235-105=130; 130/5=26. Hence I want approximately 26 pounds per class. Therefore, I will choose 25 for my class size. I prefer class marks to class boundaries, hence I will choose 100, 125, 150, 175, 200, and 225 as my class marks. This gives me six classes: 87.5-112.5, 112.5-137.5, 137.5-162.5, 162.5-187.5, 187.5-212.5, 212.5-237.5. Note that by choosing an odd class size and integer class marks my class boundaries are fractional, hence no datum lies on a class boundary (general usage is to include the left endpoint and exclude the right endpoint from a class (group) when data lie on the class boundaries)..


10_|                           _______
   |                          |       |
   |                   _______|       |
   |           _______|       |       |
   |          |       |       |       |
 5_|          |       |       |       |
   |          |       |       |       |
   |   _______|       |       |       |
   |  |       |       |       |       |
   |  |       |       |       |       |_______ _______
 __|__|_______|_______|_______|_______|_______|_______|____
          |       |       |       |       |       |
         100     125     150     175     200     225

                weights of students in pounds



     Weights of Students in Statistics Course
Note that all the original data can be recovered from a stem-and-leaf plot but you only know the approximate value of the data when it is presented in a histogram.

Competencies: Give examples of categorical (qualitative) data and quantitative data.
Present the following weights: {132, 180, 200, 150, 165, 144, 194, 125, 160, 130, 140, 140, 160, 170, 150, 155, 135, 165, 120, 185, 141, 210, 105, 115, 125, 162, 215, 235, 170, 200, 125, 125, 225, 170, 140, 135, 185, 230, 269, 130, 220, 198, 285, 140, 173, 180, 210, 148, 115, 205, 130} as a stem-and-leaf plot.
Present the above weights as a hisotgram.

Reflection: When would you violate the above rules for making a histogram, and how would you do it?

Challenge:

May 2003

return to index

campbell@math.uni.edu