Types of Data; Stem-and-Leaf Plots and Histograms
John Gaunt lists more than 30 causes of death in his Bills of Mortality. That
is too many categories for anyone to comprehend in either tables, bar charts,
or pie charts. Therefore he groups them into categories in order to discuss
their import. For example: " The fourth Observation is; That of the said
229250. not 4000. died of outward Griefs, as of Cancers, Fistulaes, Sores,
Ulcers, broken and bruised Limbs, Impostumes, Itch, King's-evil, Leprosie,
Scald-head, Swine-Pox, Wens, &c. viz. not one in 60." Other categories include
epdemical and chronical diseases. From this we can get a sense of why (how)
people died in the 1600's.
Similarly, if we had milage or safety information on all the models of cars, we
would be overwhelmed by the information, but if we grouped the cars into a few
categories such as manufacturers (Ford, Chrysler, General Motors, etc.) or type
of vehicle (subcompact, sedan, station wagon, van, etc.) we might be able to
comprehend the essence of the information. Although the usefulness of grouping
into categories is clear, it is often difficult to determine the appropriate
way to group categories. If you grouped cars by manufacturer, you would miss
safety information which depended on the type of vehicle. However, there are
sometimes natural groupings dependent on the nature of the data.
Categorical (also called qualitatitative, and sometimes further
specified
as nominal or ordinal) data is data where what is being recorded cannot be
readily identified with
the real numbers. Examples include colors of cars (red, green, black), size
of eggs (small, medium, large), sex (male, female). One can count the number
of cars of various colors, and display that information in a bar chart or pie
chart, but one cannot combine the various cars and conclude that the average
car is brown. We have illustrated some circumstances in which categorica data
can be grouped, but there is no general rule as to how to group it..
Quantitative (also further specified as interval and ratio, the
distinction between which is not of interest for our purposes) data is data
where what is being recorded can be identified with the real numbers.
Examples include age, I.Q.,
weight, height. Identification with the real numbers facilitates organizing,
comprehending,
and communicating this data. We can always group quantitative data as one groups
numbers which are close together. We will later combine it using algebraic
operations to describe where the data lies.
N.B.: We can count all data, whether categorical or
quantitative, the terms categorical and quantitative refer to the essence of
the individual items which we are counting.
Exercise: What characteristics of people are qualitative? quantitative? What
characteristics of cars are qualitative? quantitative?
A natural way to organize (group) quantitative data is with the order property
of the
real numbers, i.e., arrange the data from least to greatest. For example, the
30 weights: 185, 160, 235, 165, 125, 175, 185, 132, 168, 112, 170, 155, 105,
158, 120, 190, 140, 185, 125, 180, 145, 110, 155 135, 170, 113, 155, 175, 145,
130 are more easily comprehended in order: 105, 110, 112, 113, 120, 125, 125,
130, 132, 135, 140, 145, 145, 155, 155, 155, 158, 160, 165, 168, 170, 170,
175,
175, 180, 185, 185, 185, 190, 235. Note that each weight has been listed as
many times as it occurs. This information can be visually presented with a
stem and leaf plot. A position has been chosen to break the numbers into
a stem and a leaf. The leaf will always be one digit. The stems are listed
on
the left, and the corresponding leaves (if any) on the right. Visually a
stem-and leaf plot looks like a bar chart; the categories are defined by the
decimal structure of the numbers. A stem-and leaf plot for the above data
is presented below:
10 | 5
11 | 023
12 | 055
13 | 025
14 | 055
15 | 5558
16 | 058
17 | 0055
18 | 0555
19 | 0
20 |
21 |
22 |
23 | 5
In practice, stem-and leaf plots are formed before the data has been ordered (or
inorder to order the data). Thus it is a three step process: 1) choose the
stems based on scanning the data, 2) add the leaves in the order encountered,
3) reorder the leaves on each stem from smallest to largest.
Sometimes to enhance visual presentation of data, stems will be split (e.g.,
repeat each stem on the left, once for the digits 0-4, once for the digits
5-9).
Sometimes data are truncated (rightmost digits dropped) in order to have
an informative plot with single digit leafs.
The use of the decimal structure of the numbers sometimes constrains the
ability of a stem-and-leaf plot to visually display where the data lies. Even
the techniques of spliting stems or truncating (dropping) digits may not be
satifactory. Histograms allows arbitrary sizes for the categories, but the
categories (classes) must be contiguous, and all be the same size.
N.B.: stem-and-leaf plots are a good preliminary way to organize data
prior to representing it with a histogram.
The essence of a histogram is best illustrated by the method of its
construction.
- Choose the number of classes; this will be an aesthetic judgement based
on the data. Generally you will want between 5 and 20 classes: your goal is
to communicate where the data lies. The number of
classes
is important, although subtle.
- Choose the class size. Divide the range by the target number of classes
above, then round off aesthetically. If you do not end up with the number of
classes above, it will not matter since that number was a rough aesthetic
guess.
- Choose the class marks (or class boundaries). Again, do this
aesthetically. Since classes are all the same size, if you know the class
marks (which are at the center of the classes),
you know the class boundaries, and vice-versa. It is important that each
datum lies in
exactly one class.
- Draw the histogram. This requires that you count the number of data
which
lie in each class, and make the heights (hence areas) of the bars proportional
to the number of data in each class. Do not forget to label the histogram,
since it will convey no information if it is not labelled.
For the above data set, I would want about 5 classes, since with only 30 data
points there would be too few data per class to accurately portray where the
data lies with more classes. My range is 235-105=130; 130/5=26. Hence I
want approximately 26
pounds per class. Therefore, I will choose 25 for my class size. I prefer
class marks to class boundaries, hence I will choose 100, 125, 150, 175, 200,
and 225 as my class marks. This gives me six classes: 87.5-112.5, 112.5-137.5,
137.5-162.5, 162.5-187.5, 187.5-212.5, 212.5-237.5. Note that by choosing an
odd class size and integer class marks my class boundaries are fractional,
hence
no datum lies on a class boundary (general usage is to include the left endpoint
and exclude the right endpoint from a class (group) when data lie on the class
boundaries)..
10_| _______
| | |
| _______| |
| _______| | |
| | | | |
5_| | | | |
| | | | |
| _______| | | |
| | | | | |
| | | | | |_______ _______
__|__|_______|_______|_______|_______|_______|_______|____
| | | | | |
100 125 150 175 200 225
weights of students in pounds
Weights of Students in Statistics Course
Note that all the original data can be recovered from a stem-and-leaf plot
but you only know the approximate value of the data when it is presented in a
histogram.
Competencies: Give examples of categorical (qualitative) data and
quantitative data.
Present the following weights: {132, 180, 200, 150, 165, 144, 194, 125, 160,
130, 140, 140, 160, 170, 150, 155, 135, 165, 120, 185, 141, 210, 105, 115,
125, 162, 215, 235, 170, 200, 125, 125, 225, 170, 140, 135, 185, 230, 269,
130, 220, 198, 285, 140, 173, 180, 210, 148, 115, 205, 130} as a stem-and-leaf
plot.
Present the above weights as a hisotgram.
Reflection: When would you violate the above rules for making a
histogram, and how would you do it?
Challenge:
May 2003
return to index
campbell@math.uni.edu