# Organizing data

## What is statistics?

The Concise Oxford Dictionary defines statistics as Numerical facts systematically collected and statistic as Statistical fact or item. Data (singular: datum) is facts or information. Statistics entails all aspects of information: collecting, organizing, comprehending, communicating, and interpreting. This course will begin with descriptive statistics (organizing, comprehending, and communicating data), then spend a few weeks on probability, which lays a foundation for inferential statistics (interpreting, extrapolating). The manner of collection of data (experimental design) is important and nontrivial, but will not be a focus of this course.

## Types of data

Categorical (also called qualitatitative, and sometimes further specified as nominal or ordinal) data is data where what is being recorded cannot be readily identified with the real numbers. Examples include colors of cars (red, green, black), size of eggs (small, medium, large), sex (male, female). One can count the number of cars of various colors, and display that information in a bar chart or pie chart, but one cannot combine the various cars and conclude that the average car is brown. We shall not devote much attention to categorical data because we cannot manipulate it, but we shall return to it in the context of the binomial and multinomial distributions.

Quantitative (also further specified as interval and ratio, the distinction between which is not of interest for our purposes) data is data where what is being recorded can be identified with the real numbers. Examples include age, I.Q., weight, height. Identification with the real numbers facilitates organizing, comprehending, and communicating this data. In particular we can combine it using algebraic operations.

N.B.: We can count all data, whether categorical or quantitative, the terms categorical and quantitative refer to the essence of the individual items which we are counting. The statement 'there are 3 red haired men' entails qualitative data (red haired), the statement 'there are 3 74 inch tall men' entails quantitative data (74).

## Stem-and-leaf plots

A natural way to organize quantitative data is with the order property of the real numbers, i.e., arrange the data from least to graetest. For example, the 30 weights: 185, 160, 235, 165, 125, 175, 185, 132, 168, 112, 170, 155, 105, 158, 120, 190, 140, 185, 125, 180, 145, 110, 155 135, 170, 113, 155, 175, 145, 130 are more easily comprehended in order: 105, 110, 112, 113, 120, 125, 125, 130, 132, 135, 140, 145, 145, 155, 155, 155, 158, 160, 165, 168, 170, 170, 175, 175, 180, 185, 185, 185, 190, 235. Note that each weight has been listed as many times as it occurs. This information can be visually presented with a stem-and-leaf plot. A position (e.g., between the tens and units places) is chosen to break the numbers into a stem and a leaf. The leaf will always be one digit. The stems are listed on the left, and the corresponding leaves (if any) on the right. Visually a stem-and-leaf plot looks like a bar chart; the categories are defined by the decimal structure of the numbers. A stem-and-leaf plot for the above data is presented below:
```10 | 5
11 | 023
12 | 055
13 | 025
14 | 055
15 | 5558
16 | 058
17 | 0055
18 | 0555
19 | 0
20 |
21 |
22 |
23 | 5
```
Sometimes to enhance visual presentation of data, stems will be split (e.g., repeat each stem on the left, once for the digits 0-4, once for the digits 5-9). Sometimes data are truncated (rightmost digits dropped) in order to have an informative plot with single digit leafs.

## Histograms

The use of the decimal structure of the numbers sometimes constrains the ability of a stem-and-leaf plot to visually display where the data lies. Even the techniques of spliting stems or truncating (dropping) digits may not be satisfactory. Histograms allows arbitrary sizes for the categories, but the categories (classes) must be contiguous, and all be the same size. N.B.: stem-and-leaf plots are a good preliminary way to organize data prior to representing it with a histogram.

The essence of a histogram is best illustrated by the method of its construction.

1. Choose the number of classes; this will be an aesthetic judgement based on the data. Generally you will want between 5 and 20 classes: your goal is to communicate where the data lies. The number of classes is important, although subtle.
2. Choose the class size. Divide the range by the target number of classes above, then round off aesthetically. If you do not end up with the number of classes above, it will not matter since that number was a rough aesthetic guess.
3. Choose the class marks (centers of the classes) or class boundaries (endpoints of the classes). Again, do this aesthetically. Since classes are all the same size, if you know the class marks (which are at the center of the classes), you know the class boundaries, and vice-versa. It is important that each datum lies in exactly one class.
4. Draw the histogram. This requires that you count the number of data which lie in each class, and make the heights (hence areas) of the bars proportional to the number of data in each class. Do not forget to label the histogram, since it will convey no information if it is not labelled.
For the above data set, I would want about 5 classes, since with only 30 data points there would be too few data per class to accurately portray where the data lies with more classes. My range is 235-105=130; 130/5=26. Hence I want approximately 26 pounds per class. Therefore, I will choose 25 for my class size. I prefer class marks to class boundaries, hence I will choose 100, 125, 150, 175, 200, and 225 as my class marks. This gives me six classes: 87.5-112.5, 112.5-137.5, 137.5-162.5, 162.5-187.5, 187.5-212.5, 212.5-237.5. Note that by choosing an odd class size and integer class marks my class boundaries are fractional, hence no datum lies on a class boundary. If class boundaries such as 100-120, 120-140, etc. are chosen, left endpoint inclusion is used so that a 120 pound individual would be counted as in the 120-140 class, not the 100-120 class.

```
10_|                           _______
|                          |       |
|                   _______|       |
|           _______|       |       |
|          |       |       |       |
5_|          |       |       |       |
|          |       |       |       |
|   _______|       |       |       |
|  |       |       |       |       |
|  |       |       |       |       |_______ _______
__|__|_______|_______|_______|_______|_______|_______|____
|       |       |       |       |       |
100     125     150     175     200     225

weights of students in pounds

Weights of Students in Statistics Course
```
Note that all the original data can be recovered from a stem-and-leaf plot but you only know the approximate value of the data when it is presented in a histogram.

Competencies: Give examples of categorical (qualitative) data and quantitative data.
Present the following weights: {132, 180, 200, 150, 165, 144, 194, 125, 160, 130, 140, 140, 160, 170, 150, 155, 135, 165, 120, 185, 141, 210, 105, 115, 125, 162, 215, 235, 170, 200, 125, 125, 225, 170, 140, 135, 185, 230, 269, 130, 220, 198, 285, 140, 173, 180, 210, 148, 115, 205, 130} as a stem-and-leaf plot.
Present the above weights as a hisotgram.

Reflection: When would you violate the above rules for making a histogram, and how would you do it?

Challenge:

July 2007