Stacked versus unstacked data

Example: The individuals from whom these data were collected were trees. There were only three different species and we measured the heights in feet.

This is the data in unstacked format:

Species A Species B Species C
12 18 9
15 13 15
10 8

This is the same data in stacked format::

Species Height
A 12
A 15
B 18
C 9
A 10
C 15
B 13
A 19
C 8


Question: Why do people use this "stacked" format? It seems unnessarily complicated.

If all we want are data on heights of the 9 trees on one vacant lot in a city, we might look at all the oaks first measuring them, and then the hackberry trees, and then the pecan trees. So we think of it as three sets of one-variable data that we want to compare. That's the "unstacked" way of looking at the data.

But it is often true that this is part of a larger dataset.

If we collected data like this on 200 trees in a forest, we'd probably just go around to the trees individually and write down what species the tree was and what the height was, and then go on to the next tree. So our original data would look more like the "stacked" format.

Statisticians almost always use stacked data because they are often interested datasets with large number of individuals and maybe on several different variables (possibly to investigate the relationships between the variables) and so it is easiest to keep track of everything if each row represents an individual.

For example, the tree dataset might also have included trunk circumference in inches. And, just to illustrate, when they collected the data originally, they would probably have put in an ID number for each individual on whom they collected data.

tree Species Height Trunk
ID101 A 12 8
ID102 A 15 9
ID103 A 10 5
ID104 A 19 12
ID105 B 18 8
ID106 B 13 7
ID107 C 9 5
ID108 C 15 7
ID109 C 8 5

Home | Instructions / Discussion of Putting data into these applets

Make comments or ask questions about the applets or the web pages. Copyright