As a society we’ve practically dismissed many of the popular forms of charting as useless because most of the charts that we see are just ugly at best or, at worst, fail to communicate any actionable information. But while charts are often deemed failures unless they illustrate dramatic changes or unseen trends, their increasing abundance in popular media has also led to an increase in literacy that makes our job of communicating visual information a lot easier than it has been historically. (We don’t have to explain to our audience what a time series is anymore!) Note, though, that it’s important to keep common assumptions in mind when you’re creating graphs. For instance, since most people expect time to be represented left to right on the x-axis, presenting it vertically or from right to left may confuse your audience no matter how clearly your axes are marked.
Before we get into the perceptual (and even cultural) qualities of various charting forms, though, let’s step back and wrap our heads around what it is, exactly, that makes a chart a chart.
Anatomy of a Chart
With respect to Bertin’s variables, charts deal primarily with the position and size of visual elements. For our purposes, charts have at least one axis (timelines are an example of a chart with only one real axis) along which elements are placed to distinguish varying values from one another. I’m also intentionally excluding the genre of “big infographics” that lack any perceptual component whatsoever, because that’s what essentially distinguishes a “chart” from a “diagram”.
In most charts, cartesian coordinates describe the position of an element relative to one or more linear axes, commonly called x on the horizontal and y on the vertical, and written as (x,y). In computer screen coordinate systems (specifically, on web pages and in most visual programming environments) the upper lefthand corner serves as the origin, or (0, 0). As x values increase an element moves toward the right edge of the screen, and positive y values move the element toward the bottom. On paper we may choose to think of the origin as the lower lefthand corner, and position positive y values above it.
It’s important to note that axes can be made for both quantitative (numeric, or continuous) and qualitative (categorical, or discrete) variables. The humble bar chart’s quantitative axis (in this case, y) determines the height of each bar, and the other (x) evenly spaces out each bar so that its height can be easily compared to the others:
Often, as is the case in the above graph, the elements are sorted on the discrete axis according to their value on the other so that you can easily see the distribution of values in the set. The histogram, a cousin to the bar chart in some respects, replaces the qualitative axis with a quantitative one. The time series plots continuous values of a quantitative variable over time, usually on the horizontal axis. For some other examples, check out Nathan Yau’s guide to visualizing changes over time.
The more generalized scatter plot is particularly useful for illustrating the relationship between two quantitative variables. This one, also from Wikipedia, plots eruptions of the Old Faithful geyser in Yellowstone National Park using two variables: the duration of each eruption on the horizontal, and the time since the previous eruption on the vertical:
Polar coordinates are used to plot points in circular arrangements, such as pie and radar charts. In this system, coordinates are expressed not as x and y, but as angle and radius. Polar charts are best suited for plotting cyclical values, such as wind direction, time of day (i.e., a clock), or categorical values that, when displayed as small multiples, can reveal similarities in shape:
For more examples, check out A Tour through the Visualization Zoo by Stanford Vis Group’s Jeff Heer, Michael Bostock and Vadim Ogievetsky, which profiles a variety of common visualization forms made with their protovis library. And if you’re going to plot more than two variables against one another using only position, you might consider the ternary plot, 3D, animation, or even an interactive interface that allows the user to adjust one of the variables in realtime.
Scales
Rarely will you find a data set expressed in terms of the same coordinates used to display it. In order to convert data values into display coordinates we apply one or more scales. A scale is the means by which we plot a variable on a given axis. Each scale has a minimum and a maximum (usually built from the calculated minima and maxima, but sometimes chosen specifically to over- or under-emphasize distributions), and defines a method for interpolating values between them. The linear scale on this NOAA chart shows the reader how to convert measurements on a map into distances in real life:
If we wanted to create a bar chart of the subjects’ incomes, we would need to devise a scale for the y axis. The natural minimum for this scale would be 100, and the maximum 30,000. This example is easy because there are only 3 elements to plot: Jane goes at the bottom of the scale, and Alex at the top. In order to figure out where Joe goes, though, we have to do a little bit of math. Here I’m using y here as a relative measurement of how far along the scale the value n should be positioned, where 0 would be the bottom and 1 the top. This is generally referred to as a process of normalization:
y = (n - min) / (max - min)
y = (20,000 - 100) / (30,000 - 100)
y = 19,900 / 29,900
y = ~0.665
So, if our chart were 100 pixels tall, Joe’s bar would have a height of 66 pixels (or 67, if we round up):
One problem with this, though, is that Jane’s bar essentially has zero height because her low income corresponds to the bottom of the scale. ((100 - 100) / (30,000 - 100) = 0) We can’t really “fix” that, but we can make it clearer—and avoid having to use a calculator!—by thinking of the y axis as 100-dollar increments (the greatest common divisor of this particular collection) and setting the minimum of the scale to zero. This way, you simply divide each number by 100 to get the height in pixels; so Jane’s bar is exactly 1 pixel tall, Joe’s is 200, and Jane’s is 300. It also simplifies the labeling of the vertical axis, because you can split it into nice, round numbers:
Obviously this is an over-simplified example, but I hope that it illustrates why your choice of scale is important. We can emphasize or de-emphasize variances by making our bar charts short or tall, or we can intentionally set the scale minimum or maximum to a value outside the range of the data, as in “Miracles in nature and Science”, from the Words and Years exhibit by Toril Johannessen, which plots the number of occurrences of the word “miracle” over time in the the two eponymous periodicals:
Of course, it’s worth mentioning that unscaled values in their original unit of measurement might better suit some contexts for visualization than scaled values. This energy saving campaign depicts greenhouse gases produced by energy use as black balloons, each containing the volumetric equivalent of 50 grams. Imagine seeing Chris Jordan’s field of plastic bottles in real life. Most data sets probably aren’t worth expressing natively like this, but you should certainly consider displays that emphasize the physical dimensions of a particular data set as a useful way of drawing attention or raising awareness.
Other Visual Aspects
Once we’ve exhausted the physical dimensions of our chart as a means to communicate information, we may need to resort to modifying some other visual aspects of our elements:
Hue: the color itself—red, blue, green, orange, purple, etc.
Value: the brightness, or intensity of a color. You can think of this as some combination of the value and saturation components in the HSV color space.
We’ll go a bit more in depth on color in the next couple of weeks. For now, though, let’s see how far we can get without having to use it. Feel free to experiment with varying color for categorical variables, but be warned that creating color scales for continuous variables is fraught with peril.
Shape
Varying the shapes of visual elements is a great way of encoding categorical variables. We’ll touch on a couple examples of this with your data sets tonight if it’s applicable.
Size
Size is well suited for positional arrangements on multiple axes, such as scatter plots. Gapminder, for instance, tends to encode a country’s population in its dot size. Note that, in many cases, research has revealed that circles of varying sizes are difficult for people to compare because we tend to interpret the area of a circle more easily than its radius. You can calculate a proportional radius by taking the square root of the desired area divided by pi:
r = sqrt(area / π)
And vice-versa, the area from a radius:
area = π • r2
Texture
Texture is often useful in visualization forms like bar and area charts, in which you may wish to encode a categorical variable of each element. It’s also particularly useful in maps to denote different types of area or foliage.
Visual Perception
Rigorous scientific research of visual perception is not a particularly recent development. As noted previously, figures like Willard Cope Brinton and Jacques Bertin illuminated many of the problems common to the statistical graphics of the 20th century and attempted to codify rules for designing representations that people could better understand. Statistical analyst John Tukey contributed a significant body of work not only to the practice of statistical analysis itself, but also to the modern-day understanding how people “read” visual representations of data. More recently, William S. Cleveland and Robert McGill unveiled the findings of their research on the perception of visual cues in their paper Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods [PDF], published in the Journal of the American Statistical Association. They found the following aspects of visual elements to be most successful (this ranked list comes from Nathan Yau’s blog post on graphical perception):
Some researchers are even studying the aesthetic qualities of visualization in an attempt to learn which forms people find most beautiful. A less formal, but no less actionable, form of visual perception research and analysis is taking place in books by Edward Tufte, and sites like chartjunk (and junkcharts) exist to critique charts in the media (and sometimes correct them quite handily). Some business journals regularly feature articles that suggest graphing strategies for particular types of data. The Extreme Presentation blog published this guide that suggests specific types of charts for certain types of data, or aspects of it to be visualized (or jump straight to the PDF):
Visualization as a Process
As we create our visualizations, it’s important to consider that process as a way to learn something new about the data—to derive new information from it. Try out as many of the forms as you can (within reason of course, and keeping in mind which ones are appropriate for different types and aspects of data), and see if you can draw any interesting conclusions from the distribution of particular values (remember to sort your values first!), or find potential correlations between two variables (by matching up two different sources of data with a common variable, or by using a scatter plot). Perhaps most importantly of all, save your work often (whether that keeping a paper sketch or saving multiple versions of a file on your hard drive) and create artifacts along the way. Even experiments gone “wrong” can produce clues for how to visualize particular aspects of your data differently.
Homework UPDATED!
I’ll be posting a new entry with some specifics about your updated homework. Stay tuned!
Charted Territory
Data visualization’s most recognized form is the chart. You’ve seen charts all over the place, from PowerPoint presentations and stock tickers to public polling results and election predictions. Even the humble and oft-misused pie chart, though derided by visualization critics for its perceptual shortcomings, is still useful for comic effect:
As a society we’ve practically dismissed many of the popular forms of charting as useless because most of the charts that we see are just ugly at best or, at worst, fail to communicate any actionable information. But while charts are often deemed failures unless they illustrate dramatic changes or unseen trends, their increasing abundance in popular media has also led to an increase in literacy that makes our job of communicating visual information a lot easier than it has been historically. (We don’t have to explain to our audience what a time series is anymore!) Note, though, that it’s important to keep common assumptions in mind when you’re creating graphs. For instance, since most people expect time to be represented left to right on the x-axis, presenting it vertically or from right to left may confuse your audience no matter how clearly your axes are marked.
Before we get into the perceptual (and even cultural) qualities of various charting forms, though, let’s step back and wrap our heads around what it is, exactly, that makes a chart a chart.
Anatomy of a Chart
With respect to Bertin’s variables, charts deal primarily with the position and size of visual elements. For our purposes, charts have at least one axis (timelines are an example of a chart with only one real axis) along which elements are placed to distinguish varying values from one another. I’m also intentionally excluding the genre of “big infographics” that lack any perceptual component whatsoever, because that’s what essentially distinguishes a “chart” from a “diagram”.
In most charts, cartesian coordinates describe the position of an element relative to one or more linear axes, commonly called x on the horizontal and y on the vertical, and written as (x,y). In computer screen coordinate systems (specifically, on web pages and in most visual programming environments) the upper lefthand corner serves as the origin, or (0, 0). As x values increase an element moves toward the right edge of the screen, and positive y values move the element toward the bottom. On paper we may choose to think of the origin as the lower lefthand corner, and position positive y values above it.
It’s important to note that axes can be made for both quantitative (numeric, or continuous) and qualitative (categorical, or discrete) variables. The humble bar chart’s quantitative axis (in this case, y) determines the height of each bar, and the other (x) evenly spaces out each bar so that its height can be easily compared to the others:
Often, as is the case in the above graph, the elements are sorted on the discrete axis according to their value on the other so that you can easily see the distribution of values in the set. The histogram, a cousin to the bar chart in some respects, replaces the qualitative axis with a quantitative one. The time series plots continuous values of a quantitative variable over time, usually on the horizontal axis. For some other examples, check out Nathan Yau’s guide to visualizing changes over time.
The more generalized scatter plot is particularly useful for illustrating the relationship between two quantitative variables. This one, also from Wikipedia, plots eruptions of the Old Faithful geyser in Yellowstone National Park using two variables: the duration of each eruption on the horizontal, and the time since the previous eruption on the vertical:
Polar coordinates are used to plot points in circular arrangements, such as pie and radar charts. In this system, coordinates are expressed not as x and y, but as angle and radius. Polar charts are best suited for plotting cyclical values, such as wind direction, time of day (i.e., a clock), or categorical values that, when displayed as small multiples, can reveal similarities in shape:
For more examples, check out A Tour through the Visualization Zoo by Stanford Vis Group’s Jeff Heer, Michael Bostock and Vadim Ogievetsky, which profiles a variety of common visualization forms made with their protovis library. And if you’re going to plot more than two variables against one another using only position, you might consider the ternary plot, 3D, animation, or even an interactive interface that allows the user to adjust one of the variables in realtime.
Scales
Rarely will you find a data set expressed in terms of the same coordinates used to display it. In order to convert data values into display coordinates we apply one or more scales. A scale is the means by which we plot a variable on a given axis. Each scale has a minimum and a maximum (usually built from the calculated minima and maxima, but sometimes chosen specifically to over- or under-emphasize distributions), and defines a method for interpolating values between them. The linear scale on this NOAA chart shows the reader how to convert measurements on a map into distances in real life:
Let’s take another look at the example table from my introductory blog post:
If we wanted to create a bar chart of the subjects’ incomes, we would need to devise a scale for the y axis. The natural minimum for this scale would be 100, and the maximum 30,000. This example is easy because there are only 3 elements to plot: Jane goes at the bottom of the scale, and Alex at the top. In order to figure out where Joe goes, though, we have to do a little bit of math. Here I’m using y here as a relative measurement of how far along the scale the value n should be positioned, where 0 would be the bottom and 1 the top. This is generally referred to as a process of normalization:
So, if our chart were 100 pixels tall, Joe’s bar would have a height of 66 pixels (or 67, if we round up):
One problem with this, though, is that Jane’s bar essentially has zero height because her low income corresponds to the bottom of the scale. ((100 - 100) / (30,000 - 100) = 0) We can’t really “fix” that, but we can make it clearer—and avoid having to use a calculator!—by thinking of the y axis as 100-dollar increments (the greatest common divisor of this particular collection) and setting the minimum of the scale to zero. This way, you simply divide each number by 100 to get the height in pixels; so Jane’s bar is exactly 1 pixel tall, Joe’s is 200, and Jane’s is 300. It also simplifies the labeling of the vertical axis, because you can split it into nice, round numbers:
Obviously this is an over-simplified example, but I hope that it illustrates why your choice of scale is important. We can emphasize or de-emphasize variances by making our bar charts short or tall, or we can intentionally set the scale minimum or maximum to a value outside the range of the data, as in “Miracles in nature and Science”, from the Words and Years exhibit by Toril Johannessen, which plots the number of occurrences of the word “miracle” over time in the the two eponymous periodicals:
Of course, it’s worth mentioning that unscaled values in their original unit of measurement might better suit some contexts for visualization than scaled values. This energy saving campaign depicts greenhouse gases produced by energy use as black balloons, each containing the volumetric equivalent of 50 grams. Imagine seeing Chris Jordan’s field of plastic bottles in real life. Most data sets probably aren’t worth expressing natively like this, but you should certainly consider displays that emphasize the physical dimensions of a particular data set as a useful way of drawing attention or raising awareness.
Other Visual Aspects
Once we’ve exhausted the physical dimensions of our chart as a means to communicate information, we may need to resort to modifying some other visual aspects of our elements:
Color
Color, with respect to Bertin’s variables, is expressed in two ways:
We’ll go a bit more in depth on color in the next couple of weeks. For now, though, let’s see how far we can get without having to use it. Feel free to experiment with varying color for categorical variables, but be warned that creating color scales for continuous variables is fraught with peril.
Shape
Varying the shapes of visual elements is a great way of encoding categorical variables. We’ll touch on a couple examples of this with your data sets tonight if it’s applicable.
Size
Size is well suited for positional arrangements on multiple axes, such as scatter plots. Gapminder, for instance, tends to encode a country’s population in its dot size. Note that, in many cases, research has revealed that circles of varying sizes are difficult for people to compare because we tend to interpret the area of a circle more easily than its radius. You can calculate a proportional radius by taking the square root of the desired area divided by pi:
And vice-versa, the area from a radius:
Texture
Texture is often useful in visualization forms like bar and area charts, in which you may wish to encode a categorical variable of each element. It’s also particularly useful in maps to denote different types of area or foliage.
Visual Perception
Rigorous scientific research of visual perception is not a particularly recent development. As noted previously, figures like Willard Cope Brinton and Jacques Bertin illuminated many of the problems common to the statistical graphics of the 20th century and attempted to codify rules for designing representations that people could better understand. Statistical analyst John Tukey contributed a significant body of work not only to the practice of statistical analysis itself, but also to the modern-day understanding how people “read” visual representations of data. More recently, William S. Cleveland and Robert McGill unveiled the findings of their research on the perception of visual cues in their paper Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods [PDF], published in the Journal of the American Statistical Association. They found the following aspects of visual elements to be most successful (this ranked list comes from Nathan Yau’s blog post on graphical perception):
Some researchers are even studying the aesthetic qualities of visualization in an attempt to learn which forms people find most beautiful. A less formal, but no less actionable, form of visual perception research and analysis is taking place in books by Edward Tufte, and sites like chartjunk (and junkcharts) exist to critique charts in the media (and sometimes correct them quite handily). Some business journals regularly feature articles that suggest graphing strategies for particular types of data. The Extreme Presentation blog published this guide that suggests specific types of charts for certain types of data, or aspects of it to be visualized (or jump straight to the PDF):
Visualization as a Process
As we create our visualizations, it’s important to consider that process as a way to learn something new about the data—to derive new information from it. Try out as many of the forms as you can (within reason of course, and keeping in mind which ones are appropriate for different types and aspects of data), and see if you can draw any interesting conclusions from the distribution of particular values (remember to sort your values first!), or find potential correlations between two variables (by matching up two different sources of data with a common variable, or by using a scatter plot). Perhaps most importantly of all, save your work often (whether that keeping a paper sketch or saving multiple versions of a file on your hard drive) and create artifacts along the way. Even experiments gone “wrong” can produce clues for how to visualize particular aspects of your data differently.
Homework UPDATED!
I’ll be posting a new entry with some specifics about your updated homework. Stay tuned!