Data visualization is a pretty literal term that means, quite simply, the visual representation of quantitative data. In this course we’ll learn common techniques for visualizing data, as well as some strategies for managing information digitally. But first, a brief history.
A Brief History of Visualization
The Early Days
Although visualization hasn’t been widely recognized as a discipline in and of itself until fairly recently, today’s most popular forms date back nearly two centuries. Geographical exploration, mathematics, and popularized history spurred the creation of early maps, graphs, and timelines as far back as the 1600s; but William Playfair is widely credited as the inventor of the modern chart, having created the first widely distributed line and bar charts in his Commercial and Political Atlas of 1786, and what is generally considered to be the first pie chart in his Statistical Breviary, published in 1801.
The 1800s saw the invention of many new mapping and visualization forms, from Francis Galton’s weather maps, to the innovative time lapse photography of scientist Étienne-Jules Marey, which he used to study the motion of people, birds, horses, cats, smoke, and fluids.
In 1858, nurse and statistician Florence Nightingale pioneered the use of the circular area charts to show that more British soldiers had died during the Crimean War as a result of poor hygienic conditions in battlefield hospitals than in combat. Her famous charts eventually became known as the “coxcombs” of a voluminous Royal Commission report—not because they looked like the crest of a rooster, but because they served as the most colorful and ostentatious part of it that immediately communicated useful information, and galvanized public support for reforms.
Perhaps the most notable innovator of information graphics during this period was Charles Minard, who in 1869 published a geographical chart illustrating the decimation of Napoleon’s army during the 1812 Russian campaign. Popular visualization critic Edward Tufte says that this “may well be the best statistical graphic ever drawn”, and rightly so:
Many other “greatest hits” of visualization were invented in the 1800s, and most of them are chronicled in both Tufte’s books and the Milestones in Visualization project (from which I’ve culled many of these examples).
The So-Called “Dark Ages”
The 1900s saw the rise of a more formal, empirical attitude toward visualization, which tended to focus on aspects such as color, value scales, and labeling. Willard Cope Brinton’s Graphic Presentation details hundreds of charts, graphs, and maps; and suggests methods for improving the legibility of each form. You can read the entire book for free online at archive.org, or check out Michael Stoll’s selected photos of the hard copy:
In the mid-1900s cartographer and theorist Jacques Bertin published his Semiologie Graphique, which some say serves as the theoretical foundation of modern information visualization. While most of his patterns are either outdated by more recent research or completely inapplicable to digital media, many are still very relevant to what we’re doing in this course. Particularly, his definition of six visual variables is directly applicable in any graphical visualization. John Krygier and Denis Wood more recently incorporated these variables into their book, Making Maps: A Visual Guide to Map Design for GIS, which expands upon Bertin’s theories and illustrates how each variable, when applied to various map symbologies, better communicates quantitative or qualitative differences in geographical data. This image is from Understanding Graphics:
Recent History
Fast forward about 50 years, and here we are in the 2000s. In the last ten years the internet has emerged as a new medium for visualization, and brought with it a bag full of new tricks. Not only has the worldwide, digital distribution of both data and visualization made them more accessible to a broader audience (raising visual literacy along the way), but it has also spurred the design of new forms that incorporate interaction, animation, graphics rendering technology unique to screen media, and real-time data feeds to create immersive environments for communicating and consuming data. On the internet, visualization has graduated from the status of chart sidebar on a newspaper page to the interface that tells the story.
People are, seemingly all of a sudden, interested in data; and that interest has in turn sparked a need for visual tools that help them understand it. Visualization, in response to this need, has become increasingly dynamic. It’s no longer practical to create most charts or graphs by hand. Instead, we’ve designed new patterns for dynamic value scales; new interfaces for interactively manipulating chart dimensions, such as time; and we’ve developed new tools for managing data. For example:
Google Finance has popularized the interactive timeline chart; and their Spreadsheets offering has effectively removed the software barriers (in particular, expensive desktop applications such as Microsoft Excel) to collecting and storing data.
IBM Many Eyes, by Martin Wattenberg and Fernanda Viegas, has made it possible to plug arbitrary data into a variety of well-designed, interactive visualizations that can be embedded and shared elsewhere on the internet.
Nicholas Felton’s Daytum is a delightfully simple tool for collecting and displaying everyday data that describes our own personal habits, thoughts, and aspirations.
Cheap hardware sensors and DIY frameworks for building your own are driving down the costs of collecting analog data. Countless other applications, software tools, and low-level code libraries are springing up even as I write this to help people collect, organize, manipulate, visualize, and understand data from practically any source. The internet has also served as a fantastic distribution channel for visualizations; and a diverse (though not very “tight-knit”) community of designers, programmers, cartographers, tinkerers, and data wonks has assembled to disseminate all sorts of new ideas and tools for working with data in both visual and non-visual forms. Here is just a tiny sampling of my favorite visualization projects on the web:
Ben Fry’s Salary vs. Performance is an interactive visualization comparing American baseball teams’ sum player salaries with game winnings.
GE’s Health Visualizer visualizes American health statistics and allows you to interactively compare gender, risk factors, and conditions.
EveryBlock, though not particularly innovative in terms of visualization, presents a common visual vocabulary for quantitative and geographical data and applies it to everything from building inspections, to restaurant reviews, to crime reports.
I’d be remiss if I neglected to mention the Stack, Swarm, and Arc visualizations from Digg Labs, which all present the same real-time social news activity in their own ways. My colleagues and I at Stamen both developed these visualizations and designed the API (or Application Programming Interface) that drives them.
Google Maps has also single-handedly democratized both the interface conventions (click to pan, double-click to zoom) and the technology (256-pixel square map tiles with predictable file names) for displaying interactive geography online, to the extent that most people just know what to do when they’re presented with a map online. Flash has served well as a cross-browser platform on which to design and develop rich, beautiful internet applications incorporating interactive data visualization and maps; and now, new browser-native technologies such as canvas and SVG (sometimes collectively included under the umbrella of HTML5) are emerging to challenge Flash’s supremacy and extend the reach of dynamic visualization interfaces to mobile devices.
Advocates for various causes have also embraced visualization as a medium for communicating the breadth and depth of the problems they seek to communicate and, ultimately, solve. Hans Rosling, an expert on world development, used a specially designed tool called Gapminder as a storytelling device in this rousing, visualization-driven TED talk. Artists have also latched onto visualization as a medium for expressing information. Chris Jordan, who creates sprawling images of American consumption, explains in his own TED talk how visualization is an effective and necessary means for evoking emotional responses to data.
What Is Data, Anyway?
So what, exactly, are we referring to when we say “data”? York University professor and data visualization historian Michael Friendly defines it as “information which has been abstracted in some schematic form, including attributes or variables for the units of information”. I find it useful, though, not to think of data as an abstraction; but rather as an expression of occurrence. At Stamen we’re fond of saying that our favorite data to work with is anything is created by humans, such as:
Activity on social networking sites
Geographical locations and categorizations of crime
Health, education, and economic indicators for nations of the world, over time
Financial transactions, often categorized by the type of goods or service they purchased; or grouped by day, month, or financial quarter when they relate to businesses
Tons of CO2 emitted by specific activities, people’s aggregate activity, averaged by nation, etc.
Web site visits, typically grouped by time and date
All of these examples can be expressed in tabular structure, which you’ll commonly encounter stored as an Excel file or in a Google Docs spreadsheet. In fact, most of the data that you’ll ever encounter will be tabular, and even some data models that you wouldn’t think of as tabular can be expressed as rows and columns, or what we sometimes refer to as a matrix. The spreadsheet, despite having been tarred by Microsoft’s notoriously unwieldy tools and proprietary file formats, is today the most well-understood and universal form in which to encode and transmit data.
Data Formats
Typically, tabular data will be provided as a raw text CSV (Comma-Separated Value) file or a binary Excel file. Excel files can be converted into CSV by uploading them to Google Docs and exporting them from there. Here are some sites that provide CSV or Excel versions of some, if not all, of their data:
Data.gov, the federally maintained clearinghouse for US government data
A tabular structure can be used to express a variety of different data models. A data model, as far as it relates to a table, is simply a description of its rows and columns. In most models columns represent attribute or variable names, and each row represents a sample with a value for each column. Consider the following:
Here we’ve got several columns, each of which might serve as a potential variable for visualization. Age and income stand out as the ones most suitable for graphing, simply because they’re both numbers. Gender, if we had more than three rows, might be an attribute suitable for coloring dots or lines. Profession, like gender, is a qualitative (referring to quality rather than quantity, sometimes referred to as categorical) variable that might not lend itself to any particular visual distinction; but it may serve as an interesting filter. Name is the one that I would refer to as the “identifier”, the (hopefully) unique column that we could use to label individual points.
Some data sets may not have unique identifiers for each row, but may instead describe changes in one variable relative to another. For instance, a table of average temperatures for major American cities over time wouldn’t need unique identifiers for each row; we could make a line chart that connects the plotted points for each city with a unique color. (It’s worth noting that, in data sets which describe changes in one or more variables over time, time is itself one of the variables.)
Aggregation and Granularity
One important thing to note is that most data is not, in fact, “raw” in any sense. The word “raw”, to me, implies collection closest to the source of the activity that it describes. Financial information is rarely expressed as a list of transactions, but as an aggregation of transaction totals grouped by uniform time periods. Aggregation is not necessarily a bad thing, though, because most data sets aren’t particularly revealing in their “raw” form. We typically refer to most aggregations and mathematical operations on more granular data as statistics.
Aggregations of highly granular data allow us to understand changes in variables over long periods of time; at the scales of cities or countries; or relative to particular demographic groups, such as gender, age, and political orientation. There is such a thing as “premature aggregation”, though. Collecting data and “bucketing” it without recording all of the variables may result in a loss of useful information, and later prevent aggregation along interesting axes.
“Special” Variables
As you can see by browsing through some of the data catalogs I’ve listed above, most data is not limited by having to be expressed in a table. There are some exceptions that you should be aware of, though, which I’m going to refer to as “special” because they deserve extra attention when it comes to formatting input values, labeling, and positioning.
Time
As mentioned previously, time is one of the most common variables. Time is special because it can be represented on many different scales, from the second (and, in some cases, the millisecond) to one or more years. Often times, the rawest forms of data are aggregated into tables that list totals of other variables grouped by hour, day, month, or year. You’ll need to be sensitive to the time scale of certain data sets when plotting them on charts and graphs.
Location
The word “location” can mean many things:
A street address
The name of a specific place, such as a park or lake
A more general area, such as city, state, or country
Precise geographical coordinates, typically expressed as latitude and longitude
To some degree of precision, the the first three can be turned into the last. Google Maps, for example, exists primarily as a service for translating an address or place name into geographical coordinates so that it can be pinpointed on a map. This process is generally called geocoding, and can be reversed to derive place names from geographical coordinates. Latitude and longitude aren’t typically necessary (or even useful) unless you’re working with tools that understand them, though, so we’ll likely be focusing on the less specific location types. Street addresses, for instance, can be plotted on a map that lists block numbers (or you can just Google the address and use the map as guidance).
Location, like time, is also ripe for aggregation. Certain data is less much less interesting at the street level than it is at the neighborhood or city level. The US Census Bureau collects multivariate data aggregated by “tracts”, which they designate to contain a relatively uniform number of people so that they can be compared independent of population density. Tract statistics are then aggregated into cities, counties, and states. For privacy reasons, the Census Bureau never releases the “raw” results of their surveys.
Data Model Patterns
A number of common patterns have arisen to deal with the visualization of common data models, which often deal with one or both of our “special” variables: time and location. The line chart is an obvious example that tracks the change of one or more variables over time. The choropleth, or “heat map”, is another. This is the New York Times’ excellent 2008 electoral map which demonstrates aggregation of presidential voting tallies by both county and state:
We’ll investigate a few of these patterns in the next three weeks.
Next Week
Next week we’re going to visualize some data. Your homework in the meantime is to collect three distinct data sets that you find interesting, and would like to learn something about through the process of visualization. For the purposes of sorting and filtering data, your tables should be saved either in Excel (or Numbers) locally, or on Google Docs, which allows you to upload CSV and binary spreadsheets from Excel or Numbers. Feel free to share your spreadsheets with me (shawn at stamen dot com) and I will gladly look them over to ensure that they’re usable for next week’s exercise.
For inspiration and some more historical perspective, Edward Tufte’s books come highly recommended. The Beautiful Visualization O’Reilly book also provides a nice overview of both modern visualization and some of the canonical classics.
If you’re interested in creating your own data set, I would recommend reading The Data-Driven Life, a Times article summarizing a decade’s worth of innovation in the field of self-quantification via automated and/or obsessive data collection. I can highly recommend both Daytum and your.flowingdata as collection tools, but you may find it easier just to carry around a pad and paper.
Introduction
Data visualization is a pretty literal term that means, quite simply, the visual representation of quantitative data. In this course we’ll learn common techniques for visualizing data, as well as some strategies for managing information digitally. But first, a brief history.
A Brief History of Visualization
The Early Days
Although visualization hasn’t been widely recognized as a discipline in and of itself until fairly recently, today’s most popular forms date back nearly two centuries. Geographical exploration, mathematics, and popularized history spurred the creation of early maps, graphs, and timelines as far back as the 1600s; but William Playfair is widely credited as the inventor of the modern chart, having created the first widely distributed line and bar charts in his Commercial and Political Atlas of 1786, and what is generally considered to be the first pie chart in his Statistical Breviary, published in 1801.
In that same year geologist William Smith drew his first sketch of the 1815 geological map of Great Britain, which many cartographers even today refer to as “The Map that Changed the World”:
The 1800s saw the invention of many new mapping and visualization forms, from Francis Galton’s weather maps, to the innovative time lapse photography of scientist Étienne-Jules Marey, which he used to study the motion of people, birds, horses, cats, smoke, and fluids.
In 1858, nurse and statistician Florence Nightingale pioneered the use of the circular area charts to show that more British soldiers had died during the Crimean War as a result of poor hygienic conditions in battlefield hospitals than in combat. Her famous charts eventually became known as the “coxcombs” of a voluminous Royal Commission report—not because they looked like the crest of a rooster, but because they served as the most colorful and ostentatious part of it that immediately communicated useful information, and galvanized public support for reforms.
Perhaps the most notable innovator of information graphics during this period was Charles Minard, who in 1869 published a geographical chart illustrating the decimation of Napoleon’s army during the 1812 Russian campaign. Popular visualization critic Edward Tufte says that this “may well be the best statistical graphic ever drawn”, and rightly so:
Many other “greatest hits” of visualization were invented in the 1800s, and most of them are chronicled in both Tufte’s books and the Milestones in Visualization project (from which I’ve culled many of these examples).
The So-Called “Dark Ages”
The 1900s saw the rise of a more formal, empirical attitude toward visualization, which tended to focus on aspects such as color, value scales, and labeling. Willard Cope Brinton’s Graphic Presentation details hundreds of charts, graphs, and maps; and suggests methods for improving the legibility of each form. You can read the entire book for free online at archive.org, or check out Michael Stoll’s selected photos of the hard copy:
In the mid-1900s cartographer and theorist Jacques Bertin published his Semiologie Graphique, which some say serves as the theoretical foundation of modern information visualization. While most of his patterns are either outdated by more recent research or completely inapplicable to digital media, many are still very relevant to what we’re doing in this course. Particularly, his definition of six visual variables is directly applicable in any graphical visualization. John Krygier and Denis Wood more recently incorporated these variables into their book, Making Maps: A Visual Guide to Map Design for GIS, which expands upon Bertin’s theories and illustrates how each variable, when applied to various map symbologies, better communicates quantitative or qualitative differences in geographical data. This image is from Understanding Graphics:
Recent History
Fast forward about 50 years, and here we are in the 2000s. In the last ten years the internet has emerged as a new medium for visualization, and brought with it a bag full of new tricks. Not only has the worldwide, digital distribution of both data and visualization made them more accessible to a broader audience (raising visual literacy along the way), but it has also spurred the design of new forms that incorporate interaction, animation, graphics rendering technology unique to screen media, and real-time data feeds to create immersive environments for communicating and consuming data. On the internet, visualization has graduated from the status of chart sidebar on a newspaper page to the interface that tells the story.
People are, seemingly all of a sudden, interested in data; and that interest has in turn sparked a need for visual tools that help them understand it. Visualization, in response to this need, has become increasingly dynamic. It’s no longer practical to create most charts or graphs by hand. Instead, we’ve designed new patterns for dynamic value scales; new interfaces for interactively manipulating chart dimensions, such as time; and we’ve developed new tools for managing data. For example:
Cheap hardware sensors and DIY frameworks for building your own are driving down the costs of collecting analog data. Countless other applications, software tools, and low-level code libraries are springing up even as I write this to help people collect, organize, manipulate, visualize, and understand data from practically any source. The internet has also served as a fantastic distribution channel for visualizations; and a diverse (though not very “tight-knit”) community of designers, programmers, cartographers, tinkerers, and data wonks has assembled to disseminate all sorts of new ideas and tools for working with data in both visual and non-visual forms. Here is just a tiny sampling of my favorite visualization projects on the web:
You’ll find many, many more fantastic examples (published on the web and elsewhere) on sites such as information aesthetics, Flowing Data, and visual complexity.
Google Maps has also single-handedly democratized both the interface conventions (click to pan, double-click to zoom) and the technology (256-pixel square map tiles with predictable file names) for displaying interactive geography online, to the extent that most people just know what to do when they’re presented with a map online. Flash has served well as a cross-browser platform on which to design and develop rich, beautiful internet applications incorporating interactive data visualization and maps; and now, new browser-native technologies such as canvas and SVG (sometimes collectively included under the umbrella of HTML5) are emerging to challenge Flash’s supremacy and extend the reach of dynamic visualization interfaces to mobile devices.
Advocates for various causes have also embraced visualization as a medium for communicating the breadth and depth of the problems they seek to communicate and, ultimately, solve. Hans Rosling, an expert on world development, used a specially designed tool called Gapminder as a storytelling device in this rousing, visualization-driven TED talk. Artists have also latched onto visualization as a medium for expressing information. Chris Jordan, who creates sprawling images of American consumption, explains in his own TED talk how visualization is an effective and necessary means for evoking emotional responses to data.
What Is Data, Anyway?
So what, exactly, are we referring to when we say “data”? York University professor and data visualization historian Michael Friendly defines it as “information which has been abstracted in some schematic form, including attributes or variables for the units of information”. I find it useful, though, not to think of data as an abstraction; but rather as an expression of occurrence. At Stamen we’re fond of saying that our favorite data to work with is anything is created by humans, such as:
All of these examples can be expressed in tabular structure, which you’ll commonly encounter stored as an Excel file or in a Google Docs spreadsheet. In fact, most of the data that you’ll ever encounter will be tabular, and even some data models that you wouldn’t think of as tabular can be expressed as rows and columns, or what we sometimes refer to as a matrix. The spreadsheet, despite having been tarred by Microsoft’s notoriously unwieldy tools and proprietary file formats, is today the most well-understood and universal form in which to encode and transmit data.
Data Formats
Typically, tabular data will be provided as a raw text CSV (Comma-Separated Value) file or a binary Excel file. Excel files can be converted into CSV by uploading them to Google Docs and exporting them from there. Here are some sites that provide CSV or Excel versions of some, if not all, of their data:
Data Models
A tabular structure can be used to express a variety of different data models. A data model, as far as it relates to a table, is simply a description of its rows and columns. In most models columns represent attribute or variable names, and each row represents a sample with a value for each column. Consider the following:
Here we’ve got several columns, each of which might serve as a potential variable for visualization. Age and income stand out as the ones most suitable for graphing, simply because they’re both numbers. Gender, if we had more than three rows, might be an attribute suitable for coloring dots or lines. Profession, like gender, is a qualitative (referring to quality rather than quantity, sometimes referred to as categorical) variable that might not lend itself to any particular visual distinction; but it may serve as an interesting filter. Name is the one that I would refer to as the “identifier”, the (hopefully) unique column that we could use to label individual points.
Some data sets may not have unique identifiers for each row, but may instead describe changes in one variable relative to another. For instance, a table of average temperatures for major American cities over time wouldn’t need unique identifiers for each row; we could make a line chart that connects the plotted points for each city with a unique color. (It’s worth noting that, in data sets which describe changes in one or more variables over time, time is itself one of the variables.)
Aggregation and Granularity
One important thing to note is that most data is not, in fact, “raw” in any sense. The word “raw”, to me, implies collection closest to the source of the activity that it describes. Financial information is rarely expressed as a list of transactions, but as an aggregation of transaction totals grouped by uniform time periods. Aggregation is not necessarily a bad thing, though, because most data sets aren’t particularly revealing in their “raw” form. We typically refer to most aggregations and mathematical operations on more granular data as statistics.
Aggregations of highly granular data allow us to understand changes in variables over long periods of time; at the scales of cities or countries; or relative to particular demographic groups, such as gender, age, and political orientation. There is such a thing as “premature aggregation”, though. Collecting data and “bucketing” it without recording all of the variables may result in a loss of useful information, and later prevent aggregation along interesting axes.
“Special” Variables
As you can see by browsing through some of the data catalogs I’ve listed above, most data is not limited by having to be expressed in a table. There are some exceptions that you should be aware of, though, which I’m going to refer to as “special” because they deserve extra attention when it comes to formatting input values, labeling, and positioning.
Time
As mentioned previously, time is one of the most common variables. Time is special because it can be represented on many different scales, from the second (and, in some cases, the millisecond) to one or more years. Often times, the rawest forms of data are aggregated into tables that list totals of other variables grouped by hour, day, month, or year. You’ll need to be sensitive to the time scale of certain data sets when plotting them on charts and graphs.
Location
The word “location” can mean many things:
To some degree of precision, the the first three can be turned into the last. Google Maps, for example, exists primarily as a service for translating an address or place name into geographical coordinates so that it can be pinpointed on a map. This process is generally called geocoding, and can be reversed to derive place names from geographical coordinates. Latitude and longitude aren’t typically necessary (or even useful) unless you’re working with tools that understand them, though, so we’ll likely be focusing on the less specific location types. Street addresses, for instance, can be plotted on a map that lists block numbers (or you can just Google the address and use the map as guidance).
Location, like time, is also ripe for aggregation. Certain data is less much less interesting at the street level than it is at the neighborhood or city level. The US Census Bureau collects multivariate data aggregated by “tracts”, which they designate to contain a relatively uniform number of people so that they can be compared independent of population density. Tract statistics are then aggregated into cities, counties, and states. For privacy reasons, the Census Bureau never releases the “raw” results of their surveys.
Data Model Patterns
A number of common patterns have arisen to deal with the visualization of common data models, which often deal with one or both of our “special” variables: time and location. The line chart is an obvious example that tracks the change of one or more variables over time. The choropleth, or “heat map”, is another. This is the New York Times’ excellent 2008 electoral map which demonstrates aggregation of presidential voting tallies by both county and state:
We’ll investigate a few of these patterns in the next three weeks.
Next Week
Next week we’re going to visualize some data. Your homework in the meantime is to collect three distinct data sets that you find interesting, and would like to learn something about through the process of visualization. For the purposes of sorting and filtering data, your tables should be saved either in Excel (or Numbers) locally, or on Google Docs, which allows you to upload CSV and binary spreadsheets from Excel or Numbers. Feel free to share your spreadsheets with me (shawn at stamen dot com) and I will gladly look them over to ensure that they’re usable for next week’s exercise.
For inspiration and some more historical perspective, Edward Tufte’s books come highly recommended. The Beautiful Visualization O’Reilly book also provides a nice overview of both modern visualization and some of the canonical classics.
If you’re interested in creating your own data set, I would recommend reading The Data-Driven Life, a Times article summarizing a decade’s worth of innovation in the field of self-quantification via automated and/or obsessive data collection. I can highly recommend both Daytum and your.flowingdata as collection tools, but you may find it easier just to carry around a pad and paper.
Good luck, and see you next week!