roadtolarissa
Adam Pearce github twitter email rss

Data Exploration With D3

D3 is best known for polished interactive visualizations. With its rich API however, it is also an excellent tool for acquiring and, with a bit of work, exploring data. This post will walk through scraping and plotting different dimensions of the history of the Oscars as an instructive example.

Scraping data

The Academy Awards Database displays all award nominations on a single page (pick award years 1927 to 2014 and click search). The Elements tab of the dev tools reveals the structure of the page:

All the awards are contained within a single dl element. Each year and award type are denoted with dt and div elements, while the actual nominations are table elements interwoven - not nested - between. While document.querySelectorAll or the already loaded jQuery could be used to traverse the DOM, injecting D3 onto the page allows us to use the same API for gathering and displaying data. A little bit of javascrict in the console does the trick:

var script = document.createElement("script")
script.src = 'http://d3js.org/d3.v3.min.js'
document.body.appendChild(script)

Iterating over each child of the dl element, we build an array of nominations by tracking the current year and award type. Each time a table element is encountered, a new object is added to the nominations array with the year, award and name of the nominee parsed from the table text.

var nominations = [],
    curYear,
    curAward

d3.selectAll('dl > *').each(function(){
  var sel = d3.select(this)
  if      (this.tagName == 'DT'){
    curYear = sel.text()
  }
  else if (this.tagName == 'DIV'){
    curAward = sel.text()
  }
  else{
    nominations.push({
      year: curYear, 
      award: curAward, 
      name: sel.text().split(' -- ')[0]
    })
  }
})

Writing this code in the sources tab as a snippet and checking the output by running > table(nominations) in the console creates a pleasantly short feedback loop. To get more information from each nomination, set a breakpoint inside the else block and try looking for something else that can be programmatically read from the text (I also get the name of the movie, the winner and clean up the year field).

For a larger project, setting up a replicatable data pipeline is generally a best practice; in the interest of taking a quick look at the Oscars, we’ll relay on the clipboard. d3.csv.format converts an array of objects to a csv string. With > copy(d3.csv.format(nominations)), the nomination data is copied to the clipboard which can then be pasted and saved into a csv file locally.

Seeing data

To get started quickly with minimal fuss, I usually grab a copy of my d3-starter-kit repo. It contains d3, lodash, d3-jetpack and some helper functions for generating tooltips, scales, axes and styling them.

Having saved the array of nominations as data.csv, we can load it into the starter-kit template and check the integrity of our data:

d3.csv('data.csv', function(nominations){
  //convert the award ceremony index to a number  
  nominations.forEach(function(d){ d.ceremonyNum = +d.ceremonyNum })

  //check that every ceremony has been loaded
  d3.extent(nominations, ƒ('ceremonyNum')) //[1, 87]

Passed a single string, ƒ returns a function that takes an object and returns whatever property of the object the string is named. For the computer, ƒ('ceremonyNum') is equivalent to function(d){ return d.ceremonyNum }). For humans, the lack syntactical noise makes it more expressive and quicker to type - critical for rapid prototyping.

Lets focus on actress nominations:

//select only actress nominations
var actressNominations = nominations.filter(function(d){ 
  return d.award == 'ACTRESS' })

//group by name
var byActress = d3.nest().key(ƒ('name')).entries(actressNominations)

//sanity check - Meryl Streep has 15 nominations
d3.max(byActress, ƒ('values', 'length'))  //15

d3.nest takes a key function and an entries array, grouping the members of the entries by the value returned from calling the key function on each entry. It returns an array of group objects. Each has a key property, here the name of an actress, and an array of values, here an array of nominations.

When passed multiple string arguments, ƒ converts each string into field accessor functions and returns their composition. ƒ('values', 'length') is equivalent to function(d){ return d.values.length }), calling it with every group object and calculating the maximum returns the most Best Actress nominations a single person has received. Calculating known summary statistics from your data - here Meryl’s 15 Best Actress nominations - is a great way of double checking your data and calculations.

I’m curious about the relationship between the number of previous nominations that nominees and actual winners had. To get an overview of the data, I’ll start by making an Amanda Cox style record chart. To do that, each nomination needs information about the nominee’s previous nominations:

//count previous nominations
byActress.forEach(function(actress){
  actress.values.forEach(function(nomination, i){
    actress.prevNominations = i 
    nomination.otherNominations = actress.values
  })
})

Since the nominations are already sorted by year, the index of each actress’ nomination in its nested values array is equal to the number of previous nominations the actress had at the time of the nomination. While looping over the nominations in the values array, I also attach a reference to the actress’ other nominations so it will be easy to find a nominees’ other nominations later.

Time to graph the data!

var c = d3.conventions({parentSel: d3.select('#nominations-scatter')})

//compute domain of scales
c.x.domain(d3.extent(actressNominations, ƒ('ceremonyNum')))
c.y.domain(d3.extent(actressNominations, ƒ('prevNominations')))

//draw x and y axis
c.drawAxis()

//draw a circle for each actress nomination
c.svg.dataAppend(actressNominations, 'circle.nomination')
    .attr('cx', ƒ('ceremonyNum', c.x))
    .attr('cy', ƒ('prevNominations', c.y))
    .classed('winner', ƒ('won'))
    .attr('r', 3)
    .call(d3.attachTooltip)

There are a couple of abstractions from d3-starterkit that make this code shorter and more readable. d3.conventions returns an object with automatically configured margins, svg, scales and axis - saving the tedium of globbing together snippets from several bl.ocks over and over again.

c.svg.dataAppend(actressNominations, 'circle.nomination') is shorthand for:

c.svg.selectAll('circle.nomination')
    .data(actressNominations).enter()
  .append('circle')
    .classed('nomination', true)

In addition to converting strings into accessor functions, ƒ also composes functions. Typically the functions passed to attr or style select a single property from the data bound to an element and encode it as a visual property with a scale function. Instead of typing this same type of function over and over - .attr('cx', function(d){ return c.x(d.ceremonyNum) }) - we can strip it down to its bare essentials with .attr('cx', ƒ('ceremonyNum', c.x)).

d3.attachTooltip adds a basic tooltip showing all the properties attached to an element, removing the need to Inspect element and run > d3.select($0).datum to examine outliers and other interesting points.

The result:

Unfortunately nominations in the same year with the same number of previous nominations cover each other up. We can fix that by grouping on year and number previous nominations, then offsetting nominations in the same group so they don’t overlap:

d3.nest()
  .key(function(d){ return d.ceremonyNum + '-' + d.prevNominations })
  .entries(actressNominations)
.forEach(function(year){
  //sort nominations so winners come first  
  year.values.sort(d3.ascendingKey('won')).forEach(function(d, i){
    d.offset = i
    //save new position as a property for labels later
    d.pos = [c.x(d.ceremonyNum) + i*1.5, c.y(d.prevNominations) - i*3]
  })
})

var circles = c.svg.dataAppend(actressNominations), 'circle.nomination')
    //position with transform translate instead
    .translate(ƒ('pos'))
    .classed('winner', ƒ('won'))
    .attr('r', 3)

Just like with the calculation of previous nominations, we’ve grouped the data, sorted items in the same group, and saved their index. This pattern is useful in a wide range situations and d3 makes it easy to use.

This graph is functional but it is difficult to see the arcs of different careers. We can start by highlight all of an actress’ nominations on mouseover:

var mouseoverPath = c.svg.append('path.connection')

circles.on('mouseover', function(d){
  //make nominations with the same name larger
  circles.attr('r', function(e){ return d.name == e.name ? 7 : 3 })

  //connect them with a path
  mouseoverPath.attr('d', 'M' + d.otherNominations.map(ƒ('pos')).join('L'))
})

Saving a reference to an actress’ other nominations and storing position as an [x, y] property makes drawing a path connecting them simple.

Connecting and labeling the nominations of very successful actresses helps provide context while examining other careers. Oscar nominations aren’t nearly as sparse as home runs or touchdown passing records; only the actresses with more than 5 nominations are connected so the bottom of the graph doesn’t turn to spaghetti.

var topActresses = byActress.filter(function(d){ return d.values.length > 5 })

c.svg.dataAppend(topActresses, 'path.connection')
    .attr('d', function(d){ return 'M' + d.values.map(ƒ('pos')).join('L') })

c.svg.dataAppend(topActresses, 'text')
    //values are sorted by time - most recent nomination is always last 
    .translate(function(d){ return _.last(d.values).pos })
    .text(ƒ('key'))
    .attr({dy: -4, 'text-anchor': 'middle'})

While javascript doesn’t have an abundance of statistics packages like R or python, d3.nest and d3.mean make rudimentary trend analysis possible:

//group by year
var byYear = d3.nest().key(ƒ('ceremonyNum')).entries(actressNominations)
byYear.forEach(function(d){
  //for each year, select previous 15 years
  var prevYears = byYear.slice(Math.max(0, i - 15), i + 1)
  //create array of all nominations over previous 15 years
  var prevNoms = _.flatten(prevYears.map(ƒ('values')))

  //average previous nominations for nominees and winners 
  d.nomAvgPrev = d3.mean(prevNoms,                  ƒ('prevNominations'))
  d.wonAvgPrev = d3.mean(prevNoms.filter(ƒ('won')), ƒ('prevNominations'))
})

Looping over each year, the average number of previous nominations over the past 15 years is computed and attached to each year group. This isn’t a particularity efficient way of calculating a rolling average - see science.js or simple-statistics for that - but our data set is small and it gets the job done.

var line = d3.svg.line()
    .x(ƒ('key', c.x))
    .y(ƒ('nomAvgPrev', c.y))

c.svg.append('path.nomAvg').attr('d', line(byYear))
c.svg.append('path.winAvg').attr('d', line.y(ƒ('wonAvgPrev', c.y))(byYear))

Again, ƒ provides a succinct way of grabbing a property from an object and transforming it with a scale.

Over the last 20 years, the Academy has picked best actresses with fewer previous nominations than the other nominees. Since all the animations have already been scrapped and there’s only one line of actress specific code, return d.award == 'ACTRESS' }), seeing if this pattern holds across supporting actresses, actors and directors isn’t too difficult - grab a copy of the repo and try!

Animating data

Encoding the data differently shows different patterns. Combining D3 with these helper functions allows us to rapidly explore the space of potential visualizations.

For example, while it’s clear that Streep has the most nominations, by deemphasizing time we can see the distribution of nominations across actresses. First, create a g element for each actress and sort vertically by number of nominations:

c.y.domain([0, topActresses.length - 1])

topActresses = topActresses.sort(d3.ascendingKey(ƒ('values', 'length')))
var actressG = c.svg.dataAppend(topActresses, 'g')
    .translate(function(d, i){ return [0, c.y(i)] })

actressG.append('text.name').text(ƒ('key'))
    .attr({'text-anchor': 'end', dy: '.33em', x: -8})

Then append a circle for each nomination:

c.x.domain([0, d3.max(topActresses, ƒ('values', 'length'))])
actressG.dataAppend(ƒ('values'), 'circle.nomination')
    .classed('winner', ƒ('won'))
    .attr('cx', function(d, i){ return c.x(i) })
    .attr('r', 4)
    .call(d3.attachTooltip)

Just like we’ve abstracted the process of creating functions to transform properties of objects into visual attributes, we can abstract the sorting of actress rows and positioning of nominations into functions:

var positionByNomintions = { 
  label:  'Most Nominations',
  //position circles
  setX: function(){
    c.x.domain([0, d3.max(topActresses, ƒ('values', 'length'))])

    topActresses.forEach(function(actress){
      actress.values.forEach(function(d, i){ d.xPos = c.x(i) })
    })
  },
  //order for rows
  sortBy: ƒ('values', 'length')
}

function renderPositioning(d){
  //position circles by updating their x property
  d.setX()
  actressG.transition()
    .selectAll('circle')
      .attr('cx', ƒ('x'))

  //save order to actress object
  topActresses
    .sort(d3.ascendingKey(d.sortBy)
    .forEach(function(d, i){ d.i = i })

  actressG.transition()
      .translate(function(d, i){ return [0, c.y(i)] })

}

renderPosition(positionByNomintions)

By creating more objects with setX and sortBy functions, we can quickly investigate other arrangements of the data like the distribution of wins or the longest career:

var positionByWins = { 
  label:  'Most Wins',
  setX: function(){
    c.x.domain([0, d3.max(topActresses, ƒ('values', 'length'))])

    topActresses.forEach(function(actress){
      actress.values
        .sort(d3.ascendingKey(ƒ('won')))
        .forEach(function(d, i){ d.x = c.x(d.i) })
    })
  },
  //lexicographic sort
  sortBy: function(d){ return d.wins*100 + d.noms }
}

var positionByCareerLength = {
  label: 'Longest Career',
  setX: function(){
    c.x.domain([0, d3.max(topActresses, careerLength)])

    topActresses.forEach(function(actress){
      actress.values.forEach(function(d){
        d.x = c.x(d.ceremonyNum - actress.values[0].ceremonyNum)
      })
    })
  },
  //lexicographic sort
  sortBy: careerLength
}

//number of years between first and last nomination
function careerLength(d){
  return _.last(d.values).ceremonyNum - d.values[0].ceremonyNum 
}

Storing the created positionBy objects in an array makes creating a toggle to switch between them simple:

d3.select('#buttons').dataAppend(positionings, 'span.button')
    .text(ƒ('label'))
    .on('click', renderPositioning)

This is just a starting point. If we think of anything to sort or group our data on, seeing it only requires writing a short function. We’ve also only looked at a small slice of the whole nomination dataset.

Interesting things to read

ggplot2, dplyr, magrittr and rstudio create a wonderfully integrated environment for quickly and efficiently analyzing data. While using these tools to create visualizations for the web unfortunately requires some duplication of effort, the advice in Hadley Wickham’s Tidy Data and The Split-Apply-Combine papers apply to anyone manipulating with data. This post is essentially an amalgamation of different ways I’ve been trying to pull ideas from Hadleyverse into my D3 work.

Tamara Munzner’s Visualization Analysis & Design has helped me think significantly more clearly about ways of representing data.

Learn JS Data is a great introduction to lightweight data analysis with D3 and underscore. Other good resources on functional javascript include Eloquent Javascript, Functional Javascript and Javascript Allonge.

Code for this post on github.