NBA Draft

nbaSplash
Interactive graphic comparing the top picks to the top performers in the NBA

This is another remake, courtesy of the NYT:

nflspalsh

Assorted thoughts:

  • Clearer differences between players of different skill level. In basketball, there is a huge difference between the best player in a draft (Lebron James in ’03) and the 5th best (David West in ’03). Representing the top players with a gradient highlights these differences. Shades of grey also differentiate average players from those who never played. I’m not sure this color scheme would have worked as well with the NFL graphic since the larger drafts require thinner bars. 
  • More information about players.
    tonyparkerInstead of a tricky formula based on “number of starts, Pro Bowls and other factors,” I went with the product of minutes played and player efficiency rating. Adding a chart of those numbers exposes those calculations and making it yearly creates a snapshot of a player’s entire career. Looking at the bar graph in isolation now, it really looks like it needs axis tickmarks. In the context of the whole graphic though, they make things way too busy.
    I also copied from the last two projects I worked on and added an image of each player to the tooltip.
    Finally, clicking a bar opens a link with more information about the player which makes it significantly simpler for the viewer to continue finding things out.
  • Team specific information. To make it easier to find a player or see how a team has done over time, mousing over or clicking a team name highlights picks by that team. Moving the slider also updates each team’s number of top picks. I’m a big fan of having animated transitions be accompanied by text updates. It makes the comparisons between two stats more concrete and quantitative while generating memorable takeaway numbers on the fly (“did you know that Grizzlies have never drafted a top 3 player?”).  I think the NYT’s incorporation of this effect was more successful; by only updated two numbers at a time it would explain those numbers in a sentence and draw attention to them with bold text.
  • Because I was tacking on features, I ended up with a more cluttered display. I usually try to make the instructions only a sentence long to reduce the amount of text on the page. That wasn’t possible here, so I ended up a big block of words at the end. I probably should add some images, color, and/or formatting to it but it has been almost a month since my last post and I’m ready to move onto the next project now.
  • I really like how the NYT graphic passively shows player names in an unobtrusive manner.
Posted in Uncategorized | Leave a comment

Meteor Map

meteor map

An interactive map made for visualizing.org’s Meteorites contest.

I started from ‘scratch’ with a blank map of the world generated by d3. Each meteor in the contest data set had a latitude and longitude, so I used those coordinates to plot a small circle on the map to represent each impact. After coloring the circles orange, the map looked almost like the one above.

Most of the ~30 hours I’ve spent working on this map were spent adding some of these small improvements:

  • Circle size and color proportional to meteor mass
    The data set also included the meteors’ mass. I initially tried to make each circle’s radius proportional to the square root of the mass so each pixel of circle would correspond to some amount of mass. Because of the masses of meteors varied so much  the largest were too big and smallest were too small to see. After playing around with exponents smaller than 1/2, I eventually switched to logarithmic scaling. This gets rid of any simple correspondence between a circle’s area and mass, but fits the data much better. Glancing at the histogram, it appears that the distribution of masses is approximately log-normal.
  • Mouseover effects
    mouseover
    The data set had some additional information that I wasn’t sure how to represent graphically. To keep that information accessible and make the map interactive, I added it to a mouseover tooltip. Each meteor also had a URL pointing to its meteorological society page which was used to create an onclick event for the circles. The pages had pictures of meteors which I used to create thumbnail previews. This went fairly smoothly until I tried to upload the images to my webserver. I was using meteor name as the filename and some of them had unicode characters which resulted in weird errors that were difficult to diagnose.
  • Crossfilters
    To make the map even more interactive, I incorporated the year and mass data into histogram crossfilters. I’ve used the crossfilter library indirectly before, through dc.js which has several chart types premade. Getting the circles on the map to change with transitions went beyond what dc.js can do out of the box and I had to get my hands dirty with the actual crossfilter library. The messiness of the code I ended up with reflects that - I’m getting closer to doing things correctly, but I’m not quite there yet.

Ultimately, I ended up with a presentation pretty close to Javier de la Torre’s:

cartodb

I  think my version has a number of improvements (obviously, I just finished making it!) – the better looking tooltips which active on mouseover not on click, the crossfilters, dotmap instead of heatmap (I really don’t like what happens to their contiguous areas on zoom)  - but Torre apparently made his in only 30 minutes. I don’t regret the additional time spent on mine since most of it was spent learning, but cartodb or fusion tables look like they would be the appropriate tool to use the vast majority of the time. I am a little frustrated I wasn’t able to do more with the flexibility of d3 - the only other published entry for the contest I’ve seen, bolid.es, is stunning – and I’m looking forward to seeing what else gets made.

Posted in Uncategorized | Leave a comment

Film Strips

filmstrips

http://roadtolarissa.com/film-strips/

I’ve had this idea bouncing around in the back of my head for a few months. When I finally got around to working on it, I was pleasantly surprised to have a functional prototype after only a few hours.

The idea is pretty simple: Take a bunch of sequential stills from a movie, create a map of how their average color changes over time, and add a mouseover effect that shows the enlarged still. Thanks to google and stackoverflow, it didn’t take too long to find programs and libraries that did most of what I was picturing and with my experience on other projects, gluing everything together went smoothly.

Because of disk space constraints, I’m only hosting the 2013 Best Picture Nominees. Many more (static) film strips are located at MovieBarcode which this project was inspired by. You can also make interactive film strips from your own movie files – the whole process is automated - if you’re not familiar with python though, getting all the tools setup will take a few hours.

I have a couple of idea of additional features (subtitle integration, variable bar width, experiments with other ways of generating and showing color, ect) but since this is sitting in distinctly gray area of fair use, I’m a little hesitant to put more time into it. I am glad I did what I’ve done. I’ve started working at an actual workplace and I’d like to stay (/get back) in the habit of working on things for fun.

Posted in Uncategorized | Leave a comment

Whale Words

This is a display of word frequency in Moby Dick that I just finished making. It draws heavily on Bibly, which displays word frequency in the Bible.

To get more practice with d3.js, I wrote everything from scratch. Having a visual reference was still extremely helpfu, though. Some things I really liked about their implementation/would have run into trouble without:

  • Grouping words into equally sized buckets to create a histogram. This sounds obvious, but I had some trouble getting to it. My first idea for this project came when I saw a infographic of word frequency in the Wire. I wanted to make an interactive version of the same data, showing the number of occurrences of each word per episode. Since each episode contains roughly the same amount of dialog, it would be relatively simple to draw a single bar for each episode. When I decided to work with Moby Dick instead, my first attempt drew one bar per chapter and with the variable width bars, it looked awful. Unable to find a solution, I put the project aside until I came across Bibly.
  • The use of two colors of bars on the x-axis to detonate chapter length. I would have used tick marks, but solid bars are a significantly better visual metaphor for chapters which persist for the entirely of their length and aren’t one time events. Bars are also better mouseover targets.
  • This might be a little silly, but I really like the ‘[Enter] a Word’ formulation. I tried a lot of different instructions for the splash page of redditgraphs, but really struggled finding one that was both concise and clear.

Some things I added or removed:

  • An autocomplete drop down to text box.In addition to downloading the text of the Moby Dick (about 1 MB), the webpage also downloads an array containing all the none stop words in Moby Dick and the number of their occurrences (about .2 MB). Typing a few letters searches though the array for the most common words that match, then displays them and their frequencies in a drop down menu. This is pretty sweet for a couple of reasons. Displaying the frequency of many words at once increases the information density of the site – the drop down with no text entered yet basically shows as much information as the entire Wire infographic.  At the same time, it increases usability (if the user doesn’t realize the need to press enter, the suggestions are click targets) and invites exploration (it is hard thinking of interesting words to search for on the fly, the autocomplete shows several of them at a time).
  • The highlight functionality or the recent search display. I’ve been reading though public available portions of the cs488b syllabus and can see why Bibly included them, but I wanted to keep the page as simple as possible.
  • Mousing over a bar immediately shows the corresponding quotation at the bottom of the page instead of showing tooltip with a list of quotations to click on.
  • None of the height, width, word bucket size, or chapter length values depend on magic numbers. This makes it easy to resize the graph to fill the browser window, rescale the y-axis for different words, and potentially use the same code to display other books.

Some things that I would like to add or improve:

    • Resizing the page after it has loaded should resize the graph too. I didn’t work with MVC in mind though, so it would take some work to implement this.
    • Loading splash screen should have a loading bar or changing text.
    • The text display should have a way to read more of the passage. Maybe clicking could open a new scrollable window containing the entire text with passage highlighted.
    • The numbers on the autocomplete drop down should align to the right. This is surprisingly difficult to do without breaking jquery ui.
    • It’d be super cool to include more books. Most of the process is automated, but some hand editing was required to the make the raw text from Project Gutenberg usable for the parseBook.py program I wrote. It would be feasible to do the same for the freely available texts in the St. John’s program. Alternatively (or additionally), if I wrote a parse parseEPUB.py program (which wouldn’t require any hand editing for each text since chapter titles are written in constant ways) and ran it server side, it might be possible to allow users to upload any book formatted as an EPUB without DRM that they own and display a graph for their book. I’m not sure how useful or used this would be; even if the user has and knows where to find an ebook on their hard drive, it will probably be from Amazon and/or contain DRM.
    • While I was looking though the nltk documentation, I came across this graph: I’m not sure if it is a necessarily better display of the same information. Frequency is a little harder to discern and it just looks less interesting, but it takes up less spaces and opens up the possibly of easily comparing the frequency of different words.
Posted in Uncategorized | Leave a comment

Unemployment Rates

Three years ago, the New York Times published a nifty interactive graphic showing how the recession had impacted the unemployment rate of different demographic groups. Wanting show a broader range of dates, add features, and get more practice with d3.js, I started working on a remake.

During my job at Goody Goody, I got some practice getting time series from Bureau of Labor Statistics. Since I was only interested in a few types of data like the CPI and unemployment rate in Texas, I used their clunky java app to get the data I needed.

I briefly contemplated manually selecting all the series that I wanted to use for downloading but decided not to because of the amount of time and errors that would be involved in doing that. Poking around the BLS website more, I found their FTP site which contained huge text files with unemployment time series. This wasn’t particularly user friendly, but all the data was machine readable and I wrote a small python program to extract and repackage the time series I wanted to show.

Even at this relatively early stage, I made a couple of decisions to limit the project’s scope. The BLS collects a huge amount of demographic data – martial status, occupation, disabilities, veteran, reasons for not working more, ect. – and I omitted most of them because only a limited number of groupings have crosstabs. I also picked the semi arbitrary start year of 2001 to keep amount of data the page requires down. Many of the series go back to the 1950s and including them would have required either long page load times or a tricker system of only downloading requested time frames.

Implementing the page in d3.js wasn’t too difficult. I’m getting more familiar with the library and the style of coding that works well with it. The biggest issue with the library  right now is its inability to do smooth transitions in none webkit browsers. The NYT’s version of the graphic was written in flash and doesn’t lag in firefox. Beta builds of firefox with the right experimental settings enabled display the page fine, but I’m not sure how long it will be before those improvements are widely installed.

In terms of new features, I added another set of buttons to allow for easy comparisons between two different demographic groups and a brush on the bottom graph for selecting time. These additions take some of the focus away from the recession, but also historically contextualize the unemployment rate and draw the viewer to explore the data.

Posted in Uncategorized | Leave a comment

Zoomable Sierpinski Triangle with d3.js

http://roadtolarissa.com/triangles/

After finishing up the petition project, I wanted to use what I learned about d3.js to create something a little more fun. After a few hours* of work,  I had this:

//triangle centered at (cx, cy) with circumradius r
function addTriangle(cx, cy, r){
  svg.append('polygon')
    .on(mobile ? "click" : "mouseover", function(d){
      addTriangle(  cx,             cy - r/2,       r/2);     
      addTriangle(  cx - r*sin30/2, cy + r*cos30/2, r/2);     
      addTriangle(  cx + r*sin30/2, cy + r*cos30/2, r/2);
      
      d3.select(this).on('mouseover', function(){});
      d3.select(this).on('click', function(){
        addTriangle(cx, cy, r);});
    })
    .attr('fill', 'white')
    .attr('points', (cx)  +','+   (cy)  +' '+ 
                    (cx)  +','+   (cy)  +' '+
                    (cx)  +','+   (cy))
    .transition()
    .duration(600)
    .delay(10)
      .attr('fill', randomColor())
      .attr('points', (cx)  +','+   (cy-r)          +' '+ 
              (cx-r*sin30)  +','+   (cy + r*cos30)  +' '+
              (cx+r*sin30)  +','+   (cy + r*cos30))
}

which is currently live at the above link. There are a lot of things I’d like to add – proper mobile support, more fractal patterns, deeper zooming, and more interesting coloring – but I’ve just been clicking and scrolling around the triangles instead. It’s incredible to me that such a small snipping of code could create something so visually engaging. Even though the internet already has dozens of Sierpinski Triangles, I haven’t found any as interactive and eloquent as this one (which is because of d3.js, not anything I’ve done) so I feel ok posting it.

*The most frustrating part of making this: tricking myself into think that I had forgotten everything about geometry. From about 2 AM to 3.30 AM, I was trying to find the vertices of an equilateral triangle given the circumradius and center. This is pretty easy to do with trigonometry (see lines 21-23) , but I kept getting lopsided triangles no matter how I calculated the vertices. I even ended up pulling a geometry textbook off the shelf to check that I was doing the math correctly. Eventually (and I can’t remember what lead me to this realization or why it didn’t come earlier) it dawned on me that Math.sin(x) thought x was in radians and I was entering degrees.

Posted in Uncategorized | Leave a comment

Interactive Visualization of White House Petition Signatures

http://roadtolarissa.com/whitehouse/

I finished writing the scrapper and graphing aspects of this a few weeks ago; I’ve been putting off publicly posting while trying to make it more attractive/usable.

I’m mostly worried about the color scheme. I didn’t want to embarrass myself making something garish, so I went in a very drab, utilitarian direction. I think I may have gone too far, but I’m not sure what good alternative would look like. I got away with not worrying too much about color with redditgraphs because the data was naturally graphable with many different colors which allowed the layout to be understated. That really isn’t possible with the petitions.

I’m also not sure about the layout. On widescreens, the two charts are shown side by side so all the petition info can be view on one screen. On smaller displays though, this makes the charts a little too small. I could make the charts bigger with a toggle button or showing them stacked on top of each other like they are with non-widescreens, but that would require clicking a button or scrolling to view all the information about a petition. Additionally if I start tracking more petitions, I will need to make the left panel scrollable.

Finally, I’ve derived estimations about party affiliation and age for each petition and I don’t know how to include them. Since these estimations are based only on name and location, they aren’t nearly as accurate as the gender ones – 35% of the people wanting to impeach Obama are apparently democrats – so I’ve left them out to avoid cluttering the table more. They do a decent job of showing the relative rank of each petition though and should probably be included.

Posted in Uncategorized | Leave a comment

Next Project

Finished with redditgraphs, I have a couple of ideas about what I’d like to work on next; I’m posting them to clarify my own thoughts and to get feedback.

Games

  •  Hangout Boardgames
    I hadn’t touched this project in about a month until two days when I started working on it again. I added backgammon, fixed some bugs, and made it publicly accessible on Google+ (I think). Before sharing it with more people, the UI common to each game needs some more work. I think the concept is better than all the other hangout apps I’ve seen – instead of having the game board take up the entire screen, the game takes place entirely on a small (on most screens) 400×400 tile and the video feed from your opponent fills the rest of the space. The UI to switch teams and change the game is very spartan. I’d to make it better but I’m having trouble finding a solution that doesn’t clash individual games differing art styles:

    Rejoining a hangout also causes a mysterious crash about 20% of the time, which needs to be fixed before release.
  •  Lyric Typing Game
    Over the summer, I made a rudimentary typing game for Spotify. Pulling lyrics from tunewiki, it displays the word as they are being sung and the user tries to type them before they finish playing. It’d be really cool to make this app available on spotify, unfortunately licencing lyrics is super expensive (minimum $20,000). I might try to contact providers of tunewiki to see if they want to include the game in their own spotify application or at the very least make a post about the app and record a video of it working.
  • Neocolonialism
    Over the summer, I exchanged several lengthy emails with Seth Alter, the creator of Neocolonialism. Playing  the game with my friends was great, but we ran into several nasty bugs. I sent him an email with game play suggestions and an offer to help debug the game. After corresponding for a few days, he took me up on the offer but I got nervous and didn’t respond for two weeks. He didn’t reply (understandably) and I haven’t talked with him since, but I’d still really to help out. I don’t think it would be super hard for me to make a demo version of the game for google hangouts. Playing on hangouts would automatically integrate voice and video while avoiding many of the hangups that make it difficult to get the game up and running –  currently, everyone has to separately install the game, type in the correct IP address, and fix router issues. Before contacting Seth again, I should finish up the the board game app in order to have a solid proof of concept. I’m not sure I’ll get a response – we’re still on each other’s gtalk friends list, but I’ve been removed from the ‘Thnx’ section of the rulebook – still, trying probably won’t hurt anything.

Visualizations

  • Whitehouse Petitions
    In response to Obama’s reelection, secessionists posted a petition requesting “Peacefully grant the State of Texas to withdraw from the United States of America and create its own NEW government [sic].” By scraping the page for the locations of all the signers, this graph was generated: While there are a number of problems with the graph -the locations should be exaggerated by county instead of city and it needs to show density instead of logged total numbers – the idea is very good. I would like expand both the types of graphs shown – the rate of signing, the gender of signers, the predicted partisanship of the signers – and the number of petitions. There are dozens of active petitions and if everything was automated, it would not be hard to rescrape and update the graphs once a day.
  • More redditgraphs
    I’d like to generate more user specific statistics using comment history text pulled from the reddit API. I’ve read several papers which used text corpora from blogger, twitter,  and youtube to predict user age and gender. Doing the same with reddit comments is feasible, but the lack of a training set (reddit has no user profile page) and structural limitations with the reddit API make the task much more more difficult than it should be. Since I’m taking a break from reddit, I’m putting this idea on hold for a while.
  • Pitchfork Statistics 
    Scrape the text, date, score, author and band information from every published review on pitchfork to see the score distribution, what words correlate with better scores, and how word usage has changed during the site’s history.
Posted in Uncategorized | Leave a comment

Redditgraphs Retrospective

It’s been nearly a month since my last post, about a comment visualizer I created for reddit. Since then, I’ve mostly been polishing the application and trying to share it with people. After posting the basic demo on /r/javascript, I was encouraged make improvements and host the project on its own domain. Registering “redditgraphs.com” for a year only cost $5 and it seemed more memorial and easier to access than “roadtolarissa.com/javascript/reddit-comment-visualizer“. I spent another week adding functionality – hourly trends, weekly trends, direct linking to user names – and making the UI prettier.

Improving the UI was harder than adding functionality. I don’t think I’ve ever created anything particularly visually beautiful before in my life. The previous UI was definitely not pretty but since I was using the default radio button and drop down menus, it didn’t feel like I was exposing part of myself or actively making something bad or mockable. Still, several changes were necessary to improve usability. Most importantly, the previous graph picker was too small and did not attract enough attention:

One of the biggest difficulties in creating this type of interactive display is leading the user to interesting interactions. After posting this version, I received several requests to add histograms and karma plotting – features already included, but buried in ambigously titled drop down menus. I had thought putting the options in the upper left hand corner would make them impossible to miss. Since that wasn’t the case, I used large, tile-like icons to draw even more attention to the different graph types in the redesign:

By making the icons the largest, most colorful non graph item on the page, it is much clearer that they are supposed to be clicked on and played with. Additionally, convanying the opitions in a less textual manner allowed me to move away from the confusing “Graph Settings: Graph Type: Type” formulation. I think there is still a lot of room for improvement, especially on the data icons, but I’m not sure how to do it without learning a lot more or paying something. The pile of books and the ruler look like clip art because that’s what they are.

The rest of the redesign followed a similar pattern. All the radios and spinboxes were replaced by more clickable and colorful buttons and sliders. I added a banner and logo at the top, which took much longer than I excepted to get just right.

This was the result:

 

Satisfied with both the look and functionality, I posted a link and a description to /r/dataisbeautiful. The surprising amount of the feedback was almost entirely positive and several thousand people visited the site. This was very exciting – I’m don’t think anything else I’ve done has ever been used by so many people in so short a time. I also tweeted links  to several tech journalists and one ended up writing a short article.

Afterwords, I spent a few days making some of the suggested improvements – viewing submissions as well as comments, adding the option to log scale some of the axes, and direct linking to different graph types. Around this time, working with the code started to become much more unpleasant. The main culprit was (and is) the function that takes raw data download from reddit and transforms it into a graph. Since I hadn’t anticipated having so many options, features were added one at a time until the thing started to resemble a huge knot.

Encouraged by the suggestions to share the site with more people, I posted links on some of the default subreddits. Unfortunately, the reception wasn’t nearly as positive. Some of the time, my posts were basically ignored (/r/WTF, /r/YouShouldKnow) without many people seeing them. More often, they would would start doing well and than get removed by a moderator for being off topic (/r/askreddit, /r/technology, /r/LifeProTips, /r/offbeat).

When a post or a comment did do moderately well and directed thousands of more people to the site, it wasn’t nearly as exciting as it was the first time – it had already happened before  and since it seemed almost achievable to get 100,000+ or 1,000,000 people to the site, I was always disappointed when that didn’t happen. Additionally, while the feedback was still positive, it became more and more inane. After getting over a dozen of comments about the lack of axis labels (including three people who sent the same xkcd), I added them against my better judgement (they take up valuable space and don’t add any valuable information – the graph title and selected image very clearly shows what is being graph; is it really necessary to clarify that a series of dates along the bottom of a graph refer to ‘time’?).  I tried using more playful, Cracked like titles: “Your Reddit Comments Show When You Sleep” and the majority of the comments made the same joke about how the comments really showed when they were at work.

I also tried sending the site to more journalists and bloggers. Reddit adds a nofollow attribute to all links submitted, so getting more organic traffic from people searching for ‘reddit comment history’ required posts on separate sites. Aside from the first success, I didn’t have any success. After reading Trust Me, I’m Lying, I thought it would be, if not easy, at least doable to get story starved bloggers to write a post about redditgraphs. Based on the limited experience I had, I’d recommend anyone trying the same to find an actual person’s email address rather than using a contact form or tips@website.com email Some of the sites I tried submitting to seemed kind of … sleezy:

During most of the this time, I wasn’t really doing anything productive. It was super easy to write up a post or an email and then sit around writing responses and browsing reddit for a few too many hours. To avoid this problem without giving up hope on getting more users, I spent a few days learning python and wrote a bot to scan recent comments for mentions of ‘your comment history’ or ‘my comment history’ and reply with a link to the mentioned person’s redditgraph. Initially, I manually copied/pasted the comments and only posted a few per day. Seeing that 75% of the replies to the ‘bot’ were positive, I decided to totally automate its posting and expand the number of terms it searched for. The maker of the the statit contacted me on gtalk, telling me that I should throttle or turn off the bot or the admins would ban it. I knew he was probably right, I couldn’t bring myself to disable to continuous stream of comments I was creating. The thrill I got from redditgraphs came primarily from the idea that thousands of people around the world were using their computers to simultaneously automate a process that I had shared with them; with the commenting bot the sharing itself was automated.

The next morning, the bot was banned.

Getting responses and posting wasn’t fun anymore, so I’ve decided to stop posting on reddit. While I did learn more about reddit and communicated with admins, moderators, and bloggers more than I had before, I don’t think it is the most useful topic to learn about and I didn’t need to spend weeks on it. I definitely don’t need to need more time making the greenlines on this graph go up higher:

One more generally applicable thing I found out: feedback isn’t helpful in terms of adding features. Before the redesign no one was asking for large icons instead of drop down menu, but from all the comments asking about different graph types I inferred that they needed to be added. One of the more useful features I came up with allowed for searching for specific phrases in the comment history. This didn’t get added to the live version mainly because I wasn’t sure how to add it to the UI – should there be a user name text box and a search text box right on top of each other? Requiring people to typing in a regex didn’t seem feasible, but where would the rules of a simpler syntax be communicated? I assumed that someone would ask for searching and I would have a reason to add it, but no one did and I never got around to it.

A small proportion of the comments were useful, maybe 10 or 20 out out of the ~250 that I received. That number is even more miniscule compared to the total number number of people who have viewed the site:

It makes sense that most people wouldn’t send feedback – I look at lots of things people have made on the internet and I write something back to the creator far less than 1% of the time. This highlights the importance of having a clear idea of what you’d like to make and generating and refining your own work towards that idea. Getting feedback in the form of a conversation with someone you actually know before sharing widely is also extremely valuable. Unlike your actual friends, internet commenters don’t have any incentive to engage deeply with your idea and most don’t end up doing so.

I’ve been working on a few smaller projects since. Nothing is super finished, but I should have a post about the works in progress up later today.

Posted in Uncategorized | Leave a comment

Reddit Comment Visualizer

I’ve spent the last few days working on a visualizer for reddit comments.  Using reddit’s API, the program downloads a user’s comments and graphs them with flot.

The most obvious way to graph a set of data points is with a scatter plot. Since reddit’s user page only displays 20 comments at a time, it is very difficult to get a sense about how time has been spent on the site. On this scatter plot, every one of my comments is represented by a small circle plotted so its length position along the y-axis represents its number of (non-quotation) characters. Mousing over a circle displays the comment it represents on the right panel. I can see that I’ve spent most of my time talking about League of Legends, that I commented a lot in last July, and that I’m posting less frequently now.

Unfortunately, while it is easy to see that most of my comments are not very long, it isn’t very clear exactly how many shorter comments there are since the points cluster together closely at the bottom of the graph. Adding a heat map or fisheye zoom to the scatter plot could fix this problem, but neither are implemented in flot. Instead, I use a totally different graph type to display the data:

flot also does not include histograms, but sorting and grouping the comments then displaying them with the stacking plugin is simpler than creating a fisheye zoom effect within flot. By removing the time component of the graph, the distribution of comment length becomes much easier to see – the vast majority of comments I’ve made are quite short, even more so than the scatter plot shows.

Still, while the distribution of comment’s length is apparent in the histogram, the distribution of comment length in each subreddit is difficult to discern. Like with the scatter plot, it is clear that the most commonly element – this time the League of Legends subreddit instead of short posts – occurs quite often and others less frequently, but the actual ratio between them is not clear.

Pie charts generally don’t get a lot of love, but I use one to display total karma (and total comment length, and number of comments) by subreddit because it does a decent job of displaying the ratios and is simpler, more immediately understandable graph than the others. The second point is particularly important with the pie chart because mousing over a wedge shows a quantitative breakdown of the comments on a particular subreddit:

The mouse over detail for all the other, more complicated charts is mostly composed of the text of the comment which is easier to process than sums and averages. The simpler  display of the pie graph allows for comparatively more complicated details.

While removing time from the graph allows for a closer examination of different properties of commenting patterns, it also (obviously) masks how those patterns change over time. One of my initial motivations in undertaking this project was to see if my own habits on Reddit have changed over time, particularly how my commenting on the League of Legends subreddit had changed after I had stopped playing. For that, a sort of  pie chart on a timeline graph was needed, like the histograph from Civilization 3:

When I tried to find a histograph plugin for though, I ran into a problem:

Apparently, by ‘histograph’ Firaxis meant ‘history graph’, not ‘histograph, a term of art describing a type of graph’ like I have been assuming they meant for the last 11 years. I’ve tried finding the actual name for this type of graph, but so haven’t been able to unearth one. Undeterred, I implemented a histograph in flot:

This graph shows, in a way none of the others would be able to clearly, how rrenaud initially spent most of his time commenting on r/programming but then transitioned to r/gaming, r/MachineLearning, and r/dominion. (Since reddit shows everyone’s comments to everyone else, this app can be used to view other user’s comment history. Doing so seems a little stalkerish. At the same time, the scatter plot view provides the best, quickest overview of a reddit profile that I’ve ever seen. By mousing over outliers and fiddling with the plot settings, it is possible to see someone else’s most and least popular comments, the posts that they’ve spent the most time writing and what they wrote when they first started using reddit in seconds. If rrenaud ever comes across this post, hopefully he isn’t too weirded out – the graphs on his excellent Race to the Galaxy stats page inspired some of my work here and his site preservers my mediocre rating at the game (In my defense, I lost 10 of my last 11 games; there is probably some sort of connection between the win/loss record and my decision to stop playing))

While the graph is valuable, it need more work. By showing the accumulated number of comments at each time interval instead of the rate of commenting, the start of the graph is extremely sensitive to small changes in commenting patterns while the end doesn’t move near enough. rrenurd essentially doesn’t comment in r/programming anymore but since the graph shows a stock instead of a flow, the graph doesn’t clearly show that information. I’d like a graph of the commenting rate instead, but I’m having trouble creating one which both displays the flow of comments in different subreddits and easily communicates its meaning. Even the current form of the histograph struggles on the communication front – the title of the graph and scale of the y axis need to be changed to better convey the concept of proportions changing over time.

I’ve come a little closer to these goals graphing something similar:

This graph needs less explanation because “Comments per Day” is an easier idea to communicate than what the histograph is trying to show even though both are reconstructing a flow from discrete points.

I’d still like to improve this graph. In particular, the smoothing algorithm needs more work. Ideally, the leftmost part of the smoother would not have a large gap.  I’ve tried different kernel smoothers and radii but haven’t found anything that is responsive enough to small change in the data without leaving gaps.

Other things to improve:

  1. Splitting the graphing code into separate functions based on graph type. I was planning on doing this originally, but as I added more graph types it because less clear which parts of the program were unique to a graph type and which needed to be shared. The code was fairly concise and well organized when the page only showed a scatter plot; now it is a huge mess.
  2. I’m really not sure how manage errors that occur when connecting to Reddit’s servers. The console displays a 404 message when the user name doesn’t exist and another message if no connection is made, but I have no idea how to make those events trigger things in my program. Currently, if a valid response isn’t received within 4 seconds an error message is displayed. This isn’t that great of a solution since sometimes Reddit’s servers take more than 4 seconds to reply with a valid response and sometimes they respond with a 404 message is less than 4 seconds. jQuery’s .ajax() has settings to fix this problem, but those callbacks don’t work with JSONP (something else I don’t understand – how does using prevent cross scripting?).
  3. The page only uses CSS to position and hide elements. I think everything lines up ok in chrome, but in other browsers it looks pretty bad. Defining the appearance of the buttons and forms so different browsers use the same defined appearance instead of their different defaults would help. I’m hoping I can find a way of making pages look ‘nice’ without having to spend too much time on it; looking into different frameworks is probably a good idea.
  4. Watching the graph change as older comments load (Reddit will send a maximum of 100 comments at a time) or slowly increasing the minimum comment length by holding down the button results in a nifty animation. Drawing the graph takes only a few milliseconds, so drawing the graph over and over again with slightly different parameter creates the appearance of motion. All of the graphs could be improved with optional animation. For example, an animated pie chart could show the same information as the histograph while being easier to understand and more engaging to look at. UI issues are preventing me from adding animation to the live version - the page is already getting close to being cluttered and there are huge number of ways to add motion to a graph. I need to figure which animated graphs are best along with a concise way of displaying to option to view them.
  5. Allowing users to login with their Reddit information could also improve usability. The Reddit API only exposes friends and voting patterns of the currently logged in user. Displaying a drop down list of those friends would remove the need to copy/paste their strangely spelled names. More interestingly, the same analysis that is performed by the comments could be performed on upvoted content. I think I vote more often and in more places than I comment; that information might provide a better sense of time spent on the site. I suspect that number of votes that Reddit exposes might limit the value of this information. Only the 1000 most recent comments and submissions can be viewed and they same rule probably applies to votes. Since everyone automatically upvotes every comment they make, many of the votes would have to be thrown out. Additionally, votes on comments and submissions should probably be treated separately for everything other than subreddit analysis  which would further reduce number of available data points.
  6. There are also comment analytics which are not directly exposed by Reddit. It would be neat to see who replies to you the most and vice versa. Keep track of different commenters on Reddit isn’t easy – I’ve had several conversations with people I know in person on Reddit and not realized it until they brought it up in real life. Reddit Enhancement Suite helps with this, but I’m still curious  to see how often I actually communicate back and forth with different people. Unfortunately, loading information about each comment’s parent and children might take too long to run in a webpage. Assuming 1000 comments each with an average of 1.5 combined parents and children, 1500 requests would take 3000 rate limited seconds to process – nearly an hour.

Even considering the above reservations, I’m pretty happy about the current state of this app so I’m going to post it to reddit to get some feedback. I’ve also been pleasantly surprised about how long this took me to make – 40 hours of not especially difficult work over a week is still a longish amount of time but it’s significantly less than I would have spent a few months ago.

Posted in redditgraphs | Tagged | Leave a comment