Javascript and MapReduce

Don't we all love feeling like big data scientists, data miners or <insert data buzzword here>?
Well here's a way to get that feeling in your everyday life:

Recently I discovered the map, reduce, filter functions in Javascript. They are absolutely awesome! To demonstrate their abilities we'll be building a wordcount program, using MapReduce, in Javascript, all chained.

Digging Data

As a datasource we'll use HackerNews' big RSS feed and since we can treat RSS as plain HTML, it's very simple to get started:

var titles = document.getElementsByTagName('title');

This is simply getting all the titles from the RSS feed. But we have a problem, getElementsByTagName returns a NodeList and not an Array, which hasn't got the cool MapReduce functionallity we want.

So we have to convert it to an array, and whilst doing so, we can also grab all the titles as strings, instead of HTML Nodes

var titles = Array.prototype.slice.call(document.getElementsByTagName('title'))
  .map(function(node) {
    return node.innerText;
  });

Simple isn't it? And now we have a list of all the titles on HackerNews (HN).

Introducing .map(callback [, thisArg])

map is a nice small function that takes 2 parameters, but usually only the first one is specified.

As you can tell from the above code, the first argument is a callback function that is to be called over every element in the array, the second one is just to specify a value for this during the function execution. More important are the parameters passed to the callback function. They are: the array element itself, its index and the whole array/context. So the callback function's signature would look something like this:

var callback = function(element, index, context) { /* omitted */ }

Unwanted punctuation & Individual words

As you may have noticed, the HN titles don't just contain letters, but they also carry numbers and punctuation. These are unwanted in our final wordcount, so we remove them. This can be done in the same map call. Isn't Javascript a great language:

var words = Array.prototype.slice.call(document.getElementsByTagName('title'))
  .map(function(node) {
    return node.innerText.toLowerCase().match(/([a-z]+)/g);
  });

If you're coding along, you'll have noticed this had a side-effect: It automatically gave us all the individual words, thanks to the RegEx, which returns all matches in an array.

Even with this helpful side-effect, we are presented with a further problem. The words array is 2d, which means we need to flatten it. Good job we've got reduce:

var words = Array.prototype.slice.call(document.getElementsByTagName('title'))
  .map(function(node) {
    return node.innerText.toLowerCase().match(/([a-z]+)/g);
  })
  .reduce(function(last, now) {
    return last.concat(now);
  }, []);

Introducing .reduce(callback [, initialValue])

As map, reduce takes 2 arguments. The first one is a callback function again, which is to be called on every element in the array. The initialValue parameter will be easier to understand once you see the callback's signature:

var callback = function(previousValue, currentValue, index, context) { /* omitted */ }

As you can see, the first parameter passed to the callback is the previous value. If you wanted to sum up an array of numbers, this would be not problem. But in our case we want an array returned, so we specify an inital value for the previousValue. If we didn't do this we would end up with a ginormous string.

Counting words

For this task, we can again use reduce, but before we start let me show you what the result will look like:

[['the', 'on', 'news', 'hacker', ...], [50, 66, 20, 19, ...]]

It'll be a 2d array again. The index of a word will correspond to the number of its occurrences. As an example the word news has the index 2, in the first array, so its score is at index 2 in the second array, in this case 20.

To code this, we'll use reduce again and as an inital value we'll specify: [[], []]

var scores = Array.prototype.slice.call(document.getElementsByTagName('title'))
  .map(function(node) {
    return node.innerText.toLowerCase().match(/([a-z]+)/g);
  })
  .reduce(function(last, now) {
    now.forEach(function(word) {
      last.push(word);
    });

    return last;
  }, [])
  .reduce(function(last, now) {
    var index = last[0].indexOf(now);

    if (index === -1) {
      last[0].push(now);
      last[1].push(1);
    } else {
      last[1][index] += 1;
    }

    return last;
  }, [[], []]);

Zipping up the arrays

We're nearly done with collecting the data. All we need to do now is to combine the 2 arrays into one. The format this final array will be in is as follows:

[['the', 50], ['on', 66], ['news', 20], ['hacker', 19], ...]

For this we'll use reduce again and utilize the 4th parameter passed to the callback:

var scores = Array.prototype.slice.call(document.getElementsByTagName('title'))
  .map(function(node) {
    return node.innerText.toLowerCase().match(/([a-z]+)/g);
  })
  .reduce(function(last, now) {
    now.forEach(function(word) {
      last.push(word);
    });

    return last;
  }, [])
  .reduce(function(last, now) {
    var index = last[0].indexOf(now);

    if (index === -1) {
      last[0].push(now);
      last[1].push(1);
    } else {
      last[1][index] += 1;
    }

    return last;
  }, [[], []])
  .reduce(function(last, now, index, context) {
    var zip = [];
    last.forEach(function(word, i) {
      zip.push([word, context[1][i]])
    });
    return zip;
  });

This works, because the callback is only called once, so we can zip up the words with their scores.

Now we have all our data collected. We could stop here, but humans love visualisations!

Visualisation

We're going to use D3.js to visualize our data. And to be specific we'll create a force-directed chart.

I won't explain how the visualisation works, because that isn't the subject of this post, but to find out you can always view source!