Posted on 25 apr 2012 by Miles     

The World is rapidly becoming more and more connected, with people communicating using multiple streams - Social Media, Newswire, Wikipedia etc - on a bewildering range of topics and at a furious rate. Twitter alone receives more than 250 million new posts every day. This massive interconnection means that content can appear and quickly spread through and across different streams. For example, in the recent London riots, many tweets reported the rioting events as they happened in real-time. However, not all content posted is either of good quality or is factually correct, complicating the job of monitoring such streams for any purpose. Systems that can rapidly identify such posts, their origin and rate of propagation is of paramount importance for security monitoring purposes.

The effective management and efficient processing of multiple streams of real-time data poses new technological and scientific challenges:

  • Challenge 1: Identify interesting new stories and not drown in a sea of false positives, yet reduce the effects of bias and rumour.

  • Challenge 2: Minimise system latency, such that new stories are detected in real-time and with low latency.

We tackle the first challenge from the novel perspective of processing multiple streams and exploiting the fact that stories reported multiple times across several streams can cancel-out stream-specific bias and errors. For example, if a story is true, then it is more likely that it manifests in both Twitter and as increased views of a Wikipedia article. Alternatively, a story might appear in Twitter and also appear in a governmental cable. The more often a story occurs within and across streams, the more likely it will be interesting.

The second challenge is necessary due to the volume of the data being processed. Indeed, combining the streams that are being considered, CROSS will have to deal with over 50 million events per day. To attain and ensure low-latency story detection, CROSS will use a distributed real-time data processing architecture, similar to MapReduce but better suited for real-time operations.