One of the most frequent questions I get from readers of this blog is: can you recommend a graph visualization package? In general, I've always liked to roll my own - many of the layouts available in packages out there are much alike, so if you are interested in novel visualizations you would probably have to do this anyway.
I've not had time to really look at all the possibilities out there and really find one that I recommend. That being said, my most common answer is to take a look at GUESS, a system created by Eytan Adar at the University of Washington (and co-chair at ICWSM 2008).
If you do feel inclined to dig a little deeper in your evaluations, here are some things I would recommend looking out for:
Does the package allow you to provide a simple per-line file of data points describing the graph? This would be the quickest way to evaluate anything?
Does the package allow you to abstract the data layer and, for example, implement a database backed graph?
Does the package provide multiple layout options?
Does the package allow a clean API for adding your own layout managers?
How quickly does it layout the data?
Does it produce the same layout in repeated applications of a layout manager and the same data? Many layouts use some amount of randomization, but repeatability is key to good data visualization, so if a layout produces different results when applied repeatedly to the same data, it is a dud in my book.
How much data can it handle? Of course, this will be determined by your application, but intuition suggests that if it can handle large volumes of data it will be better engineered.
Does the package allow for graph computations (e.g. distance between two nodes)?
Of course, you won't be able to see Beyonce falling down stairs as YouTube has removed it due to a "copyright claim." It's interesting to see how media companies use copyright infringement as a means to censor undesirable representations of their artists. Of course, when those infringements help promote the content, no trouble at all - pirate away. It is far cheaper for YouTube to have a reactive policy: if a content owner requests a take down, they comply, than to police their hosted content. But this makes them manipulable by the copyright owners.
We've just returned from Vancouver, where AAAI 2007 was held. Wakako and I were really impressed with the city (possibly what Seattle could be if it didn't suffer from major highways intersecting the downtown area). It is a mystery why Vancouver hasn't had much attention from the city builders at either Google (Earth) or (Microsoft's) Virtual Earth. VE has no data and GE offers one solitary building:
Vancouver and Seattle have approximately the same population, though Seattle, perhaps for obvious vocational biases, get's more blogosphere attention:
Media Post, the "home on the web for media, marketing and advertising professionals" requires that you sign up with them if you'd like to add a comment to one of their blogs. I did this. Of course, I then started to receive emails from them (I may have elected this, the story is the same). At the footer of each email, there was a link which, the email claims, I can click on to be removed from their subscription list. It doesn't work. So now, not only do I receive spam from Media Post, I'm completely turned off from whatever content their bloggers provide. With all the hoopla around ethical behaviour and engagement in social media that WOMMA kicked up, its ironic that Media Post (which has strong ties to WOMMA) can't even get it right.
Chris Harrison is a grad student at CMU's Human Computer Interaction Institute. He's been working on a suite of visualization projects around internet/web data, some of which involve social media content.
ClusterBall visualizes link structure within Wikipedia:
visualization shows the structure of three levels of Wikipedia category
pages and their interconnections. Centered in the graph is a parent
node. Pages that are linked from this parent node are rendered inside
the ball. Finally, pages that are linked to the latter (secondary)
nodes are rendered on the outer ring. Links between category pages
are illustrated by edges, which are color coded to represent their
depth from the parent node. Nodes are clustered such that edge lengths
are minimized. This forces highly connected groups of pages to clump
together, essentially forming topical groups. The center acts as
an anchor while the ring provides a fixed perimeter. This allows
the secondary, super-categories to "float" above clusters.
Internet Map pictures the internet traffic density between location on the planet.
The stronger the contrast, the more connectivity
there is. It is immediately obvious, for example, that North America
and Europe are considerably more connected than Africa or South
America. Additionally, three graphs showing network connections
were created. I should note this is not the first time graphs like
this have been created - I've seen dozens of variations, most being
practical in nature (e.g. cable locations, bandwidth). I decided
to pursue an aesthetic approach - one more visually intriguing and
interesting to explore than useful. The intensity of edge contrast
reflects the number of connections between the two points. No country
borders or geographic features are shown. However, it should be
fairly easy to orient yourself.
LinguisticAgents is an NLP company based in Israel that provides a natural language interpretation service designed to be used as a middle layer in applications. I spoke with Ari Applbaum a while ago about their service and have never had the time to do a longer post on the company. A commenter on a recent post has nudged this company back on to my radar.
Functionally, what LA offers is a service which will provide parse trees for any input you throw at it. While this service doesn't provide all of the pieces required for integration of an NLP solution in most applications (for example, some of the disambiguation is left up to the client) it does provide a simple interaction which will allow developers to rapidly prototype applications that could take advantage of NLP. I'm assuming that if the results look good, a customer could elect to bring the technology closer to the application rather than rely on a web based interaction (which would never scale with anything other than the most modest data rates).
LinguisticAgents use a novel approach to syntactic analysis called NanoSyntax:
Every word of a sentence contains a significant amount of complex information. Words are communicated in a particular order, via sentences. Humans, naturally having the correct decoding algorithms, effortlessly decode messages on a subconscious level. The human brain takes the information a person wants to communicate and packages it into encoded packages - words. That is why accurate understanding of Human Language cannot be achieved using common computer science methods alone. Computers, lacking the necessary algorithms, are unable to remove the encryption and are unable to understand sentences in Natural Language (i.e. , Hebrew, English, Russian , etc.).
The theoretical linguistics of the 80s and 90s made progress in finding a solution to this problem, but the true giant leap in NLP has taken place over the last few years, through a system called Nanosyntax, which has made the code breaking of natural language possible.
Linguistic Agents' software is the only natural language software available today based on the latest linguistic theory, NanoSyntax.
I've had a little scratch around the web on this topic, but haven't found anything immediately readable that intuitively captures what this approach is really about. The text on the company's site certainly seems to dismiss more common approaches to syntactic and semantic processing (such as LFG - the formalism used by Powerset). In addition, much of the battle between the new breed of linguistically motivated application providers (web search included) is going to be over basic lexical resources that provide the necessary mappings between lexically and semantically related concepts. On this note, it has been interesting to see a number of papers presented here at AAAI 2007 on mining Wikipedia for ontological and taxonomic information. In addition, at last nights reception, the Freebase booth was seeing a good rate of traffic to hear Kurt Bollacker's animated demonstration of the system/data.
Digger has a table set up here at AAAI 2007 in Vancouver. I've not yet had a look at their beta (a random encounter in an elevator may lead to getting an invite) but they do have a few examples queries available on their website (the demo at the conference allows you to enter any query).
They do a couple of nice things in their demo interface. Firstly, they give you a break down of the interpretation they have given to your query. Secondly, in the snippets they return, the provide extra information for each of the terms that were used in the match (which may not be literally present in the query).
Below shows the query information for the query 'moving a chess horse.'
While Digger is going for a similar level of semantic resolution as Powerset, it is applying its technology to the query and the snippets from the result set - not computing an entire semantic index in the back end.
I'm interested in the general strategy of exposing the results of analysis in the manner that Digger does. It allows for direct comparisons with other enterprises adopting the same approach (here is Hakia's interpretation of the same input). That being said, it also reminds me of the way in which Ask Jeeves (the original iteration) reported results. It is impressive when it is right, but when the reporting of the interpretation allows the user to edit mistakes - even though that may help the user at that instant - the perception is of a hedged interface.