I remember an interview – years ago – with Benoit Mandelbrot talking about how he arrived at his famous fractal set. He quoted advice he got from his mentor, Julia:
This was, of course, in reference to imaginary numbers. By casting real numbers into their two dimensional imaginary counterparts, an extra degree of freedom is introduced. I’ve been taking this advice to heart when considering analysing social media for expressions of appraisal (the text mining problem formerly known as ‘sentiment mining’). Much research in this space tries to view the problem in a simple way - positive and negative words, classifying whole documents, and so on. In this field, complexifying really means working back to a model – a rich description of social context, discourse context, linguistics and psychology – which aims to describe how these expressions of attitude end up in documents. Pushing for this model is already a big win. However, complexification has another great advantage – by proposing a rich model, one can find a principled component of that model to focus on and make valuable, incremental contributions to our understanding of the space.
Something that I’m having fun with just now is the relationship between appraisal and word sense. While much work in this space centers on building lists of ‘positive’ and ‘negative’ words, examples like ‘strong smell’, ‘strong candidate’, ‘strong personality’, suggest that the adjective (in these examples) requires a fine, sense based analysis.
Kate Niederhoffer (a former colleague at BuzzMetrics and now at yet-to-be-named-startup with Jeffrey Dachis and Peter Kim) has put together a summary of a discussion she organized on the Future of Measurement (in the social media space). End to end, this is a neat case study as she organized the discussion via a (Google) group, put the document up on Scribd and is publicizing it via her (excellently named) weblog, Social Abacus. And, if that isn’t enough, the discussion is summarized in a tag cloud.
Measurement is generally thought of as the battle ground for the emerging industry of social media analysis. Figuring out what to measure and then convincing your customers that you are doing it accurately is what it is all about. Strategically, there is still a lot of room to maneuver in the first dimension – what to measure. In addition, there is a strong relationship between what you measure and your ability to do so accurately. The company that nails this intersection could end up driving the space and defining the standards.
[Update: I was having issues with this app due to corporate proxy rules – it’s all good from home.]
Jeff Clark over at Neoformix has come up with a really simple but compelling application/visualization. For 2 or 3 keywords, it displays the venn diagram of search results over Twitter for the sets and their intersections. Of course, there is no reason why it needs to be pointing at Twitter – the data could come from anywhere.
Unfortunately the app is down right now (not sure if it is the client or Twitter search that is broken), but you can get a good idea of what it does from Jeff’s blog posts here and here.
Today I’m seeing a lot of spam coming from Google’s blog search feed for my (this) weblog. Our platform here at Live Labs recently dealt with a lot of novel blog spam (which I discovered others in the space also struggled with). Today looks like a new type is going around. Should competitors work more as collaborators in this space?
What is innovation? I’ve been chewing on that question for a while, and a recent post from Hal has finally nudged me to write something. Hal’s post, in summary, discusses the nature of ownership in progress and breakthroughs (that was my idea!). The reason that I care about innovation is that I really care about environments that are intentionally set up to deliver innovation. While we might often hear ‘I work in an innovation centre’, or something similar, does that really have any meaning?
At best, innovation is something that we (that is to say, users) recognize, and I think this is key to understanding the nature of innovation processes in the internet space. Innovation is not just the idea (ideas are free!), nor is the implementation (often the hardest part) sufficient – connection with the user and the recognition of the value by the user is key. Thus, I think the key elements to an innovation centre are (at least):
smart people: to generate ideas
engineering excellence: to implement the ideas
connection with users: the people that recognize ultimately determine the innovation
The last part is the hardest as it requires some sort of faith. The reason being that one has to find the right users, a tasks which requires an interesting mixture of skills, perseverance, tenacity and luck.
One can possibly generalize the three components above. So in the case of Hal’s discussion, the issue of how freely one publishes ones research results is part of the mechanism for connecting with users. You may have solved some important problem, but if no-one knows, then there is no impact. Of course, the reason this is more complex than Hal’s post describes is the economic system. Breakthroughs in Google or Microsoft are not socialized in the same way their academic cousins.
Looking back at the environments I’ve worked in, I’ve witnessed situations where we had plenty of 1 (lots of smart people) but the company failed, and situations where the users have been breaking down the doors while we’ve been scrambling to match the smarts with the solution that would satisfy them. In addition, there are plenty of cases where 1 and 2 are clearly not a problem, but where the current behaviour of the users (driven in part by the existing solution paradigms) presents a barrier to connection.
The 3 part innovation system works at many levels. For example, the academic innovation system feeds not only academia and the public good of science and knowledge in general, but also industrial systems. Industrial R&D labs themselves interface often with academia, but also produce internal innovations to product group customers and so on.
Yesterday, Live Labs launched a project called Thumtack. The project is an exploration of a collection metaphor for information objects found on the web. It supports the grouping and analysis of objects and the sharing and collaboration of these collections with different communities. Steve Drucker, whose voice you will hear on the Thumbtack video, has been working in this area for a while as a principal scientist hear at Live Labs specializing in the area of HCI and UX/UI.
For me, in addition to the social aspects of the project, the most exciting thing about this area in general is the fundamental change in how it views information presented on the web: from a collection of pages linked by hyperlinks to an underlying space of information articulated via online resources (including web pages). The revolution, when it comes, will be a result of this shift in how we think about web published data and any project (Live Labs or otherwise) that starts behavioural change is well worth the investment.
Jason Priem recently pinged me with a link to a project he’s working on: FeedViz. FeedViz provides several dimensions along which to explore and consume feeds: time (via a time series), tags (via a linearized tag cloud) and specific blogs (via a list). Selecting on any of these dimensions updates the display of the other 2. Finally, you can read posts that exist and the intersection of (the settings for) these dimensions.
The tag cloud is generated using “two numbers for each word:
The first is frequency. Frequency says how many times a word is used per 1000 words. If you hover over a word, you'll see its frequency to the left of the frequency change value.
The second is frequency change. Often, a word will be more (or less) popular than usual in a certain time period (for instance, "election" in early November). Frequency change measures that difference as a percentage: greener words are unusually popular; redder words are the opposite.”
While I really like the design, animations and implementation, I’m not convinced that the above approach is the best way to surface keywords. Of course, it depends on what the purpose of the keywords is (descriptive, discriminative, or trendive), but I’d love to see this stuff running on something like BLRT or TF.IDF.
An interesting update from TechMeme regarding their new use of a human in the loop. The post talks a lot about the results being better – but I don’t see any clear description of what their criteria are. The thing I get out of these sites in addition to timely notification of news stories, is an indication of how much attention a news story is getting. For me, it is very interesting if suddenly everyone is linking to an old news story because it is suddenly relevant in the context of a recent issue.
Ultimately, this direction will push TechMeme and related sites in the direction of tools to assist editors in selecting news items. Their goal is to break news stories, which is quite different from tracking what is important to bloggers and what is getting attention. Personally, I think that the latter is way more interesting as the agenda is emergent and synthetic. In the editorial model, if the editor thinks a certain story is ‘important’ it will find its way to the top of the stack. Will TechMeme be able to continue to differentiate?
Named entity recognition – the discovery in text of strings that refer to classes of things like places, people, companies, etc. – has become a standard tool in the text mining world. For many classes (e.g. people) recognition is pretty good. But what are these entities doing? Figuring out the verb associated with a named entity, and the type of association (or the role) is trickier than you might think.
I’ve been playing a little with Thomson Reuters’ OpenCalais – their publicly available service built on top of the ClearForest technology acquired in 2007. To illustrate the challenges that the verb poses, take this sentence:
The Director of National Intelligence, Mike McConnell, has meanwhile implied he suspects the Pakistan-based group Lashkar-e-Toiba was responsible.
Verbs here are has, imply and suspect. OpenCalais analyses this as follows:
First, it works with the named entities, discovering
The Director of National Intelligence, Mike McConnell, has meanwhile implied he suspects the Pakistan-based group Lashkar-e-Toiba was responsible
Not too bad. Note that they are smart enough to pick up the use of ‘he’. In addition, interaction with the tool reveals that they resolve the pronoun to the preceding entity. Now, with the propositional analysis, things start getting tricky. Rather than recognizing that McConnell implied that he suspected [the …], the system looks for the verb (imply) and then takes the preceding and succeeding noun phrases as the grammatical subject and object, resulting in The Director of National Intelligence implying Mike McConnell.
This example illustrates some of the challenges in the space. It is costly and difficult to do a full parse (which in this case would have discovered the subordinate sentence, and discourse analysis (figuring out what pronouns refer to) is complex. Finally, the lexical resources required to guide the association of the noun phrases that go with verbs is a large investment.
While, to a large extent, one can get away with light weight solutions to named entities that treat the text as linear. Getting the verbs done correctly requires a richer understanding of the structure. You can no longer treat the text sequentially (or, to be fair, you have to at least approximate the grammatical, tree-like structure of the language).
To some degree, certain text mining applications can help here, if designed appropriately. While the above example surfaces a clear error in analysis, if the application relies on the aggregates of this type of analysis, and if in aggregate you tend to get your facts right, then the user can win. However, if at any stage, one has to reveal a specific document, the user is likely to be exposed to errors.