I'm as excited as the next blogger about Wolfram|Alpha. One thing I'd like to add to the discussion is the notion of data literacy. The traditional UI for search is one dimensional. The value of the data in the web (and the extended notion of the data web) is multi-dimensional. Effort is required on the part of the user to become data literate - being comfortable with results that are not lists. I don't think we will get to all the value in the web if we wait around for AI to arrive (which is Google's strategy). Rather I believe that a commitment from both users and vendors to provide data rich interactions with the web will get us there.
There were a couple of papers that I really liked at www. One was Mapping the worlds photos by David Crandall et al, another was SOFIE: a self organizing framework for information extraction by Fabian Suchanek et al. Both of these papers (while operating in different domains) used features from orthogonal spaces (images, tags and geo coding in the first case and entity, relations, text patterns and logic in the second) to automatically mine new facts from a data set.
However, after the initial impression of how great these things are (and they are great) one realises that the facts that they have surfaced are already known. In the first case, which excels at discovering landmarks and images of landmarks, this is a well known knowledge set – landmarks, by there definition, are discovered facts. In the second case, the fact used in the example (which we shouldn’t, of course, judge the entire system by) was also well known.
What one would want to see in these papers and certainly in their presentations, is the long tail of facts. The head of the knowledge-sphere is well known by definition, rather than discovering it, we should assume it. The long tail will have weaker signals and it is there that we really care about the power of these systems.
Other papers in this and related areas from the conference include:
An ontology may well be considered a reduction (of the infinite kind). We might just be replacing one system of symbols (strings) with another (‘semantic objects’). However, this may still be a very useful thing to do. In addition, these new symbols can still be related to the external world of objects.
I don’t think that I disagree with our integration/leveraging/exploitation of the world ‘out there’ as a point of reference for our ‘internal’ processes. That being said:
We do have to make a link to those external things, and if a more compact or mnemonic ‘internal’ symbology is a useful and efficient mechanism, why can’t that be related rather than ‘raw’ strings?
There are plenty of things that we think and talk about that have no external reference at all.
Language itself requires linkages (referents), but I can’t point to language.
Regarding Nöe’s book, my current reaction is that he struggles with language and metaphor from the very beginning.
“Mind is life” page 42
“Mind is the lower boundary of consciousness” page 45
“I use the term ‘consciousness’ to mean, roughly, experience” page 8
[consciousness is something we do] page 8
Do we ‘do’ our experiences?
Regarding the apparent duality of attributing minds to those with whom we interact (children, lovers, colleagues) and the scientific approach of cold objectivism – I have no problem with this at all. Perhaps I’m a lobster…
At any rate – it is a very stimulating read. @Fernando – I’m looking forward to your thoughts on the book.
Update it is interesting to see how Nöe makes claims about neural plasticity in building up his arguments that are the central theme (to be refuted) of Pinker’s book The Blank Slate, and yet Nöe doesn’t cite Pinker. Nöe’s account of the ferret visual cortex experiments is (if memory serves) not in line with Pinker’s account in terms of the perfection of mutability. I believe Pinker indicated that the experiment produced some sort of visual system but not a perfect solution.
I saw a tweet from danah about a talk she had given. I replied to here indicating that I appreciated it. However, that reply – though a public act – has no context at all. You can’t determine which of danah’s tweets I’m referring to, and you can’t determine what it is I’m talking about.
Twitter, due to its character limits, does not do a good job of quotation. One can re-tweet (as about 2% of tweets currently are), but that is not particularly satisfying as one can’t add to the conversation.
Perhaps this is the first axiom of Twitter discourse structure: there are no explicit cascades.
Update – well, I really showed my ignorance there. I don’t use the txt version of Twitter, but was reading tweets as if I were. I would guess, however, that in the txt world, there is no ‘in reply to’ link – is that correct?
Note also that in search results, the ‘in reply to’ information is lost (I’ve always been confused as to why Twitter search doesn’t produce Twitter XML formatted results). This tweet is in reply to this one, but the search results don’t preserve that relationship. I see the ‘show conversation’ links in search results but, as Andrew Baron points out, it is only a feature of search. I’m guessing that there are still system divisions between search and the main Twitter CMS and that integration is going to be like changes the wheels on an aeroplane meloning down the runway.
Every time Technorati published its state of the blogosphere posts it would come under fire regarding the number of blogs: 100 million you say? but how many are active, huh? At Blogpulse, we also suffered with this issue and side-stepped it by referring to the number of blogs that we had identified.
While everyone is crazy about those insane Twitter numbers, how many are active? What do we mean by active? Let’s say – if you haven’t tweeted in the past 5 days you are inactive.
Well, how on earth are we going to measure that? I came up with the cunning idea of doing a search way back in time on Twitter search for some term and then seeing how many of those that tweeted in the search results were still tweeting. Unfortunately, Twitter search only appears to provide results from about a month ago at best. So I went back to March 23rd – the oldest date I could get results for – and searched for ‘first tweet’.
Of the accounts for the first 15 results, 6 hadn’t updated on or after April 16th (40%); 9 hadn’t updated in the past 24 hours (60%).
That was a pretty unscientific experiment.
I then looked at Obama’s followers and found that 10 of the first 20 (50%) had no updates. However, I don’t think this means anything as I believe plenty of people sign up and follow Obama not knowing that he doesn’t Twitter. Perhaps followers are listed in order of most recently attached.
Obama has 904, 954 followers, and with 20 per page, that is 45, 248 pages. So if I go back to page 45, 000 I should be back with the older crowd. 5 of these 20 hadn’t updated in the past 5 days (25%), the next 20 also showed 5 inactive accounts, and the next 10 (50%).
Taking some of these inactive accounts at random, I found 3 of 8 (37.5%) followers inactive; 5 of 10 (50%); 8 of 20 (40%).
Update – a sample from Ashton Kutcher’s followers (aplusk) at page 50, 000, found that 13 of the 20 were inactive (65%).
Update – a sample from Britney Spear’s followers (britneyspears) at page 50, 000, found that 15 of the 20 were inactive (75%).
Conclusions – there’s no science here, so everything is with a pinch of salt. That being said, it seems you can’t move too far along the Twitter tree before you find some dead wood. Of course, I chose the 5 day limit because, with the first experiment, that is when something interesting happened. It’d be better to get a full distribution of inactivity for a good sample of accounts.
I’ve just finished reading “Artificial Intelligence meets natural stupidity” by Drew McDermott as kindly suggested by Fernando. I’m going to have to read it once more at least to figure out how it relates to the broader discussion. McDermott makes excellent points about the audacity of naming styles in AI programs (particularly in the knowledge understanding/reasoning space). I wonder what he’d think about the vast field of machine learning which has, to some extent, defined AI over the past decade.
In a thread of ongoing work, I’m using the terms ‘assertion’ to represent the storing of something in memory and ‘relationship’ to suggest that one assertion has a relationship with another (of an unknown type). I suspect that McDermott would not get too upset about that (he does, after all, get upset about a great many things).
One of the issues that McDermott’s paper complains about is the confusion of instances and types in any knowledge base or semantic network. He’s right, this is a problem area, but I’m under the impression that it is well understood by practitioners now.
As far as I can tell, McDermott is not against mentalese (a language of the mind which is used mnemonically in memory and reasoning) but warns of assumptions regarding this language and its relationship with that natural language used to communicate.
@Fernando – I also picked up a copy of Alva Nöe book for this trip, thanks for another great reference.
I’ve just read The Big Switch by Nick Carr – a great read. One of the key issues he summarizes is the tyranny of preferential attachment (my phrase), viz: small preferences lead to segregated communities and homophily leads to amplification of extreme views. These are the things that scare me when we don’t think about communities, network effects, ‘recommendation’ and ‘relevance’ more generally.
[Nick – page 163 of the paper back edition – Natalie Glance was never employed by Infoseek, it was Intelliseek.]
Jonathan Mendez has an interesting piece on analysing the potential for real time search (thanks Barney). I suspect, however, that he is only looking at half of the picture. I see two pieces:
New content is being created, let’s grab it immediately and figure out how to serve it.
The stream of real time data is pointing (attending to) arbitrary resources on the web, let’s make sure we grab those and serve them in a way which reflects the amount of attention that they are getting.
Jonathan, I think, focuses only on the first.
Update: regarding the claims in Jonathan’s post about Google having solved this problem, take a look at the search results for ‘barney pell’. It gets the home page of the blog (good) but the snippet is from the state of the blog 5 months ago – not really real time…