While it seems like everything in the online space is hunky dory and progress is making predictable strides towards our inevitable AI infested future, I often see such utter failures in search engine results that makes me think we haven't even started to lay the foundations.
Here's the story: as I've become interested in mining the news cycle for various reasons, I've started attempting to understand who the editors of major news sources are. The current version of the Hapax Page on d8taplex tracks the attribution of article authors and editors (I conflate the concept of writer, reporter and un-typed contributors under the term 'author' while explicit editors are tracked separately). From this analysis, I see that there is someone called Cynthia Johnston who is often associated with articles from Reuters (in fact, she is currently at the top of the list ranked by count of articles).
So, I want to know who Cynthia Johnston is.
I search in Google for 'Cynthia Johnston Reuters bio', and I get the following:
- The result of a search on the Reuters site for the name 'Cynthia Johnston'. It turns out that Reuters has a blog reserved for most of its people but not for CJ.
- An article about 'Bio Gaudiano' olives posted on Reuters and edited by CJ - this is actually duplicated in the results, one with the normal link and the other a mobile version.
- An article posted on Reuters by CJ.
- Various teaser pages from people aggregation and search engines with lists of Cynthia Johnstons.
It is important to note here that there is no evidence that the search engine understands the trivial difference between an article written by someone and an article written about someone.
What would have been a good result here? In the existing 'I can only show you a web page' paradigm, the only reasonable page I can find thus far is here, which contains an out of date bio for CJ.
Cynthia Johnston is a correspondent with Reuters News based in Cairo. This is her fifth international assignment for Reuters, and her second stint in Egypt. She has previously worked for Reuters in Beirut, Jerusalem and London, and has also reported on short-term assignments from Syria, Iraq, Sudan and many other locations in the region. She covered Israel's 2005 withdrawal from the Gaza Strip and reported from Gaza during several Israeli military incursions there. She was in Beirut for the Syrian pullout from Lebanon, and covered the death of Yasser Arafat from outside his hospital in Paris and then returned to see him buried in Ramallah.
Johnston first joined Reuters in 2000 fresh out of graduate school as part of the Reuters graduate training scheme, a program that involves a year of intensive on-the-job training in London in financial and general news in preparation for future assignments. Her assignments have included a mix of field reporting, writing and editing. Much of her work for Reuters has also involved financial journalism, from covering currency devaluations to inflation crises and privatization efforts.
Johnston holds a Bachelors degree in journalism from Northwestern University and a Masters degree in Middle Eastern and North African Studies from the University of Michigan. She spent two years studying at the American University in Cairo -- one year as a study abroad student and a second year in the Center for Arabic Study Abroad (CASA) program.
But what do I really want?
Firstly, I want the search engine to be able to understand the query. I am asking for a biography of Cynthia Johnston who (I believe) currently works for Reuters. This is not to be confused with
- Articles by Cynthia Johnston published by Reuters.
- Lists of people called Cynthia Johnston who may or may not work for Reuters.
- Failed search results from other search engines.
Secondly, I want the search engine to work with entities, not pages. I also want the search engine to really understand how the web works. A 'page' (HTML data) is not a static resource like a page in a book. It is, in our current version of the web, often an active resource and search engines need to know the difference. The results I show above give me a choice between static data (articles on Reuters), search results (the end point or midway point of some unknown interaction), etc. A good result would present me with a list of entities, below which should be aggregated pages that support the identity of the entity and justify its distinction from the other entities.
Thirdly, I don't want the search engine to be used as a patsy for other sites trying to get my attention (e.g. people search engines).
Fourthly, if the search engine believes it understand the query but thinks it doesn't have the answer, I want to know that. A softer requirement would be presenting to me how confident it is in the top results. If I can see I'm starting off with poor results I will more rapidly modify my query.
My view on things-we-currently-call-search-engines going in to 2012 is that the web-page paradigm is not only outdated, it is fundamentally wrong for lots of scenarios. The entity based view of the world (which accommodates web pages, sites and activities as things in the world just as it does people and places) is already up and running (Yelp for places, Amazon for consumer products, Wikipedia for documents) and search engines need to flip sooner rather than later.
Update: note that now this post is out there, the link I provided to the bio has increased in relevance, thus you will probably not be able to replicate the search experience I describe in the post. Kudos to Google for their real time indexing and reraking technology!