I've now spent a little time playing with Sphere - Tony Conrad's new blog search engine, currently in Beta. Before talking about the engine, here are some highlights from an interview with Tony posted on TECHNOSIGHT (thanks Natalie).
- Tony is very aware of the advantages of being a second generation blog search engine. The ability to see what others are doing, where they are failing and what scalability issues they have make for an easier task of prioritising features: speed, relevance, comprehensive coverage, data quality and time to index.
- With a strong background in Consumer Marketing as well as a history in venture capital and startups, it is clear that Sphere is also being marketed outside the consumer facing community to provide value add to other publications. In fact, this is likely to be the main business model - supplying specific or syndicated results to augment content on other sites (though one can't ignore the potential of ads on a search engine).
The first thing that strikes me about Sphere is the clean design of the interface. It is hard to bring any criticism to bear on a design which takes the simplicity of the standard text box interface to search and adds a little something to make it recognizable and snappy. However, this search engine interface is one which can prove limiting to users in a rich and dynamic space like the blogosphere. BlogPulse and Technorati are characterised somewhat by the extra elements on the page which lead to things like trend mining, conversation mining, tag search, etc. As this is a beta product, we can expect these features (or similar features) to appear later - so from the get go, Sphere is going to be judged on a small number of features: search speed, coverage, quality and time to index.
Search Speed
The launch of Google's blogsearch highlighted how others were doing a less than perfect job of this important feature. Search speed is not simply a matter of architecture, but involves a number of related issues: user load, index size, relevance quality and post filtering results. Sphere is currently doing an impressive job here - certainly comparable with Google.
Coverage
Sphere has a number of operators which permit some interesting searches. One of these is the site operator which restricts searches to a certain domain. Thus site:blogspot.com returns all posts from blogspot. Combining this with restricting the search to the 'last week', we can issue a number of queries which allow us to discover how many posts are in the index for certain hosts:
blogspot | 11, 560 |
xanga | 3 |
livejournal | 9, 671 |
typepad | 10, 107 |
(spaces) msn | 68, 600 |
bravejournal | 0 |
weblogs.us | 171 |
wordpress | 4 |
I'm not going to compare these with results from other search engines, partly due to the difficulty of replicating this type of search. However, note that Livejournal reports 861, 395 journals have been updated in the last 7 days, and the lack of Xanga results is confusing. Sphere is doing a real crawl of blogs (as BlogPulse does, unlike Google) so there is no reason why these blogs should have been missed.
Spaces interface permits searching as far back as 4 months (the beta service is running on a portion of their index). A search for bush yeilds 194, 554 results. The same search on Google blogsearch, restricting the result to 4 months, yields 1, 800 results. Huh? Ok, so Sphere is actually running on an index of a little over 4 months. However, even extending Google's search to the maximum window (approx 7 months) yields the same number of results: 1, 800. I suspect this is a bug in Google's advanced search - when every one is in beta, comparison is hard!
Quality
This is about spam. Testing for spam is tricky - how do you search for spam? The most trivial test is to search for 'viagra'. This brings back spam results in Sphere, Google, BlogPulse and Technorati (Technorati appears to be doing slightly better here). Switching Sphere to date ranking (the default is relevance) shows more spam results than for their relevance ranking. Part of the aims of Sphere was to provide good relevance ranking of blogs - this has the side effect, if done right, of surpressing spam. However, the real challange is to combine this quality with date/time ranking. I believe that pure relevance ranking is important for blogs - but possibly more for finding blogs themselves, not posts (an intuition, no real data).
Time to Index
Time to index is currently the domain of Technorati. Stats in this area are very confusing. What a blogger wants to know is : if I post now, how long will it be until I'm indexed? Stating average times to index only makes sense if you have 100% coverage - a post that is never indexed means the average time to index is infinite for all data. A quick evaluation for precision (that is to say, of those posts indexed, how timely are they) can be done via searching on common terms and observing the time stamps (note that Technorati provides this function, but recent experience there shows that their time information is horribly wrong).
A search for bush produces results from 11 hours ago. A search for earthquake - in the context of the recent quake in Kashmir produces results from 11 hours ago. In fact, all tests appear to be limited to 11 hours.
Interface
I've left talking about the details of the interface until now as writing the above has let me play around with some of the features. As I mentioned earlier - the look is clean and recognizable. The search interface has a number of drop downs for time period (last day, week, etc.) and ranking (time, relevance). There is also an English/non English language filter. These all work as expected.
One thing that is very frustrating is the state of the interface. It is reset whenever a new search is entered. Perhaps this is less annoying in standard use, but testing the results as I have above requires those combo boxes stay put!
Extras
One of the axes of search important to blog search users is post versus blog. In other words, am I searching for a post or a blog. Sphere provides both these integrated in the results page as well as relevant news articles. Annecdotal useage suggests that the relevant blogs are indeed relevant, but there just aren't that many of them. A search for data mining produces a single blog (mine) and, I note, the most relevant blog is a spam blog by Marcus Zillman - spam expert. Google and Technorati both provide more - though different - results. This was one of the features I had hoped Sphere was going to hit out of the park. I know that doing something in this area well is going to be of huge value, and I believe that it isn't as hard as all that. I'm surprised, then, that none of the main players, BlogPulse included, has really solved it. Eveyone but Google has an excuse here - resources, priorities, etc. So we have to assume that Google doesn't see the value here, or perhaps they believe that their general search engine provides this capability.
Summary
A very nice interface, fast results, nice features, but struggling with coverage. I'd like to see far more work done on the relevant blog search - even to the extent of making this a panel of its own. I admit that I've used some horrible annecdotal tests in getting to know the system - something which I'm ashamed of. Evaluation is hard!
I believe that in evaluating the spam filters you have made a very common mistake. Everyone seems to assume that blog spam targets the same products as e-mail spam, and this does not match my experience. For example, "mesothelioma" is probably more common in blog spam than "viagra", even though it never seems to appear in e-mail spam at all. In any case, both of those are incredibly trivial to filter, and thus don't make good test cases. If you really want to check the effectiveness of spam filtering, do a search for "hotels".
Posted by: Robert Stockton | October 24, 2005 at 02:02 AM
Robert - it turns out that there is plenty of pharma spam in blogs as well, so a search for this term is as valid as any other annecdotal test for spam. See here for the first page on Technorati: http://www.livejournal.com/users/medinfo/12555.html, here for the first page on Sphere: http://www.bestpharma.net/pharmacy-online/440/online-pharmacy-online-7, here for Google: http://texas-holdem-players.blogspot.com/2005/10/viagra.html (actually a poker spam site which contains the word 'viagra' - go figure), ...
Posted by: Matthew Hurst | October 24, 2005 at 08:23 AM
Thank you very much for the information I really appreciate it!!
Posted by: Buy Propecia | March 17, 2009 at 02:30 PM
XL online pharmacy ships all orders within 1 to 2 business days and you can accurately track the order status via our website just like any reliable America or Canada Pharmacy would do. For more detailed information on shipping methods, delivery times and restrictions, check the Shipping section.
Posted by: Buy Viagra | September 21, 2009 at 12:25 PM