My Photo

« Wrapper Induction, Recovery 2.0 | Main | Beta is dead, long live Beta »

September 08, 2005



Hi Matthew, I just realized I forgot to open my post up for comments so thanks for the trackback. It was not my intent to call the search engines listed in the article backwards. Believe me, I use them daily and have been impressed with the speed in which blogs have been included. I only wanted to point out that PubSub should have been included in the article - and is another option. I was at a recent conference in Boston where Steven Cohen of PubSub spoke about their matching technology. Since I admitted I'm no expert, I'm sure PubSub can provide more info if you're interested. Thanks, Cindy


I think the Cymfony blog's more serious transgression is in utilizing those excruciatingly trite clip art images of contented-looking business drones.

Matthew Hurst


I was just playing with words there - backwards *facing* was meant to indicate that the search engine was looking back over articles already published, as opposed to looking forward (prospective) which is what PubSub's product language implies. I still believe I am missing something important here as I don't see how 'prospective search' is fundamentally different from search feed subscription.


Bob Wyman

Matt, There are many ways to implement a prospective search. The best way is to use algorithms that are explicitly designed and optimized for prospective search, however, it is also possible to get some of the effect of a well-implemented prospective search by using a "repeated retrospective search" which simply re-executes a retrospective search on a regular basis and then filters out already seen entries.

A "search feed subscription," as you use the term, is implemented using a repeated retrospective search. Typically, an RSS aggregator will "poll" for new results every hour or so. If the URL being repeatedly polled translates to a retrospective search query, this is a "repeated retrospective search." Alternatively, the backend service might repeat retrospective searches on a regular basis (hourly, daily, etc.) and then present results that an aggregator can collect.

A key determinant of the quality of a prospective search is the latency between when an item is published or discovered by the service and when it is made available or delivered to a user. In a true prospective system, that latency can be reduced to seconds or milliseconds. However, if repeated retrospective searches are being used, the maximum delay will at least equal the delay implied by the polling frequency and the average delay will be half the polling interval. Thus, if you are repeating the search every hour, the maximum latency would be one hour and the average latency would be 1/2 hour -- much slower than the seconds it might take a real prospective search system to deliver results.

Because a true prospective system is able to discover matches at any time and because it is continuously active, it becomes possible for such a system to do things like push out email notifications, XMPP IM messages, and otherwise alert the user to newly discovered results. On the other hand, a repeated retrospective system is, relative to any specific user's needs, largely passive during the inter-polling interval. The implication is that true prospective technology is able to push out notifications with much lower latency than the notifications that will be picked up by a repeated retrospective system.

It should also be noted that a true prospective system can often provide much greater expressiveness than a retrospective engine can. For instance, at PubSub, we implement proximity searches ("foo /near bar") which are relatively cheap for us since we only need to "index" one document at a time. For a retrospective system to implement proximity searches typically means a dramatic increase in document size and significant performance problems.

In general, true prospective systems scale better and perform better than repeated retrospective systems since processing a prospective query is much less "expensive" than a retrospective query. A prospective query will typically require less CPU, less memory and fewer disk accesses than a retrospective query. The result is great hardware savings and the abilty to scale more rapidly and smoothly.

These are just some of the differences between a true prospective implementation and "repeated retrospective" implementations.

bob wyman

Matthew Hurst

Bob - these are great comments, they are appreciated. I think what I was trying to understand was not the implementational differences but the user experience differences. From the above, I think the major claim is that a system that is designed from the bottom up to do prospective search can be built to respond far more efficiently than one which adds the functionality onto an existing, index based search engine.

In other words, you see huge value in reacting as swiftly as possible to information publication. Minutes and seconds make a big difference.

The comments to this entry are closed.

Twitter Updates

    follow me on Twitter

    March 2016

    Sun Mon Tue Wed Thu Fri Sat
        1 2 3 4 5
    6 7 8 9 10 11 12
    13 14 15 16 17 18 19
    20 21 22 23 24 25 26
    27 28 29 30 31    


    Blog powered by Typepad