Aggregating and analysing user generated media (blogs, boards, usenet, etc.) posts on products is what Intelliseek, BuzzMetrics, etc. do for a living. This level of analysis is not currently available in consumer facing products other than search (blog search, board search, usenet search) and some specialised search applications (trend search and the simple sentiment search of OpinMind). The content universes for enterprise products in this space include generally blogs, message boards and usenet and can extend to review sites and other genres.
On the other hand, there are social content hosting sites, such as riffs.com and amazon, which support research on products via aggregating content only found on their site - what we might call captured content search. These are useful to a degree. However in order to get a fuller picture - one might say a more accurate picture - of what people think about products, or any topic for that matter, a solution is required which is capable of spanning multiple content types.
When considering user generated media, we are forced to consider the boundary between this type of content and professional content. A boundary which has blurred considerably in the last year and will continue to do so. In other words, there is a problem in deciding when to include opinion from the professional end of the spectrum and how. For example, how does one weigh a movie review from a blog, a movie review from rotten tomatoes, and one from, say, Mr. Ebert? An additional issue when providing consumer facing interfaces for this type of data is capturing the product being researched and finding those posts that refer to it. The captured content, especially in sites like amazon, may well have the advantage of a complete taxonomy of products which can be searched against rather than the content of the reviews. In other cases, the user requires assistance in accurately retrieving results for all variations of expressions used to refer to the product.
In this context, it is interesting to note two developments. First is Google's review search, which Kevin pointed to. This appears to aggregate reviews from a number of sources (and types of sources) for movies. It also provides access (in a faceted search type of interface) to positive, negative and neutral reviews (based on segmentation via numeric review scores). Not only that, but it provides common phrases mined from the reviews. This is pretty cool, and brings features found in standard enterprise offerings to the consumer. Note that BlogPulse has had consumer facing features based on similar features for over a year - for example blog bites, which both extract common terms and find snippets which contain terms with significant relationships.
I'll post a deeper dive on Google's review stuff later.
The second new consumer facing user generated media interface is reported by Smart Mobs, and TechCrunch and appears to be a service which will allow users to use their phones to input bar code data on a product and access user generated commentary drawn from blogs.
Is 2006 going to be the year of consumer facing analytical tools over the user generated media space? And how will the enterprise players in this space respond?
Sifry reports that spam is at 5% post volume, BlogPulse (credit to Natalie, Robert, via Pete's ClickZ column) reports 30%. The splog issue is getting more and more visible. I'd like to know why these numbers are so different. One way to do this would be to report percentage spam per host - at least BlogSpot, LiveJournal, Xanga, Typepad, Spaces, AOL. These stats, together with a breakdown of how much data comes from these sources would give everyone an idea of where the problems are as well as throwing some light on what is going to be an index size claims race amongst blog search engines.
Google has already been outed as a source of spam in many places, I suspect that some of the hosts that contribute major volumes of blog data have relatively low spam percentages (I'm thinking of LiveJournal and Spaces).
One can think of spam as a complex ecology involving:
people wanting to make a quick buck
people who actually buy products in part due to spam
web site genres
The last element - types of web site - turns out to be a key thing. We have, it appears, become very sensitive to blog spam in part because of the way in which search results are presented in blog space: along the time axis. Matt Siegler - a colleague of mine, and master in the art of message board discovery - has turned over to me a list of spam message boards. The use of boards for spam has something of a different ecosystem dynamic as they don't appear in search engines with time as a major ranking axis. That is to say, they are hidden to some extent in the static rankings of major search engines like Google.
We are sensitive to blog spam as it appears in our search results, we are not sensitive to spam of other sorts because we don't search in a time dependent manner for those genres of web site. However, the effect in terms of boosting the rank of pages linked to by these sites is approximately the same.
Perhaps when we have more board specific search engines, like BoardTracker, we will start to throw up our arms about this longer lived but well hidden pocket of spam. First dibs on 'sboards'!
My colleague Matt Siegler just posted on estimating the number of readers for a thread on a message board. This is great. He's been doing some pretty cool stuff recently on readership for online personal media (while everyone else out there has been looking at inlinks for blogs!). I'm not going to repost Matt's findings or the infoporn - go and read it from the source.