You may have noticed a list of book titles on the left hand margin of this blog. These titles are being pulled in from my book blog - on Vox. I've also started another blog - a Wordpress blog - which I'm using to take notes on papers that I'm reading. Now, that would also appear somewhere on Data Mining, but Typepad only allows a single feed inclusion of this type at the basic rate.
Once I have a bit more content in the paper blog I'll place a link somewhere.
I think it's time there was a better general understanding of the shaky foundations of the blogosphere. Kevin Burton has been getting more and more upset with blog search. While pings have the potential to make the web far more efficient, the truth of the matter is that like any other open 'standard', they are just another vehicle for systematic abuse by spammers. On top of this, there is a pretty hefty chunk of legitimate use which is simply poorly executed (incorrect dates, lack of synchronization with the feed, etc.).
When one considers the expectations for real time search in the blogosphere, periodic crawling (e.g. crawling every t minutes, which is what I believe both TailRank and Techmeme do - please correct me if this is wrong) is not a scalable solution. Leaving aside the issue of crawling 10s of millions of pages multiple times a day, it also fails to provide an adequate discovery mechanism for new blogs.
While memetrackers are crawling something like 10s-100s of thousands of blogs, I don't believe they could cope with 2 orders of magnitude of scale. So once one accepts the ping ecology, one has to deal with spam. And spam filtering algorithms have as a very clear parameter the time taken to make the decision. If there is a need to make a quick decision - to ensure some sort of real time system - then that limits the evidence one can use to make that decision.
What I'm getting at here, possibly in a round about manner, is that things are a mess. One can have high quality - as the memetrackers give us - but limited coverage and discoverability; or one can have near real-time and better discovery (coverage) with the danger of spam. Getting high coverage, quick mean time to index and no spam is hard - perhaps the real problem is that we have transferred main stream web expectations (where sand boxing is appropriate due to a different view of real time) to the blogosphere.
While some have been musing about the M&A window for Technorati, GigaOm reports that their partnership with Edelman, which would have had a major focus on international markets for both companies, has failed. Rubel gives a little more detail:
Work on the Asian language sites - Korean and Chinese - has ceased. In
China there are access issues and Korea data quality is less than
desirable because most blog platforms don't ping. That's the nature of
It is interesting that he is citing the ping issue. The blogosphere is built on very unreliable foundations (from the POV of data acquisition). This is not a secret and I'm surprised that this was discovered during the relationship rather than a known issue going in. Interestingly, this is not an issue for lots of other social media. As I've discussed many times on this blog - there is a lot more data in non-blog form that is highly relevant for those wishing to monitor the space of online conversation and opinion.
I've not mentioned Klostu for a while. But yesterday I saw TechCrunch talking about ProfileLinker and today I read on Wired News, the following prediction:
MySpace Spaces Out
MySpace splinters as teens head for niche sites. New services that
control profiles across multiple social networking sites begin to take
There are two approaches in play here. Klostu creates a profile and allows you to use that on different message board sites. ProfileLinker aggregates your profiles from different social network sites. Same but different. Identity online is going to be a major part of the web going forward and has the potential to challenge all the sites that rely on lock in approaches to retain users. Perhaps identity platforms, like those above, will be the real winners.
Phrasetrain is a small new technology company in Seattle, Washington. Our vision is to draw on
the power of peer production to create a new kind of natural language technology,
and to use that technology to improve text search on the web.
Some natural language companies claim to create artificial intelligence.
We make no such claim. One of our core principles is that the genius of language
is in people, not machines. We want to aggregate the linguistic intelligence
of our users in a way that benefits all of them.
If I understand their website correctly, I believe that Phrasetrain is more about mapping terms used to reference the world (e.g. entities, noun phrases, etc.) to a logical space. In other words, they are talking about interpreting the text we use to mention stuff and providing an unambiguous logical reference for everything. This is in contrast to other applications of NLP which are more or less about parsing. Of course, I could be totally wrong on this.
Things can get pretty silly in the blogosphere, especially at the end of the year. Witness Steve Rubel's claim that there isn't any value in the term 'Social Media'. This is nonsense on so many levels - where to start? Firstly, it ignores legacy media - did my dvd of Star Wars suddenly become a piece of social media? Secondly, the head in the sand view that many bloggers have is American-centric and ignorant of many other channels that are not used (or not accessible) to bloggers. Recall, only 17% of the planet has access to the internet, for example. Thirdly, whenever something is broadcast or recorded and released later, it fails to qualify as social media as there is no backchannel. Ever tried having a relationship with a prime time news caster, or how about a movie? Good luck with that. Finally, and possibly most importantly, what many bloggers fail to notice is that the most successful people in the blogosphere are, in some sense, moving towards mainstream models - main stream media may be getting closer to us, but we are also getting closer to them.
To take Steve seriously, I would have to assume that he is talking about some subset of media, but which subset I'm not sure.
LeeAnn Prescott, over at Hitwise, posts about blogsearch market share - the takeaway is that Google's blog search is beating Technorati.
Readers of this blog may recall a post I made back in October, on the 23rd, which delivered similar information. The graph below, which I've copied from that post, shows just the same jump in traffic from Google's blog search as the Hitwise graph above. It is important to note the differences. I was looking at actions taken as a result of searching - looking at how people arrived at a blog. I believe the Hitwise data looks at visitors to the search sites. Search is only one thing that can be performed on Technorati. To make an extreme example, if Technorati's blog was hugely popular and no-one was using Technorati for search, Hitwise's results wouldn't show this (at least, that is how I interpret their description of the data using the 'www.technorati.com' label).
The Hitwise post has some further interesting details regarding the demographic breakdown of users.
Discussion over at Niall's blog revolves around Technorati's reliability issues driving people to use Google's solution. Interestingly, this comment over on Search Engine Land claims the same problems for Google's blogsearch.
With all the talk of late regarding the reliability of visitor/impression data and how things have to change, I'm disappointed to see that there is no fact checking of the original Hitwise post - every post I've seen thus far on the topic just quotes LeeAnn with no discussion other than to guess as to the reason for the changes in traffic.
I've posted on technologies available right now that are better than existing consumer (that is to say, free) geographic visualization tools/toys. Here is something similar for 3d buildings. This project from Edinburgh Uni's EPCC (where I was an intern many years ago) is a demonstration of high detail 3d modeling for town planning. There are also movies on the site.
Note that this is non-Google data imported in to Google Earth which is essentially being used as a rendering engine.
One of the big issues I see looming on the horizon is how data like this is distributed and discovered. Should it be consolidated and licensed (as in the Google Earth model), should it be made a specialized commodity that people have to down load on to their desktops to use, or should it be distributed and automatically (and opportunistically) integrated into data smart clients. This seems like a data web issue to me.
John Battelle riffs about an interface to models and information about the human body. We are all used to rich interfaces to geographic information now, thanks to Google Earth and MSFT's Virtual Earth. However, I'm still amazed, flabbergasted even, that our we browsers don't understand documents. This is partly the fault of HTML, which is too limited and has to be hijacked if one is to present a rich document experience akin to that which we see encoded in PDF files. I can understand that there is a desire to leverage a simple information asymmetry - I mean, if we knew which parts were adverts, which were navigation and which were content, then perhaps we would do something about what we see. That being said, the fact that one can't work with tables, lists, headlines, etc. in a browser, e.g. to select parts of the document with a simple cut and paste action, is beyond me.
Sometimes I feel like I wish the web would freeze while we could catch up to its potential...