August 15, 2007

Rexer Analytics Data Mining Survey

Being at KDD (Knowledge Discovery and Data Mining) right now - not to mention having just sat down for a chat with Karl Rexer, I thought it fitting to post a summary that Karl shared of his recent data mining survey:

2007 HIGHLIGHTS:

·   27-item survey of data miners, conducted on-line in early 2007

·   314 responses from individuals in 35 countries

·   Regression, decision trees and cluster analysis were the most commonly used algorithms (mean number of algorithms used: 6.8)

·   Top challenges data miners report are dirty data, data access, and explaining data mining to others

·   SPSS, SPSS Clementine, and SAS are the three most frequently utilized tools (mean number of tools used: 4.5)

·   There is increasing interest in the Oracle Data Mining tool, and decreasing interest in C4.5/C5.0/See5   

·   The primary factors data miners consider when selecting an analytic tool are: 1) the dependability and stability of software, 2) the ability to handle large data sets, and 3) data manipulation capabilities

·   The findings vary somewhat depending on the domain in which the data miner works, the tools used, geography, and several other dimensions

August 01, 2007

KDD 2007 Programme

Briefly, the KDD 2007 Programme is now availble - looks very good.

July 24, 2007

Mining Search Queries for Business Intelligence

Imagine you see a referrer on your site which comes from a Google search:

company negative positive sentiments news aggregator former google

What might we infer about the goals of the searcher? And what might we infer about the knowledge that the searcher is acting on? Could we conclude that the searcher knows that some former Google employees have started, or are about to start, a new enterprise which will be analysing the sentiment in or around news articles?

The searcher is located in New York (where Google has a large presence).

This is just one data point - and my assumptions about the intention of the searcher are probably a little out there. But the information hidden in search queries is certainly attractive. It is not clear what latitude Google and other search engines have in mining data from one service/application for exploitation in another. It is one thing to mine queries for improving your search experience, it is another to mine them to provide other services (e.g. business intelligence information based on signals found in query).

July 18, 2007

KDD 2007 Programme Available

Briefly, the KDD 2007 Programme is now available.

July 07, 2007

KDD 2007

I'll be attending KDD 2007. There are some really interesting workshops associated with this year's conference. Two in particular:

I'm registered for the first one, but may also sneak a peek at the second.

April 30, 2007

Rexer Analytics Data Mining Survey

Karl Rexer, founder of Rexer Analytics asked me to pass along this link to a survey he is running on data mining. The results will be made publicly available.

December 13, 2006

20, 000, 000 Data Points Going Free

A few searches on Google suggest that there are something like 2MM spreadsheets indexed on the web (found using the filetype operator and some numbers). We could conservatively estimate that each contains 10 data points. Swivel slurps spreadsheet data (actually, csv formatted). Where do we go from here?

December 06, 2006

Netflix Prize

First a reminder - Netflix is offering $1MM to any team that delivers an algorithm that our performs its current recommendation system (that is to say, the ability to predict how a subscriber would rate a movie) by 10%. Here are the stats taken from the leader board page for the Grand Prize:

There are currently 16660 contestants    on 13466 teams    from 124 different countries.
We have received 3059 valid submissions    from 1001 different teams;    74 submissions in the last 24 hours.

The leader board shows the latest submission for each team. If we take these submissions as being from a single team and plot the results (the RMSE) over time, we get the following graph:

Leaderboard

Remember, we are looking at RMSE, so the lower the value the better. Of course, what we'd like to do is track the results for each team over time. So how can we interpret this graph? Firstly, the older points represent teams that have probably dropped out. Secondly, the competition is still pretty heated, with 6 teams with last submissions with today's time stamp (note that as there have been 74 subsmission in the last 24 hours, it is likely that teams are submitting many results per day). Thirdly there does seem to be a trend pushing lower and lower. Will the competition be over before the year is out?

Swivel Is Live

I just read that Swivel is up and live (TechCrunch). I have a pretty crushed day today so won't be able to do a deep dive. However, in looking around I did spot something of interest (that is to say, if you are obsessed with tabular data processing). Although Swivel indicates where you can get data to upload, it currently doesn't have the ability to slurp tables in raw HTML format.

Swivel is not yet smart enough to automatically extract tables from web pages. Getting data from web pages into Swivel is tricky, because of the wild variety of formats and structures of data in web pages. Follow the instructions below and give it a shot.

I can believe this - it is a hard problem (which is why I spent at least 4 years researching it for my PhD). Swivel peeps may be interested in some of the literature in this area (appearing in a special issue of IJDAR on tables). Here's also an approximate list of papers that I've written/contributed to.

November 21, 2006

Careful With Those Stats, Paul

Paul Kedrosky points to a new feature on Flickr which exposes statistics for meta data associated with members of the service. The graphs below, for example, shows popular point and shoot, and camera phones - the graphs is actually of percentage of members who have uploaded at least one photo to the service using that make of camera.

Flickr_3

These kind of stats, however, always need to be taken with the right statistical caution. They have to be interpreted literally - as statistics describing the percentage of Flickr members who have uploaded photos which the Flickr system has identified as being taken by a particular camera or other image capture device. With that description, one needs to ask:

  • How accurate is the classification?
  • How does this relate to the total population of camera consumers, of which Flickr is a biased sample?

The first question is a potential problem given that Flickr states:

The graphs are only accurate to the extent that we can automatically detect the camera used to take the photo (about 2/3rds of the time). That is not usually possible with cameraphone photos and cameraphones are therefore under-represented.

Which seems to indicate that recall is about 66% (I'm assuming that a structured field is available and that precision is 100%).

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Blog powered by TypePad