In systems that execute inferences via a pipeline of steps, every step is an opportunity for failure. Therefore, it is imperative that implementers focus attention on the details of every step. For example, in text mining, systems have to
- Import and parse documents – did you get the title? did you recognize the footers? did you strip out the page numbers?
- Identify sentences and words – is the document in a latin alphabet language? are there word separations? are you dealing with acronyms? how is your unicode-fu?
- Provide part of speech tags for the words – is the text an example of the type of data that the POS tagger trained on?
- Identify entities – are you prepared to identify unusual names like Barack Obama?
I’ve seen a couple of attention to details bugs surface in the past few hours. The first was reported by Danny Sullivan, in which Google (and this is still the case at the time of writing) thinks that Michael Jackson the writer is the most salient person with that name.
The second is visible on WeSmirch, in which the system fails to identify the title of Lisa Marie Presley’s blog, naming it ‘Create Free Blogs & Online Journals on MySpace Blogs’:
Attention to detail will always be a killer feature!