« Home | Internet-only trailer for Clerks 2 by Kevin Smith » | Free MP3 repositories on the Web » | It's a Virus » | Scary TV evangelist losing his shit » | links for 2006-05-16 How to Take Better Dirty Pi... » | links for 2006-05-15 Can I cause another person ... » | links for 2006-05-14 TRECEVAL is a program to ev... » | links for 2006-05-13 The Theosophical Society-Ad... » | Sinatra and Jobim » | links for 2006-05-12 Google AdWords: Traffic Est... » 

Thursday, May 18, 2006 

A/B Test Design for IR Systems

After cleaning my desk today, I found some tutorial notes from a talk given some years ago by Jan Pedersen and Knuth Magne Risvik from Yahoo! Inc. The tutorial covered only the basics of Web Search, but in it I found a very interesting idea for doing IR evaluation using real search engine users. Coincidently enough, I've been recently discussing similar ideas with Michael Stack from the Internet Archive, and I'm so excited with this that I think I'm actually going to try making a small testing prototype that implements this (I'm guessing that a few lines of JavaScript/HTML, using IFRAMES and the prototype/script.aculo.us libraries should be enough). Here's the general evaluation methodology, simply called A/B surveys:

Strategy : Compare two IR systems (e.g. systems using different ranking formulas, two clustering systems using different algorithms or anything else you can think of) by asking panelists to compare pairs of search results and assign scores.

Queries : Users can select from a list of possible queries. The study can, for instance, reuse queries from WebCLEF of (Web/HARD)TREC.

URLs : The interface can shown the first page of results for both systems being compared.

Voting : Users can assign votes using a 5 point (or 3 point, don't know which one is better) scale. Each user ideally rates 6 different queries, one being a control query containing random URLs. The system should reject voters who take less than 10 seconds to vote.

As oposed to more traditional ways of involving users in IR evaluation, this approach can actually produce interesting quantitative results (e.g. besides testing if one system outvotes the other, we can also check for the difference being statistically significant), and it also has the advantage that since humans are involved, visual relevance plays its role in evaluation.

I also hope this methodology can be usefull in evaluating search result clustering, which is much more difficult that evaluating ad-hoc search.

Links to this post

Create a Link

About me

This is a Flickr badge showing public photos from Bruno Martins. Make your own badge here.

Listening to

 All the Web
Me at BookCrossing
Campos Magneticos

Previous posts

Friendly Blogs

Powered by Blogger, Flickr
and del.icio.us