After cleaning my desk today, I found some tutorial notes from a talk given some years ago by Jan Pedersen and Knuth Magne Risvik from Yahoo! Inc. The tutorial covered only the basics of Web Search, but in it I found a very interesting idea for doing IR evaluation using real search engine users. Coincidently enough, I've been recently discussing similar ideas with Michael Stack from the
Internet Archive, and I'm so excited with this that I think I'm actually going to try making a small testing prototype that implements this (I'm guessing that a few lines of JavaScript/HTML, using IFRAMES and the
prototype/
script.aculo.us libraries should be enough). Here's the general evaluation methodology, simply called A/B surveys:
Strategy : Compare two IR systems (e.g. systems using different ranking formulas, two clustering systems using different algorithms or anything else you can think of) by asking panelists to compare pairs of search results and assign scores.
Queries : Users can select from a list of possible queries. The study can, for instance, reuse queries from WebCLEF of (Web/HARD)TREC.
URLs : The interface can shown the first page of results for both systems being compared.
Voting : Users can assign votes using a 5 point (or 3 point, don't know which one is better) scale. Each user ideally rates 6 different queries, one being a control query containing random URLs. The system should reject voters who take less than 10 seconds to vote.
As oposed to more traditional ways of involving users in IR evaluation, this approach can actually produce interesting quantitative results (e.g. besides testing if one system outvotes the other, we can also check for the difference being statistically significant), and it also has the advantage that since humans are involved, visual relevance plays its role in evaluation.
I also hope this methodology can be usefull in evaluating search result clustering, which is much more difficult that evaluating ad-hoc search.