A/B Test Design for IR Systems
Strategy : Compare two IR systems (e.g. systems using different ranking formulas, two clustering systems using different algorithms or anything else you can think of) by asking panelists to compare pairs of search results and assign scores.
Queries : Users can select from a list of possible queries. The study can, for instance, reuse queries from WebCLEF of (Web/HARD)TREC.
URLs : The interface can shown the first page of results for both systems being compared.
Voting : Users can assign votes using a 5 point (or 3 point, don't know which one is better) scale. Each user ideally rates 6 different queries, one being a control query containing random URLs. The system should reject voters who take less than 10 seconds to vote.
As oposed to more traditional ways of involving users in IR evaluation, this approach can actually produce interesting quantitative results (e.g. besides testing if one system outvotes the other, we can also check for the difference being statistically significant), and it also has the advantage that since humans are involved, visual relevance plays its role in evaluation.
I also hope this methodology can be usefull in evaluating search result clustering, which is much more difficult that evaluating ad-hoc search.