« November 2008 | Main | April 2009 »

February 21, 2009

Double-Blind Reviewing --- More Placebo Than Cure-All?

In double-blind reviewing (DBR), both reviewers and authors are unaware of each others' identities and affiliations. DBR is said to increase review fairness. However, DBR may only be marginally effective in combating the randomness of the typical conference review process for highly-selective conferences. DBR may also make it more difficult to adequately review conference submissions that build on earlier work of the authors and have been partially published in workshops ("laddered publications"). I believe that DBR mainly increases the perceived fairness of the reviewing process, but that may be an important benefit. Rather than waiting until the final stages, the reviewing process needs to explicitly address the issue of laddered publications early on. [A version of this article, with citations, will appear in a future issue of CCR.]

In large parts of computer science, conference reviewing has two objectives, namely to select technically sound and interesting work for presentation and, for highly-selective conferences, to nominate the "best" set of papers in a particular discipline. From my personal experience, in many conferences, there are about 20-30% of the submitted papers that address an interesting problem of some importance, appear technically sound, as much as a review of an hour or so can determine that without reproducing the simulations, implementation or analysis, contain novel results and are sufficiently well-written. However, many ACM conferences select far fewer papers for presentation, making the review process significantly more difficult and charges of bias, favoritism, cliquishness and group think more common.

This problem is not unique to reviewing papers. For example, in our discipline, grant agencies face the same problem of receiving far more qualified proposals than can be funded. (There is a difference in that the conference selection problems are largely self-inflicted, as it is generally far easier to increase the number of accepted papers than to increase the NSF budget.)

As with all such processes, the details of the process remain, by necessity, confidential, so it is necessary that authors have some assurance that their labors are evaluated fairly, even if they cannot hear the detailed discussions or weighing that takes place behind closed committee doors.

Thus, the community has developed a set of fairness rules that try to limit both the very real possibility and the perception of favoritism. Such rules include conflict-of-interest stipulations that exclude advisors, collaborators and colleagues from the same institution from reviewing papers. Some conferences also impose special conditions on submissions from technical program committee chairs and even program committee members, from excluding them altogether to raising the bar for such papers. As a note, it is generally much easier with such rules to prevent biased positive reviews than unduly negative ones, a problem that the computer science community has acquired a reputation for.

After excluding reviews based on conflicts-of-interest and perceptions of insider advantage, the third approach to reducing biased reviews is to use double-blind reviews (DBR), where the author identities and affiliations are hidden from the reviewers. Single-blind reviews (SBR), on the other hand, simply hide the reviewer identities from authors, with the goal to allow reviewers to provide frank feedback without worrying about personal or professional repercussions. (Experiments with open reviews, where all identities are disclosed, have been performed at Global Internet 2007, but are beyond the scope of this discussion.)

Below, I will try to highlight why I believe that DBR should be seen, at best, as a tool that mostly maintains the important perception of fairness, rather than dramatically change the review outcome. Thus, similar to the often-derided "security theater" at airport checkpoints, DBR may be partially "review theater" and may not actually help in a scientifically-measurable fashion and is easily circumvented by a determined adversary, but serves an important role in upholding the norms and values of the community for highly-selective conferences. However, as discussed below, DBR can cause serious side effects if it forces or encourages authors to obscure the relationship to their earlier work. Thus, I argue that if community norms and customs call for DBR, it should be handled judiciously, recognizing its limitations and addressing the practical problems.

Many ACM and, more specifically, ACM SIGCOMM conferences use double-blind reviewing (DBR). In double-blind reviewing, the paper submitted for review does not contain author names. As in single-blind reviewing, the authors are not told who reviewed the paper. While not always clearly stated, the goal of double-blind reviewing is to remove bias in the review process, particularly the perception that work by well-known ("prolific") authors or from highly-ranked institutions will be given more credence than work of similar quality from other authors and institutions (Some claim the opposite effect: Reviewers may tire of the work of certain individuals or hold prolific authors to a higher standard.) DBR may also make it less likely that personal likes and dislikes color the review, outside the normal range of conflicts of interest that should prevent review assignments. Some also claim that double-blind reviewing may help reduce gender bias, although there does not appear to be any systematic study of this effect in computer science papers. In general, the quantitative evidence for the impact of double-blind reviewing on paper selection (or on perceived fairness) is not particularly strong and the result may depend on the metrics, as illustrated by the analysis of SIGMOD submissions.

We can view DBR as related to the, relatively recent, custom in the United States to omit a candidate's birthday and picture from a resume,0 to avoid racial and age bias, or a preference for good-looking candidates, during the pre-interview selection process. The evidence for the effect of these appears to be somewhat stronger, however, than for DBR.

DBR may also combat reviewer laziness when reading papers authored by recognized experts. For example, if the author is known to the reviewer, a reviewer may be tempted to skim the analysis since the author presumably knew what he or she was doing, and is likely to be more versed in the mathematical details than the reviewer. This may not only prevent discovering mistakes, but, more likely, has presentation errors, such as missing definitions, truncated equations or inconsistent use of variable names, slip through during the review process.

Different conferences are likely to be affected to a varying degree. For example, the normal "noise" in the review process already makes it difficult to have reproducible acceptance decisions for conferences that only accept one in eight or ten papers. For such highly-selective conferences, reducing the impact of author identities on the fate of borderline papers may be helpful, even if it would make almost no difference for slightly less selective conferences. However, it may also lead to false complacency - "we're doing double-blind reviewing, so the process must be fair.

The concern about the effectiveness of DBR is summarized in a recent SIGCSE bulletin article: "Submitting papers anonymously does not work as well as one would hope. Authors who report on `ongoing work' who have prior publications can be discovered with a quick search on the Web. Papers often include clues such as grants on which the work was supported, or names of proprietary software used for the work, that make such discovery easy if not trivial. In a world where most Computer Science education professionals have web presence, and finding information on the web is increasingly easy, it may be time to reexamine the need for, and the advantages of anonymous submissions."

Even without DBR, it is unlikely that an obviously wrong paper by a "star" author will be accepted in a well-run conference or truly interesting work will not be published in some other conference even if it is rejected at the first conference it is submitted to. There is hardly a dearth of networking conferences to submit papers to, after all. Unfortunately, the stakes for reviewing have gotten higher for authors, as the number of papers accepted at highly-selective conferences has become a metric used to evaluate faculty and tenure candidates for hiring and promotion. Particularly in that context, relying on DBR to give a process that has a fair amount of randomness the sheen of scientific rigor, similar to double-blind drug trials, may well cause significant harm.

As noted, one oft-cited advantage is the perception of fairness that is helped by DBR. However, the SIGCOMM conference fairly regularly attracts criticism that certain authors or institutions have a leg up, despite having used DBR for many years.

The debate about review policies may obscure a larger problem. One could argue that we are probably exceeding the resolution of our evaluation tools, so that the fixation with double-blind reviewing and other procedural considerations, such as author responses, merely obscures the inherent limitations of a process that strongly depends on the choice of the reviewers and the dynamics of a technical program committee meeting, even if there was an objective standard what the "best" papers were in a particular conference year. For example, while not meant to measure such randomness, the SIGCOMM 2005 shadow PC experiment yielded a high variation in the set of accepted papers, with only roughly half the accepted papers appearing both in the set chosen by the shadow TPC and the real TPC.

However, this note is only partially about the benefits of double-blind reviewing, but rather how to implement it effectively without causing undue collateral damage during the review process. It should be noted that there are two stages where anonymization may matter, namely the written reviews and the technical program committee discussion. During the latter, maintaining author anonymity can be particularly challenging if the author is on the TPC or from an institution well-represented on the TPC. If suddenly all Stanford University members of the committee are asked to take a coffee break during the deliberations for a paper, it may not take the rest of the committee all that much guess work to identify the likely authors and institution. No amount of text obfuscation will compensate for the coffee break indicator. Even though it may well remove most of the fairness advantages for the borderline papers that get discussed at TPC meetings, I will focus on the written reviews, simply because there does not appear to be a good solution to the TPC meeting problem, except maybe randomly ejecting TPC members to obscure the author institution.

DBR serves a goal, namely allowing reviewers to focus on the work, not the authors. However, in practice, the goal is mostly to remove the identity of prolific authors as there seems to be little difference in removing the names of authors that few of the reviewers are likely to have heard of. (Removing institutional information may still be helpful, as it may be harder to take papers seriously if they are authored by somebody teaching at a community college.)

Once a particular conference has decided to implement double-blind reviewing, the conference chairs and steering committee have to decide how far to push the removal of author-identifying information. The ACM Transactions on Database Systems provides a sample of such rules. Obviously, the paper under review should not list any authors and institutions, and should omit any "incriminating" acknowledgements and avoid revealing filenames. None of these are necessary for evaluating the paper content and can thus be omitted during the review process. These simple steps alone probably anonymize the paper for a majority of reviewers.

However, self-citations can easily reveal the authors with minimal inspection. Since authors routinely cite their earlier work and prolific authors have no incentive to hide their identity, the simple statement "As we showed in [42]" unmasks at least some of the authors for all but the most careless reviewer. Thus, the next step is to ask authors to convert such citations to the third person.

This is probably where the easy rules and guidelines end. Particularly for highly-selective conferences, many of the competitive papers will contain material that has been published earlier in workshops or been discussed at principal investigator meetings. Among other identifying items, it may contain project, protocol or algorithm acronyms. The current SIGCOMM FAQ encourages such an approach: "When reviewing the subsequent, more mature submissions of such work, reviewers are advised to first assess whether there is an adequate additional contribution over the previous, preliminary version of the paper, if so, then reviewers are advised to measure the full-length submission not just in terms of its additional contribution but on its entire content. This policy aims to encourage the development of work while also encouraging publication of work when it is in its earlier stages.'' However, the reviewer can only identify whether there is substantial additional contribution if they know the earlier work, either because they are personally familiar with it or because the authors cite that work. If the authors obscure the relationship to the earlier work, one of two undesirable outcomes are likely:

* Unjustified suspicion of plagiarism: If the reviewer recognizes the work, but does not recognize the project acronym and the text contains no indication of the earlier work, the reviewer may well get suspicious and search for text snippets online, most likely finding the earlier paper. The reviewer then has to confirm with the TPC chair that the paper under review is indeed original work and not plagiarized. At this point, clearly all author anonymity has been lost, and the reviewer has wasted time on a wild goose chase.

* Undiscovered self-plagiarism: Unfortunately, there are indications that self-plagiarism and double submissions are no longer rare. For example, during this year's IFIP Networking conference, where I serve as TPC co-chair, we have had two papers that were verbatim compilations of earlier work by the authors. While plagiarism detection systems such as docoloc can ferret out some of these cases, they are not perfect, they missed in this case a publication in a very recent conference.

The SIGCOMM FAQ (http://www.sigcomm.org/about/policies/frequently-asked-questions-faq) currently recommends essentially a conditional review with the outcomes accept, accept-only-if-common-authors (i.e., not plagiarized) or reject. In my opinion, suspending suspicion of plagiarism while spending an hour or two on a review and possibly considerable time at the TPC meeting seems beyond what one can require of a reviewer or technical program committee. It is unclear whether this particular approach has ever been exercised.

Simply having the authors indicate that the paper contains material from an earlier publication, but not identifying the precise reference, does not help much since the reviewer can then make no judgement as to whether the incremental contribution is sufficient to merit publication. Alternatively, the conference can set up an incremental novelty examination (ICE) board e.g., consisting of former TPC members, that only investigates such cases and declares papers to be fit for review or too much of a cut-and-paste job. To speed up the process, authors would be asked to declare which earlier papers of theirs the new submission is based on, and outline the major differences to the earlier submissions, with that information visible only to the ICE board. (This review could theoretically be performed by the chairs, but since it has to be done quickly to avoid delaying the review process, a larger committee seems appropriate.) While requiring extra effort and possibly delaying the review process, this seems preferable to having the reviewer do detective work. It also seems likely to increase consistency, so that all papers containing previously published results are held to the same standard of incremental contribution.

Such self-declaration may also make it less likely that an author "forgets" about double submission rules or self-plagiarism considerations, so it may be worthwhile even without DBR.

In general, since the effectiveness of blinding will vary with the paper and the set of TPC reviewers assigned, the fairness effect will also be somewhat random. Some reviewers for some papers will suspect or know the authors, others will not. Since good TPC members should be very familiar with the current work in the paper's technical area, it could almost be considered disqualifying if they do \textbf{not} recognize the work of a prolific author in that area.

Given that trying to obscure technical content by editorial changes is unlikely to be effective for prolific authors with high-visibility projects, heroic obfuscation measures seem to cause more harm and wasted effort than they are likely to increase reviewer objectivity. Thus, DBR may be more of a placebo, working mostly because we believe it does, but that should not distract from the need for harder-to-make changes in our publication process.