----------------------- REVIEW 1 --------------------- PAPER: 30 TITLE: The Tragedy of Defect Prediction, Prince of Empirical Software Engineering Research AUTHORS: Michele Lanza, Andrea Mocci and Luca Ponzanelli ----------- Review ----------- Summary: The authors discuss how defect prediction approaches have traditionally been evaluated. Then they argue that this style of evaluation has some major flaws, such as discounting the cascading effect of fixed bugs. Finally, they argue for evaluations situated in a real world environment. Positives: • Entertaining to read, well-written • Thought provoking • Self-critiquing, too rare in academia • Brings up valid concerns with existing evaluation techniques • Pushes for more realistic evaluation techniques Issues: • Limits the scope of their arguments against current evaluations too narrowly. The lack of realism in the evaluation is the primary issue, IMO, whereas they paint the primary issue as owing to the single problem of the "time space continuum". Details: Act II is a bit too abstract and too long. The set up feels drawn out. Nice point (and supporting picture) in "the evaluation of defect prediction approaches…". However, this is where you lose some credibility with me. I would rather you said this is one of the reasons that these types of evaluations are invalid. There are lots of other reasons, as the evaluation is a major departure from the in vivo evaluations you suggest. Good advice on which movie to watch and which ones to avoid. ----------------------- REVIEW 2 --------------------- PAPER: 30 TITLE: The Tragedy of Defect Prediction, Prince of Empirical Software Engineering Research AUTHORS: Michele Lanza, Andrea Mocci and Luca Ponzanelli ----------- Review ----------- Summary: This paper reflects on the perils of defect prediction in empirical software engineering. The central thesis is that evaluating defect predictions using source code history is misleading since any change to the code (i.e., fixing a defect) will inevitably change the state of the code and the defects that follow may be prevented, or be different. In other words, the authors contend that evaluating existing defect predictors assumes that, if the defect predictor had been used in the past, it would have been ignored. The conclusion is that true defect prediction can only happen in-vivo, and the authors issue a challenge to other researchers and to themselves to put defect predictors into the hands of developers and see what happens, as a means to evaluate their efficacy. Evaluation: This paper provides valuable and original insights into the perils of defect prediction, and the rationale of the authors’ opinions appears sound, but the paper lacks somewhat in impact and could be made more useful so it can better serve the research community. I have a love-hate relationship with this paper. On the love side, the anecdotes in italics are insightful and, in my opinion, are the main contribution of the paper. This discussion is an important and valuable one, and the authors take care to show both sides and explore, to some extent, how the research community has sought to address the issue of using the past as the future. On the hate side, the casual dialog at times took away from the paper’s insights and the authors may have gone too far in some cases, but in my opinion, the paper can be saved. In particular, the paper should end on suggestions about the future of defect prediction in a call to arms, rather than suggesting edits to the reader’s Netflix queue. If this paper is to serve as reflection on defect prediction evaluation, to have a stronger impact, it also needs to serve as a resource and point researchers to work (past or future) that can help them do better. As one suggestion, the reusable benchmark data sets for defect prediction should be recognized (Act III, second italicized paragraph), as these efforts are important to facilitate comparison of approaches, even if non-ideal for evaluating a single approach. The authors casually point to their own bug database. The following should also be recognized: The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs, by Claire Le Goues, Neal Holtschulte, Edward K. Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer. Accepted/to appear in IEEE Transactions on Software Engineering, 2015. Acknowledge the effort and value (for comparison and *initial* evaluation), but also point out that the data sets are not the end - there’s a bigger picture (in vivo evaluation). While I understand that the premise of the paper is to say that evaluations need to happen in real-time and within software processes, rather than on historical data, these reusable data sets provide a starting point, and an avenue for comparison against prior art. The connection between Figure 1 and the text is a bit too loose. It should be specifically referenced at the beginning of Act III. The authors should also justify why bugs B and J are relevant when discussing the case when a predictor predicts bug A and it is fixed (3rd paragraph of Act IV). Act V, in vivo should be defined also in the context of the software development process, for clarity (final paragraph). On a less serious note, I found this paper highly entertaining and outright laughed at several points. Thanks for that. Grammar/typos: Act I : one step after another -> one step at a time Act IV: The problem are -> The problems are Strengths: + Important topic of discussion, how to evaluate predictors, using defect prediction as an example + Provides specific responses to critiques of their central thesis (Act V) + Tone is light and self-deprecating, which is important for sensitive topics such as this Weaknesses: - Lacking in practical suggestions for the reader - Not enough discussion on current research efforts to deal with this issue of evaluating defect prediction. - Tone may be too colloquial ----------------------- REVIEW 3 --------------------- PAPER: 30 TITLE: The Tragedy of Defect Prediction, Prince of Empirical Software Engineering Research AUTHORS: Michele Lanza, Andrea Mocci and Luca Ponzanelli ----------- Review ----------- Summary The authors argue that defect prediction research in software engineering is contradictory because if the results are to have any impact, software developers would modify the software to fix the problems and therefore modify the assumptions that run the models. Strengths * I agree that there are plenty of reasons to dislike bug prediction systems in practice. Weaknesses * I think the paper's argument is unsound. Consider Kalman filters, used to correct the flight path of rockets based on prior status and the equations of motion. If the rocket just presumed the equations of motion were all that were necessary to fly, it would never hit the target. As the rocket flies, its position gets off of the prediction due to low-order effects of air friction and weight imbalances inside the fuel tanks. Instead, a Kalman filter is used to update the rocket's position using live sensor data, and then combine it with the equations of motion to predict where the rocket needs to be in the next time step. If there were only 1 time step, this system would never update and would never reach the target. Instead, the quality of the guidance system is related to the speed at which sensor data can be read in to correct the motion equations. This is very similar to the author's claim about bug prediction. We have theories about where bugs will show up in the system, which guide us in knowing what to test first. However, since the system itself has "instabilities" that we can't easily model, we gather statistics about where the bugs are that we've already found and use them to correct our theories about the bugs and test different places for correctness. The authors' argument suggests that the time step size is therefore too long, or the bug analysis is not run often enough, which may be legit in a real system, but can't condemn the concept of bug prediction, as they try to do here. It just means the people running the system need to keep running it fairly often as the underlying system changes and evolves. That's a deployment issue, and yes, important, but not the fundamental contradiction that the authors claim it is. ----------------------- REVIEW 4 --------------------- PAPER: 30 TITLE: The Tragedy of Defect Prediction, Prince of Empirical Software Engineering Research AUTHORS: Michele Lanza, Andrea Mocci and Luca Ponzanelli ----------- Review ----------- Summary: This short paper points out several "flaws" in the evaluation of software defect prediction and suggests possible ways ahead. Detailed comments: - A main point/suggestion from the paper is "researchers active in this area should seriously consider putting their predictors out into the real world, and have them being used by developers who work on a live code base." While I agree with this suggestion, I am bothered by the fact that this paper misses such effort in the community entirely. So reading this sentence in the context of this paper seems to suggest that no one has attempted to put defect predictors in practice. This is not true. Such adoption has been done at AT&T [1] (references are listed at the end), Google [2], Cisco [3], IBM [4] (just to name a few). Some of such attempts weren’t very successful, yet they show valuable lessons learned: explanations of prediction results are needed [2,3], cross-validation isn't appropriate for certain types of defect prediction [3], etc. - I agree that we should aim to predict on new code and report bugs to developers just like many of us do for evaluating bug detection tools. However, I cannot agree with "the evaluation of defect prediction approaches using the past bug history of a system is intrinsically flawed." It is still important to evaluate on known bugs because (1) the future often resembles the past, thus a large number of known bugs allow for a valid evaluation quickly, and (2) we shouldn't rely our evaluation entirely on the mercy of developers. Even if some reported bugs are true bugs that developers will or should confirm and fix, sometimes developers simply don't do so (promptly) because they are busy, they missed the notification of the reported bug, or they are on vacation or sick. The point here is that both (using know bugs and reporting new bugs) are valid and important. They complement each other. - I agree that the lack of reusable datasets is an issue. However, this is known. Datasets such as PROMISE and the authors' dataset have been created and used. I believe that is the primary motivation for Tim Menzies et. al to initiate the effort of the PROMISE database. To incentivize researchers to share data, the MSR conference has a data track. In addition, sometimes it is impossible for some techniques to reuse an existing data set because that algorithm relies on certain features that don't exist in existing datasets. So I believe the problem is more complex, and requires more careful effort and investigation than simply pointing out the known fact that we need shared data sets. - Overall, I agree with the authors that researchers need to seriously address the practicality issue of defect prediction. Yet I disagree with many of the concrete points and suggestions because related work is missing, some points are already known, etc. Typo: a archetypal -> an archetypal References: [1] Predicting the Location and Number of Faults in Large Software Systems. Thomas J. Ostrand, Elaine J. Weyuker, Robert M. Bell. TSE 2005. [2] Does bug prediction support human developers? findings from a google case study. Chris Lewis, Zhongpeng Lin, Caitlin Sadowski, Xiaoyan Zhu, Rong Ou, E. James Whitehead Jr. ICSE 2013. [3] Online Defect Prediction for Imbalanced Data. Ming Tan, Lin Tan, Sashank Dara and Caleb Mayuex. ICSE SEIP 2015. [4] Merits of Organizational Metrics in Defect Prediction: An Industrial Replication. Bora Caglayan, Burak Turhan, Ayse Basar Bener, Mayy Habayeb, Andriy Miransky, Enzo Cialini. ICSE SEIP 2015.