Mining Unstructured Software Data

Staff - Faculty of Informatics

Start date:

End date:

You are cordially invited to attend the PhD Dissertation Defense of Alberto BACCHELLI on Friday, June 14th 2013 at 14h30 in room A34 (Red building)
The availability of large amounts of recorded data, produced during software development, has led to a research area called mining software repositories (MSR). Researchers mine software repositories both to support software understanding, development, and evolution, and to empirically validate novel ideas and techniques. Most MSR research focuses on mining archives of data that is either written by humans for a computer (e.g., source code) or generated by a computer for humans (e.g., execution traces). This data has an easily parseable structure that allows precise fact extraction and concerns the end product of software development. For this reason, the knowledge embedded in structured data can be extracted and modeled through well-established techniques.

Other software repositories archive data that is more unstructured, as it is produced by humans for humans: documents, such as emails, change comments, or bug reports, written in natural language and used to exchange information among people. The data contained in these repositories is not widely exploited because of its noisy and unstructured nature. The information stored in unstructured data encodes knowledge not to be found in other software artifacts, and also allows gaining valuable insights on the human factors revolving around a software project.

Our thesis is that the analysis of unstructured data supports software understanding and evolution analysis, and complements the data mined from structured sources. To this aim, we implemented the necessary toolset and investigated methods for exploring, exposing, and exploiting unstructured data.

To validate our thesis, we focused on development email data. We found two main challenges in using it to support program comprehension and software development: The disconnection between emails and code artifacts and the noisy and mixed-language nature of email content. We tackle these challenges proposing novel approaches. First, we devise lightweight techniques for linking email data to code artifacts. We use these techniques for creating a tool to support program comprehension with email data, and to create a new set of email based metrics to improve existing defect prediction approaches. Subsequently, we devise techniques for giving a structure to the content of email and we use this structure to conduct novel software analyses to support program comprehension.

In this dissertation we show that unstructured data, in the form of development emails, is a valuable addition to structured data and, if correctly mined, can be used successfully to support software engineering activities.

Dissertation Committee:

  • Prof. Michele Lanza, Università della Svizzera italiana, Switzerland (Research Advisor)
  • Prof. Fabio Crestani, Università della Svizzera italiana, Switzerland (Internal Member)
  • Prof. Carlo Ghezzi, Politecnico di Milano, Italy (Internal Member)
  • Prof. Lionel Briand, University of Luxembourg, Luxembourg (External Member)
  • Prof. Massimiliano di Penta, University of Sannio, Italy (External Member)
  • Dr. Thomas Zimmermann, Microsoft Research, USA (External Member)