Informatics Seminar on Wedensday, December 10th at 13.30 - Thomas Heinis

Faculty of Informatics - Academic Studies Administration

Start date: 10 December 2008

End date: 11 December 2008

The Faculty of Informatics is pleased to announce a seminar given by Thomas Heinis

TITLE: Efficient Lineage Tracking for Scientific Workflows

SPEAKER: Thomas Heinis, Institute for Pervasive Computing, ETH Zurich

DATE: Wednesday, December 10th, 2008

PLACE: USI Università della Svizzera italiana, room SI-006, Informatics building (Via G. Buffi 13)

TIME: 13.30

ABSTRACT:

Data lineage and data provenance have been identified as key problems in the management of scientific data. Not knowing the exact provenance and processing pipeline used to produce a derived data set often renders the data set useless from a scientific point of view. On the positive side, capturing provenance information is facilitated by the widespread use of workflow tools for processing scientific data since the workflow process describes all the steps involved in producing a given data set. On the negative side, efficiently storing and querying such information has until now proven to be difficult. Known solutions use recursive queries and even recursive tables to represent scientific workflows. Such solutions do not scale and are rather inefficient.

In this talk I will present our approach to the problem. We use a space and query efficient interval representation for dependency graphs and show how to transform arbitrary workflow processes into graphs that can be stored using such representation. The approach is very efficient with respect to the time required to encode the graph and to ask lineage related questions. We have benchmarked our approach by using it to store the data lineage of several different scientific workflows.

In the remainder of the talk I will discuss how we have put our method to use in Sisyphus, a tool we have developed to process, manage and visualize Proteomics data. Experiment data processing in Sisyphus is subject to constant change as the focus of the experiments changes and different or new processing algorithms must be considered. Clearly, with a perpetually changing data processing pipeline, tracking the lineage of the data becomes of utter importance. Tracking the lineage of data in Sisyphus is however difficult as data dependencies are intricate and the amount of lineage data is vast, making scalable and efficient tracking mechanisms mandatory.

BIO:

Thomas Heinis received a master's degree in computer science from the Federal Institute of Technology (ETH) in Zurich, Switzerland, in March 2002. After working in the industry, he visited Purdue University from August 2003 to May 2004 where he carried out research in the area of Grid computing. Since June 2004 he is a research assistant at the Information and Communication Group in the Institute for Pervasive Computing at ETH Zurich, Switzerland. His main research interests are in the field of grid & autonomic computing as well as data lineage in the context of scientific applications.

HOST: Prof. Cesare Pautasso