Analysis and Optimization of Java Streams

Staff - Faculty of Informatics

Date: 13 September 2022 / 14:30 - 16:00

USI Campus EST, room D1.13, Sector D

You are cordially invited to attend the PhD Dissertation Defence of Edgar Eduardo Rosales Rosero on Tuesday 13 September 2022 at 14:30 in room D1.13 and online at this link.

The Stream API was added in Java 8 to ease the development of data-processing logic. This API introduces two key abstractions. The stream, a sequence of elements made available by a data source, and the stream pipeline, a multistage structure which contains operations (e.g., map, filter, reduce) that are applied to the elements in the stream upon execution. Streams are receiving the attention of Java developers as they are versatile, ease the parallelization of data transformations, and can be used to improve software design by leveraging the extensibility and maintainability favored by functional programming styles. Java Streams have been extensively studied by recent work, focusing either on stream-related optimizations or the study of how Java developers are using the Stream API. Nonetheless, these studies mainly rely on manual code inspection and static analysis techniques, overlooking the analysis of runtime metrics specific to streams. Also, empirical studies on the use of streams consider only small sets of applications, leaving the large-scale analysis of stream processing an open research question. Our work fills this gap, introducing new techniques to enable the analysis and optimization of stream-related performance issues, along with the large-scale empirical study of stream processing in the wild. Our goal is advancing the understanding of both the impact of streams on application performance and how streams are used in modern open-source applications. In this dissertation, we first present a new technique for measuring the computations performed by a stream in terms of reference cycles (cycles for short). We use cycle profiling to identify problematic stream executions limiting application performance. As accuracy is crucial to this end, we estimate and remove from the profiles the extra cycles caused by the inserted instrumentation code that makes cycle profiling possible. We implement our technique in a novel cycle-accurate stream profiler for the Java Virtual Machine (JVM). We use our tool to profile the state-of-the-art benchmark suite Renaissance, revealing previously unknown performance issues in the studied workloads. We use our profiles to optimize sequential and parallel stream-based workloads, achieving speedups up to a factor of 5x. Complementarily, we develop a new technique for the profiling of an extensive set of stream-related runtime metrics suitable to be collected in the wild, whose study helps unveil common practices of Java developers when using streams. We implement our technique in a novel profiler characterizing stream processing in the wild. We use a fully automatic approach to massively apply our tool to the analysis of code exercised via unit tests available in open-source software projects hosted on GitHub. We conduct the first large-scale empirical study on the use of the Stream API. Our findings confirm the observations that related work highlighted at a smaller scale. Moreover, our work is the first to report the popularity of many features of the Stream API and reveals inefficient stream code patterns and stream misuses that are currently present in multiple open-source software projects.

Dissertation Committee:
- Prof. Walter Binder, Università della Svizzera italiana, Switzerland (Research Advisor)
- Prof. Andrea Rosà, Università della Svizzera italiana, Switzerland (Research co-Advisor)
- Prof. Patrick Thomas Eugster, Università della Svizzera italiana, Switzerland (Internal Member)
- Prof. Matthias Hauswirth, Università della Svizzera italiana, Switzerland (Internal Member)
- Prof. Andreas Krall, TU Wien, Austria (External Member)
- Prof. Petr Tuma, Charles University, Czech Republic (External Member)