I am a postdoctoral researcher at the Faculty of Informatics of the Università della Svizzera italiana (USI) in Lugano, Switzerland. I work in the Information Retrieval group, headed by Prof. Fabio Crestani.
In 2014, I received my Ph.D. in Engineering in Computer Science from Sapienza University of Rome, under the supervision of Prof. Stefano Leonardi. My Ph.D. Thesis focused on query-log analysis, optimization of IR systems through caching of search results and augmented inverted indexes, user-click analysis for recommendations of news and blog websites. Part of my research was conducted while I was intern at the Yahoo Research Lab in Barcelona (Spain) and at the Max Planck Institute for Informatics in Saarbrücken (Germany).
Before joining the IR group at USI, I was a postdoctoral researcher at Max Planck Institute for Informatics. I worked with Prof. Gerhard Weikum in the Databases and Information Systems group, and I was involved in projects on information extraction and on detection of privacy risks for web users.
This technical report presents the work of Università della Svizzera italiana (USI) at TREC 2016 Contextual Suggestion track. The goal of the Contextual Suggestion track is to develop systems that could make suggestions for venues that a user will potentially like. Our proposed method attempts to model the users' behavior and opinion by training a SVM classifier for each user. It then enriches the basic model using additional data sources such as venue categories and taste keywords to model users' interest. For predicting the contextual appropriateness of a venue to a user's context, we modeled the problem as a binary classification one. Furthermore, we built two datasets using crowdsourcing that are used to train a SVM classifier to predict the contextual appropriateness of venues. Finally, we show how to incorporate the multimodal scores in our model to produce the final ranking. The experimental results illustrate that our proposed method performed very well in terms of all the evaluation metrics used in TREC.
An important task in recommender systems is suggesting relevant venues in a city to a user. These suggestions are usually created by exploiting the user's history of preferences, which are, for example, collected in previously visited cities. In this paper, we first introduce a user model based on venues' categories and their descriptive keywords extracted from Foursquare tips. Then, we propose an enriched user model which leverages the users' reviews from Yelp. Our participation in the TREC 2015 Contextual Suggestion track, confirmed that our model out- performs other approaches by a significant margin.
Tracking public opinion in social media provides important information to enterprises or governments during a decision making process. In addition, identifying and extracting the causes of sentiment spikes allows interested parties to re- design and adjust strategies with the aim to attract more positive sentiments. In this paper, we focus on the problem of tracking sentiment towards different entities, detecting sentiment spikes and on the problem of extracting and rank- ing the causes of a sentiment spike. Our approach combines LDA topic model with Relative Entropy. The former is used for extracting the topics discussed in the time window before the sentiment spike. The latter allows to rank the detected topics based on their contribution to the sentiment spike.
Privacy of Internet users is at stake because they expose personal information in posts created in online communities, in search queries, and other activities. An adversary that monitors a community may identify the users with the most sensitive properties and utilize this knowledge against them (e.g., by adjusting the pricing of goods or targeting ads of sensitive nature). Existing privacy models for structured data are inadequate to capture privacy risks from user posts.
This paper presents a ranking-based approach to the assessment of privacy risks emerging from textual contents in online communities, focusing on sensitive topics, such as being depressed. We propose ranking as a means of modeling a rational adversary who targets the most afflicted users. To capture the adversarys background knowledge regarding vocabulary and correlations, we use latent topic models. We cast these considerations into the new model of R-Susceptibility, which can inform and alert users about their potential for being targeted, and devise measures for quantitative risk assessment. Experiments with real-world data show the feasibility of our approach.
With the rapid proliferation of microblogging services such as Twitter, a large number of tweets is published everyday often making users feel overwhelmed with information. Helping these users to discover potentially interesting tweets is an important task for such services. In this paper, we present a novel tweet-recommendation approach, which exploits network, content, and retweet analyses for making recommendations of tweets. The idea is to recommend tweets that are not visible to the user (i.e., they do not appear in the user timeline) because nobody in her social circles published or retweeted them. To do that, we create the user's ego-network up to depth two and apply the transitivity property of the friends-of-friends relationship to determine interesting recommendations, which are then ranked to best match the user's interests. Experimental results demonstrate that our approach improves the state-of-the-art technique.
Crowdsourcing is a computational paradigm whose distinctive feature is the involvement of human workers in key steps of the computation. It is used successfully to address problems that would be hard or impossible to solve for machines. As we highlight in this work, the exclusive use of nonexpert individuals may prove ineffective in some cases, especially when the task at hand or the need for accurate solutions demand some degree of specialization to avoid excessive uncertainty and inconsistency in the answers. We address this limitation by proposing an approach that combines the wisdom of the crowd with the educated opinion of experts. We present a computational model for crowdsourcing that envisions two classes of workers with different expertise levels. One of its distinctive features is the adoption of the threshold error model, whose roots are in psychometrics and which we extend from previous theoretical work. Our computational model allows to evaluate the performance of crowdsourcing algorithms with respect to accuracy and cost. We use our model to develop and analyze an algorithm for approximating the best, in a broad sense, of a set of elements. The algorithm uses naïve and expert workers to find an element that is a constant factor approximation to the best. We prove upper and lower bounds on the number of comparisons needed to solve this problem, showing that our algorithm uses expert and naïve workers optimally up to a constant factor. Finally, we evaluate our algorithm on real and synthetic datasets using the CrowdFlower crowdsourcing platform, showing that our approach is also effective in practice.
We design algorithms that, given a collection of documents and a distribution over user queries,
return a small subset of the document collection in such a way that we can efficiently provide high-quality answers to user queries using only the selected subset.
This approach has applications when space is a constraint or when the query-processing time increases significantly with the size of the collection.
We study our algorithms through the lens of stochastic analysis and prove that even though they use only a small fraction of the entire collection,
they can provide answers to most user queries, achieving a performance close to the optimal. To complement our theoretical findings,
we experimentally show the versatility of our approach by considering two important cases in the context of Web search.
In the first case, we favor the retrieval of documents that are relevant to the query,
whereas in the second case we aim for document diversification.
Both the theoretical and the experimental analysis provide strong evidence of the potential value of query covering in diverse application scenarios.
This paper proposes a new model of user-centric, global,
probabilistic privacy, geared for today's challenges of helping
users to manage their privacy-sensitive information across a
wide variety of social networks, online communities, QA forums,
and search histories. Our approach anticipates an
adversary that harnesses global background knowledge and
rich statistics in order to make educated guesses, that is,
probabilistic inferences at sensitive data. We aim for a
tool that simulates such a powerful adversary, predicts privacy
risks, and guides the user. In this paper, our framework
is specialized for the case of Internet search histories.
We present preliminary experiments that demonstrate
how estimators of global correlations among sensitive and
non-sensitive key-value items can be fed into a probabilistic
graphical model in order to compute meaningful measures
of privacy risk.
Phrase queries are a key functionality of modern search
engines. Beyond that, they increasingly serve as an important
building block for applications such as entity-oriented search, text
analytics, and plagiarism detection. Processing phrase queries is
costly, though, since positional information has to be kept in the
index and all words, including stopwords, need to be considered.
We consider an augmented inverted index that indexes selected
variable-length multi-word sequences in addition to single words. We
study how arbitrary phrase queries can be processed efficiently on
such an augmented inverted index. We show that the underlying
optimization problem is NP-hard in the general case and
describe an exact exponential algorithm and an approximation
algorithm to its solution. Experiments on ClueWeb09 and The New York
Times with different real-world query workloads examine the
practical performance of our methods.
Web usage mining is the application of data mining techniques to the data generated
by the interactions of users with web servers.
This kind of data, stored in server logs, represents a valuable source of information,
which can be exploited to optimize the document-retrieval task,
or to better understand, and thus, satisfy user needs.
Our research focuses on two important issues: improving search-engine performance
through static caching of search results, and helping users to find interesting web pages
by recommending news articles and blog posts.
Concerning the static caching of search results, we present the query covering approach.
The general idea is to populate the cache with those documents that contribute to the
result pages of a large number of queries,
as opposed to caching the top documents of most frequent queries.
For the recommendation of web pages, we present a graph-based approach, which leverages the
user-browsing logs to identify early adopters.
These users discover interesting content before others, and monitoring their activity we can find web pages to recommend.
In this paper we present a novel graph-based data abstraction
for modeling the browsing behavior of web users. The
objective is to identify users who discover interesting pages
before others. We call these users early adopters. By tracking
the browsing activity of early adopters we can identify
new interesting pages early, and recommend these pages to
similar users. We focus on news and blog pages, which are
more dynamic in nature and more appropriate for recommendation.
Our proposed model is called early-adopter graph. In this
graph, nodes represent users and a directed arc between
users u and v expresses the fact that u and v visit similar
pages and, in particular, that user u tends to visit those
pages before user v. The weight of the edge is the degree to
which the temporal rule “u visits a page before v” holds.
Based on the early-adopter graph, we build a recommendation
system for news and blog pages, which outperforms
other out-of-the-shelf recommendation systems based on collaborative
A. Anagnostopoulos, L. Becchetti, S. Leonardi, I. Mele, and P. Sankowski Stochastic Query Covering In WSDM '11: Proceedings of the 4th ACM International Conference on Web Search and Data Mining (Best Poster Award)
In this paper we introduce the problem of query covering as a
means to efficiently cache query results. The general idea is to populate the cache
with documents that contribute to the result pages of a large number of queries,
as opposed to caching the top documents for each query.
It turns out that the problem is hard and solving it requires knowledge
of the structure of the queries and the results space, as well as
knowledge of the input query distribution. We formulate the problem
under the framework of stochastic optimization;
theoretically it can be seen as a stochastic universal version of set
multicover. While the problem is NP-hard to be solved exactly,
we show that for any distribution it can be approximated using a simple
greedy approach. Our theoretical findings are complemented by
experimental activity on real datasets, showing the feasibility
and potential interest of query-covering approaches in practice.