Faculty Seminar Series

Authorship Attribution Based on a Probabilistic Topic Model

The Faculty of Informatics is pleased to announce a seminar given by Jacques Savoy

DATE: Monday, April 16th, 2012
PLACE: USI Università della Svizzera italiana, room SI-007, Informatics building (Via G. Buffi 13)
TIME: 15.30
In this presentation we describe the problem of authorship attribution (whereby the author of a given text must be determined based on text samples written by known authors) as well as related questions.  To solve them, classical methods are based on some stylistic measurements (e.g., the Delta rule) while machine learning paradigm (naīve Bayes) may provide another source of approaches based on training examples.  In this talk, we will show how we can use the Latent Dirichet Allocation (LDA) as an approach to authorship attribution.  Based on this generative probabilistic topic model, we can model each document as a mixture of topic distributions with each topic specifying a distribution over words.  Some examples will show the potential of this approach.  Based on author profiles (aggregation of all texts written by the same writer) we then suggest computing the distance with a disputed text to determine its possible writer.  To evaluate this algorithm and demonstrates its relative effectiveness, we develop two experiments, the first based on 5,408 newspaper articles (Glasgow Herald) written in English by 20 distinct authors and the second on 4,326 newspaper articles (La Stampa) written in Italian by 20 distinct authors.  The results tend to show that the LDA-based classification scheme performs better than the Delta rule, and the Į2 distance, two classical approaches in authorship attribution based on a restricted number of terms.  Compared to the Kullbach-Leibler divergence, the LDA-based scheme can provide better effectiveness when considering a larger number of terms.  When compared to the Naīve Bayes approach, the performance differences tend to be small. 
Prof. Jacques Savoy is full Professor in Computer Science at University of Neuchatel (Switzerland).  J. Savoy received a Ph.D. in quantitative economics from the University of Fribourg (Switzerland) in 1987.  From 1987-92 he was on the faculty of Computer Science at the University of Montreal (Canada).  His research interests cover mainly natural language processing and particularly information retrieval for other languages than English (European, Asian, and Indian) as well as multilingual and cross-lingual information retrieval.  He participated from many years to various evaluations campaigns (CLEF, NTCIR and TREC) dealing with these questions.  His current research interests are related to text clustering and categorization as well as authorship attribution.

HOST: Prof. Fabio Crestani