Informatics seminar on Wednesday, March 4th, 15.30 - Bruno Pouliquen and Hristo Tanev

Faculty of Informatics - Academic Studies Administration

Start date: 4 March 2009

End date: 5 March 2009

The Faculty of Informatics is pleased to announce a seminar given by Bruno Pouliquen and Hristo Tanev

TITLE: Pattern based Information extraction from Multilingual News

SPEAKERS: Bruno Pouliquen, Hristo Tanev - Joint Research Centre

DATE: Wednesday, March 4th, 2009

PLACE: USI Università della Svizzera italiana, room SI-006, Informatics building (Via G. Buffi 13)

TIME: 15.30

ABSTRACT:

The Joint Research Centre of the European Commission in Ispra (Italy) has created a unique tool: Europe Media Monitor (EMM) that gathers, classifies and analyses over 80.000 newspaper articles in 42 languages from more than 2.000 sites worldwide on daily bases.

Natural language processing is an essential component of the system. Our context is quite specific: we have many languages to deal with, not many linguistic resources available and even less human resources. We try to avoid using parsers because they are not available for some languages (others have specific parsers but without homogeneity). We try to separate rules from resources and use bottom-up approaches when possible.

We will first present tools that use simple surface analysis:

- Clustering and linking news over time and languages.

- Named Entity Recognition: Geotagging (place name recognition using gazetteers and statistical disambiguation rules), person name recognition (guessing using trigger words, the problem of declensions and name variants identification and merging, especially across writing systems)

- Knowledge base updating and querying (linking News/ Persons/ Place names/ keywords...)

- Quotation recognition (direct speech extraction)

Real world event and relation extraction require effective tools for automatic pattern learning and matching. In this clue we developed several algorithms:

- Learning of linear patterns using local context entropy
- Representation of a set of syntactic structures through syntactic networks
- Learning and efficient matching of syntactic patterns in the context of the syntactic networks

Using these algorithms we learned a library of patterns for event and relation extraction. Building on these libraries we implemented two working systems for event and social network extraction. We carried out several experiments with the automatically extracted social network:
- Automatically deriving the importance of a person in the social network
- Automatic expansion of a signed social network
- Analysis of the contacts of political leaders

Our approach has some limits but clear advantages. We built a knowledge base on persons, events and worldwide news which is daily updated without human supervision. Our website (

http://press.jrc.it/NewsExplorer/home/it/latest.html exists since 4 years and gathers an average of more than 1 Million hits/day. see)

More information about the authors and projects is available on http://langtech.jrc.it/, about the various applications on http://press.jrc.it/overview.html

HOST: Prof. Fabio Crestani