Information Extraction from Unstructured, Ungrammatical Data Sources on the Web

Staff - Faculty of Informatics

Start date: 12 July 2010

End date: 13 July 2010

The Faculty of Informatics is pleased to announce a seminar given by Craig A. Knoblock

DATE: Monday, July 12th 2010
PLACE: USI Università della Svizzera italiana, room SI-008, Informatics building (Via G. Buffi 13)
TIME: 14.00

There is a vast amount of textual data on the Web that is neither grammatical nor formally structured.  Examples of these types of data sources are online classified ads from Craigslist and auction item listings from eBay.  The unstructured nature of this data makes query and integration difficult because the attributes are embedded within the text.  Traditional information extraction (IE) techniques are inadequate to perform this task because it relies on clues from the data, such as the regular page structure or natural language grammar, neither of which are found in posts. Furthermore, traditional information extraction does not incorporate data cleaning, which is necessary to accurately query and integrate the source. The two-step approach described in this paper creates relational data sets from unstructured and ungrammatical text by addressing both issues. To do this, we require a set of known entities called a reference set. The first step aligns each post to each member of a reference set. The second step performs information extraction for the attributes, including attributes not easily represented by reference sets, such as a price. In this manner we create a relational structure over previously unstructured data, supporting deep and accurate queries over the data as well as standard values for integration. Our experimental results show that our technique matches the posts to the reference set accurately and efficiently and outperforms state-of-the-art extraction systems on the extraction task from posts.

This research is joint work with Matthew Michelson

Craig Knoblock is the Director of Information Integration at the Information Sciences Institute, a unit of the University of Southern California (USC), and a Research Professor in the USC Computer Science Department.  Dr. Knoblock also is a founder and Chief Scientist of Fetch Technologies, a web extraction and integration provider, and of Geosemble Technologies, which develops geospatial information solutions.  At the Information Sciences Institute (ISI), Dr. Knoblock leads a team of about 20 researchers, staff and students in developing intelligent techniques for rapid, efficient information integration.
He focuses on constructing integrated applications from online sources through information extraction, source modeling, record linkage, constraint reasoning and other technologies for geospatial and bioinformatics data integration.

Dr. Knoblock is a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), a Distinguished Scientist of the Association of Computing Machinery (ACM), a Trustee of the International Joint Conference on Artificial Intelligence (IJCAI), and past President of the International Conference on Automated Planning and Scheduling (ICAPS).  He has served on the Senior Program Committee of the National Artificial Intelligence Conference, among others, and is conference chair for the 2011 International Joint Conference on AI (IJCAI).  Dr. Knoblock has published Generating Abstraction Hierarchies (Kluwer Academic Publishers, 1993), along with more than 200 journal articles, book chapters and conference papers.  He serves on the Editorial Boards of several journals, including Artificial Intelligence and Computational Intelligence.  Dr. Knoblock was awarded his Bachelor of Science degree by Syracuse University, and his Master's and Ph.D. by Carnegie Mellon University, all in computer science.   He is currently spending his summer teaching a course on geospatial data integration at the University of Trento.

HOST: Prof. Fabio Crestani