Saturday, August 18, 2012

Information Extraction from Web Search perspective


In my opinion the killer application of IE is to develop search applications. Classic Web search is all about string matching and page ranking, the modern search is all about finding user intention (from bag of key words to entities/concept queries) to ranking entities/objects.

Information Extraction(IE) refers to the extracting of structured Information (typically by means of automatic) such as entities (Example: Location, Person etc), relationships between entities (Example: Located-In, Employed-By etc), and attributes describing entities from unstructured sources. The extraction of structure information from noisy, unstructured sources is a challenging task, the IE is part of Natural Language processing technique.

In order to convert the current information-based web into knowledge-based web, we should convert the HTML (dominant format of the web) into RDF. IE is the first step in order to process natural language text by machines, the next step is to map/link the annotated text with ontology.
The third step is to convert the output into RDF. Generating RDF from annotated text is not such trivial task.

The techniques of IE have evolved considerably over the last two decades.
1) Early systems were rule-based with manually coded rules. As manual coding of rules became complex and tedious, algorithms for automatically learning rules from examples were developed. As extraction systems were targeted on more noisy unstructured sources, rules were found to be too brittle.

2) Then came the age of statistical learning, where in parallel two kinds of techniques were deployed: generative models based on Hidden Markov Models and conditional models based on maximum entropy. Both were superseded by global conditional models, popularly called Conditional Random Fields.

3) There also exist hybrid models that attempt to use the benefits of both statistical and rule-based methods.

As the scope of extraction systems widened to require a more holistic analysis of a document’s structure, techniques from grammar construction were developed. In spite of this journey of varied techniques, there is no clear winner. Rule-based methods and statistical methods continue to be used in parallel depending on the nature of the extraction task.

Keyword searches are adequate for getting information about entities, which are typically nouns or noun phrases. They fail on the following queries
- Abstract queries (Ex: An actor who won Oscar award in 2012)
- Queries that are looking for relationships between entities (ex: Company X acquired Company Y)
- Queries which are searched by entity attributes (Ex: A car within range of 15k to 20k)
- Queries based on Knowledge Base (Ex: Entity mentions from DBPedia)
Etc

What information can be extracted?
Entities, relationships, lists, tables, attributes, Adjectives Describing Entities (Opinion mining), Ontology

What is mean by annotating relationships?
Relationships are defined over two or more entities related in a predefined way. Examples are “employeed-by” relationship between a person and an organization, “acquired-by” relationship between pairs of companies. These are all binary relationship (ie: between two entities the relationship is defined). The complex part of the annotation is defining more than two entities, that is called annotating multi-way relationship. There are two techniques which are used by NLP community namely – Semantic Role Lebeling and record extraction
More will come on this topics…

No comments: