Grammar-Based Information Extraction

Contents

Grammar-Based Information Extraction

Scenario

Recognizing and extracting meaningful information from unstructured Web documents is an important problem in information and knowledge management. In the recent literature a number of approaches for Information Extraction from unstructured documents have been proposed. Among them, the HiLeX system [Ruffolo et al JELIA'06], based on ASP, is receiving quite some attention also in industry, and its industrial exploitation is the purpose of a joint-venture between an Italian and US company from Chicago.

A strong point of HiLeX is that its extraction method combines syntactic and SEMANTIC information, through the usage of domain ontologies. A pre-processor transforms input documents in ASP facts; extraction rules are translated into ASP; and information extraction amounts to reasoning on an ASP program, which is executed by DLV.

The present benchmark is "distilled" from a simple HiLeX application.

Problem Description

We are given the following context free grammar, which specifies arithmetic equations:

   <equation> -> <expression> <greater> <sign_1> <number> 
   <expression> -> <sign_1> <term> <term_1> 
   <term_1> -> <sign> <term> <term_1> | epsilon 
   <term> -> <number> | <op_bracket> <expression> <cl_bracket> 
   <sign> -> <plus> | <minus> 
   <sign_1> -> <sign> | epsilon 
   <number> -> <digit> <number_1> 
   <number_1> -> <digit> <number_1> | epsilon 
   <plus> -> "+" 
   <minus> -> "-" 
   <greater> -> ">" 
   <op_bracket> -> "(" 
   <cl_bracket> -> ")" 
   <digit> -> "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

The nonterminals, especially <equation>, <expression>, <term>, <number>, have an associated value, usually representing a numerical value, and in the case of <equation> a binary value.

Given an equation (sequence of terminals), the problem is determining whether it belongs to the language defined by this grammar (starting from <equation>) and if so, whether the value associated to the axiom is "true" proving that the equation is true as well.

Predicates

Input: char/2
Output: values/4, sol/0

Input format

An instance of this problem contains a fact (of arity two) for each character of the equation. The first argument is a constant from the set {p, m, g, o, c, 0, 1, ..., 9} representing, respectively a terminal of the alphabet {'+', '-', '>', '(', ')', '0', '1', ..., '9'}, and the second one is its position (starting from 1). The instance "9-(3+(4-2)+1)>-1" from above (of length 16) would be represented by the following atoms:

char(9,1). char(m,2). char(o,3). char(3,4). char(p,5). char(o,6). char(4,7). char(m,8). char(2,9). char(c,10). char(p,11). char(1,12). char(c,13). char(g,14). char(m,15). char(1,16).

In the instances provided for the competition, 1.500.000 (one million two hundred thousand) is an upper bound for the numerical values that may occur at any point (so also each intermediate result may be at most 1.500.000), while the length of the input string will not be longer than 4 megabytes characters.

Output format

If the string described by the input does not belong to the language (is not a well-formed expression), or if it is a well-formed but false expression, then the instance should be unsatisfiable. If the string is well-formed and true, the witness should consist of:

sol. values(S1,N1,S2,N2).

Here the value of two expressions are represented by (S1,N1), (S2,N2), where Si is a sign (p or m, standing for {'+','-'}) and Ni a positive integer in the set {0,1,2,...}. The value zero is represented as (p,0). Of course, the number represented by (S1,N1) should be larger than (S2,N2).

For instance, in the above example, as the value of "9-(3+(4-2)+1)" is 3, then the answer set has to contain the atoms:

sol. values(p,3,m,1).

Example(s)

Input: char(9,1). char(m,2). char(o,3). char(3,4). char(p,5). char(o,6). char(4,7). char(m,8). char(2,9). char(c,10). char(p,11). char(1,12). char(c,13). char(g,14). char(m,15). char(1,16).

Output: sol. values(p,3,m,1).

Problem Peculiarities

Type: Search/P Competition: both System and M&S

The problem reflects an applicative context in which a string, candidate to be member of a given language, has to be recognized (and evaluated) in extremely small times. The usage of a declarative language for specifying the grammar at hand makes applications (e.g. data extraction systems) very flexible, at the cost of some loss of performance. The problem, whose benchmarks will be executed on fairly big instances, will measure how reasonable such a loss can be with current declarative languages technology.

Author(s)

Author: Marco Manna
- Affiliation: University of Calabria, Italy

ASP Competition 2011: FinalProblemDescriptions/Grammar-BasedInformationExtraction (last edited 2011-02-02 17:59:27 by MarioAlviano)

ASP Competition 2011

Quick Links

Search Wiki

Page Tools