# Grammar-Based Information Extraction

Contents

## Scenario

Recognizing and extracting meaningful information from unstructured Web documents is an important problem in information and knowledge management. In the recent literature a number of approaches for Information Extraction from unstructured documents have been proposed. Among them, the HiLeX system [Ruffolo et al JELIA'06], based on ASP, is receiving quite some attention also in industry, and its industrial exploitation is the purpose of a joint-venture between an Italian and US company from Chicago.

A strong point of HiLeX is that its extraction method combines syntactic and SEMANTIC information, through the usage of domain ontologies. A pre-processor transforms input documents in ASP facts; extraction rules are translated into ASP; and information extraction amounts to reasoning on an ASP program, which is executed by DLV.

The present benchmark is "distilled" from a simple HiLeX application.

## Problem Description

We are given the following context free grammar, which specifies arithmetic equations:

<equation> -> <expression> <greater> <sign_1> <number> <expression> -> <sign_1> <term> <term_1> <term_1> -> <sign> <term> <term_1> | epsilon <term> -> <number> | <op_bracket> <expression> <cl_bracket> <sign> -> <plus> | <minus> <sign_1> -> <sign> | epsilon <number> -> <digit> <number_1> <number_1> -> <digit> <number_1> | epsilon <plus> -> "+" <minus> -> "-" <greater> -> ">" <op_bracket> -> "(" <cl_bracket> -> ")" <digit> -> "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

The nonterminals, especially <equation>, <expression>, <term>, <number>, have an associated value, usually representing a numerical value, and in the case of <equation> a binary value.

Given an equation (sequence of terminals), the problem is determining whether it belongs to the language defined by this grammar (starting from <equation>) and if so, whether the value associated to the axiom is "true" proving that the equation is true as well.

## Input format

An instance of this problem contains a fact (of arity two) for each character of the equation. The first argument is a constant from the set {p, m, g, o, c, 0, 1, ..., 9} representing (with a positional mapping) a terminal of the alphabet {'+', '-', '>', '(', ')', '0', '1', ..., '9'}, and the second one is its position (starting from 1). The instance "9-(3+(4-2)+1)>-1" from above (of length 16) would be represented by the following atoms:

char(9,1). char(m,2). char(o,3). char(3,4). char(p,5). char(o,6). char(4,7). char(m,8). char(2,9). char(c,10). char(p,11). char(1,12). char(c,13). char(g,14). char(m,15). char(1,16).

In the instances provided for the competition, 1.200.000 (one million two hundred thousand) is an upper bound for the numerical values that may occur at any point (so also each intermediate result may be at most 1.200.000), while the length of the input string will not be longer than 2.000.000 bytes characters.

## Output format

If the string described by the input does not belong to the language (is not a well-formed expression), or if it is a well-formed but false expression, then the instance should be unsatisfiable. If the string is well-formed and true, the witness should consist of:

sol. values(S1,N1,S2,N2).

Here the value of two expressions are represented by (S1,N1), (S2,N2), where Si is a sign (p or m, standing for {'+','-'}) and Ni a positive integer in the set {0,1,2,...}. Number Zero is represented as (p,0). Of course, the number represented by (S1,N1) should be larger than (S2,N2).

For instance, in the above example, as the value of "9-(3+(4-2)+1)" is 3, then the answer set has to contain the atoms:

sol. values(p,3,m,1).

## Example

## Author(s)

Author: Marco Manna

Affiliation: University of Calabria, Italy