MetaBJTool - A package for metaquery evaluation

 

Introduction

The MetaBJTool prototype includes the implementation of the various stages in which the whole metarule mining process is divided. The prototype consists of two principal modules:

·        Management of the communication protocols

·        Algorithms for computing a metaquery

Our prototype accepts a metaquery provided in XML format. Results are sent back in XML format as well. The internal implementation of the algoritms has been aimed at the reduction of the computing times. It was used here an algorithm of instantiation which uses the well-known technique of the forward checking; In particular the introduction of  backjumping allows to cut drastically the space of the possible solutions, thus making the evaluation step faster and more efficient. The concept of "up-support" was used in filtration stage; it is an upper bound extimate on the support index; it allows a pre-selection of the found rules, so avoiding the support and confidence computation for those rules with  not very interesting informative content. The reader is referred to Report D3.R1 for an introductory discussion to the metaquery evaluation technique, and to D3.R3 for more insights about metarule evaluation algorithms.

 


Communication Protocol

 

The proposed communication protocol  allows the interfacing between external modules (GUI, remote clients, etc.) and the “heart” of the prototype consisting of the metaquery computation module. The instantiation algorithm, which we will call MEE, takes in input a metarule, coupled with a set of parameters and constraints.

This information, encoded in XML format,  is checked with respect to the  Metarules Interchange Format. The answer set to the metarule is computed. Then, this set is output in XML format as well.

Here is a sketch of the MetaBJTool architecture:

 

 

 

 The structure of the MIFIn and MIFOut formats are according to report D3.R2.

For instance a document in MIFIn format is as follows:

<?xml version="1.0" encoding="UTF-8" ?>
- <metaquery xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="MIFIn.xsd" support="50.00" confidence="75.00">
- <head>
- <metaatom name="P">
  <variable name="X" />
  <variable name="Y" />
  </metaatom>
  </head>
- <body>
- <metaatom name="Q">
  <variable name="X" />
  <variable name="Z" />
  </metaatom>
- <metaatom name="R">
  <variable name="Z" />
  <variable name="K" />
  </metaatom>
  </body>
  </metaquery>

 


MetaBJTool Internals

 

Our prototype  consists of four stages

 

·        Access to the database

·        Overlap stage

·        Compute algorithm

·        Evaluation algorithm

 


Overlap stage

Among the various steps which lead to the acquisition of new patterns, the preprocessing and the analysis of the data  play an important role in the whole mining system.

Aim of this phase is to discard information considered useless and that can slow down the following steps. It is necessary to suitably choose the couples of attributes which could participate within an instantiated rule having sufficient values for support and confidence.

The same variables appearing in different literal patterns are usually bound to attributes of different relations but with the same type. But also in presence of the same data types it is possible find couples of table columns having very little values in common (overlapping values). It is the reason for what the concept of overlap was introduced in formal way.

The Instantiation step will generate rules considering only couples of attributes having enough overlapping values.

 


Instantiation algorithm

The purpose of this step is to instantiate each metapattern ( R_1(X_1), ... R_n(X_n) ) of a given metaquery. At the end of the process, each R_i will be bound to a database relation and each variable to an attribute of some relation. To reduce computation times, the instantiation algorithm exploits  the well-known technique of the forward checking with backjumping. This algorithm in fact allows perform the exhaustive search of all the possible solutions inside the space of the attributes of the database like the most traditional backtracking, although the search space is reduced.


The indices evaluation algorithm

Rules produced in the previous stage are validated by the evaluation algorithm, which by means of quality indices, support and confidence, is responsible of filtering out rules with low support and  low confidence. The calculation of the Up-support, which constitutes an upper-bound for the support was introduced; it is less expensive to be calculated, in terms of time. Having an up-support value lower than the wished support is sufficient to affirm that a given rule will certainly not be valid. In order to perform joins operations, needed in order to execute the support and confidence value computation, some ad hoc procedures were used; they take the istantiated rules and turn them into SQL queries to be sent , through Java statement, to the data-source.

              


Package Interface

Our prototype can be tested on the net using a set of a few simple procedures which allow to remotely start  the process of computing a metaquery. The interface was realized using the Java package RMI. Here are the procedures a client can use to interact with our prototype, they will afterwards be shown in detail:

 

 1.      boolean setConnection(String URL, String username, String password)   

 2.      boolean setValues(double overlap, int levels

  3.   void setDataInput(File f)

  4.   boolean overlapCompute(String nomeFileLoad)

  5.   public void backJumpSolutions(String nomeFileLoad)

  6.   public void upSupportCompute()

  7.   public void supportCompute()     

  8.      void getResult(File f)

  9.   String progress()

 

1.   Set the parameters useful for the connection to the data source.

2.    This function allows to set the overlap threshold value and the number of intervals in which numeric columns scopes have to be discretized.

3.   This function allows to submit an XML-file in MIFIn format containing a metaquery to be calculated.

4.  Starts the overlap process. It is possible to specify where to store results.

5. Starts the instantiation computation for the last submitted metarules, on the basis of the overlap analysis stored in the file indicated as parameter. An instantiations set is produced.

6. Starts the process of computing the up-support for the current instantiation set. Not valid rules are filtered out from the current result set.

7. Computes support values for the current rules set. Not valid rules are filtered out from the current result set.

8. Produces the current result set in MIFOut format.

9.  Get informations about the current progress of the computation.