Online Additional Material for “Extended High Utility Pattern Mining: An Answer Set Programming Based Framework and Applications”

Francesco Cauteruccio1 and Giorgio Terracina2

1DII, Polytechnic University of Marche, Ancona (Italy)

f.cauteruccio@univpm.it

2DEMACS, University of Calabria, Rende (Italy)

terracina@mat.unical.it

Abstract

Detecting sets of relevant patterns from a given dataset is an important challenge in data mining. The relevance of a pattern, also called utility in the literature, is a subjective measure and can be actually assessed from very different points of view. Rule-based languages like Answer Set Programming (ASP) seem well suited for specifying user-provided criteria to assess pattern utility in a form of constraints; moreover, declarativity of ASP allows for a very easy switch between several criteria in order to analyze the dataset from different points of view. In this paper, we make steps toward extending the notion of High Utility Pattern Mining (HUPM); in particular we introduce a new framework that allows for new classes of utility criteria not considered in the previous literature. We also show how recent extensions of ASP with external functions can support a fast and effective encoding and testing of the new framework. To demonstrate the potential of the proposed framework, we exploit it as a building block for the definition of an innovative method for predicting ICU admission for COVID-19 patients. Finally, an extensive experimental activity demonstrates both from a quantitative and a qualitative point of view the effectiveness of the proposed approach.

About

This page provides all the additional material related to the paper “Extended High Utility Pattern Mining: An Answer Set Programming Based Framework and Applications” by F. Cauteruccio and G. Terracina, currently submitted to Theory and Practice of Logic Programming. It is an extended version, invited for publication, of the paper appeared in the proceedings of the 5th International Joint Conference, RuleML+RR 2021, Leuven, Belgium, September 13-15, 2021.

Table of contents

Datasets in ASP format ↑ go top

In the proposed experimental evaluation, we employed two datasets. The first one is the paper review use case dataset presented in [1]. The second one is a dataset concerning the ICU admission prediction context for COVID-19 patients, which is available on Kaggle. Information on the process of representing the data in ASP formats are given below.

Paper review use case dataset Download

The dataset is build upon the work presented in [1] which addresses aspect-based sentiment analysis of scientific reviews to extract useful information and correlate them with the accept/reject decision. Click on the Download button to download a .zip archive containing this description (in the README.md file) and a folder dataset containing a text file for each paper $p$. Each file contains the reviews for that specific paper, encoded in ASP format as reported in Listing 1. It is worth pointing out that to comply with the syntax of modern solvers, the domain for the transaction facets, i.e., the aspects defined in [1], has been encoded as {0,1,2}, where 0 (resp., 1 and 2) indicates a neutral (resp., positive and negative) value.

ICU admission prediction for COVID-19 patients dataset Download

The dataset is publicly available here as a .xlsx file. The dataset contains almost 2000 anonymized data for confirmed COVID-19 patients; each patient in the database underwent several encounters. The dataset contains several information: (i) patient demographic information, (ii) patient previous grouped diseases, (iii) if the patient was admitted to the ICU during the observation period (ICU=1) or not (ICU=0), and, for each encounter, (iv) 36 parameters for blood results, and (v) 6 parameters for vital signs. With respect to the proposed e-HUPM model, a patient is an Object, whose facet is ICU; each visit is a transaction whose items are categorical data from (i) and (ii) and whose facets are numerical data from (iv) and (v). Click on the Download button to download a .zip archive containing the following files:

Coding the Paper Reviews use case in ASP ↑ go top

In this section we present the ASP listing for the Paper Review use case exploited in the paper. We present the various parts in separate listings to show that some code portions can be composed in different ways, in order to cope with different scenarios of interest. The schema of the dataset, along with expected input parameters are presented in Listing 1, whereas potential pre-filtering rules are shown in Listing 2; in this listing, only words associated with at least one positive or negative sentiment value for the sentences they appear in are considered useful for the analysis. A looser approach, that considers all words as useful, can be implemented using Listing 3, whereas a tighter approach, that includes only words for which a disagreement between Appropriateness and Decision exists can be implemented using Listing 4; obviously, pre-filtering can be personalized depending on the objectives of the analysis.

Candidate generation and occurrences computation and check are shown in Listing 5; in this case words in the pattern are unordered; if the relative order in which words appear in sentences is relevant, Listing 5 should be substituted with Listing 6.

Since in this use case items have no specific utility vector, and facets are defined only for sentences, reviews and papers, the utility computation is carried out as shown in Listing 7. As previously pointed out, different kinds of utility functions can be implemented, e.g. in Python, in the external function call computeUtility.

Finally, different kinds of pattern selection rules (masks) can be even defined. The simplest one is on pattern length, which is shown in Listing 8. If we assume to know, for each word, its part-of-speech tag, a more elaborated mask can be defined as presented in Listing 9, which states that if a pattern contains at least three items, these must contain at least a noun, a verb, and an adjective.

Interestingly, Listings 7 and 8 can be coupled together in the same program in order to further limit the kind of patterns we are interested in.

Summarizing, different versions of HUPM can be defined in this context, either by varying the computeUtility external function or by composing the different listing portions introduced above. Listings 1, 2, 6, 7, and 9 and Listings 1, 4, 6, 7, and 8 are two potential programs that can be composed. As a consequence, even non ASP practitioners can compose several versions of the program to run, simply picking up sub-programs and utility functions from a pool of pre-defined available ones.

%%%   Input schema:
%container(PaperId) - The papers
%object(ReviewId,PaperId) - The Reviews
%transaction(LineId, ReviewId) - Review sentences (each sentence has a unique identifier)
%item(Word, LineId, Position) - The Words in review sentences 
%containerUtilityVector(PaperId, Decision) - Decision: Reject (0) / Accept (1)
%objectUtilityVector(ReviewId, Rating, Confidence) - Rating: (1-10), Confidence: (1-5) 
%transactionUtilityVector(LineId, Appropriatness, Clarity, Originality, Empirical_Theoretical_Soudness, Meaningful_Comparison, Substance, Impact_of Dataset_Software_Ideas, Recommendation) - Each facet: Positive (1), Negative (-1), Neutral (0) 

%%%   Parameters:
occurrencesThreshold(...). utilityTreshold(...).
minLength(...). maxLength(...).
Listing 1: Paper reviews use case: input schema and parameters
%%%   Item pre-filtering
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, S, _, _, _, _, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_, 
            transactionUtilityVector(LineId, _, S, _, _, _, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, _, _, S, _, _, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, _, _, _, S, _, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, _, _, _, _, S, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, _, _, _, _, _, S, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, _, _, _, _, _, _, S, _),S<>0.
usefulItem(W):- item(Word,LineId,_),
            transactionUtilityVector(LineId, _, _, _, _, _, _, _, S),S<>0.
Listing 2: Paper reviews use case: pre-filtering - version 1
%%%   Item pre-filtering
                usefulItem(W):- item(Word,_,_).
Listing 3: Paper reviews use case: pre-fltering - version 2
%%%   Item pre-filtering
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, 1, _, _, _, _, _, _, _), 
            transaction(LineId, ReviewId),object(ReviewId,PaperId),
            containerUtilityVector(PaperId, 0).
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, -1, _, _, _, _, _, _, _), 
            transaction(LineId, ReviewId),object(ReviewId,PaperId),
            containerUtilityVector(PaperId, 1).          
Listing 4: Paper reviews use case: pre-fltering - version 3
%%%   Candidate pattern generation and occurrences check
{inCandidatePattern(I)} :- usefulItem(I), #count{Tid,Line: inSupport(Tid,Line), contains(I,Tid,Line)}=N, N ≥ Tho, occurrencesThreshold(Tho).
inSupport(Tid,Line) :- sentence(Tid,Line), #count{I: usefulItem(I), conflictAt(Tid,Line,I)}=0.
conflictAt(Tid,Line,I) :- inCandidatePattern(I), sentence(Tid,Line), not contains(I,Tid,Line).
contains(I,Tid,Line) :- word(I,Tid,Line,_).
Listing 5: Paper reviews use case: candidate pattern generation and occurrencescheck
%%%   Candidate pattern generation and occurrences check
{inCandidatePattern(I)} :- usefulItem(I), #count{Tid,Line: inSupport(Tid,Line), contains(I,Tid,Line)}=N, N ≥ Tho, occurrencesThreshold(Tho).
inSupport(Tid,Line) :- sentence(Tid,Line), #count{I: usefulItem(I), conflictAt(Tid,Line,I)}=0, not unordered(Tid).
conflictAt(Tid,Line,I) :- inCandidatePattern(I), sentence(Tid,Line), not contains(I,Tid,Line).
contains(I,Tid,Line) :- word(I,Tid,Line,_).

{itemOrder(W,N)} :- inCandidatePattern(W),position(N).
:-inCandidatePattern(W), #count{N:itemOrder(W,N)}<>1.
:-itemOrder(W1,N), itemOrder(W2,N), W1<>W2.

unordered(Tid):- item(W1,Tid,Pos1), item(W2,Tid,Pos2), W1<>W2, 
    itemOrder(W1,N1), itemOrder(W2,N2), N1<N2, Pos1>Pos2. 
Listing 6: Paper reviews use case: ordered candidate pattern generation and occurrences check
%%%   Utility computation
occurrenceUtilityVector(Tid,A, Cl, O, So, Co, Su, I, R, Rating,Confidence,Decision):-
    inTransaction(Tid),  transactionUtilityVector(Tid, A, Cl, O, So, Co, Su, I, R), 
    transaction(Tid, ReviewId), objectUtilityVector(ReviewId, Rating,Confidence),
    object(ReviewId,PaperId), containerUtilityVector(PaperId, Decision).
:- &computeUtility[occurrenceUtilityVector](U), U < Thu, utilityTreshold(Thu).
Listing 7: Paper reviews use case: utility computation
%%%   Pattern mask (e.g.) on size
:- #count{T : inCandidatePattern(T)} < L, minLength(L). 
:- #count{T : inCandidatePattern(T)} >  L, maxLength(L).
Listing 8: Paper reviews use case: pattern mask on size
%%%   Pattern mask on part-of-speech
%%%   Assumes external knowledge of the form pos(Word,PartOfSpeech)
inPOS(P):- inCandidatePattern(W),pos(W,P).
:- #count{T : inCandidatePattern(T)} >=3, inPOS(noun), inPOS(adj), inPOS(P), P<>noun, P<>adj, P<> verb.
:- #count{T : inCandidatePattern(T)} >=3, inPOS(P), inPOS(adj), inPOS(verb), P<>noun, P<>adj, P<> verb.
:- #count{T : inCandidatePattern(T)} >=3, inPOS(noun), inPOS(P), inPOS(verb), P<>noun, P<>adj, P<> verb.
Listing 9: Paper reviews use case: pattern mask on part-of-speech

Material for the analysis of the impact of external function calls in ASP ↑ go top

In this section, we report the additional material related to the analysis of the impact of external function calls in ASP. In particular, we report the ASP programs for the two utility functions sum and product, discussed in Section 5.1 of the manuscript. For each of these functions, we report both the direct encoding in ASP and the encoding exploiting the external function call.

ASP program Download

Click on the Download button to download a .zip archive containing the following files:

Requirements: the ASP program can be executed using clingo and wasp. The former is used in gringo mode. In our experimental evaluation we employed

Python 3.9 is needed.

Executing the ASP program: to execute the ASP program exploiting the external function call, both clingo and wasp are used. A command line interface is required to execute it. The following command executes the program:
clingo --mode=gringo --output=smodels ext-encoding.asp facts.db | wasp --interpreter=python --plugins-files=ext-fun -n0
The provided command requires having clingo and wasp binaries available in the global executable files path (e.g., set the env var $PATH for Unix-like systems). It also requires the indicated files in the same directory the command is being executed. wasp also requires the file wasp.py to be in the same directory of the ext-fun.py file. The former file can be found in the wasp repository. facts.db is the dataset in ASP format of the paper reviews use case available here.

Material for ICU admission prediction for COVID-19 patients ↑ go top

In this section, we report the additional material related to the application of our proposed framework in a biomedical context, that is the ICU admission prediction for COVID-19 patients. We recall that the dataset exploited in this application is publicly available here. The task is to find patterns, derived from patient demographic information and patient disease groups, showing a significant Pearson correlation between each of the 42 facets and ICU outcome. In the following, we provide the exploited ASP programs and results obtained from their evaluation.

ASP program Download

Click on the Download button to download a .zip archive containing the following files:

Requirements: the ASP program can be executed using clingo and wasp. The former is used in gringo mode. In our experimental evaluation we employed

Python 3.8 is needed. For our experimental evaluation, also scipy library is required.

Executing the ASP program: to execute the ASP program, both clingo and wasp are used. A command line interface is required to execute it. The following command executes the program:
clingo --mode=gringo --output=smodels icu_prediction.edb thresholds.edb program.asp | wasp --interpreter=python --plugins-files=propagator -n0 --silent=0
The provided command requires having clingo and wasp binaries available in the global executable files path (e.g., set the env var $PATH for Unix-like systems). It also requires the indicated files in the same directory the command is being executed. wasp also requires the file wasp.py to be in the same directory of the propagator.py file. The former file can be found in the wasp repository.

Results Download

The execution of the ASP program requires few thresholds, e.g., minimum occurrences, minimum Pearson value and maximum pattern cardinality thresholds. In the following, we provide the results obtained with different combinations of such thresholds.

We selected the following values for each threshold:

For each combination of these three values, we exploited the ASP program to extract patterns, complying with the threshold values, w.r.t. each of the 42 facets (36 parameters for blood results and 6 parameters for vital signs).

Click on the Download button to download a .zip archive containing this description (as a README.md file) and the folder results. This folder contains several .json files. The name of each file follows a specific syntax:

FACET_MINOCC_MINPEAR_MAXCARD.json

where:

As an example, the file BIC_ARTERIAL_MEAN_5_0.5_5.json contains the patterns resulting from the execution of the ASP program on the BIC_ARTERIAL_MEAN target facet, considering at least 5 occurrences and at least a Pearson value of 0.5, with a maximum pattern cardinality of 5 items. The results are preprocessed and represented as JSON objects.

Each file contains a list of JSON object. Each object represents a pattern, and contains several information about it. The key strings of each objects are the following:

The image to the side depicts an example of JSON object representing a pattern. The items making the pattern are the values of the key p: in this example, we have a pattern consisting of the items dg4=0.0, immunocompromised=1.0, perc=60th, gender=1, dg2=0.0; the cardinality of the pattern is 5; the transactions covered by the pattern are 70, 1325, 74, 1320 and 589 (note that this number indicates the $i$-th transaction as ordered in the original dataset); the number of transactions covered by the pattern is 5, and the Pearson value obtained by this pattern is 0.68.

References

[1] S. Chakraborty, P. Goyal, A. Mukherjee, Aspect-based sentiment analysisof scientific reviews, in: JCDL ’20: Proceedings of the ACM/IEEE JointConference on Digital Libraries in 2020, Virtual Event, China, August 1-5,2020, 2020, pp. 207-216, ACM