1DII, Polytechnic University of Marche, Ancona (Italy)
f.cauteruccio@univpm.it
2DEMACS, University of Calabria, Rende (Italy)
terracina@mat.unical.it
Detecting sets of relevant patterns from a given dataset is an important challenge in data mining. The relevance of a pattern, also called utility in the literature, is a subjective measure and can be actually assessed from very different points of view. Rule-based languages like Answer Set Programming (ASP) seem well suited for specifying user-provided criteria to assess pattern utility in a form of constraints; moreover, declarativity of ASP allows for a very easy switch between several criteria in order to analyze the dataset from different points of view. In this paper, we make steps toward extending the notion of High Utility Pattern Mining (HUPM); in particular we introduce a new framework that allows for new classes of utility criteria not considered in the previous literature. We also show how recent extensions of ASP with external functions can support a fast and effective encoding and testing of the new framework. To demonstrate the potential of the proposed framework, we exploit it as a building block for the definition of an innovative method for predicting ICU admission for COVID-19 patients. Finally, an extensive experimental activity demonstrates both from a quantitative and a qualitative point of view the effectiveness of the proposed approach.
This page provides all the additional material related to the paper “Extended High Utility Pattern Mining: An Answer Set Programming Based Framework and Applications” by F. Cauteruccio and G. Terracina, currently submitted to Theory and Practice of Logic Programming. It is an extended version, invited for publication, of the paper appeared in the proceedings of the 5th International Joint Conference, RuleML+RR 2021, Leuven, Belgium, September 13-15, 2021.
In the proposed experimental evaluation, we employed two datasets. The first one is the paper review use case dataset presented in [1]. The second one is a dataset concerning the ICU admission prediction context for COVID-19 patients, which is available on Kaggle. Information on the process of representing the data in ASP formats are given below.
The dataset is build upon the work presented in [1] which addresses aspect-based sentiment analysis of scientific reviews to extract useful information and correlate them with the accept/reject decision. Click on the Download button to download a .zip
archive containing this description (in the README.md
file) and a folder dataset
containing a text file for each paper $p$. Each file contains the reviews for that specific paper, encoded in ASP format as reported in Listing 1. It is worth pointing out that to comply with the syntax of modern solvers, the domain for the transaction facets, i.e., the aspects defined in [1], has been encoded as {0,1,2}, where 0 (resp., 1 and 2) indicates a neutral (resp., positive and negative) value.
The dataset is publicly available here as a .xlsx
file. The dataset contains almost 2000 anonymized data for confirmed COVID-19 patients; each patient in the database underwent several encounters. The dataset contains several information: (i) patient demographic information, (ii) patient previous grouped diseases, (iii) if the patient was admitted to the ICU during the observation period (ICU=1) or not (ICU=0), and, for each encounter, (iv) 36 parameters for blood results, and (v) 6 parameters for vital signs. With respect to the proposed e-HUPM model, a patient is an Object, whose facet is ICU; each visit is a transaction whose items are categorical data from (i) and (ii) and whose facets are numerical data from (iv) and (v). Click on the Download button to download a .zip
archive containing the following files:
README.md
: this description;dataset_to_asp.py
: a Python (3.8) script which takes in input the publicly available .xlsx
file and produces a file containing the ASP encoding of the dataset according to the decided schema;icu_prediction.edb
: a text file containing the ASP encoding of the dataset.In this section we present the ASP listing for the Paper Review use case exploited in the paper. We present the various parts in separate listings to show that some code portions can be composed in different ways, in order to cope with different scenarios of interest. The schema of the dataset, along with expected input parameters are presented in Listing 1, whereas potential pre-filtering rules are shown in Listing 2; in this listing, only words associated with at least one positive or negative sentiment value for the sentences they appear in are considered useful for the analysis. A looser approach, that considers all words as useful, can be implemented using Listing 3, whereas a tighter approach, that includes only words for which a disagreement between Appropriateness and Decision exists can be implemented using Listing 4; obviously, pre-filtering can be personalized depending on the objectives of the analysis.
Candidate generation and occurrences computation and check are shown in Listing 5; in this case words in the pattern are unordered; if the relative order in which words appear in sentences is relevant, Listing 5 should be substituted with Listing 6.
Since in this use case items have no specific utility vector, and facets are defined only for sentences, reviews and papers, the utility computation is carried out as shown in Listing 7. As previously pointed out, different kinds of utility functions can be implemented, e.g. in Python, in the external function call computeUtility
.
Finally, different kinds of pattern selection rules (masks) can be even defined. The simplest one is on pattern length, which is shown in Listing 8. If we assume to know, for each word, its part-of-speech tag, a more elaborated mask can be defined as presented in Listing 9, which states that if a pattern contains at least three items, these must contain at least a noun, a verb, and an adjective.
Interestingly, Listings 7 and 8 can be coupled together in the same program in order to further limit the kind of patterns we are interested in.
Summarizing, different versions of HUPM can be defined in this context, either by varying the computeUtility
external function or by composing the different listing portions introduced above. Listings 1, 2, 6, 7, and 9 and Listings 1, 4, 6, 7, and 8 are two potential programs that can be composed. As a consequence, even non ASP practitioners can compose several versions of the program to run, simply picking up sub-programs and utility functions from a pool of pre-defined available ones.
%%% Input schema:
%container(PaperId) - The papers
%object(ReviewId,PaperId) - The Reviews
%transaction(LineId, ReviewId) - Review sentences (each sentence has a unique identifier)
%item(Word, LineId, Position) - The Words in review sentences
%containerUtilityVector(PaperId, Decision) - Decision: Reject (0) / Accept (1)
%objectUtilityVector(ReviewId, Rating, Confidence) - Rating: (1-10), Confidence: (1-5)
%transactionUtilityVector(LineId, Appropriatness, Clarity, Originality, Empirical_Theoretical_Soudness, Meaningful_Comparison, Substance, Impact_of Dataset_Software_Ideas, Recommendation) - Each facet: Positive (1), Negative (-1), Neutral (0)
%%% Parameters:
occurrencesThreshold(...). utilityTreshold(...).
minLength(...). maxLength(...).
Listing 1: Paper reviews use case: input schema and parameters
%%% Item pre-filtering
usefulItem(W):- item(Word,LineId,_),
transactionUtilityVector(LineId, S, _, _, _, _, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_,
transactionUtilityVector(LineId, _, S, _, _, _, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_),
transactionUtilityVector(LineId, _, _, S, _, _, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_),
transactionUtilityVector(LineId, _, _, _, S, _, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_),
transactionUtilityVector(LineId, _, _, _, _, S, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_),
transactionUtilityVector(LineId, _, _, _, _, _, S, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_),
transactionUtilityVector(LineId, _, _, _, _, _, _, S, _),S<>0.
usefulItem(W):- item(Word,LineId,_),
transactionUtilityVector(LineId, _, _, _, _, _, _, _, S),S<>0.
Listing 2: Paper reviews use case: pre-filtering - version 1
%%% Item pre-filtering
usefulItem(W):- item(Word,_,_).
Listing 3: Paper reviews use case: pre-fltering - version 2
%%% Item pre-filtering
usefulItem(W):- item(Word,LineId,_),
transactionUtilityVector(LineId, 1, _, _, _, _, _, _, _),
transaction(LineId, ReviewId),object(ReviewId,PaperId),
containerUtilityVector(PaperId, 0).
usefulItem(W):- item(Word,LineId,_),
transactionUtilityVector(LineId, -1, _, _, _, _, _, _, _),
transaction(LineId, ReviewId),object(ReviewId,PaperId),
containerUtilityVector(PaperId, 1).
Listing 4: Paper reviews use case: pre-fltering - version 3
%%% Candidate pattern generation and occurrences check
{inCandidatePattern(I)} :- usefulItem(I), #count{Tid,Line: inSupport(Tid,Line), contains(I,Tid,Line)}=N, N ≥ Tho, occurrencesThreshold(Tho).
inSupport(Tid,Line) :- sentence(Tid,Line), #count{I: usefulItem(I), conflictAt(Tid,Line,I)}=0.
conflictAt(Tid,Line,I) :- inCandidatePattern(I), sentence(Tid,Line), not contains(I,Tid,Line).
contains(I,Tid,Line) :- word(I,Tid,Line,_).
Listing 5: Paper reviews use case: candidate pattern generation and occurrencescheck
%%% Candidate pattern generation and occurrences check
{inCandidatePattern(I)} :- usefulItem(I), #count{Tid,Line: inSupport(Tid,Line), contains(I,Tid,Line)}=N, N ≥ Tho, occurrencesThreshold(Tho).
inSupport(Tid,Line) :- sentence(Tid,Line), #count{I: usefulItem(I), conflictAt(Tid,Line,I)}=0, not unordered(Tid).
conflictAt(Tid,Line,I) :- inCandidatePattern(I), sentence(Tid,Line), not contains(I,Tid,Line).
contains(I,Tid,Line) :- word(I,Tid,Line,_).
{itemOrder(W,N)} :- inCandidatePattern(W),position(N).
:-inCandidatePattern(W), #count{N:itemOrder(W,N)}<>1.
:-itemOrder(W1,N), itemOrder(W2,N), W1<>W2.
unordered(Tid):- item(W1,Tid,Pos1), item(W2,Tid,Pos2), W1<>W2,
itemOrder(W1,N1), itemOrder(W2,N2), N1<N2, Pos1>Pos2.
Listing 6: Paper reviews use case: ordered candidate pattern generation and occurrences check
%%% Utility computation
occurrenceUtilityVector(Tid,A, Cl, O, So, Co, Su, I, R, Rating,Confidence,Decision):-
inTransaction(Tid), transactionUtilityVector(Tid, A, Cl, O, So, Co, Su, I, R),
transaction(Tid, ReviewId), objectUtilityVector(ReviewId, Rating,Confidence),
object(ReviewId,PaperId), containerUtilityVector(PaperId, Decision).
:- &computeUtility[occurrenceUtilityVector](U), U < Thu, utilityTreshold(Thu).
Listing 7: Paper reviews use case: utility computation
%%% Pattern mask (e.g.) on size
:- #count{T : inCandidatePattern(T)} < L, minLength(L).
:- #count{T : inCandidatePattern(T)} > L, maxLength(L).
Listing 8: Paper reviews use case: pattern mask on size
%%% Pattern mask on part-of-speech
%%% Assumes external knowledge of the form pos(Word,PartOfSpeech)
inPOS(P):- inCandidatePattern(W),pos(W,P).
:- #count{T : inCandidatePattern(T)} >=3, inPOS(noun), inPOS(adj), inPOS(P), P<>noun, P<>adj, P<> verb.
:- #count{T : inCandidatePattern(T)} >=3, inPOS(P), inPOS(adj), inPOS(verb), P<>noun, P<>adj, P<> verb.
:- #count{T : inCandidatePattern(T)} >=3, inPOS(noun), inPOS(P), inPOS(verb), P<>noun, P<>adj, P<> verb.
Listing 9: Paper reviews use case: pattern mask on part-of-speech
In this section, we report the additional material related to the analysis of the impact of external function calls in ASP. In particular, we report the ASP programs for the two utility functions sum and product, discussed in Section 5.1 of the manuscript. For each of these functions, we report both the direct encoding in ASP and the encoding exploiting the external function call.
Click on the Download button to download a .zip
archive containing the following files:
README.md
: this description;sum/direct-encoding.asp
: the direct encoding in ASP of the sum utility function;sum/ext-encoding.asp
: the encoding in ASP, exploiting the external function call, of the sum utility function;sum/ext-fun.py
: the external propagator, written in Python, implementing the sum utility function;product/direct-encoding.asp
: the direct encoding in ASP of the product utility function;product/ext-encoding.asp
: the encoding in ASP, exploiting the external function call, of the product utility function;product/ext-fun.py
: the external propagator, written in Python, implementing the product utility function;Requirements: the ASP program can be executed using clingo
and wasp
. The former is used in gringo
mode. In our experimental evaluation we employed
clingo
version 5.4.0 (address model 64-bit, libclingo version 5.4.0 configured w/ Python 3.7.4 and Lua 5.3.4; libclasp version 3.3.5 w/ libpotassco version 1.1.0). We installed clingo using homebrew. Download and more info here.wasp
version 2.0, compiled via clang (version 11.0.3) with Python 3.9 support. Download and more info here.Python 3.9 is needed.
Executing the ASP program: to execute the ASP program exploiting the external function call, both clingo and wasp are used. A command line interface is required to execute it. The following command executes the program:clingo --mode=gringo --output=smodels ext-encoding.asp facts.db | wasp --interpreter=python --plugins-files=ext-fun -n0
The provided command requires having clingo
and wasp
binaries available in the global executable files path (e.g., set the env var $PATH for Unix-like systems). It also requires the indicated files in the same directory the command is being executed. wasp also requires the file wasp.py
to be in the same directory of the ext-fun.py
file. The former file can be found in the wasp repository. facts.db
is the dataset in ASP format of the paper reviews use case available here.
In this section, we report the additional material related to the application of our proposed framework in a biomedical context, that is the ICU admission prediction for COVID-19 patients. We recall that the dataset exploited in this application is publicly available here. The task is to find patterns, derived from patient demographic information and patient disease groups, showing a significant Pearson correlation between each of the 42 facets and ICU outcome. In the following, we provide the exploited ASP programs and results obtained from their evaluation.
Click on the Download button to download a .zip
archive containing the following files:
README.md
: this description;program.asp
: the exploited ASP program to extract patterns;thresholds.edb
: an ASP encoding of the input thresholds for the considered task;propagator.py
: a Python (3.8) script defining the propagator used in wasp to execute the external function;Requirements: the ASP program can be executed using clingo
and wasp
. The former is used in gringo
mode. In our experimental evaluation we employed
clingo
version 5.4.0 (address model 64-bit, libclingo version 5.4.0 configured w/ Python 3.7.4 and Lua 5.3.4; libclasp version 3.3.5 w/ libpotassco version 1.1.0). We installed clingo using homebrew. Download and more info here.wasp
version 2.0, compiled via clang (version 11.0.3) with Python 3.9 support. Download and more info here.Python 3.8 is needed. For our experimental evaluation, also scipy library is required.
Executing the ASP program: to execute the ASP program, both clingo and wasp are used. A command line interface is required to execute it. The following command executes the program:clingo --mode=gringo --output=smodels icu_prediction.edb thresholds.edb program.asp | wasp --interpreter=python --plugins-files=propagator -n0 --silent=0
The provided command requires having clingo
and wasp
binaries available in the global executable files path (e.g., set the env var $PATH for Unix-like systems). It also requires the indicated files in the same directory the command is being executed. wasp also requires the file wasp.py
to be in the same directory of the propagator.py
file. The former file can be found in the wasp repository.
The execution of the ASP program requires few thresholds, e.g., minimum occurrences, minimum Pearson value and maximum pattern cardinality thresholds. In the following, we provide the results obtained with different combinations of such thresholds.
We selected the following values for each threshold:
For each combination of these three values, we exploited the ASP program to extract patterns, complying with the threshold values, w.r.t. each of the 42 facets (36 parameters for blood results and 6 parameters for vital signs).
Click on the Download button to download a .zip
archive containing this description (as a README.md
file) and the folder results
. This folder contains several .json
files. The name of each file follows a specific syntax:
FACET_MINOCC_MINPEAR_MAXCARD.json
where:
FACET
: name of the target facet; it can be one of the following values: ALBUMIN_MEAN
, BE_ARTERIAL_MEAN
, BE_VENOUS_MEAN
, BIC_ARTERIAL_MEAN
, BIC_VENOUS_MEAN
, BILLIRUBIN_MEAN
, BLAST_MEAN
,
CALCIUM_MEAN
, CREATININ_MEAN
, FFA_MEAN
, GGT_MEAN
, GLUCOSE_MEAN
, HEMATOCRITE_MEAN
, HEMOGLOBIN_MEAN
, INR_MEAN
, LACTATE_MEAN
,
LEUKOCYTES_MEAN
, LINFOCITOS_MEAN
, NEUTROPHILES_MEAN
, P02_ARTERIAL_MEAN
, P02_VENOUS_MEAN
, PC02_ARTERIAL_MEAN
, PC02_VENOUS_MEAN
,
PCR_MEAN
, PH_ARTERIAL_MEAN
, PH_VENOUS_MEAN
, PLATELETS_MEAN
, POTASSIUM_MEAN
, SAT02_ARTERIAL_MEAN
, SAT02_VENOUS_MEAN
, SODIUM_MEAN
,
TGO_MEAN
, TGP_MEAN
, TTPA_MEAN
, UREA_MEAN
, DIMER_MEAN
, BLOODPRESSURE_DIASTOLIC_MEAN
, BLOODPRESSURE_SISTOLIC_MEAN
, HEART_RATE_MEAN
,
RESPIRATORY_RATE_MEAN
, TEMPERATURE_MEAN
, OXYGEN_SATURATION_MEAN
;MINOCC
: minimum occurrences threshold;MINPEAR
: minimum Pearson value threshold;MAXCARD
: maximum pattern cardinality.As an example, the file BIC_ARTERIAL_MEAN_5_0.5_5.json
contains the patterns resulting from the execution of the ASP program on the BIC_ARTERIAL_MEAN
target facet, considering at least 5 occurrences and at least a Pearson value of 0.5, with a maximum pattern cardinality of 5 items. The results are preprocessed and represented as JSON objects.
Each file contains a list of JSON object. Each object represents a pattern, and contains several information about it. The key strings of each objects are the following:
p
: the value is an array of strings representing the items the pattern consists of;len_p
: the cardinality of the pattern, i.e., the number of items;t
: the value is an array of strings representing the transactions covered by the pattern;len_t
: the number of transactions covered by the pattern;pe
: the obtained Pearson value for the pattern.The image to the side depicts an example of JSON object representing a pattern. The items making the pattern are the values of the key p
: in this example, we have a pattern consisting of the items dg4=0.0
, immunocompromised=1.0
, perc=60th
, gender=1
, dg2=0.0
; the cardinality of the pattern is 5; the transactions covered by the pattern are 70, 1325, 74, 1320 and 589 (note that this number indicates the $i$-th transaction as ordered in the original dataset); the number of transactions covered by the pattern is 5, and the Pearson value obtained by this pattern is 0.68.
[1] S. Chakraborty, P. Goyal, A. Mukherjee, Aspect-based sentiment analysisof scientific reviews, in: JCDL ’20: Proceedings of the ACM/IEEE JointConference on Digital Libraries in 2020, Virtual Event, China, August 1-5,2020, 2020, pp. 207-216, ACM