Online Appendix for “An Answer Set Programming based framework for High-Utility Pattern Mining extended with Facets and Advanced Utility Functions”

Francesco Cauteruccio and Giorgio Terracina

DEMACS, University of Calabria, Rende (Italy)
{cauteruccio,terracina}@mat.unical.it

Abstract

In the context of pattern mining, the utility of a pattern can be described as a preference ordering over a choice set; it can be actually assessed from very different perspectives and at different abstraction levels. However, while the topic of High-Utility Pattern Mining (HUPM) has been widely studied, the basic assumption is that each item in a knowledge base is associated with one, static utility. In this paper we introduce, among others, the notion of facets for items, which allows to cope with this limitation and, moreover, we show how a more structured representation of available information, coupled with facets defined also for higher abstraction levels, paves the way to new opportunities for HUPM. In particular, the proposed framework allows to introduce some new advanced classes of utility functions in the detection process, whose relevance is also experimentally evaluated. A real use case on paper reviews is exploited to analyze the potentiality of the proposed framework in knowledge creation and discovery. Given the wide variety of analytical scenarios that can be envisioned in this new setting, we take full advantage of the capabilities of Answer Set Programming and its extensions for a fast encoding and testing of the framework.

Context example

In order to provide a better understanding of the introduced concepts, we develop an example on the context of a store chain. In this case, each container represents a store; for each store a list of customers is available: each customer is an object in our model. Finally, each customer's purchase is a transaction in the database. Table 1 depicts the selected facets for each domain element and its representation in the database, e.g., facets for containers, objects, transactions, items, and the database itself. Each container represents a store and for a container $C_s$ we have a single facet, that is the size of the store. Customers information are represented by objects and for an object $O_u$ we have the facets age, and frequency of purchase (per week). Each customer's purchase is a transaction and for a transaction $T_p$ we have the facets fidelity card availability, and applied discount. Finally, a purchased item is represented as an item $i$ and we consider the facets price and weight.

Table 2 illustrates the utility vectors for container, object, transaction, and item respectively, for an instance of the database that is reported in Table 3. Note that each transaction $T_p$ is a set of pair $(i, q(i, T_p))$ where $i$ is an item and $q(i, T_p)$ its quantity in the transaction $T_p$.

Let us now consider the pattern $P=\{i_2,i_4\}$, and the set of the transaction in which it occurs $T_P = \{T_1, T_4\}$. To proceed with the pattern utility computation, we first compute its intra-pattern utility by merging the item utility vectors into a unique item utility vector. Here, we use the example intra-pattern utility function provided in Section 2 of the paper. Considering that $P$ occurs in both $T_1$ and $T_4$, we obtain the following unique item utility vectors:

$$IU_{T_1} = ip(P,T_1,\{IU_{i_2},IU_{i_4}\}) = [4.97, 1.1]$$ $$IU_{T_4} = ip(P,T_4,\{IU_{i_2},IU_{i_4}\}) = [8.95, 1.7]$$

Having calculated the unique item utility vectors, we can now derive the corresponding occurrence utility vector $OccU_{T_{1}}$ and $OccU_{T_{4}}$ as the juxtaposition of unique item, transaction, object and container utility vectors:

$$OccU_{T_{1}} = [IU_{T_1}, TU_{T_1}, OU_{C_2}, CU_{S_1}] = [4.97, 1.1, 0, 0.2, 21, 8, \text{small}]$$ $$OccU_{T_{4}} = [IU_{T_4}, TU_{T_4}, OU_{C_1}, CU_{S_2}] = [8.95, 1.7, 0, 0.1, 42, 5, \text{medium}]$$

Here, the association between each transaction and its object (resp. each object and its container) is provided in Table 3. Then, the pattern utility vector $U_p$ consists of the occurrence utility vectors $OccU_{T_{1}}$ and $OccU_{T_{4}}$, that is

$$U_P=\{OccU_{T_1}, OccU_{T_4}\} = \{[4.97, 1.1, 0, 0.2, 21, 8, \text{small}], [8.95, 1.7, 0, 0.1, 42, 5, \text{medium}]\}$$

Indeed, using different utility functions can highlight different interesting patterns.

Table 1: Description of considered facets in a database relative to a store chain
Domain element e-HUPM model Facets
Store Container $C_S$ size
Customer Object $O_u$ age, frequency of purchase (per week)
Customer’s purchase Transaction Tp fidelity card availability, applied discount
Purchased item Item $i$ price, weight
Table 2: Container, object, transaction, and item utility vectors for the store chain example
Database elements Utility vectors
Containers (store)
$S_1$ $CU_{S_1} = [\text{small}]$
$S_2$ $CU_{S_2} = [\text{medium}]$
Object (customers)
$C_1$ $OU_{C_1} = [42, 5]$
$C_2$ $OU_{C_2} = [21, 8]$
$C_3$ $OU_{C_3} = [36, 4]$
Transaction (customers' purchase)
$T_1$ $TU_{T_1} = [0, 0.2]$
$T_2$ $TU_{T_2} = [1, 0.6]$
$T_3$ $TU_{T_3} = [1, 0.4]$
$T_4$ $TU_{T_4} = [0, 0.1]$
Items (purchased items)
$i_1$ $IU_{i_1} = [19.99, 2.0]$
$i_2$ $IU_{i_2} = [2.99, 0.1]$
$i_3$ $IU_{i_3} = [4.20, 0.850]$
$i_4$ $IU_{i_4} = [0.99, 0.5]$
$i_5$ $IU_{i_5} = [5.00, 0.750]$
Table 3: The set of transactions contained in the database along with the corresponding objects and containers
Container Object TID Transaction
$S_1$ $C_2$ $T_1$ $\{(i_2, 1), (i_4, 2)\}$
$C_3$ $T_2$ $\{(i_1, 1), (i_2, 2), (i_3, 5)\}$
$S_2$ $C_1$ $T_3$ $\{(i_3, 1), (i_4, 2), (i_5, 1)\}$
$T_4$ $\{(i_2, 2), (i_4, 3)\}$

To proceed with our example, let us consider the pattern utility vector $U_{P}$ for the pattern $P=\{i_2,i_4\}$ calculated above. We now give examples for some of the classes of utility functions.

As for Horizontal and Vertical first functions, we provide a simple example of an inter-transaction function involving Filter and MAX functions. In particular, be $f_h = \text{filter}(\cdot)$ and $f_v = \text{max}(\cdot)$ the filter and MAX function respectively. Then, $u(P)=f_v(f_h(U_P))$. Assume we focus on the price facet of each item utility vector, therefore $u(P)$ consists in filtering the price facet (by means of $f_h$) and then selecting the maximum (by means of $f_v$). In detail, $f_h(U_P) = \text{filter}(U_P) = \{[4.97], [8.95]\}$, where $[4.97]$ (resp. $[8.95]$) indicates the price facet of $OccU_{T_1}$ (resp. of $OccU_{T_4}$) represented as a singleton. Then, by applying $f_v$ across each occurrence, we obtain $u(P)=f_v(f_h(U_P)) = \text{max}(\text{filter}(U_P)) = \text{max}(\{4.97, 8.95\}) = 8.95$. The same result is obtained by applying the same functions in a vertical first fashion, e.g., $u(P)=f_h(f_v(U_P))$.

An example for a mixed pattern-vs-object utility function is the computation of the Pearson correlation coefficient. Let us define $f = r_{x,y}$, where $r_{x,y}$ is the classic Pearson correlation coefficient. In this example, suppose that $x$ indicates the price facet and $y$ indicates the frequency of purchase facet of a customer, across all occurrence utility vectors in $U_P$. According to $U_P$ we have $x = [4.97, 8.95]$ and $y=[8,5]$, and by applying the pattern-vs-object utility function $f$ we obtain $f(U_P) = r_{x,y} = -1$, indicating a perfect negative correlation between price and frequency of purchase facets for the pattern $P=\{i_2,i_4\}$.

Coding the Paper Reviews use case in ASP

In this section we present the ASP listing for the Paper Review use case exploited in the paper. We present the various parts in separate listings to show that some code portions can be composed in different ways, in order to cope with different scenarios of interest. The schema of the dataset, along with expected input parameters are presented in Listing 1, whereas potential pre-filtering rules are shown in Listing 2; in this listing, only words associated with at least one positive or negative sentiment value for the sentences they appear in are considered useful for the analysis. A looser approach, that considers all words as useful, can be implemented using Listing 3, whereas a tighter approach, that includes only words for which a disagreement between Appropriateness and Decision exists can be implemented using Listing 4; obviously, pre-filtering can be personalized depending on the objectives of the analysis.

Candidate generation and occurrences computation and check are shown in Listing 5; in this case words in the pattern are unordered; if the relative order in which words appear in sentences is relevant, Listing 6 should be substituted with Listing 6.

Since in this use case items have no specific utility vector, and facets are defined only for sentences, reviews and papers, the utility computation is carried out as shown in Listing 7. As previously pointed out, different kinds of utility functions can be implemented, e.g. in Python, in the external function call computeUtility.

Finally, different kinds of pattern selection rules (masks) can be even defined. The simplest one is on pattern length, which is shown in Listing 8. If we assume to know, for each word, its part-of-speech tag, a more elaborated mask can be defined as presented in Listing 9, which states that if a pattern contains at least three items, these must contain at least a noun, a verb, and an adjective.

Interestingly, Listings 7 and 8 can be coupled together in the same program in order to further limit the kind of patterns we are interested in.

Summarizing, different versions of HUPM can be defined in this context, either by varying the computeUtility external function or by composing the different listing portions introduced above. Listings 1, 2, 6, 7, and 9 and Listings 1, 4, 6, 7, and 8 are two potential programs that can be composed. As a consequence, even non ASP practitioners can compose several versions of the program to run, simply picking up sub-programs and utility functions from a pool of pre-defined available ones.

%%%   Input schema:
%container(PaperId) - The papers
%object(ReviewId,PaperId) - The Reviews
%transaction(LineId, ReviewId) - Review sentences (each sentence has a unique identifier)
%item(Word, LineId, Position) - The Words in review sentences 
%containerUtilityVector(PaperId, Decision) - Decision: Reject (0) / Accept (1)
%objectUtilityVector(ReviewId, Rating, Confidence) - Rating: (1-10), Confidence: (1-5) 
%transactionUtilityVector(LineId, Appropriatness, Clarity, Originality, Empirical_Theoretical_Soudness, Meaningful_Comparison, Substance, Impact_of Dataset_Software_Ideas, Recommendation) - Each facet: Positive (1), Negative (-1), Neutral (0) 

%%%   Parameters:
occurrencesThreshold(...). utilityTreshold(...).
minLength(...). maxLength(...).
Listing 1: Paper reviews use case: input schema and parameters
%%%   Item pre-filtering
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, S, _, _, _, _, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_, 
            transactionUtilityVector(LineId, _, S, _, _, _, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, _, _, S, _, _, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, _, _, _, S, _, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, _, _, _, _, S, _, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, _, _, _, _, _, S, _, _),S<>0.
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, _, _, _, _, _, _, S, _),S<>0.
usefulItem(W):- item(Word,LineId,_),
            transactionUtilityVector(LineId, _, _, _, _, _, _, _, S),S<>0.
Listing 2: Paper reviews use case: pre-filtering - version 1
%%%   Item pre-filtering
                usefulItem(W):- item(Word,_,_).
Listing 3: Paper reviews use case: pre-fltering - version 2
%%%   Item pre-filtering
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, 1, _, _, _, _, _, _, _), 
            transaction(LineId, ReviewId),object(ReviewId,PaperId),
            containerUtilityVector(PaperId, 0).
usefulItem(W):- item(Word,LineId,_), 
            transactionUtilityVector(LineId, -1, _, _, _, _, _, _, _), 
            transaction(LineId, ReviewId),object(ReviewId,PaperId),
            containerUtilityVector(PaperId, 1).          
Listing 4: Paper reviews use case: pre-fltering - version 3
%%%   Candidate pattern generation
{inCandidatePattern(W)}:- usefulItem(W).

%%%   Occurrences computation and check
inTransaction(Tid):- transaction(Tid,_), not incomplete(Tid).
incomplete(Tid):- sentence(Tid,_), inCandidatePattern(W), not contains(W,Tid).
contains(Word,Tid):- item(Word,Tid,_). 
:- #count{ Tid : inTransaction(Tid)}=N, N < Tho, occurrencesThreshold(Tho).
Listing 5: Paper reviews use case: candidate pattern generation and occurrences check
%%%   Candidate pattern generation
{inCandidatePattern(W)}:- usefulItem(W).
{itemOrder(W,N)}:-inCandidatePattern(W),position(N).
:-inCandidatePattern(W), #count{N:itemOrder(W,N)}<>1.
:-itemOrder(W1,N), itemOrder(W2,N), W1<>W2. 

%%%   Occurrences computation and check
inTransaction(Tid):- transaction(Tid,_), not incomplete(Tid), not unordered(Tid).
incomplete(TiD):- sentence(Tid,_), inCandidatePattern(W), not contains(W,Tid).
contains(I,Tid):- item(Word,Tid,_). 
unordered(Tid):- item(W1,Tid,Pos1), item(W2,Tid,Pos2), W1<>W2, 
        itemOrder(W1,N1), itemOrder(W2,N2), N1<N2, Pos1>Pos2. 
:- #count{ Tid : inTransaction(Tid)}=N, N < Tho, occurrencesThreshold(Tho).
Listing 6: Paper reviews use case: ordered candidate pattern generation and occurrences check
%%%   Utility computation
occurrenceUtilityVector(Tid,A, Cl, O, So, Co, Su, I, R, Rating,Confidence,Decision):-
    inTransaction(Tid),  transactionUtilityVector(Tid, A, Cl, O, So, Co, Su, I, R), 
    transaction(Tid, ReviewId), objectUtilityVector(ReviewId, Rating,Confidence),
    object(ReviewId,PaperId), containerUtilityVector(PaperId, Decision).
:- &computeUtility[occurrenceUtilityVector](U), U < Thu, utilityTreshold(Thu).
Listing 7: Paper reviews use case: utility computation
%%%   Pattern mask (e.g.) on length
:- #count{T : inCandidatePattern(T)} < L, minLength(L). 
:- #count{T : inCandidatePattern(T)} >  L, maxLength(L).
Listing 8: Paper reviews use case: pattern mask on length
%%%   Pattern mask on part-of-speech
%%%   Assumes external knowledge of the form pos(Word,PartOfSpeech)
inPOS(P):- inCandidatePattern(W),pos(W,P).
:- #count{T : inCandidatePattern(T)} >=3, inPOS(noun), inPOS(adj), inPOS(P), P<>noun, P<>adj, P<> verb.
:- #count{T : inCandidatePattern(T)} >=3, inPOS(P), inPOS(adj), inPOS(verb), P<>noun, P<>adj, P<> verb.
:- #count{T : inCandidatePattern(T)} >=3, inPOS(noun), inPOS(P), inPOS(verb), P<>noun, P<>adj, P<> verb.
Listing 9: Paper reviews use case: pattern mask on part-of-speech