This repository contains materials (Implementation and Experiments) concerning the paper in review: "DEvIANT : Discovering statistically significant exceptional (dis-)agreement within groups". It contains:
- Deviant-Code: includes all the python scripts of the framework Deviant. Please run the scripts using Python 2.7 (we are so old fashioned :-) )
- Experiments_Scripts_and_Results: includes all experiments reported in the paper (quantitative and qualitative) as well as the set of scripts used to generate the experiments.
- Datasets: Contains the behavioral datasets used for the qualitative and quantitative experiments reported in the paper (Please extract the dataset files before running the experiments).
- Example: Contain an example of an input parameter file which allows to launch the method and uncover exceptional (dis-)agreement within groups.
Below we give the method overview. Note that DEvIANT
stands for Discovering statistically significant Exceptional contextual Intra-group Agreement paTterns.
DEvIANT
is an exceptional model mining technique which, given a behavioral dataset, mines for statistically significant exceptional (dis-)agreement within groups. The patterns returned byDEvIANT
are of the form(g,c)
whereg
is a group (reviewers) of individuals andc
a context characeterizing a subset of entities (reviewees). In a nutshell,DEvIANT
starts by forming groups of individuals (1) by enumerating conditions/restrictions on the descriptive attributes of individuals. Next, (2)DEvIANT
computes the bootstraping confidence interval to handle variablity of outcomes of the considered group. Subsequently, (3)DEvIANT
selects a context by enumerating conditions/restrictions on the descriptive attributes of entities handing over a subset of entities. For this subset of entities, (4) the corresponding contextual intra-group agreement is computed by using Krippendorff's Alpha. In order to evaluate how significant is the deviation from the expected intra-group agreement by chance, (5)DEvIANT
establishes the Distribution of False Discoveries, dubbed DFD. The DFD corresponds to the distribution of Krippendorff's Alpha observed over subsets randomly and uniformly issued from the collections of subsets of entities having equal cardinality to the subset covered by the current context. The DFD allows to determine if the observed intra-group agreement occurs due to chance only (a baseline finding) or is significant enough (considering some critical value alpha). If (6) the observed contextual intra-group agreement falls within the critical region, this means that the finding is significant and hence is reported in the returned result set (7) .
In order to launch the algorithm DEvIANT
to read a behavioral dataset and returns a set of relevant exceptional (dis-)agreement patterns, a configuration file need to be specified to the method which structure is given below. We give an example in the "Example" directory where a configuration file and a command are defined.
{
"objects_file":<entities collection file path>,
"individuals_file":<individuals collection file path>,
"reviews_file":<outcomes collection file path>,
"delimiter":<delimiter used in the input csv file> (e.g. "\t"),
"nb_objects":<number of entities to consider> (e.g.50000000),
"nb_individuals":<number of individuals to consider> (e.g.5000000),
"arrayHeader":<the set of attributes which values are of the form of an array (usually HMT attributes> (e.g. ["PROCEDURE_SUBJECT"]),
"numericHeader":<the set of attributes which values are numerical (e.g. ["VOTE_DATE","EU_MEMBER_SINCE"]),
"vector_of_outcome":<the structure of the outcomes vector> (e.g. if null all the attributes in the outcomes file are considered as elements depicting the action of an individual over an entity),
"description_attributes_objects":<the descriptive attributes to consider for entities>, (e.g.[["PROCEDURE_SUBJECT", "themes"],["VOTE_DATE","numeric"],["COMMITTEE","simple"]]),
"description_attributes_individuals":<the descriptive attributes to consider for individuals>, (e.g.[["EU_MEMBER_SINCE", "numeric"],["CURRENCY", "simple"],["SCHENGEN_MEMBER", "simple"],["COUNTRY", "simple"],["GROUPE_ID", "simple"],["NATIONAL_PARTY", "simple"]])
"threshold_objects":<the minimum entities support size threshold> (e.g. 40),
"threshold_individuals":<the minimum individuals support size threshold> (e.g. 10),
"threshold_quality":<the critical value alpha> (e.g. 0.05),
"quality_measure":"BOTH", <For now, this parameter need to be fixed to BOTH even if it allows to consider one tail tests>
"algorithm":"P_VALUE_PEERS", <For now, this parameter need to be fixed to P_VALUE_PEERS. if fixed to COMMON_PEERS it performs a common exceptional model mining task where the contextual intra-group agreement is compared to the overall one and reported if this comparison leads to a distance greater than the threshold_quality>
"results_destination":<yielded patterns results file path"> (e.g. .//results.csv")
"detailed_results_destination": <yielded patterns additional results directory path - such as the context and groups informations> (e.g.".//DetailedResults//")
}
Once the configuration file is defined it can be executed using the command below.
python .//Deviant-Code//main.py <configuration file path> -q
Other options are availableand enables to modify the parameters specified in the configuration file. All these elements are specified in the documentation of the main script. For more information about the available options please run:
python .//Deviant-Code//main.py -h
The script allows also to launch performance experiments, examples of such commands for each benchmark dataset are given in Experiments_Scripts_and_Results.py.
Some patterns returned by DEvIANT
when looking for exceptional consensual/conflictual topic in the 115 th Congress - house of representative between republicans.
id_pattern | group | context | nb individuals | nb entities | nb outcomes | Overall intra-group agreement | Contextual intra-group agreement | Deviation | confidence interval | p value | state intra-agreement |
---|---|---|---|---|---|---|---|---|---|---|---|
Pattern 1 | Republican Party | ['20.11 Government Branch Relations, Administrative Issues, and Constitutional Reforms'] | 246 | 27 | 6178 | 0.83 | 0.32 | -0.51 | [0.67, 0.99] | <0.0001 | Conflictual |
Pattern 2 | Republican Party | ['5 Labor'] | 246 | 22 | 5071 | 0.83 | 0.64 | -0.20 | [0.659, 1.] | <0.01 | Conflictual |
Pattern 3 | Republican Party | ['20.05 Nominations and Appointments Not Codable Elsewhere'] | 246 | 177 | 40879 | 0.83 | 0.92 | +0.09 | [0.76, 0.89] | <0.0001 | Consensual |
1.0.0
For additional informations please contact: BELFODIL Adnene adnene.belfodil@gmail.com