graphxplore.GraphDataScience package

This subpackage contains functionality for exploratory data analysis using attribute association graphs. The starting point is a Neo4j database containing the data of a BaseGraph. With the AttributeAssociationGraphGenerator class, the user can generate an AttributeAssociationGraph. The user must provide one or multiple groups of primary keys using the GroupSelector or manually writing Neo4J Cypher queries. Several statistical traits of each attribute (table/variable/value combination) are measured for each group as well as conditional dependencies between attributes. This is done with a breadth-first-search (BFS) strategy starting from the primary keys contained in the group and traversing to all connected attributes in the table itself and potentially foreign tables (and foreign tables of foreign tables…). Attributes can be pre-filtered by table, name and value using AttributeAssociationGraphPreFilter objects, or the generated AttributeAssociationGraph can be post-filtered by statistical traits using AttributeAssociationGraphPostFilter objects.

The statistical traits of single attributes (absolute count, percentage in groups, missing data ratio, percentage difference and ratio) are stored in AttributeAssociationNode objects where all parameters are explained in detail. All nodes are assigned a FrequencyLabel based on their prevalence (percentage of occurrence in group) and the frequency_thresholds parameter of AttributeAssociationGraphGenerator. This label will impact the size of the circle depicting the node in the visualization. This way, the attention of the user is drawn to more frequently appearing attributes. Additionally, nodes are assigned a DistinctionLabel (if positive_group and negative_group of AttributeAssociationGraphGenerator are set) based on their prevalence difference and ratio and the prevalence_diff_thresholds and prevalence_ratio_thresholds parameters of AttributeAssociationGraphGenerator (at least one threshold must be passed). This label impacts the color of the nodes with red or orange (higher prevalence in positive_group), beige (roughly same prevalence in positive_group and negative_group), or turquoise or blue (higher prevalence in negative_group). This way, attention is drawn to attributes which might be associated with the selected groups.

The conditional dependencies (absolute co-occurrence, conditional prevalence of the target attribute given the source attribute, comparison of conditional and unconditional target prevalence via difference and ratio) are stored in AttributeAssociationEdge objects and explained there in detail. All edges are assigned an AttributeAssociationEdgeType based on the difference and ratio of the conditional and unconditional prevalence, and the cond_increase_thresholds and increase_ratio_thresholds parameters of AttributeAssociationGraphGenerator (at least one threshold must be passed).

Code might look like

>>> from graphxplore.Basis import GraphDatabaseWriter
>>> from graphxplore.MetaDataHandling import MetaData
>>> from graphxplore.GraphDataScience import (AttributeAssociationGraphGenerator, CompositionGraphPostFilter,
>>>                                           AttributeFilter, StringFilterType, NumericFilterType,
>>>                                           AttributeAssociationGraphPreFilter, GroupSelector)
>>> from graphxplore.DataMapping.Conditionals import StringOperator, StringOperatorType, NegatedOperator
>>>
>>> my_meta = MetaData.load_from_json(filepath='/meta_path.json')
# define a group of primary keys which have the attribute 'apple' of variable 'food' and a control group not
# having this attribute
>>> apple_condition = StringOperator(table='table',variable='food',value='apple', compare=StringOperatorType.Equals)
>>> apple_group = GroupSelector(group_table='table',meta=my_meta,
>>>                             group_filter=apple_condition)
>>> control_group = GroupSelector(group_table='table',meta=my_meta,
>>>                               group_filter=NegatedOperator(pos_operator=apple_condition)
# captured variable/value combinations cannot originate from 'forbidden_table', cannot contain 'nana' in their
# variable name, and their value must be a string, or be negative or at least 100
>>> no_nanas = AttributeFilter('nana', StringFilterType.Contains, include=False)
>>> negative_or_large = [AttributeFilter(0, NumericFilterType.Smaller, include=True),
>>>                      AttributeFilter(100, NumericFilterType.LargerOrEqual, include=True)]
>>> pre_filter = AttributeAssociationGraphPreFilter(blacklist_tables=['forbidden_table'], name_filters=[no_nanas],
>>>                                                 value_filters=negative_or_large)
# use a composition filter for post-filtering which removes 50% of nodes and enforces a ratio of 10% high
# prevalence, 70% high prevalence difference, and 20% high prevalence ratio
>>> node_ratio = (0.1,0.7,0.2)
>>> post_filter = CompositionGraphPostFilter(node_comp_ratio=node_ratio)
>>> generator = AttributeAssociationGraphGenerator(
>>>     db_name='mygraphdb', group_selection={'Apple' : apple_group, 'NoApple' : control_group},
>>>     positive_group='Apple', negative_group='NoApple', pre_filter=pre_filter, post_filter=post_filter,
>>>     address='bolt://localhost:7687', auth=('my_user', 'my_password'))
>>> aag = generator.generate_graph()
# write graph to Neo4J database
>>> GraphDatabaseWriter.write_graph(db_name='apple_aag', graph=aag, address='bolt://localhost:7687',
>>>                                 auth=('my_user', 'my_password'))

Module contents

class graphxplore.GraphDataScience.AndThresholdFilterCascade(filters: Iterable[ThresholdFilter] | None = None)[source]

Bases: ThresholdFilterCascade

This class checks if all its sub-filter criteria are fulfilled (conjunction).

Parameters:: filters (Iterable[ThresholdFilter] | None) – The sub-filters

is_valid(obj_to_filter: AttributeAssociationNode | AttributeAssociationEdge) → bool[source]

Checks the given filter criteria.

Parameters:: obj_to_filter (AttributeAssociationNode | AttributeAssociationEdge) – The object to filter
Returns:: Returns True, if the object passed the filter criteria, False otherwise
Return type:: bool

class graphxplore.GraphDataScience.AttributeAssociationGraphGenerator(db_name: str, group_selection: Dict[str, GroupSelector | str], positive_group: str | None = None, negative_group: str | None = None, pre_filter: AttributeAssociationGraphPreFilter | None = None, post_filter: AttributeAssociationGraphPostFilter | None = None, frequency_thresholds: Tuple[float, float] = (0.1, 0.5), prevalence_diff_thresholds: Tuple[float, float] = (0.1, 0.2), prevalence_ratio_thresholds: Tuple[float, float] = (1.5, 2.0), cond_increase_thresholds: Tuple[float, float] = (0.1, 0.2), increase_ratio_thresholds: Tuple[float, float] = (1.5, 2.0), address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', ''))[source]

Bases: object

This class extracts statistical measurements for all attributes in a dataset regarding their association with one or multiple selected group of primary keys (e.g. patient IDs). Absolute counts. missing value rates and prevalence of attributes within groups are calculated and compare by difference and ratio. These parameters are stored in AttributeAssociationNode objects. Additionally, conditional dependencies between attributes are measured by co-occurrence, conditional prevalence in groups and compared to unconditional prevalence. The results are stored in AttributeAssociationEdge objects. For more detailed descriptions of the calculated metrics refer to AttributeAssociationNode and AttributeAssociationEdge.

Nodes are labeled based on the prevalence of their attribute in the defined groups and the distinction between prevalence. These labels are encoded in the size and color of their node depicting in the Neo4J visualization. Additionally, edges are assigned a type based on the distinction between conditional and unconditional prevalence. This edge type influences the thickness of the drawn arrow representing the edge.

The origin dataset must be stored as a BaseGraph in a Neo4J database. The considered attributes can be pre-filtered by name and value using datatypes, string and numerical comparisons, blacklist and whitelist conditions. Additionally, the generated graph can be post-filtered by assessing the calculated statistical measurements. For more detailed descriptions of the calculated metrics refer to AttributeAssociationNode and AttributeAssociationEdge

Parameters:

db_name (str) – The name of the database
group_selection (Dict[str, GroupSelector | str]) – For each group of primary keys, the name and selection condition as a GroupSelector object or as a Cypher query. The node IDs of primary keys must be returned with the Cypher variable “x_0” in the form “return id(<node variable>) as x_0”
positive_group (str | None) – The name of the positive group. Must be contained in group_selection if defined. Attributes which appear more frequently in this group compare to the negative_group will be label as “related” or “highly related” and colored in orange or red in the visualization. Defaults to None
negative_group (str | None) – The name of the negative group. Must be contained in group_selection if defined. Attributes which appear more frequently in this group compare to the positive_group will be label as “inverse” or “highly inverse” and colored in turquoise or blue in the visualization. Defaults to None
pre_filter (AttributeAssociationGraphPreFilter | None) – The filter applied to attribute nodes when querying the database, defaults to None
post_filter (AttributeAssociationGraphPostFilter | None) – The post filter applied to the generated knowledge graph, defaults to None
frequency_thresholds (Tuple[float, float]) – Thresholds of prevalence for “frequent” and “highly frequent” labels, defaults to 0.1 and 0.5
prevalence_diff_thresholds (Tuple[float, float]) – Thresholds for prevalence difference for “related”/”inverse” and “highly related”/”highly inverse” labels, defaults to 0.1 and 0.2
prevalence_ratio_thresholds (Tuple[float, float]) – Thresholds for prevalence ratio for “related”/”inverse” and “highly related”/”highly inverse” labels, defaults to 1.5 and 2.0
cond_increase_thresholds (Tuple[float, float]) – Thresholds of conditional increase for “medium relation” and “high relation” labels, defaults to 0.1 and 0.2
increase_ratio_thresholds (Tuple[float, float]) – Thresholds of conditional increase ratio for “medium relation” and “high relation” labels, defaults to 1.5 and 2.0
address (str) – The address of the Neo4J DBMS. Can be generated with get_neo4j_address()
auth (Tuple[str, str]) – username and password to access the Neo4j DBMS

generate_graph() → AttributeAssociationGraph[source]

Generates the graph by first identifying all group primary key nodes, and then retrieving all reachable attributes (directly connected or via a path to foreign tables) with a breadth-first search strategy from the Neo4J database. Pre and/or post filters are applied if they were specified.

Returns:: Returns the generated graph
Return type:: AttributeAssociationGraph

class graphxplore.GraphDataScience.AttributeAssociationGraphPostFilter[source]

Bases: object

This is the abstract parent class of all post filter for attribute association graphs. Post filter assess the statistical metrics calculated for attribute and either use thresholds (ThresholdGraphPostFilter) or select the attributes based on a composition of metric values (CompositionGraphPostFilter).

filter_graph(graph: AttributeAssociationGraph) → AttributeAssociationGraph[source]

Filters the given graph by its statistical traits

Parameters:: graph (AttributeAssociationGraph) – The graph to filter
Returns:: Returns the new, filtered graph
Return type:: AttributeAssociationGraph

class graphxplore.GraphDataScience.AttributeAssociationGraphPreFilter(max_path_length: int = 3, whitelist_tables: Iterable[str] | None = None, blacklist_tables: Iterable[str] | None = None, target_tables: Iterable[str] | None = None, name_filters: Iterable[AttributeFilter] | None = None, value_filters: Iterable[AttributeFilter] | None = None)[source]

Bases: object

This class captures all filters that are applied to the attribute nodes of a BaseGraph as Neo4J database by the AttributeAssociationGraphGenerator. Attribute nodes are selected for the statistical analysis based on these filters. Each node’s name and value parameter must match at least one whitelist filter (if specified) and cannot match a blacklist filter (if specified). With the different table filters the BFS search of the AttributeAssociationGraphGenerator can be narrowed down, potentially reducing its runtime dramatically for large databases.

Parameters:

max_path_length (int) – The maximum allowed length of a path from a primary key node to an attribute node in the BFS
whitelist_tables (Iterable[str] | None) – If specified, only nodes of these tables and optionally the target_tables are traversed
blacklist_tables (Iterable[str] | None) – If specified, nodes of these tables are enver traversed
target_tables (Iterable[str] | None) – If specified, only nodes of these tables can be the end of the BFS traversal (others can be traversed)
name_filters (Iterable[AttributeFilter] | None) – The filters on the name parameter of the nodes
value_filters (Iterable[AttributeFilter] | None) – The filters on the value parameter of the nodes

get_query(primary_node_id: int)[source]

Generates the Cypher query for the BFS search starting from the primary node with index primary_node_id.

Parameters:: primary_node_id (int) – The Neo4j internal node index
Returns:: Returns the query as string

class graphxplore.GraphDataScience.AttributeFilter(filter_value: str | int | float, filter_type: StringFilterType | NumericFilterType, include: bool)[source]

Bases: object

This class represents one of multiple filters applied to the attribute nodes which are BaseNode objects. A node is valid, if it matches the filter criteria

Parameters:

filter_value (str | int | float) – The value against which will be filtered
filter_type (StringFilterType | NumericFilterType) – The type of filter
include (bool) – If True, the filter will be used as whitelist, otherwise as blacklist filter

class graphxplore.GraphDataScience.CompositionGraphPostFilter(min_prevalence: float = 0.01, min_prevalence_mode: GroupFilterMode = GroupFilterMode.All, min_cond_prevalence: float = 0.05, min_cond_prevalence_mode: GroupFilterMode = GroupFilterMode.All, max_missing: float = 0.2, max_missing_mode: GroupFilterMode = GroupFilterMode.All, perc_nof_nodes: float = 0.5, perc_nof_edges=0.25, node_comp_ratio: Tuple[float, float, float] = (0.2, 0.5, 0.3), edge_comp_ratio: Tuple[float, float, float] = (0.2, 0.5, 0.3), max_nof_nodes: int | None = None, max_nof_edges: int | None = None, include_conditional_decrease: bool = False)[source]

Bases: AttributeAssociationGraphPostFilter

This class filters an attribute association graph based on user-defined score composition ratios. For each score, the nodes or edges with the highest values will be selected and the graph will be filled with them according to the specified ratio. The node ratio can be built out of the following three metrics:

high maximum prevalence: These attributes appear often in the group with the highest prevalence, but are not necessarily selective for that specific group
high prevalence difference: These attributes appear more often in one group compared another in absolute terms. Thus, this attribute has a sensitivity for that group. But they could still have some prevalence in another group, meaning their specificity could be low
high prevalence ratio: These attributes are specific for one group compared to another group. But all prevalence could be low and the sensitivity of the attribute could be low as well as a result.

The edge ratio can be built out of the following three metrics for a relation A->B:

high maximum conditional prevalence: Many members with attribute A, also exhibit attribute B in the group with the highest conditional prevalence. However, B could just have a high prevalence itself and thus the added condition of A would have little influence
high maximum conditional increase: The added condition of A has a high sensitivity for the presence of B in at least one group. However, the prevalence B could be high as well, meaning A would not be specific for B
high maximum conditional increase: The added condition of A has a high specificity for the presence of B in at least one group. However, the conditional prevalence could be low, meaning A would not be sensitive for the presence of B

Additionally, a minimal prevalence and conditional prevalence, as well as maximum missing value ratio can be specified. Moreover, the number of nodes and edges in the filtered graph can be adjusted using a percentage of the unfiltered amount or an absolute value. If both the percentage and absolute value are specified, the smallest resulting number of nodes or edges will be taken.

NOTE: Since attribute association graphs with only one group have no prevalence difference and ratio metrics, only the nodes with the highest prevalence will be selected

Parameters:

min_prevalence (float) – Nodes with a prevalence below this value will be removed, defaults to 1%
min_prevalence_mode (GroupFilterMode) – Specifies if only one or all groups must pass min_prevalence, defaults to GroupFilterMode.All
min_cond_prevalence (float) – Edges with a conditional prevalence below this value in all groups will be removed, defaults to 5%
min_cond_prevalence_mode (GroupFilterMode) – Specifies if only one or all groups must pass min_cond_prevalence, defaults to GroupFilterMode.All
max_missing (float) – Nodes with a missing ratio above this value will be removed, defaults to 20%
max_missing_mode (GroupFilterMode) – Specifies if only one or all groups must pass max_missing, defaults to GroupFilterMode.All
perc_nof_nodes (float) – Percentage of nodes that should remain of all the nodes passing min_prevalence and max_missing, defaults to 50%
perc_nof_edges – Percentage of edges that should remain of all edges passing min_cond_prevalence, defaults to 25%
node_comp_ratio (Tuple[float, float, float]) – The percentage of nodes with high maximal prevalence/difference/ratio after filtering. Ratios must sum to 1.0. Defaults to 20%/50%/30%
edge_comp_ratio (Tuple[float, float, float]) – The percentage of edges with high maximal conditional prevalence/maximal conditional increase/maximal conditional increase ratio after filtering. Ratios must sum to 1.0. Defaults to 20%/50%/30%
max_nof_nodes (int | None) – The maximum number of nodes that should exist in the graph after filtering, defaults to None (meaning no filtering applied with this threshold)
max_nof_edges (int | None) – The maximum number of edges that should exist in the graph after filtering, defaults to None (meaning no filtering applied with this threshold)
include_conditional_decrease (bool) – Specifies, if negative absolute conditional increase and conditional increase ratio smaller than 1.0 should be identified as high conditional increase (and ratio) in the edge composition. Defaults to False

filter_graph(graph: AttributeAssociationGraph) → AttributeAssociationGraph[source]

Filters the given graph by its statistical traits

Parameters:: graph (AttributeAssociationGraph) – The graph to filter
Returns:: Returns the new, filtered graph
Return type:: AttributeAssociationGraph

class graphxplore.GraphDataScience.GroupFilterMode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

Specifies if only one or all group metrics must pass the filter criteria.

All = 'All'

Any = 'Any'

class graphxplore.GraphDataScience.GroupSelector(group_table: str, meta: ~graphxplore.MetaDataHandling.meta_data.MetaData, group_filter: ~graphxplore.DataMapping.Conditionals.logic_operators.LogicOperator = <graphxplore.DataMapping.Conditionals.logic_operators.AlwaysTrueOperator object>)[source]

Bases: object

This class generates Cypher statements to select a group of primary keys (e.g. patient IDs) from a Neo4J database based on a LogicOperator object. Variables from inverted foreign table chains can be aggregated, and variables from foreign table chains can be used for singular comparison. Negations, conjunctions and disjunctions can be used as well.

Parameters:

group_table (str) – The name of the origin table for the group to select
meta (MetaData) – The metadata of the database
group_filter (LogicOperator) – The conditional describing the group, defaults to the tautology (all primary keys of group_table will be selected)

get_cypher_query() → str[source]

Generates the Cypher query to select the primary keys for the group.

Returns:: Returns the generated query as a string
Return type:: str

class graphxplore.GraphDataScience.NumericFilterType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

The type of filter on attribute nodes with numeric value.

Equals = '='

Larger = '>'

LargerOrEqual = '>='

Smaller = '<'

SmallerOrEqual = '<='

UnequalTo = '<>'

class graphxplore.GraphDataScience.OrThresholdFilterCascade(filters: Iterable[ThresholdFilter] | None = None)[source]

Bases: ThresholdFilterCascade

This class checks if at least one of its sub-filter criteria are fulfilled (disjunction).

Parameters:: filters (Iterable[ThresholdFilter] | None) – The sub-filters

is_valid(obj_to_filter: AttributeAssociationNode | AttributeAssociationEdge) → bool[source]

Checks the given filter criteria.

Parameters:: obj_to_filter (AttributeAssociationNode | AttributeAssociationEdge) – The object to filter
Returns:: Returns True, if the object passed the filter criteria, False otherwise
Return type:: bool

class graphxplore.GraphDataScience.StringFilterType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

The type of filter on attribute nodes with string value.

Contains = 'contains'

Equals = '='

UnequalTo = '<>'

class graphxplore.GraphDataScience.ThresholdFilter[source]

Bases: object

This class is an abstract parent for classes filtering AttributeAssociationNode and AttributeAssociationEdge objects by parameter thresholds.

is_valid(obj_to_filter: AttributeAssociationNode | AttributeAssociationEdge) → bool[source]

Checks the given filter criteria.

Parameters:: obj_to_filter (AttributeAssociationNode | AttributeAssociationEdge) – The object to filter
Returns:: Returns True, if the object passed the filter criteria, False otherwise
Return type:: bool

class graphxplore.GraphDataScience.ThresholdFilterCascade(filters: Iterable[ThresholdFilter] | None = None)[source]

Bases: ThresholdFilter

This class contains a list of sub-filters (also ThresholdFilter objects) which are each apply consecutively. Is either a conjunction or disjunction.

Parameters:: filters (Iterable[ThresholdFilter] | None) – The sub-filters

add_filter(filter_to_add: ThresholdFilter) → bool[source]

Add a filter to the cascade. The filter is only added, if it imposes a real constraint.

Parameters:: filter_to_add (ThresholdFilter) – The filter that should be added to the cascade
Returns:: Return True, if the filter imposes a constraint and was added, False otherwise
Return type:: bool

is_valid(obj_to_filter: AttributeAssociationNode | AttributeAssociationEdge) → bool[source]

Checks the given filter criteria.

Parameters:: obj_to_filter (AttributeAssociationNode | AttributeAssociationEdge) – The object to filter
Returns:: Returns True, if the object passed the filter criteria, False otherwise
Return type:: bool

class graphxplore.GraphDataScience.ThresholdGraphPostFilter(node_filter: ThresholdFilter | None = None, edge_filter: ThresholdFilter | None = None)[source]

Bases: AttributeAssociationGraphPostFilter

This class filters the nodes and edges of a AttributeAssociationGraph based on property thresholds. The thresholds can be arbitrarily combined by conjunctions and disjunctions.

Parameters:

node_filter (ThresholdFilter | None) – The filter applied to nodes
edge_filter (ThresholdFilter | None) – The filter applied to edges

filter_graph(graph: AttributeAssociationGraph) → AttributeAssociationGraph[source]

Filters the given graph by its statistical traits

Parameters:: graph (AttributeAssociationGraph) – The graph to filter
Returns:: Returns the new, filtered graph
Return type:: AttributeAssociationGraph

Bases: ThresholdFilter

This class filters one specific parameter of a AttributeAssociationNode or AttributeAssociationEdge object and checks if the parameter value lies in the interval [min_val; max_val]. If the property is group dependent, the filter mode must be specified.

Parameters:

param_to_filter (str) – The value for which will be filtered. Must be a statistical parameter of AttributeAssociationNode or AttributeAssociationEdge
min_val (int | float | None) – The lowest allowed property value to pass the filter, defaults to None
max_val (int | float | None) – The highest allowed property value to pass the filter, defaults to None
mode (GroupFilterMode | None) – The filter mode required for group-dependent parameters. Specifies, if all or only one group value must meet the filter thresholds. Defaults to None

is_valid(obj_to_filter: AttributeAssociationNode | AttributeAssociationEdge) → bool[source]

Checks the given filter criteria.

Parameters:: obj_to_filter (AttributeAssociationNode | AttributeAssociationEdge) – The object to filter
Returns:: Returns True, if the object passed the filter criteria, False otherwise
Return type:: bool