graphxplore.GraphDataScience package
This subpackage contains functionality for exploratory data analysis using attribute association graphs. The starting
point is a Neo4j database containing the data of a BaseGraph. With the
AttributeAssociationGraphGenerator class, the user can generate an
AttributeAssociationGraph. The user must provide one or multiple
groups of primary keys using the GroupSelector or manually writing Neo4J Cypher
queries. Several statistical traits of
each attribute (table/variable/value combination) are measured for each group as well as conditional dependencies
between attributes. This is done with a breadth-first-search (BFS) strategy starting from the primary keys contained
in the group and traversing to all connected attributes in the table itself and potentially foreign tables (and
foreign tables of foreign tables…). Attributes can be pre-filtered by table, name and value using
AttributeAssociationGraphPreFilter objects, or the generated
AttributeAssociationGraph can be post-filtered by statistical
traits using AttributeAssociationGraphPostFilter objects.
The statistical traits of single attributes (absolute count, percentage in groups, missing data ratio, percentage
difference and ratio) are stored in
AttributeAssociationNode objects where all parameters are
explained in detail. All nodes are assigned a FrequencyLabel
based on their prevalence (percentage of occurrence in group) and the frequency_thresholds parameter of
AttributeAssociationGraphGenerator. This label will impact the size of the
circle depicting the node in the visualization. This way, the attention of the user is drawn to more frequently
appearing attributes. Additionally, nodes are assigned a
DistinctionLabel (if positive_group and negative_group of
AttributeAssociationGraphGenerator are set) based on their prevalence
difference and ratio and the prevalence_diff_thresholds and prevalence_ratio_thresholds parameters of
AttributeAssociationGraphGenerator (at least one threshold must be passed).
This label impacts the color of the nodes with red or orange (higher prevalence in positive_group), beige (roughly
same prevalence in positive_group and negative_group), or turquoise or blue (higher prevalence in
negative_group). This way, attention is drawn to attributes which might be associated with the selected groups.
The conditional dependencies (absolute co-occurrence, conditional prevalence of the target attribute given the source
attribute, comparison of conditional and unconditional target prevalence via difference and ratio) are stored in
AttributeAssociationEdge objects and explained there in detail.
All edges are assigned an AttributeAssociationEdgeType based on
the difference and ratio of the conditional and unconditional prevalence, and the cond_increase_thresholds
and increase_ratio_thresholds parameters of
AttributeAssociationGraphGenerator (at least one threshold must be passed).
Code might look like
>>> from graphxplore.Basis import GraphDatabaseWriter
>>> from graphxplore.MetaDataHandling import MetaData
>>> from graphxplore.GraphDataScience import (AttributeAssociationGraphGenerator, CompositionGraphPostFilter,
>>> AttributeFilter, StringFilterType, NumericFilterType,
>>> AttributeAssociationGraphPreFilter, GroupSelector)
>>> from graphxplore.DataMapping.Conditionals import StringOperator, StringOperatorType, NegatedOperator
>>>
>>> my_meta = MetaData.load_from_json(filepath='/meta_path.json')
# define a group of primary keys which have the attribute 'apple' of variable 'food' and a control group not
# having this attribute
>>> apple_condition = StringOperator(table='table',variable='food',value='apple', compare=StringOperatorType.Equals)
>>> apple_group = GroupSelector(group_table='table',meta=my_meta,
>>> group_filter=apple_condition)
>>> control_group = GroupSelector(group_table='table',meta=my_meta,
>>> group_filter=NegatedOperator(pos_operator=apple_condition)
# captured variable/value combinations cannot originate from 'forbidden_table', cannot contain 'nana' in their
# variable name, and their value must be a string, or be negative or at least 100
>>> no_nanas = AttributeFilter('nana', StringFilterType.Contains, include=False)
>>> negative_or_large = [AttributeFilter(0, NumericFilterType.Smaller, include=True),
>>> AttributeFilter(100, NumericFilterType.LargerOrEqual, include=True)]
>>> pre_filter = AttributeAssociationGraphPreFilter(blacklist_tables=['forbidden_table'], name_filters=[no_nanas],
>>> value_filters=negative_or_large)
# use a composition filter for post-filtering which removes 50% of nodes and enforces a ratio of 10% high
# prevalence, 70% high prevalence difference, and 20% high prevalence ratio
>>> node_ratio = (0.1,0.7,0.2)
>>> post_filter = CompositionGraphPostFilter(node_comp_ratio=node_ratio)
>>> generator = AttributeAssociationGraphGenerator(
>>> db_name='mygraphdb', group_selection={'Apple' : apple_group, 'NoApple' : control_group},
>>> positive_group='Apple', negative_group='NoApple', pre_filter=pre_filter, post_filter=post_filter,
>>> address='bolt://localhost:7687', auth=('my_user', 'my_password'))
>>> aag = generator.generate_graph()
# write graph to Neo4J database
>>> GraphDatabaseWriter.write_graph(db_name='apple_aag', graph=aag, address='bolt://localhost:7687',
>>> auth=('my_user', 'my_password'))
Module contents
- class graphxplore.GraphDataScience.AndThresholdFilterCascade(filters: Iterable[ThresholdFilter] | None = None)[source]
Bases:
ThresholdFilterCascadeThis class checks if all its sub-filter criteria are fulfilled (conjunction).
- Parameters:
filters (Iterable[ThresholdFilter] | None) – The sub-filters
- is_valid(obj_to_filter: AttributeAssociationNode | AttributeAssociationEdge) bool[source]
Checks the given filter criteria.
- Parameters:
obj_to_filter (AttributeAssociationNode | AttributeAssociationEdge) – The object to filter
- Returns:
Returns
True, if the object passed the filter criteria,Falseotherwise- Return type:
bool
- class graphxplore.GraphDataScience.AttributeAssociationGraphGenerator(db_name: str, group_selection: Dict[str, GroupSelector | str], positive_group: str | None = None, negative_group: str | None = None, pre_filter: AttributeAssociationGraphPreFilter | None = None, post_filter: AttributeAssociationGraphPostFilter | None = None, frequency_thresholds: Tuple[float, float] = (0.1, 0.5), prevalence_diff_thresholds: Tuple[float, float] = (0.1, 0.2), prevalence_ratio_thresholds: Tuple[float, float] = (1.5, 2.0), cond_increase_thresholds: Tuple[float, float] = (0.1, 0.2), increase_ratio_thresholds: Tuple[float, float] = (1.5, 2.0), address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', ''))[source]
Bases:
objectThis class extracts statistical measurements for all attributes in a dataset regarding their association with one or multiple selected group of primary keys (e.g. patient IDs). Absolute counts. missing value rates and prevalence of attributes within groups are calculated and compare by difference and ratio. These parameters are stored in
AttributeAssociationNodeobjects. Additionally, conditional dependencies between attributes are measured by co-occurrence, conditional prevalence in groups and compared to unconditional prevalence. The results are stored inAttributeAssociationEdgeobjects. For more detailed descriptions of the calculated metrics refer toAttributeAssociationNodeandAttributeAssociationEdge.Nodes are labeled based on the prevalence of their attribute in the defined groups and the distinction between prevalence. These labels are encoded in the size and color of their node depicting in the Neo4J visualization. Additionally, edges are assigned a type based on the distinction between conditional and unconditional prevalence. This edge type influences the thickness of the drawn arrow representing the edge.
The origin dataset must be stored as a
BaseGraphin a Neo4J database. The considered attributes can be pre-filtered by name and value using datatypes, string and numerical comparisons, blacklist and whitelist conditions. Additionally, the generated graph can be post-filtered by assessing the calculated statistical measurements. For more detailed descriptions of the calculated metrics refer toAttributeAssociationNodeandAttributeAssociationEdge- Parameters:
db_name (str) – The name of the database
group_selection (Dict[str, GroupSelector | str]) – For each group of primary keys, the name and selection condition as a
GroupSelectorobject or as a Cypher query. The node IDs of primary keys must be returned with the Cypher variable “x_0” in the form “return id(<node variable>) as x_0”positive_group (str | None) – The name of the positive group. Must be contained in
group_selectionif defined. Attributes which appear more frequently in this group compare to thenegative_groupwill be label as “related” or “highly related” and colored in orange or red in the visualization. Defaults to Nonenegative_group (str | None) – The name of the negative group. Must be contained in
group_selectionif defined. Attributes which appear more frequently in this group compare to thepositive_groupwill be label as “inverse” or “highly inverse” and colored in turquoise or blue in the visualization. Defaults to Nonepre_filter (AttributeAssociationGraphPreFilter | None) – The filter applied to attribute nodes when querying the database, defaults to None
post_filter (AttributeAssociationGraphPostFilter | None) – The post filter applied to the generated knowledge graph, defaults to None
frequency_thresholds (Tuple[float, float]) – Thresholds of prevalence for “frequent” and “highly frequent” labels, defaults to 0.1 and 0.5
prevalence_diff_thresholds (Tuple[float, float]) – Thresholds for prevalence difference for “related”/”inverse” and “highly related”/”highly inverse” labels, defaults to 0.1 and 0.2
prevalence_ratio_thresholds (Tuple[float, float]) – Thresholds for prevalence ratio for “related”/”inverse” and “highly related”/”highly inverse” labels, defaults to 1.5 and 2.0
cond_increase_thresholds (Tuple[float, float]) – Thresholds of conditional increase for “medium relation” and “high relation” labels, defaults to 0.1 and 0.2
increase_ratio_thresholds (Tuple[float, float]) – Thresholds of conditional increase ratio for “medium relation” and “high relation” labels, defaults to 1.5 and 2.0
address (str) – The address of the Neo4J DBMS. Can be generated with
get_neo4j_address()auth (Tuple[str, str]) – username and password to access the Neo4j DBMS
- generate_graph() AttributeAssociationGraph[source]
Generates the graph by first identifying all group primary key nodes, and then retrieving all reachable attributes (directly connected or via a path to foreign tables) with a breadth-first search strategy from the Neo4J database. Pre and/or post filters are applied if they were specified.
- Returns:
Returns the generated graph
- Return type:
- class graphxplore.GraphDataScience.AttributeAssociationGraphPostFilter[source]
Bases:
objectThis is the abstract parent class of all post filter for attribute association graphs. Post filter assess the statistical metrics calculated for attribute and either use thresholds (
ThresholdGraphPostFilter) or select the attributes based on a composition of metric values (CompositionGraphPostFilter).- filter_graph(graph: AttributeAssociationGraph) AttributeAssociationGraph[source]
Filters the given graph by its statistical traits
- Parameters:
graph (AttributeAssociationGraph) – The graph to filter
- Returns:
Returns the new, filtered graph
- Return type:
- class graphxplore.GraphDataScience.AttributeAssociationGraphPreFilter(max_path_length: int = 3, whitelist_tables: Iterable[str] | None = None, blacklist_tables: Iterable[str] | None = None, target_tables: Iterable[str] | None = None, name_filters: Iterable[AttributeFilter] | None = None, value_filters: Iterable[AttributeFilter] | None = None)[source]
Bases:
objectThis class captures all filters that are applied to the attribute nodes of a
BaseGraphas Neo4J database by theAttributeAssociationGraphGenerator. Attribute nodes are selected for the statistical analysis based on these filters. Each node’s name and value parameter must match at least one whitelist filter (if specified) and cannot match a blacklist filter (if specified). With the different table filters the BFS search of theAttributeAssociationGraphGeneratorcan be narrowed down, potentially reducing its runtime dramatically for large databases.- Parameters:
max_path_length (int) – The maximum allowed length of a path from a primary key node to an attribute node in the BFS
whitelist_tables (Iterable[str] | None) – If specified, only nodes of these tables and optionally the target_tables are traversed
blacklist_tables (Iterable[str] | None) – If specified, nodes of these tables are enver traversed
target_tables (Iterable[str] | None) – If specified, only nodes of these tables can be the end of the BFS traversal (others can be traversed)
name_filters (Iterable[AttributeFilter] | None) – The filters on the name parameter of the nodes
value_filters (Iterable[AttributeFilter] | None) – The filters on the value parameter of the nodes
- class graphxplore.GraphDataScience.AttributeFilter(filter_value: str | int | float, filter_type: StringFilterType | NumericFilterType, include: bool)[source]
Bases:
objectThis class represents one of multiple filters applied to the attribute nodes which are
BaseNodeobjects. A node is valid, if it matches the filter criteria- Parameters:
filter_value (str | int | float) – The value against which will be filtered
filter_type (StringFilterType | NumericFilterType) – The type of filter
include (bool) – If True, the filter will be used as whitelist, otherwise as blacklist filter
- class graphxplore.GraphDataScience.CompositionGraphPostFilter(min_prevalence: float = 0.01, min_prevalence_mode: GroupFilterMode = GroupFilterMode.All, min_cond_prevalence: float = 0.05, min_cond_prevalence_mode: GroupFilterMode = GroupFilterMode.All, max_missing: float = 0.2, max_missing_mode: GroupFilterMode = GroupFilterMode.All, perc_nof_nodes: float = 0.5, perc_nof_edges=0.25, node_comp_ratio: Tuple[float, float, float] = (0.2, 0.5, 0.3), edge_comp_ratio: Tuple[float, float, float] = (0.2, 0.5, 0.3), max_nof_nodes: int | None = None, max_nof_edges: int | None = None, include_conditional_decrease: bool = False)[source]
Bases:
AttributeAssociationGraphPostFilterThis class filters an attribute association graph based on user-defined score composition ratios. For each score, the nodes or edges with the highest values will be selected and the graph will be filled with them according to the specified ratio. The node ratio can be built out of the following three metrics:
high maximum prevalence: These attributes appear often in the group with the highest prevalence, but are not necessarily selective for that specific group
high prevalence difference: These attributes appear more often in one group compared another in absolute terms. Thus, this attribute has a sensitivity for that group. But they could still have some prevalence in another group, meaning their specificity could be low
high prevalence ratio: These attributes are specific for one group compared to another group. But all prevalence could be low and the sensitivity of the attribute could be low as well as a result.
The edge ratio can be built out of the following three metrics for a relation A->B:
high maximum conditional prevalence: Many members with attribute A, also exhibit attribute B in the group with the highest conditional prevalence. However, B could just have a high prevalence itself and thus the added condition of A would have little influence
high maximum conditional increase: The added condition of A has a high sensitivity for the presence of B in at least one group. However, the prevalence B could be high as well, meaning A would not be specific for B
high maximum conditional increase: The added condition of A has a high specificity for the presence of B in at least one group. However, the conditional prevalence could be low, meaning A would not be sensitive for the presence of B
Additionally, a minimal prevalence and conditional prevalence, as well as maximum missing value ratio can be specified. Moreover, the number of nodes and edges in the filtered graph can be adjusted using a percentage of the unfiltered amount or an absolute value. If both the percentage and absolute value are specified, the smallest resulting number of nodes or edges will be taken.
NOTE: Since attribute association graphs with only one group have no prevalence difference and ratio metrics, only the nodes with the highest prevalence will be selected
- Parameters:
min_prevalence (float) – Nodes with a prevalence below this value will be removed, defaults to 1%
min_prevalence_mode (GroupFilterMode) – Specifies if only one or all groups must pass min_prevalence, defaults to
GroupFilterMode.Allmin_cond_prevalence (float) – Edges with a conditional prevalence below this value in all groups will be removed, defaults to 5%
min_cond_prevalence_mode (GroupFilterMode) – Specifies if only one or all groups must pass min_cond_prevalence, defaults to
GroupFilterMode.Allmax_missing (float) – Nodes with a missing ratio above this value will be removed, defaults to 20%
max_missing_mode (GroupFilterMode) – Specifies if only one or all groups must pass max_missing, defaults to
GroupFilterMode.Allperc_nof_nodes (float) – Percentage of nodes that should remain of all the nodes passing min_prevalence and max_missing, defaults to 50%
perc_nof_edges – Percentage of edges that should remain of all edges passing min_cond_prevalence, defaults to 25%
node_comp_ratio (Tuple[float, float, float]) – The percentage of nodes with high maximal prevalence/difference/ratio after filtering. Ratios must sum to 1.0. Defaults to 20%/50%/30%
edge_comp_ratio (Tuple[float, float, float]) – The percentage of edges with high maximal conditional prevalence/maximal conditional increase/maximal conditional increase ratio after filtering. Ratios must sum to 1.0. Defaults to 20%/50%/30%
max_nof_nodes (int | None) – The maximum number of nodes that should exist in the graph after filtering, defaults to
None(meaning no filtering applied with this threshold)max_nof_edges (int | None) – The maximum number of edges that should exist in the graph after filtering, defaults to
None(meaning no filtering applied with this threshold)include_conditional_decrease (bool) – Specifies, if negative absolute conditional increase and conditional increase ratio smaller than 1.0 should be identified as high conditional increase (and ratio) in the edge composition. Defaults to
False
- filter_graph(graph: AttributeAssociationGraph) AttributeAssociationGraph[source]
Filters the given graph by its statistical traits
- Parameters:
graph (AttributeAssociationGraph) – The graph to filter
- Returns:
Returns the new, filtered graph
- Return type:
- class graphxplore.GraphDataScience.GroupFilterMode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
str,EnumSpecifies if only one or all group metrics must pass the filter criteria.
- All = 'All'
- Any = 'Any'
- class graphxplore.GraphDataScience.GroupSelector(group_table: str, meta: ~graphxplore.MetaDataHandling.meta_data.MetaData, group_filter: ~graphxplore.DataMapping.Conditionals.logic_operators.LogicOperator = <graphxplore.DataMapping.Conditionals.logic_operators.AlwaysTrueOperator object>)[source]
Bases:
objectThis class generates Cypher statements to select a group of primary keys (e.g. patient IDs) from a Neo4J database based on a
LogicOperatorobject. Variables from inverted foreign table chains can be aggregated, and variables from foreign table chains can be used for singular comparison. Negations, conjunctions and disjunctions can be used as well.- Parameters:
group_table (str) – The name of the origin table for the group to select
meta (MetaData) – The metadata of the database
group_filter (LogicOperator) – The conditional describing the group, defaults to the tautology (all primary keys of
group_tablewill be selected)
- class graphxplore.GraphDataScience.NumericFilterType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
str,EnumThe type of filter on attribute nodes with numeric value.
- Equals = '='
- Larger = '>'
- LargerOrEqual = '>='
- Smaller = '<'
- SmallerOrEqual = '<='
- UnequalTo = '<>'
- class graphxplore.GraphDataScience.OrThresholdFilterCascade(filters: Iterable[ThresholdFilter] | None = None)[source]
Bases:
ThresholdFilterCascadeThis class checks if at least one of its sub-filter criteria are fulfilled (disjunction).
- Parameters:
filters (Iterable[ThresholdFilter] | None) – The sub-filters
- is_valid(obj_to_filter: AttributeAssociationNode | AttributeAssociationEdge) bool[source]
Checks the given filter criteria.
- Parameters:
obj_to_filter (AttributeAssociationNode | AttributeAssociationEdge) – The object to filter
- Returns:
Returns
True, if the object passed the filter criteria,Falseotherwise- Return type:
bool
- class graphxplore.GraphDataScience.StringFilterType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
str,EnumThe type of filter on attribute nodes with string value.
- Contains = 'contains'
- Equals = '='
- UnequalTo = '<>'
- class graphxplore.GraphDataScience.ThresholdFilter[source]
Bases:
objectThis class is an abstract parent for classes filtering
AttributeAssociationNodeandAttributeAssociationEdgeobjects by parameter thresholds.- is_valid(obj_to_filter: AttributeAssociationNode | AttributeAssociationEdge) bool[source]
Checks the given filter criteria.
- Parameters:
obj_to_filter (AttributeAssociationNode | AttributeAssociationEdge) – The object to filter
- Returns:
Returns
True, if the object passed the filter criteria,Falseotherwise- Return type:
bool
- class graphxplore.GraphDataScience.ThresholdFilterCascade(filters: Iterable[ThresholdFilter] | None = None)[source]
Bases:
ThresholdFilterThis class contains a list of sub-filters (also
ThresholdFilterobjects) which are each apply consecutively. Is either a conjunction or disjunction.- Parameters:
filters (Iterable[ThresholdFilter] | None) – The sub-filters
- add_filter(filter_to_add: ThresholdFilter) bool[source]
Add a filter to the cascade. The filter is only added, if it imposes a real constraint.
- Parameters:
filter_to_add (ThresholdFilter) – The filter that should be added to the cascade
- Returns:
Return True, if the filter imposes a constraint and was added, False otherwise
- Return type:
bool
- is_valid(obj_to_filter: AttributeAssociationNode | AttributeAssociationEdge) bool[source]
Checks the given filter criteria.
- Parameters:
obj_to_filter (AttributeAssociationNode | AttributeAssociationEdge) – The object to filter
- Returns:
Returns
True, if the object passed the filter criteria,Falseotherwise- Return type:
bool
- class graphxplore.GraphDataScience.ThresholdGraphPostFilter(node_filter: ThresholdFilter | None = None, edge_filter: ThresholdFilter | None = None)[source]
Bases:
AttributeAssociationGraphPostFilterThis class filters the nodes and edges of a
AttributeAssociationGraphbased on property thresholds. The thresholds can be arbitrarily combined by conjunctions and disjunctions.- Parameters:
node_filter (ThresholdFilter | None) – The filter applied to nodes
edge_filter (ThresholdFilter | None) – The filter applied to edges
- filter_graph(graph: AttributeAssociationGraph) AttributeAssociationGraph[source]
Filters the given graph by its statistical traits
- Parameters:
graph (AttributeAssociationGraph) – The graph to filter
- Returns:
Returns the new, filtered graph
- Return type:
- class graphxplore.GraphDataScience.ThresholdParamFilter(param_to_filter: str, min_val: int | float | None = None, max_val: int | float | None = None, mode: GroupFilterMode | None = None)[source]
Bases:
ThresholdFilterThis class filters one specific parameter of a
AttributeAssociationNodeorAttributeAssociationEdgeobject and checks if the parameter value lies in the interval [min_val;max_val]. If the property is group dependent, the filter mode must be specified.- Parameters:
param_to_filter (str) – The value for which will be filtered. Must be a statistical parameter of
AttributeAssociationNodeorAttributeAssociationEdgemin_val (int | float | None) – The lowest allowed property value to pass the filter, defaults to None
max_val (int | float | None) – The highest allowed property value to pass the filter, defaults to None
mode (GroupFilterMode | None) – The filter mode required for group-dependent parameters. Specifies, if all or only one group value must meet the filter thresholds. Defaults to None
- is_valid(obj_to_filter: AttributeAssociationNode | AttributeAssociationEdge) bool[source]
Checks the given filter criteria.
- Parameters:
obj_to_filter (AttributeAssociationNode | AttributeAssociationEdge) – The object to filter
- Returns:
Returns
True, if the object passed the filter criteria,Falseotherwise- Return type:
bool