graphxplore.Basis package

This subpackage contains all graph data structures in the subpackages graphxplore.Basis.BaseGraph package and graphxplore.Basis.AttributeAssociationGraph package. Additionally, several utility functionalities like file format detection and statistical score calculation can be done with the BaseUtils class. GraphCSVReader, GraphCSVWriter and GraphDatabaseWriter classes can be used for IO handling of graph structures. Building and writing a BaseGraph (AttributeAssociationGraph follows the same rationale) to a Neo4J database might look like

>>> from graphxplore.Basis.BaseGraph import BaseGraph, BaseNode, BaseEdge, BaseLabels, BaseNodeType, BaseEdgeType
>>> from graphxplore.Basis import GraphDatabaseWriter
>>> base_graph = BaseGraph()
>>> pk_node = BaseNode(node_id=0, labels=BaseLabels(membership_labels=('FirstTable',),
>>>                    node_type=BaseNodeType.Key), name='first_primary_key', val=42)
>>> base_graph.nodes.append(pk_node)
>>> first_attr_node = BaseNode(node_id=1, labels=BaseLabels(membership_labels=('FirstTable',),
>>>                            node_type=BaseNodeType.Attribute), name='attribute', val='value')
>>> base_graph.nodes.append(first_attr_node)
>>> fk_node = BaseNode(node_id=2, labels=BaseLabels(membership_labels=('SecondTable',),
>>>                    node_type=BaseNodeType.Key), name='second_primary_key', val=1337)
>>> base_graph.nodes.append(fk_node)
>>> second_attr_node = BaseNode(node_id=3, labels=BaseLabels(membership_labels=('SecondTable',),
>>>                             node_type=BaseNodeType.Attribute), name='measurement', val=0.25)
>>> base_graph.nodes.append(second_attr_node)
# row in table 'FirstTable' of primary key value 42 has value 'value' for variable 'attribute'
>>> base_graph.edges.append(BaseEdge(source=0, target=1, edge_type=BaseEdgeType.HAS_ATTR_VAL))
# row in table 'SecondTable' of primary key value 1337 has value 0.25 for variable 'measurement'
>>> base_graph.edges.append(BaseEdge(source=2, target=3, edge_type=BaseEdgeType.HAS_ATTR_VAL))
# row in table 'SecondTable' of primary key value 1337 is referenced as foreign key
# in row in table 'FirstTable' of primary key value 42
>>> base_graph.edges.append(BaseEdge(source=2, target=0, edge_type=BaseEdgeType.CONNECTED_TO))
# write graph to Neo4J database 'mygraph'
>>> GraphDatabaseWriter.write_graph(db_name='mygraph', graph=base_graph, overwrite=False,
>>>                                 address='bolt://localhost:7687', auth=('my_user', 'my_password'))

Submodules

Module contents

class graphxplore.Basis.BaseUtils[source]

Bases: object

This class contains utility functions.

static calculate_mean(value_dist: Dict[int | float, int]) float | None[source]

Calculates the mean of a distribution dictionary with distribution values as key and counts as values of the dictionary. If the dictionary is empty None is returned.

Parameters:

value_dist (Dict[int | float, int]) – The distribution

Returns:

Returns the mean or None for empty distributions

Return type:

float | None

static calculate_median(value_dist: Dict[int | float, int]) float | None[source]

Calculates the median of a distribution dictionary with distribution values as key and counts as values of the dictionary. If the dictionary is empty None is returned.

Parameters:

value_dist (Dict[int | float, int]) – The distribution

Returns:

Returns the median or None for empty distributions

Return type:

float | None

static calculate_median_quartiles(value_dist: Dict[int | float, int]) Tuple[float, float | None, float | None] | None[source]

Calculates the median and quartiles of a distribution dictionary with distribution values as key and counts as values of the dictionary. If the dictionary is empty None is returned. If the accumulated counts are less than four, quartiles are returned as None.

Parameters:

value_dist (Dict[int | float, int]) – The distribution

Returns:

Returns the median, first quartile and third quartile, or None for empty distributions

Return type:

Tuple[float, float | None, float | None] | None

static calculate_min_max(value_dist: Dict[int | float, int]) Tuple[float, float] | None[source]

Calculates the minimal and maximal value of a distribution dictionary with distribution values as key and counts as values of the dictionary. If the dictionary is empty None is returned.

Parameters:

value_dist (Dict[int | float, int]) – The distribution

Returns:

Returns the minimum and maximum or None for empty distributions

Return type:

Tuple[float, float] | None

static calculate_quartile_quintile_sorted_dist(sorted_dist: Sequence[Tuple[int | float, int]], use_quartile: bool, quantile_id: int) float | None[source]

Calculates quartiles or quintiles from a sorted distribution. If the distribution is empty, `None is returned.

Parameters:
  • sorted_dist (Sequence[Tuple[int | float, int]]) – The distribution with pairs of values and counts sorted in ascending value order

  • use_quartile (bool) – If True quartiles are calculate, else quintiles

  • quantile_id (int) – The identifier for the quartile. Must be 1, 2, or 3 for quartiles, or 1, 2, 3, 4 for quintiles

Returns:

Returns the quartile, or None for empty distributions

Return type:

float | None

static calculate_std(value_dist: Dict[int | float, int], mean: float | None = None) float | None[source]

Calculates the standard deviation of a distribution dictionary with distribution values as key and counts as values of the dictionary. If the dictionary is empty None is returned. A precalculated mean can be specified to speed up the calculation.

Parameters:
  • value_dist (Dict[int | float, int]) – The distribution

  • mean (float | None) – The precalculated mean, defaults to None.

Returns:

Returns the standard deviation or None for empty distributions

Return type:

float | None

static check_csv_row(row: Dict[str, str], required_data: Dict[str, Any])[source]

Checks if all required fields are present in the CSV row and have the correct data type.

Parameters:
  • row (Dict[str, str]) – The CSV row to check

  • required_data (Dict[str, Any]) – A dictionary of required field names and entry data types

static combine_group_info(groups: List[str], group_size: Dict[str, int], pos_group: str | None, neg_group: str | None) List[str][source]
Parameters:
  • groups (List[str])

  • group_size (Dict[str, int])

  • pos_group (str | None)

  • neg_group (str | None)

Return type:

List[str]

static count_lines_in_file(file_path: str) int[source]

Count lines in a text file

Parameters:

file_path (str) – The path to the text file

Returns:

Returns the number of lines

Return type:

int

static csv_row_string_to_list(row: Dict[str, str], row_key: str) List[int] | List[float] | List[str][source]
Retrieves the value string for the key row_key from a CSV row, the value string should contain semicolons

separating the individual property values. The string value is split and each entry is cast to string, integer or float.

Parameters:
  • row (Dict[str, str]) – The CSV row as dictionary

  • row_key (str) – he key for the property in row

Returns:

Returns the list of cast values

Return type:

List[int] | List[float] | List[str]

static detect_file_encoding(file_path: str) str[source]

Reads the first 100k bytes from a file and guesses its encoding e.g., ASCII, UTF-8,… Can afterwards be used with open(file_path, ‘r’, encoding=encoding). Uses the library chardet.

Parameters:

file_path (str) – The path to the file

Returns:

Returns the guessed encoding

Return type:

str

static extract_group_info_from_list(group_str_list: List[str]) Tuple[List[str], Dict[str, int], str | None, str | None][source]

Extracts group names, their sizes and optionally positive and negative group from a list of strings in the format “<group_name> (<group_size>)<[+] or [-] or blank>

Parameters:

group_str_list (List[str]) – The string list containing all group data in the specified format

Returns:

Returns all extracted data as a list of group names, dict of group sizes and positive and negative group if specified (or None)

Return type:

Tuple[List[str], Dict[str, int], str | None, str | None]

static extract_group_info_from_str(group_str: str) Tuple[List[str], Dict[str, int], str | None, str | None][source]

Extracts group names, their sizes and optionally positive and negative group from a string in the format “<group_name> (<group_size>)<[+] or [-] or blank>;<group_name> (<group_size>)<[+] or [-] or blank;…

Parameters:

group_str (str) – The string containing all group data in the specified format

Returns:

Returns all extracted data as a list of group names, dict of group sizes and positive and negative group if specified (or None)

Return type:

Tuple[List[str], Dict[str, int], str | None, str | None]

static file_has_more_lines(file_path: str, threshold: int) bool[source]

Checks if a text file has more than threshold lines

Parameters:
  • file_path (str) – The path to the text file

  • threshold (int) – The threshold to be checked

Returns:

Returns True if the file contains more lines

Return type:

bool

static load_csv_data(file_or_dir_path: str, delimiter: str | None = None, file_encoding: str | None = None) Dict[str, List[Dict[str, str]]][source]

Load table data from one CSV file or from all CSV files contained in a directory

Parameters:
  • file_or_dir_path (str) – Path to directory and file

  • delimiter (str | None) – CSV delimiter used for all files, inferred automatically if None is specified

  • file_encoding (str | None) – File encoding used for all files, inferred automatically if None is specified

Returns:

Returns a dict with the filename without ‘.csv’ extension as key and list of row dicts as table data

Return type:

Dict[str, List[Dict[str, str]]]

class graphxplore.Basis.Graph(graph_type: GraphType, nodes: list | None = None, edges: list | None = None)[source]

Bases: object

This is the parent class of all types of graphs. It is a data holder of nodes and edges.

Parameters:
  • nodes (list | None) – The list of nodes

  • edges (list | None) – The list of edges

  • graph_type (GraphType) – The type of graph

class graphxplore.Basis.GraphCSVIODevice(graph_dir: str, graph_type: GraphType)[source]

Bases: object

This is a parent class for reading and writing CSV files containing generated Graph objects.

Parameters:
  • graph_dir (str) – The directory to which the graph is written or from which it is read

  • graph_type (GraphType) – The type of Graph.

class graphxplore.Basis.GraphCSVReader(graph_dir: str, graph_type: GraphType)[source]

Bases: GraphCSVIODevice

This class reads Graph objects from CSV files.

Parameters:
  • graph_dir (str) – The directory containing the CSV files that will be read

  • graph_type (GraphType) – The type of Graph.

read_graph() Graph[source]

Reads a graph from the specified source directory.

Returns:

Returns the read graph

Return type:

Graph

class graphxplore.Basis.GraphCSVWriter(graph_dir: str, graph_type: GraphType)[source]

Bases: GraphCSVIODevice

This class writes nodes and edges, and whole Graph object to a target directory in the form of CSV files.

Parameters:
  • graph_dir (str) – The directory the CSV files are written to

  • graph_type (GraphType) – The type of Graph.

write_edge(edge: BaseEdge | AttributeAssociationEdge) None[source]

Writes a single edge to a CSV file.

Parameters:

edge (BaseEdge | AttributeAssociationEdge) – The edge to write

Return type:

None

static write_graph(graph_dir: str, graph: Graph) None[source]

Writes a whole graph to a specified target directory in the form of CSV files.

Parameters:
  • graph_dir (str) – The directory the CSV files are written to

  • graph (Graph) – The graph that will be written

Return type:

None

write_node(node: BaseNode | AttributeAssociationNode) None[source]

Writes a single node to a CSV file based on its datatype.

Parameters:

node (BaseNode | AttributeAssociationNode) – The node to write

Return type:

None

class graphxplore.Basis.GraphDatabaseUtils[source]

Bases: object

static check_graph_type_of_db(db_name: str, address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) GraphType[source]

Retrieves the GraphType of a given Neo4J database by checking all labels found in the database and checking for BaseNodeType.Key, DistinctionLabel, FrequencyLabel. Raises an exception if the connection could not be established, the database does not exist in the DBMS, or the type of database is not recognized

Parameters:
  • db_name (str) – The database name

  • address (str) – The address of the Neo4J DBMS

  • auth (Tuple[str, str]) – username and password to access the Neo4j DBMS

Returns:

Returns the type of graph

Return type:

GraphType

static database_contains_labels(db_name: str, labels: Iterable[str], address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) bool[source]

Checks if the nodes of a given database contain all labels specified in labels

Parameters:
  • db_name (str) – The name of the database

  • labels (Iterable[str]) – The list of labels that should be contained

  • address (str) – The address of the Neo4J DBMS

  • auth (Tuple[str, str]) – username and password to access the Neo4j DBMS

Returns:

Returns True, if all labels are contained

Return type:

bool

static execute_query(query: str, database: str, address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) List[Dict[str, Any]][source]

Execute a single Cypher query and retrieve the results. Raises an exception if the query fails

Parameters:
  • query (str) – The Cypher query

  • database (str) – The Neo4J database to query

  • address (str) – The address of the Neo4J DBMS

  • auth (Tuple[str, str]) – The authentication of the Neo4J DBMS

Returns:

Returns a list of dictionaries, one for each returned record

Return type:

List[Dict[str, Any]]

static get_edge_write_cypher_statement(edge: BaseEdge | AttributeAssociationEdge, node_id_mapping: Dict[int, int], separate_params: bool = False) str | Tuple[str, Dict[str, Any]][source]

Generates a Cypher statement to insert a single edge into a Neo4J database given its edge type and parameters. Additionally, the incident nodes are matched prior to the merge by their internal database IDs.

Parameters:
  • edge (BaseEdge | AttributeAssociationEdge) – The edge to insert

  • node_id_mapping (Dict[int, int]) – A dictionary containing pairs of graphxplore node ID and associated internal node ID of the Neo4J database

  • separate_params (bool) – If True, a separate dict of parameter/values is generated and parameters are added as variables with a preceding $ character

Returns:

Returns the Cypher statement with or without parameter/value dictionary

Return type:

str | Tuple[str, Dict[str, Any]]

static get_existing_databases(address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) List[str][source]

Retrieves the names of all databases existing in a Neo4J DBMS (except the “system” database). Raises an exception if the connection could not be established. Note that existing database will be listed, even if they are offline.

Parameters:
  • address (str) – The address of the Neo4J DBMS

  • auth (Tuple[str, str]) – username and password to access the Neo4j DBMS

Returns:

Returns a list of all database names

Return type:

List[str]

static get_neo4j_address(host: str = 'localhost', port: int = 7687, protocol: str = 'bolt') str[source]

Generates the address of a Neo4J DBMS with the given host, port and protocol.

Parameters:
  • host (str) – The host name where the Neo4J DBMS is running

  • port (int) – The port for the Neo4J Bolt protocol

  • protocol (str) – The protocol of the connection

Returns:

Returns the address as string

Return type:

str

static get_node_write_cypher_statement(node: BaseNode | AttributeAssociationNode, separate_params: bool = False, use_create: bool = True) str | Tuple[str, Dict[str, Any]][source]

Generates a Cypher CREATE or MERGE statement to insert a single node into a Neo4J database

Parameters:
  • node (BaseNode | AttributeAssociationNode) – The node to insert

  • use_create (bool) – If True, a CREATE statement is generated, else a MERGE statement

  • separate_params (bool) – If True, a separate dict of parameter/values is generated and parameters are added as variables with a preceding $ character

Returns:

Returns the Cypher statement with or without parameter/value dictionary

Return type:

str | Tuple[str, Dict[str, Any]]

static get_nof_edges_in_database(db_name: str, address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) int[source]

Returns the number of edges stored in a Neo4J database

Parameters:
  • db_name (str) – The name of the database

  • address (str) – The address of the Neo4J DBMS

  • auth (Tuple[str, str]) – username and password to access the Neo4j DBMS

Returns:

Returns the number of edges

Return type:

int

static test_connection(address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) None[source]

Tests if a connection to a Neo4J DBMS is possible with the given host, bolt (Neo4J protocol) port and credentials. Raises an exception if the connection could not be established

Parameters:
  • address (str) – The address of the Neo4J DBMS

  • auth (Tuple[str, str]) – username and password to access the Neo4j DBMS

Return type:

None

class graphxplore.Basis.GraphDatabaseWriter(graph_type: GraphType, db_name: str, overwrite: bool = False, address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', ''))[source]

Bases: object

This class writes nodes and edges, and a whole Graph object to a Neo4J database. WARNING: This class is not suited for very large graphs, since all nodes and edges are held in memory and the neo4j python interface is not designed for large bulk imports. In case of very large graphs please write your graph to CSV with GraphCSVWriter and then import the CSVs using the “neo4j admin import” tool.

Parameters:
  • db_name (str) – The name of the database the data is written to

  • overwrite (bool) – if True, database db_name will be overwritten if already exists

  • address (str) – The address of the Neo4J DBMS

  • auth (Tuple[str, str]) – username and password to access the Neo4j DBMS

  • graph_type (GraphType)

write_edge(edge: BaseEdge | AttributeAssociationEdge) None[source]

Stores a single edge for insertion into the Neo4J database. It will be cached and later written to the database

Parameters:

edge (BaseEdge | AttributeAssociationEdge) – The edge to write

Return type:

None

static write_graph(db_name: str, graph: Graph, overwrite: bool = False, address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) None[source]

Writes a Graph object to a Neo4J database.

Parameters:
  • db_name (str) – The name of the database the graph is written to

  • graph (Graph) – The graph to write

  • overwrite (bool) – if True, database db_name will be overwritten if already exists

  • address (str) – The address of the Neo4J DBMS

  • auth (Tuple[str, str]) – username and password to access the Neo4j DBMS

Return type:

None

write_node(node: BaseNode | AttributeAssociationNode) None[source]

Stores a single node for insertion into the Neo4J database. It will be cached and later written to the database

Parameters:

node (BaseNode | AttributeAssociationNode) – The node to write

Return type:

None

class graphxplore.Basis.GraphOutputType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

The type of output format for a graph

CSV = 'CSV'
Database = 'Database'
class graphxplore.Basis.GraphType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

The type of graph.

AttributeAssociation = 'AttributeAssociation'
Base = 'Base'
class graphxplore.Basis.RelationalDataIODevice(data_location: str | Dict[str, List[Dict[str, str]]], table: str, write: bool = False, header: List[str] | None = None, file_encoding: str | None = None, delimiter: str | None = None)[source]

Bases: object

This class reads and writes relational table data either from/to a directory as CSV files, or a dict in Python

Parameters:
  • data_location (str | Dict[str, List[Dict[str, str]]]) – A directory path, or a dictionary of table name (without .csv extension) and list of table row dicts

  • table (str) – The current table to consider. Either table + ‘.csv’ must be in specified directory path, or as key in the dict

  • write (bool) – bool for write or read access

  • header (List[str] | None) – The header to write in the csv. Can be omitted, if write is False

  • file_encoding (str | None) – The file encoding of the CSV file to read. Can be omitted, if write is True, or data dict is specified

  • delimiter (str | None) – The delimiter of the CSV file to read. Can be omitted, if write is True, or data dict is specified

static check_data_location(data_location: str | Dict[str, List[Dict[str, str]]], write: bool = False)[source]

Check if the data location exists as path, if it is a string, or if there is at least one table present in the data dict if write is False

Parameters:
  • data_location (str | Dict[str, List[Dict[str, str]]]) – A directory path, or a dictionary of table name (without .csv extension) and list of table row dicts

  • write (bool) – bool for write or read access

static get_available_table_names(data_source: str | Dict[str, List[Dict[str, str]]]) List[str][source]

Retrieves all table names (without .csv extension) from a directory path, or all keys from a data dictionary

Parameters:

data_source (str | Dict[str, List[Dict[str, str]]]) – A directory path, or a dictionary of table name (without .csv extension) and list of table row dicts

Returns:

Returns the found table names as a list of strings

Return type:

List[str]

get_header() List[str][source]

Get the currently used header of the table

Returns:

Returns the header as list of strings

Return type:

List[str]

writerow(row: Dict[str, str | int | float | None])[source]

Write a single table row to the output

Parameters:

row (Dict[str, str | int | float | None]) – The data row as a dict of variable name and value