graphxplore.Basis package
This subpackage contains all graph data structures in the subpackages graphxplore.Basis.BaseGraph package and graphxplore.Basis.AttributeAssociationGraph package. Additionally,
several utility functionalities like file format detection and statistical score calculation can be done with
the BaseUtils class. GraphCSVReader,
GraphCSVWriter and GraphDatabaseWriter classes can be used for
IO handling of graph structures. Building and writing a BaseGraph
(AttributeAssociationGraph follows the same rationale) to a Neo4J
database might look like
>>> from graphxplore.Basis.BaseGraph import BaseGraph, BaseNode, BaseEdge, BaseLabels, BaseNodeType, BaseEdgeType
>>> from graphxplore.Basis import GraphDatabaseWriter
>>> base_graph = BaseGraph()
>>> pk_node = BaseNode(node_id=0, labels=BaseLabels(membership_labels=('FirstTable',),
>>> node_type=BaseNodeType.Key), name='first_primary_key', val=42)
>>> base_graph.nodes.append(pk_node)
>>> first_attr_node = BaseNode(node_id=1, labels=BaseLabels(membership_labels=('FirstTable',),
>>> node_type=BaseNodeType.Attribute), name='attribute', val='value')
>>> base_graph.nodes.append(first_attr_node)
>>> fk_node = BaseNode(node_id=2, labels=BaseLabels(membership_labels=('SecondTable',),
>>> node_type=BaseNodeType.Key), name='second_primary_key', val=1337)
>>> base_graph.nodes.append(fk_node)
>>> second_attr_node = BaseNode(node_id=3, labels=BaseLabels(membership_labels=('SecondTable',),
>>> node_type=BaseNodeType.Attribute), name='measurement', val=0.25)
>>> base_graph.nodes.append(second_attr_node)
# row in table 'FirstTable' of primary key value 42 has value 'value' for variable 'attribute'
>>> base_graph.edges.append(BaseEdge(source=0, target=1, edge_type=BaseEdgeType.HAS_ATTR_VAL))
# row in table 'SecondTable' of primary key value 1337 has value 0.25 for variable 'measurement'
>>> base_graph.edges.append(BaseEdge(source=2, target=3, edge_type=BaseEdgeType.HAS_ATTR_VAL))
# row in table 'SecondTable' of primary key value 1337 is referenced as foreign key
# in row in table 'FirstTable' of primary key value 42
>>> base_graph.edges.append(BaseEdge(source=2, target=0, edge_type=BaseEdgeType.CONNECTED_TO))
# write graph to Neo4J database 'mygraph'
>>> GraphDatabaseWriter.write_graph(db_name='mygraph', graph=base_graph, overwrite=False,
>>> address='bolt://localhost:7687', auth=('my_user', 'my_password'))
Submodules
Module contents
- class graphxplore.Basis.BaseUtils[source]
Bases:
objectThis class contains utility functions.
- static calculate_mean(value_dist: Dict[int | float, int]) float | None[source]
Calculates the mean of a distribution dictionary with distribution values as key and counts as values of the dictionary. If the dictionary is empty None is returned.
- Parameters:
value_dist (Dict[int | float, int]) – The distribution
- Returns:
Returns the mean or None for empty distributions
- Return type:
float | None
- static calculate_median(value_dist: Dict[int | float, int]) float | None[source]
Calculates the median of a distribution dictionary with distribution values as key and counts as values of the dictionary. If the dictionary is empty None is returned.
- Parameters:
value_dist (Dict[int | float, int]) – The distribution
- Returns:
Returns the median or None for empty distributions
- Return type:
float | None
- static calculate_median_quartiles(value_dist: Dict[int | float, int]) Tuple[float, float | None, float | None] | None[source]
Calculates the median and quartiles of a distribution dictionary with distribution values as key and counts as values of the dictionary. If the dictionary is empty None is returned. If the accumulated counts are less than four, quartiles are returned as
None.- Parameters:
value_dist (Dict[int | float, int]) – The distribution
- Returns:
Returns the median, first quartile and third quartile, or None for empty distributions
- Return type:
Tuple[float, float | None, float | None] | None
- static calculate_min_max(value_dist: Dict[int | float, int]) Tuple[float, float] | None[source]
Calculates the minimal and maximal value of a distribution dictionary with distribution values as key and counts as values of the dictionary. If the dictionary is empty None is returned.
- Parameters:
value_dist (Dict[int | float, int]) – The distribution
- Returns:
Returns the minimum and maximum or None for empty distributions
- Return type:
Tuple[float, float] | None
- static calculate_quartile_quintile_sorted_dist(sorted_dist: Sequence[Tuple[int | float, int]], use_quartile: bool, quantile_id: int) float | None[source]
Calculates quartiles or quintiles from a sorted distribution. If the distribution is empty,
`Noneis returned.- Parameters:
sorted_dist (Sequence[Tuple[int | float, int]]) – The distribution with pairs of values and counts sorted in ascending value order
use_quartile (bool) – If
Truequartiles are calculate, else quintilesquantile_id (int) – The identifier for the quartile. Must be 1, 2, or 3 for quartiles, or 1, 2, 3, 4 for quintiles
- Returns:
Returns the quartile, or None for empty distributions
- Return type:
float | None
- static calculate_std(value_dist: Dict[int | float, int], mean: float | None = None) float | None[source]
Calculates the standard deviation of a distribution dictionary with distribution values as key and counts as values of the dictionary. If the dictionary is empty None is returned. A precalculated mean can be specified to speed up the calculation.
- Parameters:
value_dist (Dict[int | float, int]) – The distribution
mean (float | None) – The precalculated mean, defaults to None.
- Returns:
Returns the standard deviation or None for empty distributions
- Return type:
float | None
- static check_csv_row(row: Dict[str, str], required_data: Dict[str, Any])[source]
Checks if all required fields are present in the CSV row and have the correct data type.
- Parameters:
row (Dict[str, str]) – The CSV row to check
required_data (Dict[str, Any]) – A dictionary of required field names and entry data types
- static combine_group_info(groups: List[str], group_size: Dict[str, int], pos_group: str | None, neg_group: str | None) List[str][source]
- Parameters:
groups (List[str])
group_size (Dict[str, int])
pos_group (str | None)
neg_group (str | None)
- Return type:
List[str]
- static count_lines_in_file(file_path: str) int[source]
Count lines in a text file
- Parameters:
file_path (str) – The path to the text file
- Returns:
Returns the number of lines
- Return type:
int
- static csv_row_string_to_list(row: Dict[str, str], row_key: str) List[int] | List[float] | List[str][source]
- Retrieves the value string for the key row_key from a CSV row, the value string should contain semicolons
separating the individual property values. The string value is split and each entry is cast to string, integer or float.
- Parameters:
row (Dict[str, str]) – The CSV row as dictionary
row_key (str) – he key for the property in row
- Returns:
Returns the list of cast values
- Return type:
List[int] | List[float] | List[str]
- static detect_file_encoding(file_path: str) str[source]
Reads the first 100k bytes from a file and guesses its encoding e.g., ASCII, UTF-8,… Can afterwards be used with open(file_path, ‘r’, encoding=encoding). Uses the library chardet.
- Parameters:
file_path (str) – The path to the file
- Returns:
Returns the guessed encoding
- Return type:
str
- static extract_group_info_from_list(group_str_list: List[str]) Tuple[List[str], Dict[str, int], str | None, str | None][source]
Extracts group names, their sizes and optionally positive and negative group from a list of strings in the format “<group_name> (<group_size>)<[+] or [-] or blank>
- Parameters:
group_str_list (List[str]) – The string list containing all group data in the specified format
- Returns:
Returns all extracted data as a list of group names, dict of group sizes and positive and negative group if specified (or None)
- Return type:
Tuple[List[str], Dict[str, int], str | None, str | None]
- static extract_group_info_from_str(group_str: str) Tuple[List[str], Dict[str, int], str | None, str | None][source]
Extracts group names, their sizes and optionally positive and negative group from a string in the format “<group_name> (<group_size>)<[+] or [-] or blank>;<group_name> (<group_size>)<[+] or [-] or blank;…
- Parameters:
group_str (str) – The string containing all group data in the specified format
- Returns:
Returns all extracted data as a list of group names, dict of group sizes and positive and negative group if specified (or None)
- Return type:
Tuple[List[str], Dict[str, int], str | None, str | None]
- static file_has_more_lines(file_path: str, threshold: int) bool[source]
Checks if a text file has more than
thresholdlines- Parameters:
file_path (str) – The path to the text file
threshold (int) – The threshold to be checked
- Returns:
Returns
Trueif the file contains more lines- Return type:
bool
- static load_csv_data(file_or_dir_path: str, delimiter: str | None = None, file_encoding: str | None = None) Dict[str, List[Dict[str, str]]][source]
Load table data from one CSV file or from all CSV files contained in a directory
- Parameters:
file_or_dir_path (str) – Path to directory and file
delimiter (str | None) – CSV delimiter used for all files, inferred automatically if
Noneis specifiedfile_encoding (str | None) – File encoding used for all files, inferred automatically if
Noneis specified
- Returns:
Returns a dict with the filename without ‘.csv’ extension as key and list of row dicts as table data
- Return type:
Dict[str, List[Dict[str, str]]]
- class graphxplore.Basis.Graph(graph_type: GraphType, nodes: list | None = None, edges: list | None = None)[source]
Bases:
objectThis is the parent class of all types of graphs. It is a data holder of nodes and edges.
- Parameters:
nodes (list | None) – The list of nodes
edges (list | None) – The list of edges
graph_type (GraphType) – The type of graph
- class graphxplore.Basis.GraphCSVIODevice(graph_dir: str, graph_type: GraphType)[source]
Bases:
objectThis is a parent class for reading and writing CSV files containing generated
Graphobjects.
- class graphxplore.Basis.GraphCSVReader(graph_dir: str, graph_type: GraphType)[source]
Bases:
GraphCSVIODeviceThis class reads
Graphobjects from CSV files.- Parameters:
- class graphxplore.Basis.GraphCSVWriter(graph_dir: str, graph_type: GraphType)[source]
Bases:
GraphCSVIODeviceThis class writes nodes and edges, and whole
Graphobject to a target directory in the form of CSV files.- Parameters:
- write_edge(edge: BaseEdge | AttributeAssociationEdge) None[source]
Writes a single edge to a CSV file.
- Parameters:
edge (BaseEdge | AttributeAssociationEdge) – The edge to write
- Return type:
None
- static write_graph(graph_dir: str, graph: Graph) None[source]
Writes a whole graph to a specified target directory in the form of CSV files.
- Parameters:
graph_dir (str) – The directory the CSV files are written to
graph (Graph) – The graph that will be written
- Return type:
None
- write_node(node: BaseNode | AttributeAssociationNode) None[source]
Writes a single node to a CSV file based on its datatype.
- Parameters:
node (BaseNode | AttributeAssociationNode) – The node to write
- Return type:
None
- class graphxplore.Basis.GraphDatabaseUtils[source]
Bases:
object- static check_graph_type_of_db(db_name: str, address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) GraphType[source]
Retrieves the
GraphTypeof a given Neo4J database by checking all labels found in the database and checking forBaseNodeType.Key,DistinctionLabel,FrequencyLabel. Raises an exception if the connection could not be established, the database does not exist in the DBMS, or the type of database is not recognized- Parameters:
db_name (str) – The database name
address (str) – The address of the Neo4J DBMS
auth (Tuple[str, str]) – username and password to access the Neo4j DBMS
- Returns:
Returns the type of graph
- Return type:
- static database_contains_labels(db_name: str, labels: Iterable[str], address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) bool[source]
Checks if the nodes of a given database contain all labels specified in
labels- Parameters:
db_name (str) – The name of the database
labels (Iterable[str]) – The list of labels that should be contained
address (str) – The address of the Neo4J DBMS
auth (Tuple[str, str]) – username and password to access the Neo4j DBMS
- Returns:
Returns True, if all labels are contained
- Return type:
bool
- static execute_query(query: str, database: str, address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) List[Dict[str, Any]][source]
Execute a single Cypher query and retrieve the results. Raises an exception if the query fails
- Parameters:
query (str) – The Cypher query
database (str) – The Neo4J database to query
address (str) – The address of the Neo4J DBMS
auth (Tuple[str, str]) – The authentication of the Neo4J DBMS
- Returns:
Returns a list of dictionaries, one for each returned record
- Return type:
List[Dict[str, Any]]
- static get_edge_write_cypher_statement(edge: BaseEdge | AttributeAssociationEdge, node_id_mapping: Dict[int, int], separate_params: bool = False) str | Tuple[str, Dict[str, Any]][source]
Generates a Cypher statement to insert a single edge into a Neo4J database given its edge type and parameters. Additionally, the incident nodes are matched prior to the merge by their internal database IDs.
- Parameters:
edge (BaseEdge | AttributeAssociationEdge) – The edge to insert
node_id_mapping (Dict[int, int]) – A dictionary containing pairs of graphxplore node ID and associated internal node ID of the Neo4J database
separate_params (bool) – If
True, a separate dict of parameter/values is generated and parameters are added as variables with a preceding $ character
- Returns:
Returns the Cypher statement with or without parameter/value dictionary
- Return type:
str | Tuple[str, Dict[str, Any]]
- static get_existing_databases(address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) List[str][source]
Retrieves the names of all databases existing in a Neo4J DBMS (except the “system” database). Raises an exception if the connection could not be established. Note that existing database will be listed, even if they are offline.
- Parameters:
address (str) – The address of the Neo4J DBMS
auth (Tuple[str, str]) – username and password to access the Neo4j DBMS
- Returns:
Returns a list of all database names
- Return type:
List[str]
- static get_neo4j_address(host: str = 'localhost', port: int = 7687, protocol: str = 'bolt') str[source]
Generates the address of a Neo4J DBMS with the given host, port and protocol.
- Parameters:
host (str) – The host name where the Neo4J DBMS is running
port (int) – The port for the Neo4J Bolt protocol
protocol (str) – The protocol of the connection
- Returns:
Returns the address as string
- Return type:
str
- static get_node_write_cypher_statement(node: BaseNode | AttributeAssociationNode, separate_params: bool = False, use_create: bool = True) str | Tuple[str, Dict[str, Any]][source]
Generates a Cypher CREATE or MERGE statement to insert a single node into a Neo4J database
- Parameters:
node (BaseNode | AttributeAssociationNode) – The node to insert
use_create (bool) – If
True, a CREATE statement is generated, else a MERGE statementseparate_params (bool) – If
True, a separate dict of parameter/values is generated and parameters are added as variables with a preceding $ character
- Returns:
Returns the Cypher statement with or without parameter/value dictionary
- Return type:
str | Tuple[str, Dict[str, Any]]
- static get_nof_edges_in_database(db_name: str, address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) int[source]
Returns the number of edges stored in a Neo4J database
- Parameters:
db_name (str) – The name of the database
address (str) – The address of the Neo4J DBMS
auth (Tuple[str, str]) – username and password to access the Neo4j DBMS
- Returns:
Returns the number of edges
- Return type:
int
- static test_connection(address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) None[source]
Tests if a connection to a Neo4J DBMS is possible with the given host, bolt (Neo4J protocol) port and credentials. Raises an exception if the connection could not be established
- Parameters:
address (str) – The address of the Neo4J DBMS
auth (Tuple[str, str]) – username and password to access the Neo4j DBMS
- Return type:
None
- class graphxplore.Basis.GraphDatabaseWriter(graph_type: GraphType, db_name: str, overwrite: bool = False, address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', ''))[source]
Bases:
objectThis class writes nodes and edges, and a whole
Graphobject to a Neo4J database. WARNING: This class is not suited for very large graphs, since all nodes and edges are held in memory and the neo4j python interface is not designed for large bulk imports. In case of very large graphs please write your graph to CSV withGraphCSVWriterand then import the CSVs using the “neo4j admin import” tool.- Parameters:
db_name (str) – The name of the database the data is written to
overwrite (bool) – if True, database db_name will be overwritten if already exists
address (str) – The address of the Neo4J DBMS
auth (Tuple[str, str]) – username and password to access the Neo4j DBMS
graph_type (GraphType)
- write_edge(edge: BaseEdge | AttributeAssociationEdge) None[source]
Stores a single edge for insertion into the Neo4J database. It will be cached and later written to the database
- Parameters:
edge (BaseEdge | AttributeAssociationEdge) – The edge to write
- Return type:
None
- static write_graph(db_name: str, graph: Graph, overwrite: bool = False, address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) None[source]
Writes a
Graphobject to a Neo4J database.- Parameters:
db_name (str) – The name of the database the graph is written to
graph (Graph) – The graph to write
overwrite (bool) – if True, database db_name will be overwritten if already exists
address (str) – The address of the Neo4J DBMS
auth (Tuple[str, str]) – username and password to access the Neo4j DBMS
- Return type:
None
- write_node(node: BaseNode | AttributeAssociationNode) None[source]
Stores a single node for insertion into the Neo4J database. It will be cached and later written to the database
- Parameters:
node (BaseNode | AttributeAssociationNode) – The node to write
- Return type:
None
- class graphxplore.Basis.GraphOutputType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
str,EnumThe type of output format for a graph
- CSV = 'CSV'
- Database = 'Database'
- class graphxplore.Basis.GraphType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
str,EnumThe type of graph.
- AttributeAssociation = 'AttributeAssociation'
- Base = 'Base'
- class graphxplore.Basis.RelationalDataIODevice(data_location: str | Dict[str, List[Dict[str, str]]], table: str, write: bool = False, header: List[str] | None = None, file_encoding: str | None = None, delimiter: str | None = None)[source]
Bases:
objectThis class reads and writes relational table data either from/to a directory as CSV files, or a dict in Python
- Parameters:
data_location (str | Dict[str, List[Dict[str, str]]]) – A directory path, or a dictionary of table name (without .csv extension) and list of table row dicts
table (str) – The current table to consider. Either
table+ ‘.csv’ must be in specified directory path, or as key in the dictwrite (bool) – bool for write or read access
header (List[str] | None) – The header to write in the csv. Can be omitted, if
writeisFalsefile_encoding (str | None) – The file encoding of the CSV file to read. Can be omitted, if
writeisTrue, or data dict is specifieddelimiter (str | None) – The delimiter of the CSV file to read. Can be omitted, if
writeisTrue, or data dict is specified
- static check_data_location(data_location: str | Dict[str, List[Dict[str, str]]], write: bool = False)[source]
Check if the data location exists as path, if it is a string, or if there is at least one table present in the data dict if
writeisFalse- Parameters:
data_location (str | Dict[str, List[Dict[str, str]]]) – A directory path, or a dictionary of table name (without .csv extension) and list of table row dicts
write (bool) – bool for write or read access
- static get_available_table_names(data_source: str | Dict[str, List[Dict[str, str]]]) List[str][source]
Retrieves all table names (without .csv extension) from a directory path, or all keys from a data dictionary
- Parameters:
data_source (str | Dict[str, List[Dict[str, str]]]) – A directory path, or a dictionary of table name (without .csv extension) and list of table row dicts
- Returns:
Returns the found table names as a list of strings
- Return type:
List[str]