graphxplore.GraphTranslation package

This subpackage contains the GraphTranslator which transforms a relational dataset (in the form of CSV files) into a graph structure which can e.g. be loaded into a Neo4J database. A prerequisite is a MetaData object of the relational dataset which can be generated with the MetaDataGenerator class.

The result of the transformation process is a BaseGraph which contains a node for each unique table/variable/value combination in the original relational dataset. A node x for a primary key value has an edge to another node y if the values of x and y appear in the same row of the relational dataset. As all table/variable/value combinations are unique within the graph, two primary key values (representing their respective CSV rows) x1 and x2 with the same value for one variable will both have an outgoing edge pointing to the same node y.

The BaseGraph can be stored in a Neo4J database (or as CSV files). The graph structure enables efficient lookups with the Neo4J Cypher query language by value (select statements in SQL). As foreign key relations are also stored via edges, efficient lookup across tables are also possible without tedious join statements. The code to generate and store a BaseGraph might look like

>>> from graphxplore.Basis import GraphType, GraphOutputType
>>> from graphxplore.MetaDataHandling import MetaData
>>> from graphxplore.GraphTranslation import GraphTranslator
>>> meta = MetaData.load_from_json(filepath='path_to_meta.json')
>>> translator = GraphTranslator(meta)
>>> translator.transform_to_graph(csv_data='/relational_csv_dir', output='mygraphdb',
>>>                               output_type=GraphOutputType.Database, address='bolt://localhost:7687',
>>>                               auth=('my_user', 'my_password'))

Module contents

class graphxplore.GraphTranslation.GraphTranslator(metadata: MetaData, missing_vals: Iterable[str | None] = (None, '', 'NaN', 'Na', 'NA', 'NAN', 'nan', 'na'), file_encoding: str | None = None)[source]

Bases: object

This class transforms relational data represented by one or multiple CSVs to a graph structure given a MetaData object. Each unique triplet of table, variable and cell is assigned to a node in the graph structure. Cell nodes are connected to the node for their primary key value via an edge. This way, multiple primary keys with some identical cell value share a neighbor and are connected if they are in a foreign key relation. As a result, efficient data lookup can be achieved while avoiding complex joins across different tables. The generated BaseGraph forms the basis for all further data exploration/analysis methods.

Parameters:
  • metadata (MetaData) – The metadata of the relational dataset

  • missing_vals (Iterable[str | None]) – This cell values are skipped and not added to the generated graph. Convenient for data with missing values, defaults to common missing value definitions

  • file_encoding (str | None) – The file encoding of the CSV files (ascii, utf-8,…) in chardet definition. Is guessed if not specified, defaults to None

transform_to_graph(csv_data: str | Dict[str, Iterable[Dict[str, str]]], output: str, output_type: GraphOutputType = GraphOutputType.CSV, overwrite: bool = False, address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) None[source]

Reads all CSV files from a data directory, that are specified in the supplied metadata. Generates a graph with nodes for primary keys and attributes. Links between primary keys, if they appear in a primary/foreign key relation between different CSV files. Stores the generated graph in the specified output directory as CSV files or in a Neo4j database.

Parameters:
  • csv_data (str | Dict[str, Iterable[Dict[str, str]]]) – The input data of the CSV files either as directory path containing the CSV files or as dictionary of table name and table data as dictionary per row

  • output (str) – The output directory for the generated graph, will be written as CSV files or the name of the Neo4j database

  • output_type (GraphOutputType) – The type of output. Either CSV or a Neo4j database, defaults to CSV

  • overwrite (bool) – If written to an existing Neo4j database, overwriting has to be set here

  • address (str) – The address of the Neo4J DBMS. Can be generated with get_neo4j_address(). Will only be used if the graph should be written to database

  • auth (Tuple[str, str]) – username and password to access the Neo4j DBMS. Will only be used if graph should be written to database

Return type:

None