graphxplore.GraphTranslation package
This subpackage contains the GraphTranslator which transforms a relational
dataset (in the form of CSV files) into a graph structure which can e.g. be loaded into a Neo4J database.
A prerequisite is a MetaData object of the relational dataset which can be
generated with the MetaDataGenerator class.
The result of the transformation process is a BaseGraph which contains a node for
each unique table/variable/value combination in the original relational dataset. A node x for a primary key value has
an edge to another node y if the values of x and y appear in the same row of the relational dataset. As all
table/variable/value combinations are unique within the graph, two primary key values (representing their respective
CSV rows) x1 and x2 with the same value for one variable will both have an outgoing edge pointing to the same node
y.
The BaseGraph can be stored in a Neo4J database (or as CSV files). The graph
structure enables efficient lookups with the Neo4J Cypher query language by value (select statements in SQL). As
foreign key relations are also stored via edges, efficient lookup across tables are also possible without tedious join
statements. The code to generate and store a BaseGraph might look like
>>> from graphxplore.Basis import GraphType, GraphOutputType
>>> from graphxplore.MetaDataHandling import MetaData
>>> from graphxplore.GraphTranslation import GraphTranslator
>>> meta = MetaData.load_from_json(filepath='path_to_meta.json')
>>> translator = GraphTranslator(meta)
>>> translator.transform_to_graph(csv_data='/relational_csv_dir', output='mygraphdb',
>>> output_type=GraphOutputType.Database, address='bolt://localhost:7687',
>>> auth=('my_user', 'my_password'))
Module contents
- class graphxplore.GraphTranslation.GraphTranslator(metadata: MetaData, missing_vals: Iterable[str | None] = (None, '', 'NaN', 'Na', 'NA', 'NAN', 'nan', 'na'), file_encoding: str | None = None)[source]
Bases:
objectThis class transforms relational data represented by one or multiple CSVs to a graph structure given a
MetaDataobject. Each unique triplet of table, variable and cell is assigned to a node in the graph structure. Cell nodes are connected to the node for their primary key value via an edge. This way, multiple primary keys with some identical cell value share a neighbor and are connected if they are in a foreign key relation. As a result, efficient data lookup can be achieved while avoiding complex joins across different tables. The generatedBaseGraphforms the basis for all further data exploration/analysis methods.- Parameters:
metadata (MetaData) – The metadata of the relational dataset
missing_vals (Iterable[str | None]) – This cell values are skipped and not added to the generated graph. Convenient for data with missing values, defaults to common missing value definitions
file_encoding (str | None) – The file encoding of the CSV files (ascii, utf-8,…) in chardet definition. Is guessed if not specified, defaults to None
- transform_to_graph(csv_data: str | Dict[str, Iterable[Dict[str, str]]], output: str, output_type: GraphOutputType = GraphOutputType.CSV, overwrite: bool = False, address: str = 'bolt://localhost:7687', auth: Tuple[str, str] = ('neo4j', '')) None[source]
Reads all CSV files from a data directory, that are specified in the supplied metadata. Generates a graph with nodes for primary keys and attributes. Links between primary keys, if they appear in a primary/foreign key relation between different CSV files. Stores the generated graph in the specified output directory as CSV files or in a Neo4j database.
- Parameters:
csv_data (str | Dict[str, Iterable[Dict[str, str]]]) – The input data of the CSV files either as directory path containing the CSV files or as dictionary of table name and table data as dictionary per row
output (str) – The output directory for the generated graph, will be written as CSV files or the name of the Neo4j database
output_type (GraphOutputType) – The type of output. Either CSV or a Neo4j database, defaults to CSV
overwrite (bool) – If written to an existing Neo4j database, overwriting has to be set here
address (str) – The address of the Neo4J DBMS. Can be generated with
get_neo4j_address(). Will only be used if the graph should be written to databaseauth (Tuple[str, str]) – username and password to access the Neo4j DBMS. Will only be used if graph should be written to database
- Return type:
None