graphxplore.MetaDataHandling package

This subpackage contains functionality based around metadata. You can either define metadata by hand or use the MetaDataGenerator for automatic extraction from a dataset. The result is a MetaData object which contains (among others) the following features:

list of all tables and variables
primary/foreign key relations between tables
metadata on the variable-level stored in VariableInfo objects which contain:
- data types (string, integer or decimal) and variable types (primary key, foreign key, metric, or categorical)
- value distributions
- detected or annotated artifacts (data type mismatches and extreme outliers)
- labels and descriptions
- BinningInfo for assigning metric variable values to bins

The MetaData objects can be stored and loaded as JSON files. The code could look like this:

>>> from graphxplore.MetaDataHandling import MetaDataGenerator, MetaData
>>> generator = MetaDataGenerator(csv_data='/dir_with_csv_files')
>>> metadata = generator.gather_meta_data()
# the meta data could be adjusted before storage
>>> metadata.store_in_json(file_path='path/to/json')

Module contents

class graphxplore.MetaDataHandling.ArtifactMode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

Here, you can choose the level to which GraphXplore should detect artifacts:

NoArtifacts: GraphXplore detects no artifacts
OnlyDataTypeMismatch: GraphXplore considers cell values artifacts which do not match the data type of the variable
DataTypeMismatchAndOutliers: In addition to data type mismatch artifacts, GraphXplore considers extreme outliers as artifacts. For categorical variables where the top 10 most frequent categories account for at 50% of the data, cell values which are not in the top 10 and appear only once are detected as artifacts. GraphXplore assumes these values to be typos. For metric variables, values which have no other value within 1.5 interquartile range, are considered artifacts

DataTypeMismatchAndOutliers = 'DataTypeMismatchAndOutliers'

NoArtifacts = 'NoArtifacts'

OnlyDataTypeMismatch = 'OnlyDataTypeMismatch'

class graphxplore.MetaDataHandling.BinningInfo(should_bin: bool, exclude_from_binning: List[float] | None = None, ref_high: float | None = None, ref_low: float | None = None)[source]

Bases: object

This class contains information about the value binning of a metric variable into “low”, “normal”, and “high” bins. If desired lower and upper bounds for the reference range (“normal” bin) used in the binning process can be specified, or values can be excluded from binning such as artifacts.

Parameters:

should_bin (bool) – Determines if the variable will be binned by the GraphTranslator
exclude_from_binning (List[float] | None) – These values are excluded during the binning process, defaults to None
ref_high (float | None) – The optionally set upper bound of the reference range, defaults to None
ref_low (float | None) – The optionally set lower bound of the reference range, defaults to None

exclude_from_binning: List[float] | None = None

ref_high: float | None = None

ref_low: float | None = None

should_bin: bool

class graphxplore.MetaDataHandling.CategoricalDistribution(category_counts: Dict[str | int | float, int], other_count: int, missing_count: int, artifact_count: int)[source]

Bases: object

Value distribution for categorical variables

Parameters:

category_counts (Dict[str | int | float, int]) – Counts for the top 10 most frequent categories
other_count (int) – Accumulated count of categories not listed in category_counts
missing_count (int) – Count of cell values which are missing values
artifact_count (int) – Count of artifact cells

artifact_count: int

category_counts: Dict[str | int | float, int]

missing_count: int

other_count: int

class graphxplore.MetaDataHandling.DataType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

A variable’s data type.

Decimal = 'Decimal'

Integer = 'Integer'

String = 'String'

class graphxplore.MetaDataHandling.MetaData(tables: Iterable[str])[source]

Bases: object

This class is the core of all ETL processes in graphxplore. It stores the metadata of a relational dataset. It contains information about its CSV tables, variables, primary/foreign keys, and much more information on the variable-level. For more information checkout VariableInfo

Parameters:: tables (Iterable[str]) – The names of the CSV tables of the relational data set (without .csv)

add_foreign_key(table: str, foreign_table: str, foreign_key: str) → None[source]

Adds a foreign key and its foreign origin table to a specified table. foreign_key must be a variable of table and a primary key of foreign_table.

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
foreign_table (str) – The name of the foreign table, i.e. its file name with ‘.csv’ omitted
foreign_key (str) – The name of the foreign key, i.e. the column name

Returns:

Return type:

None

add_table(table: str) → None[source]

Add a table to the metadata

Parameters:: table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
Return type:: None

add_variable(table: str, variable: str) → VariableInfo[source]

Adds a variable for a specified table to the metadata.

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
variable (str) – The name of the variable, i.e. the column name

Returns:

Returns the generated variable info that can be filled

Return type:

VariableInfo

assign_label(table: str, label: str) → None[source]

Assigns a label to a table, e.g. describing the contained data. Existing labels will be overwritten

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
label (str) – The label that should be assigned, should not contain whitespace or line breaks

Return type:

None

assign_primary_key(table: str, primary_key: str) → None[source]

Assigns a primary key for the specified table. Raises an exception if table already has a primary key, or primary_key is not a variable of table

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
primary_key (str) – The name of the primary key, i.e. the column name

Return type:

None

change_primary_key(table: str, primary_key: str) → None[source]

Changes the primary key for the specified table. Raises an exception if primary_key is not a variable of table

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
primary_key (str) – The name of the primary key, i.e. the column name

Return type:

None

static from_dict(data: dict) → MetaData[source]

Parses a Metadata object from a dictionary.

Parameters:: data (dict) – The input dictionary
Returns:: Returns the parsed object
Return type:: MetaData

get_foreign_keys(table) → Dict[str, str][source]

Retrieve all foreign keys of a table as a dictionary with the keys being the foreign keys and the values the foreign tables.

Parameters:: table – The name of the table, i.e. its file name with ‘.csv’ omitted
Returns:: Returns the foreign key/table dictionary
Return type:: Dict[str, str]

get_label(table) → str[source]

Returns the label of the table or the empty string if none was assigned.

Parameters:: table – The name of the table, i.e. its file name with ‘.csv’ omitted
Returns:: Returns the table label as string
Return type:: str

get_primary_key(table: str) → str[source]

Retrieve the primary key of the table. Returns the empty string if not yet assigned.

Parameters:: table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
Returns:: Returns the name of the primary key
Return type:: str

get_table_names() → List[str][source]

Retrieve all table name (file names with ‘.csv’ omitted) of the metadata.

Returns:: Returns the list of table names
Return type:: List[str]

get_total_nof_variables() → int[source]

Counts all variables in the metadata across all tables

Returns:: Returns the count as an integer
Return type:: int

get_variable(table: str, variable: str) → VariableInfo[source]

Retrieves the information about a given variable for inspection or altering.

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
variable (str) – The name of the variable, i.e. the column name

Returns:

Returns the variable information object

Return type:

VariableInfo

get_variable_names(table: str) → List[str][source]

Retrieves all variable names for a given table.

Parameters:: table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
Returns:: Returns the list of retrieved variable names
Return type:: List[str]

has_artifacts() → bool[source]

Check, if at least one variable has annotated artifacts

Returns:: Returns True, if at least one annotated artifact was found
Return type:: bool

has_primary_key(table: str) → bool[source]

Checks if the table has a primary key assigned.

Parameters:: table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
Returns:: Returns True if a primary key was assigned
Return type:: bool

static load_from_json(filepath: str, file_encoding: str | None = None) → MetaData[source]

Reads a Metadata object from a JSON.

Parameters:

filepath (str) – Path to the JSON
file_encoding (str | None) – file encoding of the JSON

Returns:

Returns a Metadata object

Return type:

MetaData

remove_foreign_key(table: str, foreign_key: str) → None[source]

Removes a foreign key for a specified table. foreign_key must be a variable of table.

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
foreign_key (str) – The name of the foreign key, i.e. the column name

Returns:

Return type:

None

remove_table(table: str) → None[source]

Remove a table from the metadata. All foreign keys pointing to this table are changed to categorical variables

Parameters:: table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
Return type:: None

remove_variable(table: str, variable: str) → None[source]

Delete the variable for the specified table from the metadata. If it is a primary key, foreign key references from other tables are deleted as well

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
variable (str) – The name of the variable, i.e. the column name

Return type:

None

store_in_json(file_path: str, file_encoding: str | None = None) → None[source]

Stores the object as a JSON file.

Parameters:

file_path (str) – Path to the JSON
file_encoding (str | None) – file encoding that should be used for writing the JSON

Return type:

None

to_dict() → dict[source]

Converts the object to a dictionary.

Returns:: Returns the generated dictionary
Return type:: dict

class graphxplore.MetaDataHandling.MetaDataGenerator(csv_data: str | Dict[str, List[Dict[str, str]]], artifact_mode: ArtifactMode = ArtifactMode.DataTypeMismatchAndOutliers, missing_vals: Iterable[str | None] = ('', 'NaN', 'Na', 'NA', 'NAN', 'nan', 'na'), nof_read_lines: int = 1000000, str_len_free_text: int = 300, binning_threshold: int = 20, categorical_threshold: int = 20, file_encoding: str | None = None)[source]

Bases: object

This class extracts metadata information from CSV files. It detects primary keys and foreign key relations between tables. Additionally, VariableInfo objects are inferred for all columns of all CSV files. The result is a MetaData object.

Parameters:

csv_data (str | Dict[str, List[Dict[str, str]]]) – The input data as CSV files either as directory path containing the CSV files or as dictionary of table name and table data as list of dictionaries per row
artifact_mode (ArtifactMode) – Determines if artifacts should be detected and at what level. For further information check ArtifactMode
missing_vals (Iterable[str | None]) – These characters indicate missing values, defaults to empty string, None and variations of “NaN” and “Na”
nof_read_lines (int) – Maximum number of lines read from each CSV file to gather metadata, defaults to 1 million
str_len_free_text (int) – Strings with at least this number of characters are considered free text and the containing variable is unfavored as primary key. Defaults to 300.
binning_threshold (int) – Metric variables with more distinct values are marked for binning, defaults to 20
categorical_threshold (int) – Variables with at most this number of distinct values are considered categorical, defaults to 20
file_encoding (str | None) – The file encoding of the CSV files (ascii, utf-8,…) in chardet definition. Is guessed if not specified. Only used when CSV data is read from a directory, defaults to None

assign_foreign_keys() → None[source]

Assigns foreign keys by detecting occurrences of primary keys in other tables.

Return type:: None

extract_variable_infos() → None[source]

Extracts all information about variables contained in CSVs of the source directory and detects primary keys. Artifacts are detected, if specified by artifact_mode. For more information checkout ArtifactMode

Return type:: None

gather_meta_data() → MetaData[source]

Extracts variables and primary/foreign key relations between CSV files. Each CSV MUST contain a column with unique entries and no empty cells. Among these, a primary key is selected prioritizing integer columns. Additionally, VariableInfo objects are inferred for all columns of all CSV files. Artifacts are detected, if specified by artifact_mode. For more information checkout ArtifactMode

Returns:: Returns the gathered metadata
Return type:: MetaData

Bases: object

Value distribution for metric variables

Parameters:

median (int | float) – The median
q1 (int | float) – The first quartile
q3 (int | float) – The third quartile
lower_fence (int | float) – The maximum of the minimal value and q1 - 1.5 interquartile range
upper_fence (int | float) – The minimum of the maximal value and q3 + 1.5 interquartile range
outliers (List[int | float]) – The list of values smaller than lower_fence or larger than upper_fence which are not annotated as artifacts
missing_count (int) – Count of cell values which are missing values
artifact_count (int) – Count of artifact cells

artifact_count: int

lower_fence: int | float

median: int | float

missing_count: int

outliers: List[int | float]

q1: int | float

q3: int | float

upper_fence: int | float

class graphxplore.MetaDataHandling.VariableInfo(name: str, table: str, labels: List[str], variable_type: VariableType, data_type: DataType, description: str | None = None, data_type_distribution: Dict[DataType, float] | None = None, default_value: str | int | float | None = None, value_distribution: MetricDistribution | CategoricalDistribution | None = None, binning: BinningInfo | None = None, artifacts: List[str] | None = None, reviewed: bool | None = None)[source]

Bases: object

This class contains all information about a single variable.

Parameters:

name (str) – The name of the variable, i.e. the column name
table (str) – The name of the origin table, i.e. its file name with ‘.csv’ omitted
labels (List[str]) – One or multiple labels describing the variable
variable_type (VariableType) – The type of variable
data_type (DataType) – The data type of the variable
description (str | None) – A description of the variable, e.g. containing units of measurement or SNOMED CT codes, defaults to None
data_type_distribution (Dict[DataType, float] | None) – The percentage of different data types in the variable, defaults to None
default_value (str | int | float | None) – The optional default value of the variable, defaults to None
value_distribution (MetricDistribution | CategoricalDistribution | None) – Distribution of values depending on the variable type, defaults to None
binning (BinningInfo | None) – The binning info of the variable, defaults to None
artifacts (List[str] | None) – Potential artifacts existing for the variable, defaults to None
reviewed (bool | None) – Variable information was reviewed, defaults to None

add_label(label: str)[source]

Add a label to the variable, e.g. describing its broad category such as “Laboratory”.

Parameters:: label (str) – The label to add, must only contain letters, numbers, hyphens or underscores

artifacts: List[str] | None = None

binning: BinningInfo | None = None

static cast_value(val_to_cast: str, data_type: DataType) → str | int | float | None[source]

Casts a value to the specified data type. Returns None if the value could not be cast.

Parameters:

val_to_cast (str) – The value which should be cast
data_type (DataType) – The data type to which the value should be cast

Returns:

Returns the cast value

Return type:

str | int | float | None

Casts a value to the data type of the variable. Returns None if the value could not be cast.

Parameters:: val_to_cast (str | int | float) – The value which should be cast
Returns:: Returns the cast value
Return type:: str | int | float | None

data_type: DataType

data_type_distribution: Dict[DataType, float] | None = None

default_value: str | int | float | None = None

description: str | None = None

detect_artifacts_and_value_distribution(value_count_dict: Dict[str, int], artifact_mode: ArtifactMode = ArtifactMode.DataTypeMismatchAndOutliers, missing_vals: Iterable[str | None] = ('', 'NaN', 'Na', 'NA', 'NAN', 'nan', 'na'))[source]

Calculates a value distribution based on the variable type. For categorical variables, a distribution with counts is calculated. For metric variables, data for a whisker plot is calculated. For primary and foreign keys no value distributions is derived. For more information check out MetricDistribution and CategoricalDistribution. Depending on artifact_mode, artifacts are detected on the specified level. Pre-existing artifacts are preserved. For more information check out ArtifactMode

Parameters:

value_count_dict (Dict[str, int]) – The dictionary with all values (as string) and their occurrence count
artifact_mode (ArtifactMode) – Determines if artifacts should be detected and at what level. For further information check ArtifactMode
missing_vals (Iterable[str | None]) – The list of possible missing values as string

static from_dict(var_name: str, table: str, variable_dict: dict) → VariableInfo[source]

Parses a VariableInfo object from a dictionary.

Parameters:

var_name (str) – The name of the variable, i.e. the column name
table (str) – The name of the origin table, i.e. its file name with ‘.csv’ omitted
variable_dict (dict) – A dictionary containing all information about the variable

Returns:

Returns the parsed object

Return type:

VariableInfo

labels: List[str]

name: str

reviewed: bool | None = None

table: str

to_dict() → dict[source]

Converts the object to a dictionary.

Returns:: Returns the generated dictionary
Return type:: dict

value_distribution: MetricDistribution | CategoricalDistribution | None = None

variable_type: VariableType

class graphxplore.MetaDataHandling.VariableType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

The type of variable.

Categorical = 'Categorical'

ForeignKey = 'ForeignKey'

Metric = 'Metric'

PrimaryKey = 'PrimaryKey'