graphxplore.MetaDataHandling package

This subpackage contains functionality based around metadata. You can either define metadata by hand or use the MetaDataGenerator for automatic extraction from a dataset. The result is a MetaData object which contains (among others) the following features:

  • list of all tables and variables

  • primary/foreign key relations between tables

  • metadata on the variable-level stored in VariableInfo objects which contain:

    • data types (string, integer or decimal) and variable types (primary key, foreign key, metric, or categorical)

    • value distributions

    • detected or annotated artifacts (data type mismatches and extreme outliers)

    • labels and descriptions

    • BinningInfo for assigning metric variable values to bins

The MetaData objects can be stored and loaded as JSON files. The code could look like this:

>>> from graphxplore.MetaDataHandling import MetaDataGenerator, MetaData
>>> generator = MetaDataGenerator(csv_data='/dir_with_csv_files')
>>> metadata = generator.gather_meta_data()
# the meta data could be adjusted before storage
>>> metadata.store_in_json(file_path='path/to/json')

Module contents

class graphxplore.MetaDataHandling.ArtifactMode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

Here, you can choose the level to which GraphXplore should detect artifacts:

  • NoArtifacts: GraphXplore detects no artifacts

  • OnlyDataTypeMismatch: GraphXplore considers cell values artifacts which do not match the data type of the variable

  • DataTypeMismatchAndOutliers: In addition to data type mismatch artifacts, GraphXplore considers extreme outliers as artifacts. For categorical variables where the top 10 most frequent categories account for at 50% of the data, cell values which are not in the top 10 and appear only once are detected as artifacts. GraphXplore assumes these values to be typos. For metric variables, values which have no other value within 1.5 interquartile range, are considered artifacts

DataTypeMismatchAndOutliers = 'DataTypeMismatchAndOutliers'
NoArtifacts = 'NoArtifacts'
OnlyDataTypeMismatch = 'OnlyDataTypeMismatch'
class graphxplore.MetaDataHandling.BinningInfo(should_bin: bool, exclude_from_binning: List[float] | None = None, ref_high: float | None = None, ref_low: float | None = None)[source]

Bases: object

This class contains information about the value binning of a metric variable into “low”, “normal”, and “high” bins. If desired lower and upper bounds for the reference range (“normal” bin) used in the binning process can be specified, or values can be excluded from binning such as artifacts.

Parameters:
  • should_bin (bool) – Determines if the variable will be binned by the GraphTranslator

  • exclude_from_binning (List[float] | None) – These values are excluded during the binning process, defaults to None

  • ref_high (float | None) – The optionally set upper bound of the reference range, defaults to None

  • ref_low (float | None) – The optionally set lower bound of the reference range, defaults to None

exclude_from_binning: List[float] | None = None
ref_high: float | None = None
ref_low: float | None = None
should_bin: bool
class graphxplore.MetaDataHandling.CategoricalDistribution(category_counts: Dict[str | int | float, int], other_count: int, missing_count: int, artifact_count: int)[source]

Bases: object

Value distribution for categorical variables

Parameters:
  • category_counts (Dict[str | int | float, int]) – Counts for the top 10 most frequent categories

  • other_count (int) – Accumulated count of categories not listed in category_counts

  • missing_count (int) – Count of cell values which are missing values

  • artifact_count (int) – Count of artifact cells

artifact_count: int
category_counts: Dict[str | int | float, int]
missing_count: int
other_count: int
class graphxplore.MetaDataHandling.DataType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

A variable’s data type.

Decimal = 'Decimal'
Integer = 'Integer'
String = 'String'
class graphxplore.MetaDataHandling.MetaData(tables: Iterable[str])[source]

Bases: object

This class is the core of all ETL processes in graphxplore. It stores the metadata of a relational dataset. It contains information about its CSV tables, variables, primary/foreign keys, and much more information on the variable-level. For more information checkout VariableInfo

Parameters:

tables (Iterable[str]) – The names of the CSV tables of the relational data set (without .csv)

add_foreign_key(table: str, foreign_table: str, foreign_key: str) None[source]

Adds a foreign key and its foreign origin table to a specified table. foreign_key must be a variable of table and a primary key of foreign_table.

Parameters:
  • table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

  • foreign_table (str) – The name of the foreign table, i.e. its file name with ‘.csv’ omitted

  • foreign_key (str) – The name of the foreign key, i.e. the column name

Returns:

Return type:

None

add_table(table: str) None[source]

Add a table to the metadata

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

Return type:

None

add_variable(table: str, variable: str) VariableInfo[source]

Adds a variable for a specified table to the metadata.

Parameters:
  • table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

  • variable (str) – The name of the variable, i.e. the column name

Returns:

Returns the generated variable info that can be filled

Return type:

VariableInfo

assign_label(table: str, label: str) None[source]

Assigns a label to a table, e.g. describing the contained data. Existing labels will be overwritten

Parameters:
  • table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

  • label (str) – The label that should be assigned, should not contain whitespace or line breaks

Return type:

None

assign_primary_key(table: str, primary_key: str) None[source]

Assigns a primary key for the specified table. Raises an exception if table already has a primary key, or primary_key is not a variable of table

Parameters:
  • table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

  • primary_key (str) – The name of the primary key, i.e. the column name

Return type:

None

change_primary_key(table: str, primary_key: str) None[source]

Changes the primary key for the specified table. Raises an exception if primary_key is not a variable of table

Parameters:
  • table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

  • primary_key (str) – The name of the primary key, i.e. the column name

Return type:

None

static from_dict(data: dict) MetaData[source]

Parses a Metadata object from a dictionary.

Parameters:

data (dict) – The input dictionary

Returns:

Returns the parsed object

Return type:

MetaData

get_foreign_keys(table) Dict[str, str][source]

Retrieve all foreign keys of a table as a dictionary with the keys being the foreign keys and the values the foreign tables.

Parameters:

table – The name of the table, i.e. its file name with ‘.csv’ omitted

Returns:

Returns the foreign key/table dictionary

Return type:

Dict[str, str]

get_label(table) str[source]

Returns the label of the table or the empty string if none was assigned.

Parameters:

table – The name of the table, i.e. its file name with ‘.csv’ omitted

Returns:

Returns the table label as string

Return type:

str

get_primary_key(table: str) str[source]

Retrieve the primary key of the table. Returns the empty string if not yet assigned.

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

Returns:

Returns the name of the primary key

Return type:

str

get_table_names() List[str][source]

Retrieve all table name (file names with ‘.csv’ omitted) of the metadata.

Returns:

Returns the list of table names

Return type:

List[str]

get_total_nof_variables() int[source]

Counts all variables in the metadata across all tables

Returns:

Returns the count as an integer

Return type:

int

get_variable(table: str, variable: str) VariableInfo[source]

Retrieves the information about a given variable for inspection or altering.

Parameters:
  • table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

  • variable (str) – The name of the variable, i.e. the column name

Returns:

Returns the variable information object

Return type:

VariableInfo

get_variable_names(table: str) List[str][source]

Retrieves all variable names for a given table.

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

Returns:

Returns the list of retrieved variable names

Return type:

List[str]

has_artifacts() bool[source]

Check, if at least one variable has annotated artifacts

Returns:

Returns True, if at least one annotated artifact was found

Return type:

bool

has_primary_key(table: str) bool[source]

Checks if the table has a primary key assigned.

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

Returns:

Returns True if a primary key was assigned

Return type:

bool

static load_from_json(filepath: str, file_encoding: str | None = None) MetaData[source]

Reads a Metadata object from a JSON.

Parameters:
  • filepath (str) – Path to the JSON

  • file_encoding (str | None) – file encoding of the JSON

Returns:

Returns a Metadata object

Return type:

MetaData

remove_foreign_key(table: str, foreign_key: str) None[source]

Removes a foreign key for a specified table. foreign_key must be a variable of table.

Parameters:
  • table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

  • foreign_key (str) – The name of the foreign key, i.e. the column name

Returns:

Return type:

None

remove_table(table: str) None[source]

Remove a table from the metadata. All foreign keys pointing to this table are changed to categorical variables

Parameters:

table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

Return type:

None

remove_variable(table: str, variable: str) None[source]

Delete the variable for the specified table from the metadata. If it is a primary key, foreign key references from other tables are deleted as well

Parameters:
  • table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted

  • variable (str) – The name of the variable, i.e. the column name

Return type:

None

store_in_json(file_path: str, file_encoding: str | None = None) None[source]

Stores the object as a JSON file.

Parameters:
  • file_path (str) – Path to the JSON

  • file_encoding (str | None) – file encoding that should be used for writing the JSON

Return type:

None

to_dict() dict[source]

Converts the object to a dictionary.

Returns:

Returns the generated dictionary

Return type:

dict

class graphxplore.MetaDataHandling.MetaDataGenerator(csv_data: str | Dict[str, List[Dict[str, str]]], artifact_mode: ArtifactMode = ArtifactMode.DataTypeMismatchAndOutliers, missing_vals: Iterable[str | None] = ('', 'NaN', 'Na', 'NA', 'NAN', 'nan', 'na'), nof_read_lines: int = 1000000, str_len_free_text: int = 300, binning_threshold: int = 20, categorical_threshold: int = 20, file_encoding: str | None = None)[source]

Bases: object

This class extracts metadata information from CSV files. It detects primary keys and foreign key relations between tables. Additionally, VariableInfo objects are inferred for all columns of all CSV files. The result is a MetaData object.

Parameters:
  • csv_data (str | Dict[str, List[Dict[str, str]]]) – The input data as CSV files either as directory path containing the CSV files or as dictionary of table name and table data as list of dictionaries per row

  • artifact_mode (ArtifactMode) – Determines if artifacts should be detected and at what level. For further information check ArtifactMode

  • missing_vals (Iterable[str | None]) – These characters indicate missing values, defaults to empty string, None and variations of “NaN” and “Na”

  • nof_read_lines (int) – Maximum number of lines read from each CSV file to gather metadata, defaults to 1 million

  • str_len_free_text (int) – Strings with at least this number of characters are considered free text and the containing variable is unfavored as primary key. Defaults to 300.

  • binning_threshold (int) – Metric variables with more distinct values are marked for binning, defaults to 20

  • categorical_threshold (int) – Variables with at most this number of distinct values are considered categorical, defaults to 20

  • file_encoding (str | None) – The file encoding of the CSV files (ascii, utf-8,…) in chardet definition. Is guessed if not specified. Only used when CSV data is read from a directory, defaults to None

assign_foreign_keys() None[source]

Assigns foreign keys by detecting occurrences of primary keys in other tables.

Return type:

None

extract_variable_infos() None[source]

Extracts all information about variables contained in CSVs of the source directory and detects primary keys. Artifacts are detected, if specified by artifact_mode. For more information checkout ArtifactMode

Return type:

None

gather_meta_data() MetaData[source]

Extracts variables and primary/foreign key relations between CSV files. Each CSV MUST contain a column with unique entries and no empty cells. Among these, a primary key is selected prioritizing integer columns. Additionally, VariableInfo objects are inferred for all columns of all CSV files. Artifacts are detected, if specified by artifact_mode. For more information checkout ArtifactMode

Returns:

Returns the gathered metadata

Return type:

MetaData

class graphxplore.MetaDataHandling.MetricDistribution(median: int | float, q1: int | float, q3: int | float, lower_fence: int | float, upper_fence: int | float, outliers: List[int | float], missing_count: int, artifact_count: int)[source]

Bases: object

Value distribution for metric variables

Parameters:
  • median (int | float) – The median

  • q1 (int | float) – The first quartile

  • q3 (int | float) – The third quartile

  • lower_fence (int | float) – The maximum of the minimal value and q1 - 1.5 interquartile range

  • upper_fence (int | float) – The minimum of the maximal value and q3 + 1.5 interquartile range

  • outliers (List[int | float]) – The list of values smaller than lower_fence or larger than upper_fence which are not annotated as artifacts

  • missing_count (int) – Count of cell values which are missing values

  • artifact_count (int) – Count of artifact cells

artifact_count: int
lower_fence: int | float
median: int | float
missing_count: int
outliers: List[int | float]
q1: int | float
q3: int | float
upper_fence: int | float
class graphxplore.MetaDataHandling.VariableInfo(name: str, table: str, labels: List[str], variable_type: VariableType, data_type: DataType, description: str | None = None, data_type_distribution: Dict[DataType, float] | None = None, default_value: str | int | float | None = None, value_distribution: MetricDistribution | CategoricalDistribution | None = None, binning: BinningInfo | None = None, artifacts: List[str] | None = None, reviewed: bool | None = None)[source]

Bases: object

This class contains all information about a single variable.

Parameters:
  • name (str) – The name of the variable, i.e. the column name

  • table (str) – The name of the origin table, i.e. its file name with ‘.csv’ omitted

  • labels (List[str]) – One or multiple labels describing the variable

  • variable_type (VariableType) – The type of variable

  • data_type (DataType) – The data type of the variable

  • description (str | None) – A description of the variable, e.g. containing units of measurement or SNOMED CT codes, defaults to None

  • data_type_distribution (Dict[DataType, float] | None) – The percentage of different data types in the variable, defaults to None

  • default_value (str | int | float | None) – The optional default value of the variable, defaults to None

  • value_distribution (MetricDistribution | CategoricalDistribution | None) – Distribution of values depending on the variable type, defaults to None

  • binning (BinningInfo | None) – The binning info of the variable, defaults to None

  • artifacts (List[str] | None) – Potential artifacts existing for the variable, defaults to None

  • reviewed (bool | None) – Variable information was reviewed, defaults to None

add_label(label: str)[source]

Add a label to the variable, e.g. describing its broad category such as “Laboratory”.

Parameters:

label (str) – The label to add, must only contain letters, numbers, hyphens or underscores

artifacts: List[str] | None = None
binning: BinningInfo | None = None
static cast_value(val_to_cast: str, data_type: DataType) str | int | float | None[source]

Casts a value to the specified data type. Returns None if the value could not be cast.

Parameters:
  • val_to_cast (str) – The value which should be cast

  • data_type (DataType) – The data type to which the value should be cast

Returns:

Returns the cast value

Return type:

str | int | float | None

cast_value_to_data_type(val_to_cast: str | int | float) str | int | float | None[source]

Casts a value to the data type of the variable. Returns None if the value could not be cast.

Parameters:

val_to_cast (str | int | float) – The value which should be cast

Returns:

Returns the cast value

Return type:

str | int | float | None

data_type: DataType
data_type_distribution: Dict[DataType, float] | None = None
default_value: str | int | float | None = None
description: str | None = None
detect_artifacts_and_value_distribution(value_count_dict: Dict[str, int], artifact_mode: ArtifactMode = ArtifactMode.DataTypeMismatchAndOutliers, missing_vals: Iterable[str | None] = ('', 'NaN', 'Na', 'NA', 'NAN', 'nan', 'na'))[source]

Calculates a value distribution based on the variable type. For categorical variables, a distribution with counts is calculated. For metric variables, data for a whisker plot is calculated. For primary and foreign keys no value distributions is derived. For more information check out MetricDistribution and CategoricalDistribution. Depending on artifact_mode, artifacts are detected on the specified level. Pre-existing artifacts are preserved. For more information check out ArtifactMode

Parameters:
  • value_count_dict (Dict[str, int]) – The dictionary with all values (as string) and their occurrence count

  • artifact_mode (ArtifactMode) – Determines if artifacts should be detected and at what level. For further information check ArtifactMode

  • missing_vals (Iterable[str | None]) – The list of possible missing values as string

static from_dict(var_name: str, table: str, variable_dict: dict) VariableInfo[source]

Parses a VariableInfo object from a dictionary.

Parameters:
  • var_name (str) – The name of the variable, i.e. the column name

  • table (str) – The name of the origin table, i.e. its file name with ‘.csv’ omitted

  • variable_dict (dict) – A dictionary containing all information about the variable

Returns:

Returns the parsed object

Return type:

VariableInfo

labels: List[str]
name: str
reviewed: bool | None = None
table: str
to_dict() dict[source]

Converts the object to a dictionary.

Returns:

Returns the generated dictionary

Return type:

dict

value_distribution: MetricDistribution | CategoricalDistribution | None = None
variable_type: VariableType
class graphxplore.MetaDataHandling.VariableType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

The type of variable.

Categorical = 'Categorical'
ForeignKey = 'ForeignKey'
Metric = 'Metric'
PrimaryKey = 'PrimaryKey'