graphxplore.MetaDataHandling package
This subpackage contains functionality based around metadata. You can either define metadata by hand or use the
MetaDataGenerator for automatic extraction from a dataset. The result is a
MetaData object which contains (among others) the following features:
list of all tables and variables
primary/foreign key relations between tables
metadata on the variable-level stored in
VariableInfoobjects which contain:data types (string, integer or decimal) and variable types (primary key, foreign key, metric, or categorical)
value distributions
detected or annotated artifacts (data type mismatches and extreme outliers)
labels and descriptions
BinningInfofor assigning metric variable values to bins
The MetaData objects can be stored and loaded as JSON files. The code could
look like this:
>>> from graphxplore.MetaDataHandling import MetaDataGenerator, MetaData
>>> generator = MetaDataGenerator(csv_data='/dir_with_csv_files')
>>> metadata = generator.gather_meta_data()
# the meta data could be adjusted before storage
>>> metadata.store_in_json(file_path='path/to/json')
Module contents
- class graphxplore.MetaDataHandling.ArtifactMode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
str,EnumHere, you can choose the level to which GraphXplore should detect artifacts:
NoArtifacts: GraphXplore detects no artifacts
OnlyDataTypeMismatch: GraphXplore considers cell values artifacts which do not match the data type of the variable
DataTypeMismatchAndOutliers: In addition to data type mismatch artifacts, GraphXplore considers extreme outliers as artifacts. For categorical variables where the top 10 most frequent categories account for at 50% of the data, cell values which are not in the top 10 and appear only once are detected as artifacts. GraphXplore assumes these values to be typos. For metric variables, values which have no other value within 1.5 interquartile range, are considered artifacts
- DataTypeMismatchAndOutliers = 'DataTypeMismatchAndOutliers'
- NoArtifacts = 'NoArtifacts'
- OnlyDataTypeMismatch = 'OnlyDataTypeMismatch'
- class graphxplore.MetaDataHandling.BinningInfo(should_bin: bool, exclude_from_binning: List[float] | None = None, ref_high: float | None = None, ref_low: float | None = None)[source]
Bases:
objectThis class contains information about the value binning of a metric variable into “low”, “normal”, and “high” bins. If desired lower and upper bounds for the reference range (“normal” bin) used in the binning process can be specified, or values can be excluded from binning such as artifacts.
- Parameters:
should_bin (bool) – Determines if the variable will be binned by the
GraphTranslatorexclude_from_binning (List[float] | None) – These values are excluded during the binning process, defaults to None
ref_high (float | None) – The optionally set upper bound of the reference range, defaults to None
ref_low (float | None) – The optionally set lower bound of the reference range, defaults to None
- exclude_from_binning: List[float] | None = None
- ref_high: float | None = None
- ref_low: float | None = None
- should_bin: bool
- class graphxplore.MetaDataHandling.CategoricalDistribution(category_counts: Dict[str | int | float, int], other_count: int, missing_count: int, artifact_count: int)[source]
Bases:
objectValue distribution for categorical variables
- Parameters:
category_counts (Dict[str | int | float, int]) – Counts for the top 10 most frequent categories
other_count (int) – Accumulated count of categories not listed in
category_countsmissing_count (int) – Count of cell values which are missing values
artifact_count (int) – Count of artifact cells
- artifact_count: int
- category_counts: Dict[str | int | float, int]
- missing_count: int
- other_count: int
- class graphxplore.MetaDataHandling.DataType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
str,EnumA variable’s data type.
- Decimal = 'Decimal'
- Integer = 'Integer'
- String = 'String'
- class graphxplore.MetaDataHandling.MetaData(tables: Iterable[str])[source]
Bases:
objectThis class is the core of all ETL processes in graphxplore. It stores the metadata of a relational dataset. It contains information about its CSV tables, variables, primary/foreign keys, and much more information on the variable-level. For more information checkout
VariableInfo- Parameters:
tables (Iterable[str]) – The names of the CSV tables of the relational data set (without .csv)
- add_foreign_key(table: str, foreign_table: str, foreign_key: str) None[source]
Adds a foreign key and its foreign origin table to a specified table.
foreign_keymust be a variable oftableand a primary key offoreign_table.- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
foreign_table (str) – The name of the foreign table, i.e. its file name with ‘.csv’ omitted
foreign_key (str) – The name of the foreign key, i.e. the column name
- Returns:
- Return type:
None
- add_table(table: str) None[source]
Add a table to the metadata
- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
- Return type:
None
- add_variable(table: str, variable: str) VariableInfo[source]
Adds a variable for a specified table to the metadata.
- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
variable (str) – The name of the variable, i.e. the column name
- Returns:
Returns the generated variable info that can be filled
- Return type:
- assign_label(table: str, label: str) None[source]
Assigns a label to a table, e.g. describing the contained data. Existing labels will be overwritten
- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
label (str) – The label that should be assigned, should not contain whitespace or line breaks
- Return type:
None
- assign_primary_key(table: str, primary_key: str) None[source]
Assigns a primary key for the specified table. Raises an exception if
tablealready has a primary key, orprimary_keyis not a variable oftable- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
primary_key (str) – The name of the primary key, i.e. the column name
- Return type:
None
- change_primary_key(table: str, primary_key: str) None[source]
Changes the primary key for the specified table. Raises an exception if
primary_keyis not a variable oftable- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
primary_key (str) – The name of the primary key, i.e. the column name
- Return type:
None
- static from_dict(data: dict) MetaData[source]
Parses a
Metadataobject from a dictionary.- Parameters:
data (dict) – The input dictionary
- Returns:
Returns the parsed object
- Return type:
- get_foreign_keys(table) Dict[str, str][source]
Retrieve all foreign keys of a table as a dictionary with the keys being the foreign keys and the values the foreign tables.
- Parameters:
table – The name of the table, i.e. its file name with ‘.csv’ omitted
- Returns:
Returns the foreign key/table dictionary
- Return type:
Dict[str, str]
- get_label(table) str[source]
Returns the label of the table or the empty string if none was assigned.
- Parameters:
table – The name of the table, i.e. its file name with ‘.csv’ omitted
- Returns:
Returns the table label as string
- Return type:
str
- get_primary_key(table: str) str[source]
Retrieve the primary key of the table. Returns the empty string if not yet assigned.
- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
- Returns:
Returns the name of the primary key
- Return type:
str
- get_table_names() List[str][source]
Retrieve all table name (file names with ‘.csv’ omitted) of the metadata.
- Returns:
Returns the list of table names
- Return type:
List[str]
- get_total_nof_variables() int[source]
Counts all variables in the metadata across all tables
- Returns:
Returns the count as an integer
- Return type:
int
- get_variable(table: str, variable: str) VariableInfo[source]
Retrieves the information about a given variable for inspection or altering.
- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
variable (str) – The name of the variable, i.e. the column name
- Returns:
Returns the variable information object
- Return type:
- get_variable_names(table: str) List[str][source]
Retrieves all variable names for a given table.
- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
- Returns:
Returns the list of retrieved variable names
- Return type:
List[str]
- has_artifacts() bool[source]
Check, if at least one variable has annotated artifacts
- Returns:
Returns
True, if at least one annotated artifact was found- Return type:
bool
- has_primary_key(table: str) bool[source]
Checks if the table has a primary key assigned.
- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
- Returns:
Returns
Trueif a primary key was assigned- Return type:
bool
- static load_from_json(filepath: str, file_encoding: str | None = None) MetaData[source]
Reads a
Metadataobject from a JSON.- Parameters:
filepath (str) – Path to the JSON
file_encoding (str | None) – file encoding of the JSON
- Returns:
Returns a Metadata object
- Return type:
- remove_foreign_key(table: str, foreign_key: str) None[source]
Removes a foreign key for a specified table.
foreign_keymust be a variable oftable.- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
foreign_key (str) – The name of the foreign key, i.e. the column name
- Returns:
- Return type:
None
- remove_table(table: str) None[source]
Remove a table from the metadata. All foreign keys pointing to this table are changed to categorical variables
- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
- Return type:
None
- remove_variable(table: str, variable: str) None[source]
Delete the variable for the specified table from the metadata. If it is a primary key, foreign key references from other tables are deleted as well
- Parameters:
table (str) – The name of the table, i.e. its file name with ‘.csv’ omitted
variable (str) – The name of the variable, i.e. the column name
- Return type:
None
- class graphxplore.MetaDataHandling.MetaDataGenerator(csv_data: str | Dict[str, List[Dict[str, str]]], artifact_mode: ArtifactMode = ArtifactMode.DataTypeMismatchAndOutliers, missing_vals: Iterable[str | None] = ('', 'NaN', 'Na', 'NA', 'NAN', 'nan', 'na'), nof_read_lines: int = 1000000, str_len_free_text: int = 300, binning_threshold: int = 20, categorical_threshold: int = 20, file_encoding: str | None = None)[source]
Bases:
objectThis class extracts metadata information from CSV files. It detects primary keys and foreign key relations between tables. Additionally,
VariableInfoobjects are inferred for all columns of all CSV files. The result is aMetaDataobject.- Parameters:
csv_data (str | Dict[str, List[Dict[str, str]]]) – The input data as CSV files either as directory path containing the CSV files or as dictionary of table name and table data as list of dictionaries per row
artifact_mode (ArtifactMode) – Determines if artifacts should be detected and at what level. For further information check
ArtifactModemissing_vals (Iterable[str | None]) – These characters indicate missing values, defaults to empty string, None and variations of “NaN” and “Na”
nof_read_lines (int) – Maximum number of lines read from each CSV file to gather metadata, defaults to 1 million
str_len_free_text (int) – Strings with at least this number of characters are considered free text and the containing variable is unfavored as primary key. Defaults to 300.
binning_threshold (int) – Metric variables with more distinct values are marked for binning, defaults to 20
categorical_threshold (int) – Variables with at most this number of distinct values are considered categorical, defaults to 20
file_encoding (str | None) – The file encoding of the CSV files (ascii, utf-8,…) in chardet definition. Is guessed if not specified. Only used when CSV data is read from a directory, defaults to None
- assign_foreign_keys() None[source]
Assigns foreign keys by detecting occurrences of primary keys in other tables.
- Return type:
None
- extract_variable_infos() None[source]
Extracts all information about variables contained in CSVs of the source directory and detects primary keys. Artifacts are detected, if specified by
artifact_mode. For more information checkoutArtifactMode- Return type:
None
- gather_meta_data() MetaData[source]
Extracts variables and primary/foreign key relations between CSV files. Each CSV MUST contain a column with unique entries and no empty cells. Among these, a primary key is selected prioritizing integer columns. Additionally,
VariableInfoobjects are inferred for all columns of all CSV files. Artifacts are detected, if specified byartifact_mode. For more information checkoutArtifactMode- Returns:
Returns the gathered metadata
- Return type:
- class graphxplore.MetaDataHandling.MetricDistribution(median: int | float, q1: int | float, q3: int | float, lower_fence: int | float, upper_fence: int | float, outliers: List[int | float], missing_count: int, artifact_count: int)[source]
Bases:
objectValue distribution for metric variables
- Parameters:
median (int | float) – The median
q1 (int | float) – The first quartile
q3 (int | float) – The third quartile
lower_fence (int | float) – The maximum of the minimal value and
q1- 1.5 interquartile rangeupper_fence (int | float) – The minimum of the maximal value and
q3+ 1.5 interquartile rangeoutliers (List[int | float]) – The list of values smaller than
lower_fenceor larger thanupper_fencewhich are not annotated as artifactsmissing_count (int) – Count of cell values which are missing values
artifact_count (int) – Count of artifact cells
- artifact_count: int
- lower_fence: int | float
- median: int | float
- missing_count: int
- outliers: List[int | float]
- q1: int | float
- q3: int | float
- upper_fence: int | float
- class graphxplore.MetaDataHandling.VariableInfo(name: str, table: str, labels: List[str], variable_type: VariableType, data_type: DataType, description: str | None = None, data_type_distribution: Dict[DataType, float] | None = None, default_value: str | int | float | None = None, value_distribution: MetricDistribution | CategoricalDistribution | None = None, binning: BinningInfo | None = None, artifacts: List[str] | None = None, reviewed: bool | None = None)[source]
Bases:
objectThis class contains all information about a single variable.
- Parameters:
name (str) – The name of the variable, i.e. the column name
table (str) – The name of the origin table, i.e. its file name with ‘.csv’ omitted
labels (List[str]) – One or multiple labels describing the variable
variable_type (VariableType) – The type of variable
data_type (DataType) – The data type of the variable
description (str | None) – A description of the variable, e.g. containing units of measurement or SNOMED CT codes, defaults to None
data_type_distribution (Dict[DataType, float] | None) – The percentage of different data types in the variable, defaults to None
default_value (str | int | float | None) – The optional default value of the variable, defaults to None
value_distribution (MetricDistribution | CategoricalDistribution | None) – Distribution of values depending on the variable type, defaults to None
binning (BinningInfo | None) – The binning info of the variable, defaults to None
artifacts (List[str] | None) – Potential artifacts existing for the variable, defaults to None
reviewed (bool | None) – Variable information was reviewed, defaults to None
- add_label(label: str)[source]
Add a label to the variable, e.g. describing its broad category such as “Laboratory”.
- Parameters:
label (str) – The label to add, must only contain letters, numbers, hyphens or underscores
- artifacts: List[str] | None = None
- binning: BinningInfo | None = None
- static cast_value(val_to_cast: str, data_type: DataType) str | int | float | None[source]
Casts a value to the specified data type. Returns None if the value could not be cast.
- Parameters:
val_to_cast (str) – The value which should be cast
data_type (DataType) – The data type to which the value should be cast
- Returns:
Returns the cast value
- Return type:
str | int | float | None
- cast_value_to_data_type(val_to_cast: str | int | float) str | int | float | None[source]
Casts a value to the data type of the variable. Returns None if the value could not be cast.
- Parameters:
val_to_cast (str | int | float) – The value which should be cast
- Returns:
Returns the cast value
- Return type:
str | int | float | None
- default_value: str | int | float | None = None
- description: str | None = None
- detect_artifacts_and_value_distribution(value_count_dict: Dict[str, int], artifact_mode: ArtifactMode = ArtifactMode.DataTypeMismatchAndOutliers, missing_vals: Iterable[str | None] = ('', 'NaN', 'Na', 'NA', 'NAN', 'nan', 'na'))[source]
Calculates a value distribution based on the variable type. For categorical variables, a distribution with counts is calculated. For metric variables, data for a whisker plot is calculated. For primary and foreign keys no value distributions is derived. For more information check out
MetricDistributionandCategoricalDistribution. Depending onartifact_mode, artifacts are detected on the specified level. Pre-existing artifacts are preserved. For more information check outArtifactMode- Parameters:
value_count_dict (Dict[str, int]) – The dictionary with all values (as string) and their occurrence count
artifact_mode (ArtifactMode) – Determines if artifacts should be detected and at what level. For further information check
ArtifactModemissing_vals (Iterable[str | None]) – The list of possible missing values as string
- static from_dict(var_name: str, table: str, variable_dict: dict) VariableInfo[source]
Parses a
VariableInfoobject from a dictionary.- Parameters:
var_name (str) – The name of the variable, i.e. the column name
table (str) – The name of the origin table, i.e. its file name with ‘.csv’ omitted
variable_dict (dict) – A dictionary containing all information about the variable
- Returns:
Returns the parsed object
- Return type:
- labels: List[str]
- name: str
- reviewed: bool | None = None
- table: str
- to_dict() dict[source]
Converts the object to a dictionary.
- Returns:
Returns the generated dictionary
- Return type:
dict
- value_distribution: MetricDistribution | CategoricalDistribution | None = None
- variable_type: VariableType