Modules

Submodules

klib.describe module

Functions for descriptive analytics.

author:Andreas Kanz
klib.describe.cat_plot(data: pandas.core.frame.DataFrame, figsize: Tuple = (18, 18), top: int = 3, bottom: int = 3, bar_color_top: str = '#5ab4ac', bar_color_bottom: str = '#d8b365')[source]

Two-dimensional visualization of the number and frequency of categorical features.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots

figsize : Tuple, optional

Use to control the figure size, by default (18, 18)

top : int, optional

Show the “top” most frequent values in a column, by default 3

bottom : int, optional

Show the “bottom” most frequent values in a column, by default 3

bar_color_top : str, optional

Use to control the color of the bars indicating the most common values, by default “#5ab4ac”

bar_color_bottom : str, optional

Use to control the color of the bars indicating the least common values, by default “#d8b365”

cmap : str, optional

The mapping from data values to color space, by default “BrBG”

Returns:
Gridspec

gs: Figure with array of Axes objects

klib.describe.corr_mat(data: pandas.core.frame.DataFrame, split: Optional[str] = None, threshold: float = 0, target: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray, str, None] = None, method: str = 'pearson', colored: bool = True) → Union[pandas.core.frame.DataFrame, Any][source]

Returns a color-encoded correlation matrix.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots

split : Optional[str], optional

Type of split to be performed, by default None {None, “pos”, “neg”, “high”, “low”}

threshold : float, optional

Value between 0 and 1 to set the correlation threshold, by default 0 unless split = “high” or split = “low”, in which case default is 0.3

target : Optional[Union[pd.DataFrame, str]], optional

Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label, by default None

method : str, optional

method: {“pearson”, “spearman”, “kendall”}, by default “pearson” * pearson: measures linear relationships and requires normally distributed and homoscedastic data. * spearman: ranked/ordinal correlation, measures monotonic relationships. * kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but more robust in smaller dataets than “spearman”

colored : bool, optional

If True the negative values in the correlation matrix are colored in red, by default True

Returns:
Union[pd.DataFrame, pd.Styler]

If colored = True - corr: Pandas Styler object If colored = False - corr: Pandas DataFrame

klib.describe.corr_plot(data: pandas.core.frame.DataFrame, split: Optional[str] = None, threshold: float = 0, target: Union[pandas.core.series.Series, str, None] = None, method: str = 'pearson', cmap: str = 'BrBG', figsize: Tuple = (12, 10), annot: bool = True, dev: bool = False, **kwargs)[source]

Two-dimensional visualization of the correlation between feature-columns excluding NA values.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots

split : Optional[str], optional
Type of split to be performed {None, “pos”, “neg”, “high”, “low”}, by default None
  • None: visualize all correlations between the feature-columns
  • pos: visualize all positive correlations between the feature-columns above the threshold
  • neg: visualize all negative correlations between the feature-columns below the threshold
  • high: visualize all correlations between the feature-columns for which abs (corr) > threshold is True
  • low: visualize all correlations between the feature-columns for which abs(corr) < threshold is True
threshold : float, optional

Value between 0 and 1 to set the correlation threshold, by default 0 unless split = “high” or split = “low”, in which case default is 0.3

target : Optional[Union[pd.Series, str]], optional

Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label, by default None

method : str, optional
method: {“pearson”, “spearman”, “kendall”}, by default “pearson”
  • pearson: measures linear relationships and requires normally distributed and homoscedastic data.
  • spearman: ranked/ordinal correlation, measures monotonic relationships.
  • kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but more robust in smaller dataets than “spearman”.
cmap : str, optional

The mapping from data values to color space, matplotlib colormap name or object, or list of colors, by default “BrBG”

figsize : Tuple, optional

Use to control the figure size, by default (12, 10)

annot : bool, optional

Use to show or hide annotations, by default True

dev : bool, optional

Display figure settings in the plot by setting dev = True. If False, the settings are not displayed, by default False

Keyword Arguments : optional

Additional elements to control the visualization of the plot, e.g.:

  • mask: bool, default True
    If set to False the entire correlation matrix, including the upper triangle is shown. Set dev = False in this case to avoid overlap.
  • vmax: float, default is calculated from the given correlation coefficients.
    Value between -1 or vmin <= vmax <= 1, limits the range of the cbar.
  • vmin: float, default is calculated from the given correlation coefficients.
    Value between -1 <= vmin <= 1 or vmax, limits the range of the cbar.
  • linewidths: float, default 0.5
    Controls the line-width inbetween the squares.
  • annot_kws: dict, default {“size” : 10}
    Controls the font size of the annotations. Only available when annot = True.
  • cbar_kws: dict, default {“shrink”: .95, “aspect”: 30}
    Controls the size of the colorbar.
  • Many more kwargs are available, i.e. “alpha” to control blending, or options to adjust labels, ticks …

Kwargs can be supplied through a dictionary of key-value pairs (see above).

Returns:
ax: matplotlib Axes

Returns the Axes object with the plot for further tweaking.

klib.describe.dist_plot(data: pandas.core.frame.DataFrame, mean_color: str = 'orange', size: int = 2.5, fill_range: Tuple = (0.025, 0.975), showall: bool = False, kde_kws: Dict[str, Any] = None, rug_kws: Dict[str, Any] = None, fill_kws: Dict[str, Any] = None, font_kws: Dict[str, Any] = None)[source]

Two-dimensional visualization of the distribution of non binary numerical features.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots

mean_color : str, optional

Color of the vertical line indicating the mean of the data, by default “orange”

size : int, optional

Controls the plot size, by default 2.5

fill_range : Tuple, optional

Set the quantiles for shading. Default spans 95% of the data, which is about two std. deviations above and below the mean, by default (0.025, 0.975)

showall : bool, optional

Set to True to remove the output limit of 20 plots, by default False

kde_kws : Dict[str, Any], optional

Keyword arguments for kdeplot(), by default {“color”: “k”, “alpha”: 0.75, “linewidth”: 1.5, “bw_adjust”: 0.8}

rug_kws : Dict[str, Any], optional

Keyword arguments for rugplot(), by default {“color”: “#ff3333”, “alpha”: 0.15, “lw”: 3, “height”: 0.075}

fill_kws : Dict[str, Any], optional

Keyword arguments to control the fill, by default {“color”: “#80d4ff”, “alpha”: 0.2}

font_kws : Dict[str, Any], optional

Keyword arguments to control the font, by default {“color”: “#111111”, “weight”: “normal”, “size”: 11}

Returns:
ax: matplotlib Axes

Returns the Axes object with the plot for further tweaking.

klib.describe.missingval_plot(data: pandas.core.frame.DataFrame, cmap: str = 'PuBuGn', figsize: Tuple = (20, 20), sort: bool = False, spine_color: str = '#EEEEEE')[source]

Two-dimensional visualization of the missing values in a dataset.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots

cmap : str, optional

Any valid colormap can be used. E.g. “Greys”, “RdPu”. More information can be found in the matplotlib documentation, by default “PuBuGn”

figsize : Tuple, optional

Use to control the figure size, by default (20, 20)

sort : bool, optional

Sort columns based on missing values in descending order and drop columns without any missing values, by default False

spine_color : str, optional

Set to “None” to hide the spines on all plots or use any valid matplotlib color argument, by default “#EEEEEE”

Returns:
GridSpec

gs: Figure with array of Axes objects

klib.clean module

Functions for data cleaning.

author:Andreas Kanz
klib.clean.clean_column_names(data: pandas.core.frame.DataFrame, hints: bool = True) → pandas.core.frame.DataFrame[source]

Cleans the column names of the provided Pandas Dataframe and optionally provides hints on duplicate and long column names.

Parameters:
data : pd.DataFrame

Original Dataframe with columns to be cleaned

hints : bool, optional

Print out hints on column name duplication and colum name length, by default True

Returns:
pd.DataFrame

Pandas DataFrame with cleaned column names

klib.clean.convert_datatypes(data: pandas.core.frame.DataFrame, category: bool = True, cat_threshold: float = 0.05, cat_exclude: Optional[List[Union[str, int]]] = None) → pandas.core.frame.DataFrame[source]

Converts columns to best possible dtypes using dtypes supporting pd.NA. Temporarily not converting to integers due to an issue in pandas. This is expected to be fixed in pandas 1.1. See https://github.com/pandas-dev/pandas/issues/33803

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame

category : bool, optional

Change dtypes of columns with dtype “object” to “category”. Set threshold using cat_threshold or exclude columns using cat_exclude, by default True

cat_threshold : float, optional

Ratio of unique values below which categories are inferred and column dtype is changed to categorical, by default 0.05

cat_exclude : Optional[List[Union[str, int]]], optional

List of columns to exclude from categorical conversion, by default None

Returns:
pd.DataFrame

Pandas DataFrame with converted Datatypes

klib.clean.data_cleaning(data: pandas.core.frame.DataFrame, drop_threshold_cols: float = 0.9, drop_threshold_rows: float = 0.9, drop_duplicates: bool = True, convert_dtypes: bool = True, col_exclude: Optional[List[str]] = None, category: bool = True, cat_threshold: float = 0.03, cat_exclude: Optional[List[Union[str, int]]] = None, clean_col_names: bool = True, show: str = 'changes') → pandas.core.frame.DataFrame[source]

Perform initial data cleaning tasks on a dataset, such as dropping single valued and empty rows, empty columns as well as optimizing the datatypes.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame

drop_threshold_cols : float, optional

Drop columns with NA-ratio equal to or above the specified threshold, by default 0.9

drop_threshold_rows : float, optional

Drop rows with NA-ratio equal to or above the specified threshold, by default 0.9

drop_duplicates : bool, optional

Drop duplicate rows, keeping the first occurence. This step comes after the dropping of missing values, by default True

convert_dtypes : bool, optional

Convert dtypes using pd.convert_dtypes(), by default True

col_exclude : Optional[List[str]], optional

Specify a list of columns to exclude from dropping, by default None

category : bool, optional

Enable changing dtypes of “object” columns to “category”. Set threshold using cat_threshold. Requires convert_dtypes=True, by default True

cat_threshold : float, optional

Ratio of unique values below which categories are inferred and column dtype is changed to categorical, by default 0.03

cat_exclude : Optional[List[str]], optional

List of columns to exclude from categorical conversion, by default None

clean_column_names: bool, optional

Cleans the column names and provides hints on duplicate and long names, by default True

show : str, optional

{“all”, “changes”, None}, by default “changes” Specify verbosity of the output:

  • “all”: Print information about the data before and after cleaning as well as information about changes and memory usage (deep). Please be aware, that this can slow down the function by quite a bit.
  • “changes”: Print out differences in the data before and after cleaning.
  • None: No information about the data and the data cleaning is printed.
Returns:
pd.DataFrame

Cleaned Pandas DataFrame

See also

convert_datatypes
Convert columns to best possible dtypes.
drop_missing
Flexibly drop columns and rows.
_memory_usage
Gives the total memory usage in megabytes.
_missing_vals
Metrics about missing values in the dataset.

Notes

The category dtype is not grouped in the summary, unless it contains exactly the same categories.

klib.clean.drop_missing(data: pandas.core.frame.DataFrame, drop_threshold_cols: float = 1, drop_threshold_rows: float = 1, col_exclude: Optional[List[str]] = None) → pandas.core.frame.DataFrame[source]

Drops completely empty columns and rows by default and optionally provides flexibility to loosen restrictions to drop additional non-empty columns and rows based on the fraction of NA-values.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame

drop_threshold_cols : float, optional

Drop columns with NA-ratio equal to or above the specified threshold, by default 1

drop_threshold_rows : float, optional

Drop rows with NA-ratio equal to or above the specified threshold, by default 1

col_exclude : Optional[List[str]], optional

Specify a list of columns to exclude from dropping. The excluded columns do not affect the drop thresholds, by default None

Returns:
pd.DataFrame

Pandas DataFrame without any empty columns or rows

Notes

Columns are dropped first

klib.clean.mv_col_handling(data: pandas.core.frame.DataFrame, target: Union[str, pandas.core.series.Series, List[T], None] = None, mv_threshold: float = 0.1, corr_thresh_features: float = 0.5, corr_thresh_target: float = 0.3, return_details: bool = False) → pandas.core.frame.DataFrame[source]

Converts columns with a high ratio of missing values into binary features and eventually drops them based on their correlation with other features and the target variable. This function follows a three step process: - 1) Identify features with a high ratio of missing values (above ‘mv_threshold’). - 2) Identify high correlations of these features among themselves and with other features in the dataset (above ‘corr_thresh_features’). - 3) Features with high ratio of missing values and high correlation among each other are dropped unless they correlate reasonably well with the target variable (above ‘corr_thresh_target’).

Note: If no target is provided, the process exits after step two and drops columns identified up to this point.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame

target : Optional[Union[str, pd.Series, List]], optional

Specify target for correlation. I.e. label column to generate only the correlations between each feature and the label, by default None

mv_threshold : float, optional

Value between 0 <= threshold <= 1. Features with a missing-value-ratio larger than mv_threshold are candidates for dropping and undergo further analysis, by default 0.1

corr_thresh_features : float, optional

Value between 0 <= threshold <= 1. Maximum correlation a previously identified features (with a high mv-ratio) is allowed to have with another feature. If this threshold is overstepped, the feature undergoes further analysis, by default 0.5

corr_thresh_target : float, optional

Value between 0 <= threshold <= 1. Minimum required correlation of a remaining feature (i.e. feature with a high mv-ratio and high correlation to another existing feature) with the target. If this threshold is not met the feature is ultimately dropped, by default 0.3

return_details : bool, optional

Provdies flexibility to return intermediary results, by default False

Returns:
pd.DataFrame

Updated Pandas DataFrame

optional:
cols_mv: Columns with missing values included in the analysis
drop_cols: List of dropped columns

klib.preprocess module

Functions for data preprocessing.

author:Andreas Kanz
klib.preprocess.feature_selection_pipe(var_thresh=VarianceThreshold(threshold=0.1), select_from_model=SelectFromModel(estimator=LassoCV(cv=4, random_state=408), threshold='0.1*median'), select_percentile=SelectPercentile(percentile=95), var_thresh_info=PipeInfo(name='after var_thresh'), select_from_model_info=PipeInfo(name='after select_from_model'), select_percentile_info=PipeInfo(name='after select_percentile'))[source]

Preprocessing operations for feature selection.

Parameters:
var_thresh: default, VarianceThreshold(threshold=0.1)

Specify a threshold to drop low variance features.

select_from_model: default, SelectFromModel(LassoCV(cv=4, random_state=408), threshold=”0.1 * median”)

Specify an estimator which is used for selecting features based on importance weights.

select_percentile: default, SelectPercentile(f_classif, percentile=95)

Specify a score-function and a percentile value of features to keep.

var_thresh_info, select_from_model_info, select_percentile_info

Prints the shape of the dataset after applying the respective function. Set to ‘None’ to avoid printing the shape of dataset. This parameter can also be set as a hyperparameter, e.g. ‘pipeline__pipeinfo-1’: [None] or ‘pipeline__pipeinfo-1__name’: [‘my_custom_name’].

Returns:
Pipeline
klib.preprocess.num_pipe(imputer=IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=25, n_jobs=4, random_state=408), random_state=408), scaler=RobustScaler())[source]

Standard preprocessing operations on numerical data.

Parameters:
imputer: default, IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=25, n_jobs=4, random_state=408), random_state=408)
scaler: default, RobustScaler()
Returns:
Pipeline
klib.preprocess.cat_pipe(imputer=SimpleImputer(strategy='most_frequent'), encoder=OneHotEncoder(handle_unknown='ignore'), scaler=MaxAbsScaler(), encoder_info=PipeInfo(name='after encoding categorical data'))[source]

Standard preprocessing operations on categorical data.

Parameters:
imputer: default, SimpleImputer(strategy=’most_frequent’)
encoder: default, OneHotEncoder(handle_unknown=’ignore’)

Encode categorical features as a one-hot numeric array.

scaler: default, MaxAbsScaler()

Scale each feature by its maximum absolute value. MaxAbsScaler() does not shift/center the data, and thus does not destroy any sparsity. It is recommended to check for outliers before applying MaxAbsScaler().

encoder_info:

Prints the shape of the dataset at the end of ‘cat_pipe’. Set to ‘None’ to avoid printing the shape of dataset. This parameter can also be set as a hyperparameter, e.g. ‘pipeline__pipeinfo-1’: [None] or ‘pipeline__pipeinfo-1__name’: [‘my_custom_name’].

Returns:
Pipeline
klib.preprocess.train_dev_test_split(data, target, dev_size=0.1, test_size=0.1, stratify=None, random_state=408)[source]

Split a dataset and a label column into train, dev and test sets.

Parameters:
data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots.
target: string, list, np.array or pd.Series, default None

Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label.

dev_size: float, default 0.1

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the dev split.

test_size: float, default 0.1

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.

stratify: target column, default None

If not None, data is split in a stratified fashion, using the input as the class labels.

random_state: integer, default 408

Random_state is the seed used by the random number generator.

Returns:
tuple: Tuple containing train-dev-test split of inputs.

Module contents

Data Science Module for Python

klib is an easy to use Python library of customized functions for cleaning and analyzing data.

klib.clean_column_names(data: pandas.core.frame.DataFrame, hints: bool = True) → pandas.core.frame.DataFrame[source]

Cleans the column names of the provided Pandas Dataframe and optionally provides hints on duplicate and long column names.

Parameters:
data : pd.DataFrame

Original Dataframe with columns to be cleaned

hints : bool, optional

Print out hints on column name duplication and colum name length, by default True

Returns:
pd.DataFrame

Pandas DataFrame with cleaned column names

klib.convert_datatypes(data: pandas.core.frame.DataFrame, category: bool = True, cat_threshold: float = 0.05, cat_exclude: Optional[List[Union[str, int]]] = None) → pandas.core.frame.DataFrame[source]

Converts columns to best possible dtypes using dtypes supporting pd.NA. Temporarily not converting to integers due to an issue in pandas. This is expected to be fixed in pandas 1.1. See https://github.com/pandas-dev/pandas/issues/33803

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame

category : bool, optional

Change dtypes of columns with dtype “object” to “category”. Set threshold using cat_threshold or exclude columns using cat_exclude, by default True

cat_threshold : float, optional

Ratio of unique values below which categories are inferred and column dtype is changed to categorical, by default 0.05

cat_exclude : Optional[List[Union[str, int]]], optional

List of columns to exclude from categorical conversion, by default None

Returns:
pd.DataFrame

Pandas DataFrame with converted Datatypes

klib.data_cleaning(data: pandas.core.frame.DataFrame, drop_threshold_cols: float = 0.9, drop_threshold_rows: float = 0.9, drop_duplicates: bool = True, convert_dtypes: bool = True, col_exclude: Optional[List[str]] = None, category: bool = True, cat_threshold: float = 0.03, cat_exclude: Optional[List[Union[str, int]]] = None, clean_col_names: bool = True, show: str = 'changes') → pandas.core.frame.DataFrame[source]

Perform initial data cleaning tasks on a dataset, such as dropping single valued and empty rows, empty columns as well as optimizing the datatypes.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame

drop_threshold_cols : float, optional

Drop columns with NA-ratio equal to or above the specified threshold, by default 0.9

drop_threshold_rows : float, optional

Drop rows with NA-ratio equal to or above the specified threshold, by default 0.9

drop_duplicates : bool, optional

Drop duplicate rows, keeping the first occurence. This step comes after the dropping of missing values, by default True

convert_dtypes : bool, optional

Convert dtypes using pd.convert_dtypes(), by default True

col_exclude : Optional[List[str]], optional

Specify a list of columns to exclude from dropping, by default None

category : bool, optional

Enable changing dtypes of “object” columns to “category”. Set threshold using cat_threshold. Requires convert_dtypes=True, by default True

cat_threshold : float, optional

Ratio of unique values below which categories are inferred and column dtype is changed to categorical, by default 0.03

cat_exclude : Optional[List[str]], optional

List of columns to exclude from categorical conversion, by default None

clean_column_names: bool, optional

Cleans the column names and provides hints on duplicate and long names, by default True

show : str, optional

{“all”, “changes”, None}, by default “changes” Specify verbosity of the output:

  • “all”: Print information about the data before and after cleaning as well as information about changes and memory usage (deep). Please be aware, that this can slow down the function by quite a bit.
  • “changes”: Print out differences in the data before and after cleaning.
  • None: No information about the data and the data cleaning is printed.
Returns:
pd.DataFrame

Cleaned Pandas DataFrame

See also

convert_datatypes
Convert columns to best possible dtypes.
drop_missing
Flexibly drop columns and rows.
_memory_usage
Gives the total memory usage in megabytes.
_missing_vals
Metrics about missing values in the dataset.

Notes

The category dtype is not grouped in the summary, unless it contains exactly the same categories.

klib.drop_missing(data: pandas.core.frame.DataFrame, drop_threshold_cols: float = 1, drop_threshold_rows: float = 1, col_exclude: Optional[List[str]] = None) → pandas.core.frame.DataFrame[source]

Drops completely empty columns and rows by default and optionally provides flexibility to loosen restrictions to drop additional non-empty columns and rows based on the fraction of NA-values.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame

drop_threshold_cols : float, optional

Drop columns with NA-ratio equal to or above the specified threshold, by default 1

drop_threshold_rows : float, optional

Drop rows with NA-ratio equal to or above the specified threshold, by default 1

col_exclude : Optional[List[str]], optional

Specify a list of columns to exclude from dropping. The excluded columns do not affect the drop thresholds, by default None

Returns:
pd.DataFrame

Pandas DataFrame without any empty columns or rows

Notes

Columns are dropped first

klib.mv_col_handling(data: pandas.core.frame.DataFrame, target: Union[str, pandas.core.series.Series, List[T], None] = None, mv_threshold: float = 0.1, corr_thresh_features: float = 0.5, corr_thresh_target: float = 0.3, return_details: bool = False) → pandas.core.frame.DataFrame[source]

Converts columns with a high ratio of missing values into binary features and eventually drops them based on their correlation with other features and the target variable. This function follows a three step process: - 1) Identify features with a high ratio of missing values (above ‘mv_threshold’). - 2) Identify high correlations of these features among themselves and with other features in the dataset (above ‘corr_thresh_features’). - 3) Features with high ratio of missing values and high correlation among each other are dropped unless they correlate reasonably well with the target variable (above ‘corr_thresh_target’).

Note: If no target is provided, the process exits after step two and drops columns identified up to this point.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame

target : Optional[Union[str, pd.Series, List]], optional

Specify target for correlation. I.e. label column to generate only the correlations between each feature and the label, by default None

mv_threshold : float, optional

Value between 0 <= threshold <= 1. Features with a missing-value-ratio larger than mv_threshold are candidates for dropping and undergo further analysis, by default 0.1

corr_thresh_features : float, optional

Value between 0 <= threshold <= 1. Maximum correlation a previously identified features (with a high mv-ratio) is allowed to have with another feature. If this threshold is overstepped, the feature undergoes further analysis, by default 0.5

corr_thresh_target : float, optional

Value between 0 <= threshold <= 1. Minimum required correlation of a remaining feature (i.e. feature with a high mv-ratio and high correlation to another existing feature) with the target. If this threshold is not met the feature is ultimately dropped, by default 0.3

return_details : bool, optional

Provdies flexibility to return intermediary results, by default False

Returns:
pd.DataFrame

Updated Pandas DataFrame

optional:
cols_mv: Columns with missing values included in the analysis
drop_cols: List of dropped columns
klib.pool_duplicate_subsets(data: pandas.core.frame.DataFrame, col_dupl_thresh: float = 0.2, subset_thresh: float = 0.2, min_col_pool: int = 3, exclude: Optional[List[str]] = None, return_details=False) → pandas.core.frame.DataFrame[source]

Checks for duplicates in subsets of columns and pools them. This can reduce the number of columns in the data without loosing much information. Suitable columns are combined to subsets and tested for duplicates. In case sufficient duplicates can be found, the respective columns are aggregated into a “pooled_var” column. Identical numbers in the “pooled_var” column indicate identical information in the respective rows.

Note: It is advised to exclude features that provide sufficient informational content by themselves as well as the target column by using the “exclude” setting.
Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame

col_dupl_thresh : float, optional

Columns with a ratio of duplicates higher than “col_dupl_thresh” are considered in the further analysis. Columns with a lower ratio are not considered for pooling, by default 0.2

subset_thresh : float, optional

The first subset with a duplicate threshold higher than “subset_thresh” is chosen and aggregated. If no subset reaches the threshold, the algorithm continues with continuously smaller subsets until “min_col_pool” is reached, by default 0.2

min_col_pool : int, optional

Minimum number of columns to pool. The algorithm attempts to combine as many columns as possible to suitable subsets and stops when “min_col_pool” is reached, by default 3

exclude : Optional[List[str]], optional

List of column names to be excluded from the analysis. These columns are passed through without modification, by default None

return_details : bool, optional

Provdies flexibility to return intermediary results, by default False

Returns:
pd.DataFrame

DataFrame with low cardinality columns pooled

optional:
subset_cols: List of columns used as subset
klib.cat_plot(data: pandas.core.frame.DataFrame, figsize: Tuple = (18, 18), top: int = 3, bottom: int = 3, bar_color_top: str = '#5ab4ac', bar_color_bottom: str = '#d8b365')[source]

Two-dimensional visualization of the number and frequency of categorical features.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots

figsize : Tuple, optional

Use to control the figure size, by default (18, 18)

top : int, optional

Show the “top” most frequent values in a column, by default 3

bottom : int, optional

Show the “bottom” most frequent values in a column, by default 3

bar_color_top : str, optional

Use to control the color of the bars indicating the most common values, by default “#5ab4ac”

bar_color_bottom : str, optional

Use to control the color of the bars indicating the least common values, by default “#d8b365”

cmap : str, optional

The mapping from data values to color space, by default “BrBG”

Returns:
Gridspec

gs: Figure with array of Axes objects

klib.corr_mat(data: pandas.core.frame.DataFrame, split: Optional[str] = None, threshold: float = 0, target: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray, str, None] = None, method: str = 'pearson', colored: bool = True) → Union[pandas.core.frame.DataFrame, Any][source]

Returns a color-encoded correlation matrix.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots

split : Optional[str], optional

Type of split to be performed, by default None {None, “pos”, “neg”, “high”, “low”}

threshold : float, optional

Value between 0 and 1 to set the correlation threshold, by default 0 unless split = “high” or split = “low”, in which case default is 0.3

target : Optional[Union[pd.DataFrame, str]], optional

Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label, by default None

method : str, optional

method: {“pearson”, “spearman”, “kendall”}, by default “pearson” * pearson: measures linear relationships and requires normally distributed and homoscedastic data. * spearman: ranked/ordinal correlation, measures monotonic relationships. * kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but more robust in smaller dataets than “spearman”

colored : bool, optional

If True the negative values in the correlation matrix are colored in red, by default True

Returns:
Union[pd.DataFrame, pd.Styler]

If colored = True - corr: Pandas Styler object If colored = False - corr: Pandas DataFrame

klib.corr_plot(data: pandas.core.frame.DataFrame, split: Optional[str] = None, threshold: float = 0, target: Union[pandas.core.series.Series, str, None] = None, method: str = 'pearson', cmap: str = 'BrBG', figsize: Tuple = (12, 10), annot: bool = True, dev: bool = False, **kwargs)[source]

Two-dimensional visualization of the correlation between feature-columns excluding NA values.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots

split : Optional[str], optional
Type of split to be performed {None, “pos”, “neg”, “high”, “low”}, by default None
  • None: visualize all correlations between the feature-columns
  • pos: visualize all positive correlations between the feature-columns above the threshold
  • neg: visualize all negative correlations between the feature-columns below the threshold
  • high: visualize all correlations between the feature-columns for which abs (corr) > threshold is True
  • low: visualize all correlations between the feature-columns for which abs(corr) < threshold is True
threshold : float, optional

Value between 0 and 1 to set the correlation threshold, by default 0 unless split = “high” or split = “low”, in which case default is 0.3

target : Optional[Union[pd.Series, str]], optional

Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label, by default None

method : str, optional
method: {“pearson”, “spearman”, “kendall”}, by default “pearson”
  • pearson: measures linear relationships and requires normally distributed and homoscedastic data.
  • spearman: ranked/ordinal correlation, measures monotonic relationships.
  • kendall: ranked/ordinal correlation, measures monotonic relationships. Computationally more expensive but more robust in smaller dataets than “spearman”.
cmap : str, optional

The mapping from data values to color space, matplotlib colormap name or object, or list of colors, by default “BrBG”

figsize : Tuple, optional

Use to control the figure size, by default (12, 10)

annot : bool, optional

Use to show or hide annotations, by default True

dev : bool, optional

Display figure settings in the plot by setting dev = True. If False, the settings are not displayed, by default False

Keyword Arguments : optional

Additional elements to control the visualization of the plot, e.g.:

  • mask: bool, default True
    If set to False the entire correlation matrix, including the upper triangle is shown. Set dev = False in this case to avoid overlap.
  • vmax: float, default is calculated from the given correlation coefficients.
    Value between -1 or vmin <= vmax <= 1, limits the range of the cbar.
  • vmin: float, default is calculated from the given correlation coefficients.
    Value between -1 <= vmin <= 1 or vmax, limits the range of the cbar.
  • linewidths: float, default 0.5
    Controls the line-width inbetween the squares.
  • annot_kws: dict, default {“size” : 10}
    Controls the font size of the annotations. Only available when annot = True.
  • cbar_kws: dict, default {“shrink”: .95, “aspect”: 30}
    Controls the size of the colorbar.
  • Many more kwargs are available, i.e. “alpha” to control blending, or options to adjust labels, ticks …

Kwargs can be supplied through a dictionary of key-value pairs (see above).

Returns:
ax: matplotlib Axes

Returns the Axes object with the plot for further tweaking.

klib.dist_plot(data: pandas.core.frame.DataFrame, mean_color: str = 'orange', size: int = 2.5, fill_range: Tuple = (0.025, 0.975), showall: bool = False, kde_kws: Dict[str, Any] = None, rug_kws: Dict[str, Any] = None, fill_kws: Dict[str, Any] = None, font_kws: Dict[str, Any] = None)[source]

Two-dimensional visualization of the distribution of non binary numerical features.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots

mean_color : str, optional

Color of the vertical line indicating the mean of the data, by default “orange”

size : int, optional

Controls the plot size, by default 2.5

fill_range : Tuple, optional

Set the quantiles for shading. Default spans 95% of the data, which is about two std. deviations above and below the mean, by default (0.025, 0.975)

showall : bool, optional

Set to True to remove the output limit of 20 plots, by default False

kde_kws : Dict[str, Any], optional

Keyword arguments for kdeplot(), by default {“color”: “k”, “alpha”: 0.75, “linewidth”: 1.5, “bw_adjust”: 0.8}

rug_kws : Dict[str, Any], optional

Keyword arguments for rugplot(), by default {“color”: “#ff3333”, “alpha”: 0.15, “lw”: 3, “height”: 0.075}

fill_kws : Dict[str, Any], optional

Keyword arguments to control the fill, by default {“color”: “#80d4ff”, “alpha”: 0.2}

font_kws : Dict[str, Any], optional

Keyword arguments to control the font, by default {“color”: “#111111”, “weight”: “normal”, “size”: 11}

Returns:
ax: matplotlib Axes

Returns the Axes object with the plot for further tweaking.

klib.missingval_plot(data: pandas.core.frame.DataFrame, cmap: str = 'PuBuGn', figsize: Tuple = (20, 20), sort: bool = False, spine_color: str = '#EEEEEE')[source]

Two-dimensional visualization of the missing values in a dataset.

Parameters:
data : pd.DataFrame

2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots

cmap : str, optional

Any valid colormap can be used. E.g. “Greys”, “RdPu”. More information can be found in the matplotlib documentation, by default “PuBuGn”

figsize : Tuple, optional

Use to control the figure size, by default (20, 20)

sort : bool, optional

Sort columns based on missing values in descending order and drop columns without any missing values, by default False

spine_color : str, optional

Set to “None” to hide the spines on all plots or use any valid matplotlib color argument, by default “#EEEEEE”

Returns:
GridSpec

gs: Figure with array of Axes objects

klib.feature_selection_pipe(var_thresh=VarianceThreshold(threshold=0.1), select_from_model=SelectFromModel(estimator=LassoCV(cv=4, random_state=408), threshold='0.1*median'), select_percentile=SelectPercentile(percentile=95), var_thresh_info=PipeInfo(name='after var_thresh'), select_from_model_info=PipeInfo(name='after select_from_model'), select_percentile_info=PipeInfo(name='after select_percentile'))[source]

Preprocessing operations for feature selection.

Parameters:
var_thresh: default, VarianceThreshold(threshold=0.1)

Specify a threshold to drop low variance features.

select_from_model: default, SelectFromModel(LassoCV(cv=4, random_state=408), threshold=”0.1 * median”)

Specify an estimator which is used for selecting features based on importance weights.

select_percentile: default, SelectPercentile(f_classif, percentile=95)

Specify a score-function and a percentile value of features to keep.

var_thresh_info, select_from_model_info, select_percentile_info

Prints the shape of the dataset after applying the respective function. Set to ‘None’ to avoid printing the shape of dataset. This parameter can also be set as a hyperparameter, e.g. ‘pipeline__pipeinfo-1’: [None] or ‘pipeline__pipeinfo-1__name’: [‘my_custom_name’].

Returns:
Pipeline
klib.num_pipe(imputer=IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=25, n_jobs=4, random_state=408), random_state=408), scaler=RobustScaler())[source]

Standard preprocessing operations on numerical data.

Parameters:
imputer: default, IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=25, n_jobs=4, random_state=408), random_state=408)
scaler: default, RobustScaler()
Returns:
Pipeline
klib.cat_pipe(imputer=SimpleImputer(strategy='most_frequent'), encoder=OneHotEncoder(handle_unknown='ignore'), scaler=MaxAbsScaler(), encoder_info=PipeInfo(name='after encoding categorical data'))[source]

Standard preprocessing operations on categorical data.

Parameters:
imputer: default, SimpleImputer(strategy=’most_frequent’)
encoder: default, OneHotEncoder(handle_unknown=’ignore’)

Encode categorical features as a one-hot numeric array.

scaler: default, MaxAbsScaler()

Scale each feature by its maximum absolute value. MaxAbsScaler() does not shift/center the data, and thus does not destroy any sparsity. It is recommended to check for outliers before applying MaxAbsScaler().

encoder_info:

Prints the shape of the dataset at the end of ‘cat_pipe’. Set to ‘None’ to avoid printing the shape of dataset. This parameter can also be set as a hyperparameter, e.g. ‘pipeline__pipeinfo-1’: [None] or ‘pipeline__pipeinfo-1__name’: [‘my_custom_name’].

Returns:
Pipeline
klib.train_dev_test_split(data, target, dev_size=0.1, test_size=0.1, stratify=None, random_state=408)[source]

Split a dataset and a label column into train, dev and test sets.

Parameters:
data: 2D dataset that can be coerced into Pandas DataFrame. If a Pandas DataFrame is provided, the index/column information is used to label the plots.
target: string, list, np.array or pd.Series, default None

Specify target for correlation. E.g. label column to generate only the correlations between each feature and the label.

dev_size: float, default 0.1

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the dev split.

test_size: float, default 0.1

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.

stratify: target column, default None

If not None, data is split in a stratified fashion, using the input as the class labels.

random_state: integer, default 408

Random_state is the seed used by the random number generator.

Returns:
tuple: Tuple containing train-dev-test split of inputs.