Machine Learning Functions

This section documents the complete ML workflow functions introduced in edaflow v0.13.0.

ML Configuration & Setup

setup_ml_experiment([data, target_column, ...])

Set up a complete ML experiment with train/validation/test splits.

configure_model_pipeline(data_config[, ...])

Configure a preprocessing pipeline for the ML experiment.

validate_ml_data([experiment_data, ...])

Validate data quality for ML experiments.

Model Comparison & Ranking

compare_models(models[, X_train, X_val, ...])

Compare multiple models across various performance metrics.

rank_models(comparison_df, primary_metric[, ...])

Rank models based on performance metrics.

display_leaderboard([comparison_results, ...])

Display a visual leaderboard of model performance.

export_model_comparison(comparison_df, filepath)

Export model comparison results to file.

Hyperparameter Optimization

optimize_hyperparameters(model, ...[, cv, ...])

Optimize hyperparameters using various search strategies.

grid_search_models(models, param_grids, ...)

Perform grid search optimization for multiple models.

bayesian_optimization(model, param_space, ...)

Perform Bayesian optimization using scikit-optimize.

random_search_models(models, ...[, n_iter, ...])

Perform random search optimization for multiple models.

Performance Visualization

plot_learning_curves(model, X_train, y_train)

Plot learning curves to analyze model performance vs training set size.

plot_validation_curves(model, X_train, ...)

Plot validation curves for hyperparameter analysis.

plot_roc_curves(models, X_val, y_val[, ...])

Plot ROC curves for multiple models (binary classification only).

plot_precision_recall_curves(models, X_val, ...)

Plot Precision-Recall curves for multiple models.

plot_confusion_matrix(model, X_val, y_val[, ...])

Plot confusion matrix for a classification model.

plot_feature_importance(model, feature_names)

Plot feature importance for models that support it.

Model Artifacts & Tracking

save_model_artifacts(model, model_name, ...)

Save complete model artifacts including model, config, and metadata.

load_model_artifacts(artifact_path[, ...])

Load model artifacts from saved files.

track_experiment(experiment_name, ...[, ...])

Track experiment results in a CSV log file.

create_model_report(model, model_name, ...)

Generate a comprehensive model report.

Function Details

Configuration Functions

edaflow.ml.setup_ml_experiment(data: DataFrame | None = None, target_column: str | None = None, test_size: float = 0.2, validation_size: float | None = None, random_state: int = 42, stratify: bool = True, verbose: bool = True, experiment_name: str | None = None, X: DataFrame | None = None, y: Series | None = None, val_size: float | None = None, primary_metric: str | None = None) Dict[str, Any][source]

Set up a complete ML experiment with train/validation/test splits.

This function supports two calling patterns: 1. DataFrame with target column: setup_ml_experiment(data, target_column) 2. Sklearn-style: setup_ml_experiment(X=X, y=y)

Parameters:

…existing parameters… primary_metric : str, optional

The main metric used for model selection and ranking (e.g., ‘roc_auc’, ‘f1’, ‘accuracy’, ‘r2’). This will be stored in the config for downstream use.

target_columnstr, optional

Name of the target variable column (required if using data parameter)

test_sizefloat, default=0.2

Proportion of data to use for testing

validation_sizefloat, optional

Proportion of training data to use for validation (default=0.2)

random_stateint, default=42

Random seed for reproducibility

stratifybool, default=True

Whether to stratify the splits (for classification)

verbosebool, default=True

Whether to print experiment setup details

experiment_namestr, optional

Name for the experiment (default=’ml_experiment’)

Xpd.DataFrame, optional

Feature matrix (alternative to data + target_column pattern)

ypd.Series, optional

Target vector (alternative to data + target_column pattern)

val_sizefloat, optional

Alternative name for validation_size (for compatibility)

Returns:

Dict[str, Any]

Dictionary containing X_train, X_val, X_test, y_train, y_val, y_test, feature_names, target_name, and experiment_config

Examples:

# Method 1: DataFrame with target column (recommended) >>> experiment = ml.setup_ml_experiment(df, target_column=’target’)

# Method 2: Sklearn-style (also supported) >>> X = df.drop(‘target’, axis=1) >>> y = df[‘target’] >>> experiment = ml.setup_ml_experiment(X=X, y=y)

edaflow.ml.configure_model_pipeline(data_config: Dict[str, Any], numerical_strategy: str = 'standard', categorical_strategy: str = 'onehot', handle_missing: str = 'drop', verbose: bool = True) Pipeline[source]

Configure a preprocessing pipeline for the ML experiment.

Parameters:

data_configDict[str, Any]

Configuration dictionary from setup_ml_experiment

numerical_strategystr, default=’standard’

Scaling strategy for numerical features (‘standard’, ‘minmax’, ‘robust’, ‘none’)

categorical_strategystr, default=’onehot’

Encoding strategy for categorical features (‘onehot’, ‘target’, ‘none’)

handle_missingstr, default=’drop’

Missing value strategy (‘drop’, ‘impute’, ‘flag’)

verbosebool, default=True

Whether to print pipeline configuration details

Returns:

Pipeline

Configured sklearn Pipeline for preprocessing

edaflow.ml.validate_ml_data(experiment_data: Dict[str, Any] | None = None, check_missing: bool = True, check_duplicates: bool = True, check_outliers: bool = True, verbose: bool = True, X: DataFrame | None = None, y: Series | None = None, check_cardinality: bool = True, check_distributions: bool = True) Dict[str, Any][source]

Validate data quality for ML experiments.

This function supports two calling patterns: 1. Experiment config: validate_ml_data(experiment_config) 2. Sklearn-style: validate_ml_data(X=X_train, y=y_train)

Parameters:

experiment_dataDict[str, Any], optional

Dictionary from setup_ml_experiment containing splits

check_missingbool, default=True

Whether to check for missing values

check_duplicatesbool, default=True

Whether to check for duplicate rows

check_outliersbool, default=True

Whether to check for outliers

verbosebool, default=True

Whether to print validation details

Xpd.DataFrame, optional

Feature data (alternative to experiment_data)

ypd.Series, optional

Target data (alternative to experiment_data)

check_cardinalitybool, default=True

Whether to check feature cardinality

check_distributionsbool, default=True

Whether to check feature distributions

Returns:

Dict[str, Any]

Dictionary containing validation results and recommendations

Model Comparison Functions

edaflow.ml.compare_models(models: Dict[str, BaseEstimator], X_train: DataFrame | None = None, X_val: DataFrame | None = None, X_test: DataFrame | None = None, y_train: Series | None = None, y_val: Series | None = None, y_test: Series | None = None, experiment_config: Dict[str, Any] | None = None, problem_type: str = 'auto', metrics: List[str] | None = None, cv_folds: int = 5, scoring: str | List[str] | None = None, verbose: bool = True) DataFrame[source]

Compare multiple models across various performance metrics.

Parameters:

modelsDict[str, BaseEstimator]

Dictionary of model name -> fitted model pairs

X_trainpd.DataFrame, optional

Training features (can be provided via experiment_config)

X_valpd.DataFrame, optional

Validation features (can be provided via experiment_config)

X_testpd.DataFrame, optional

Test features for final evaluation

y_trainpd.Series, optional

Training target (can be provided via experiment_config)

y_valpd.Series, optional

Validation target (can be provided via experiment_config)

y_testpd.Series, optional

Test target for final evaluation

experiment_configDict[str, Any], optional

Complete experiment configuration from setup_ml_experiment() If provided, will extract X_train, X_val, y_train, y_val from it

problem_typestr, default=’auto’

‘classification’, ‘regression’, or ‘auto’ to detect

metricsList[str], optional

Specific metrics to calculate. If None, uses default metrics

cv_foldsint, default=5

Number of cross-validation folds (if applicable)

scoringstr or List[str], optional

Scoring metric(s) to use for evaluation

verbosebool, default=True

Whether to print comparison progress

Returns:

pd.DataFrame

Comparison results with models as rows and metrics as columns

edaflow.ml.rank_models(comparison_df: DataFrame, primary_metric: str, ascending: bool = False, secondary_metrics: List[str] | None = None, weights: Dict[str, float] | None = None, return_format: str = 'dataframe') DataFrame | List[Dict][source]

Rank models based on performance metrics.

Parameters:

comparison_dfpd.DataFrame

Results from compare_models()

primary_metricstr

Main metric to rank by

ascendingbool, default=False

Whether to sort in ascending order (True for error metrics)

secondary_metricsList[str], optional

Additional metrics to consider for tie-breaking

weightsDict[str, float], optional

Weights for weighted ranking across multiple metrics

return_formatstr, default=’dataframe’

Format to return: ‘dataframe’ or ‘list’

Returns:

Union[pd.DataFrame, List[Dict]]

If ‘dataframe’: Ranked models DataFrame If ‘list’: List of dicts for easy access with pattern [0][“model_name”]

Examples:

# DataFrame format (default) ranked_df = rank_models(results, ‘accuracy’) best_model = ranked_df.iloc[0][‘model’]

# List format for easier access ranked_list = rank_models(results, ‘accuracy’, return_format=’list’) best_model = ranked_list[0][“model_name”]

edaflow.ml.display_leaderboard(comparison_results: DataFrame = None, ranked_df: DataFrame = None, sort_by: str = None, ascending: bool = False, show_std: bool = False, top_n: int = 10, show_metrics: List[str] | None = None, highlight_best: bool = True, figsize: Tuple[int, int] = (12, 8)) None[source]

Display a visual leaderboard of model performance.

Parameters:

comparison_resultspd.DataFrame, optional

Raw comparison results from compare_models()

ranked_dfpd.DataFrame, optional

Pre-ranked results (alternative to comparison_results)

sort_bystr, optional

Metric to sort by. If None, uses first numeric column

ascendingbool, default=False

Whether to sort in ascending order

show_stdbool, default=False

Whether to show standard deviation columns

top_nint, default=10

Number of top models to display

show_metricsList[str], optional

Specific metrics to show. If None, shows all numeric metrics

highlight_bestbool, default=True

Whether to highlight the best performing model

figsizeTuple[int, int], default=(12, 8)

Figure size for the visualization

edaflow.ml.export_model_comparison(comparison_df: DataFrame, filepath: str, include_config: bool = True, format: str = 'csv') None[source]

Export model comparison results to file.

Parameters:

comparison_dfpd.DataFrame

Comparison results to export

filepathstr

Path where to save the file

include_configbool, default=True

Whether to include experiment configuration

formatstr, default=’csv’

Export format (‘csv’, ‘excel’, ‘json’)

Hyperparameter Tuning Functions

edaflow.ml.optimize_hyperparameters(model: BaseEstimator, param_distributions: Dict[str, Any], X_train: DataFrame, y_train: Series, cv: int = 5, scoring: str = 'auto', n_iter: int = 50, method: str = 'random', verbose: bool = True, random_state: int = 42) Dict[str, Any][source]

Optimize hyperparameters using various search strategies.

Parameters:

modelBaseEstimator

The base model to optimize

param_distributionsDict[str, Any]

Parameter distributions to search over

X_trainpd.DataFrame

Training features

y_trainpd.Series

Training target

cvint, default=5

Number of cross-validation folds

scoringstr, default=’auto’

Scoring metric (‘auto’ detects based on problem type)

n_iterint, default=50

Number of iterations for random/bayesian search

methodstr, default=’random’

Search method (‘grid’, ‘random’, ‘bayesian’)

verbosebool, default=True

Whether to print optimization progress

random_stateint, default=42

Random seed for reproducibility

Returns:

Dict[str, Any]

Dictionary containing best model, parameters, and optimization results

edaflow.ml.grid_search_models(models: Dict[str, BaseEstimator], param_grids: Dict[str, Dict[str, Any]], X_train: DataFrame, y_train: Series, cv: int = 5, scoring: str = 'auto', verbose: bool = True) Dict[str, Dict[str, Any]][source]

Perform grid search optimization for multiple models.

Parameters:

modelsDict[str, BaseEstimator]

Dictionary of model name -> model pairs

param_gridsDict[str, Dict[str, Any]]

Dictionary of model name -> parameter grid pairs

X_trainpd.DataFrame

Training features

y_trainpd.Series

Training target

cvint, default=5

Number of cross-validation folds

scoringstr, default=’auto’

Scoring metric

verbosebool, default=True

Whether to print progress

Returns:

Dict[str, Dict[str, Any]]

Dictionary of model name -> optimization results pairs

edaflow.ml.bayesian_optimization(model: BaseEstimator, param_space: Dict[str, Any], X_train: DataFrame, y_train: Series, n_calls: int = 50, cv: int = 5, scoring: str = 'auto', verbose: bool = True, random_state: int = 42) Dict[str, Any][source]

Perform Bayesian optimization using scikit-optimize.

Parameters:

modelBaseEstimator

The base model to optimize

param_spaceDict[str, Any]

Parameter space definition (requires skopt)

X_trainpd.DataFrame

Training features

y_trainpd.Series

Training target

n_callsint, default=50

Number of optimization calls

cvint, default=5

Number of cross-validation folds

scoringstr, default=’auto’

Scoring metric

verbosebool, default=True

Whether to print progress

random_stateint, default=42

Random seed for reproducibility

Returns:

Dict[str, Any]

Optimization results including best parameters and convergence plot

edaflow.ml.random_search_models(models: Dict[str, BaseEstimator], param_distributions: Dict[str, Dict[str, Any]], X_train: DataFrame, y_train: Series, n_iter: int = 50, cv: int = 5, scoring: str = 'auto', verbose: bool = True, random_state: int = 42) Dict[str, Dict[str, Any]][source]

Perform random search optimization for multiple models.

Parameters:

modelsDict[str, BaseEstimator]

Dictionary of model name -> model pairs

param_distributionsDict[str, Dict[str, Any]]

Dictionary of model name -> parameter distributions pairs

X_trainpd.DataFrame

Training features

y_trainpd.Series

Training target

n_iterint, default=50

Number of random search iterations

cvint, default=5

Number of cross-validation folds

scoringstr, default=’auto’

Scoring metric

verbosebool, default=True

Whether to print progress

random_stateint, default=42

Random seed for reproducibility

Returns:

Dict[str, Dict[str, Any]]

Dictionary of model name -> optimization results pairs

Performance Visualization Functions

edaflow.ml.plot_learning_curves(model: BaseEstimator, X_train: DataFrame, y_train: Series, cv: int = 5, scoring: str = 'auto', train_sizes: ndarray | None = None, title: str | None = None, figsize: Tuple[int, int] = (10, 6), show_std: bool = True) Figure[source]

Plot learning curves to analyze model performance vs training set size.

Parameters:

modelBaseEstimator

The model to analyze

X_trainpd.DataFrame

Training features

y_trainpd.Series

Training target

cvint, default=5

Number of cross-validation folds

scoringstr, default=’auto’

Scoring metric

train_sizesnp.ndarray, optional

Training set sizes to use

titlestr, optional

Plot title

figsizeTuple[int, int], default=(10, 6)

Figure size

show_stdbool, default=True

Whether to show standard deviation bands

Returns:

plt.Figure

The matplotlib figure

edaflow.ml.plot_validation_curves(model: BaseEstimator, X_train: DataFrame, y_train: Series, param_name: str, param_range: List[Any], cv: int = 5, scoring: str = 'auto', title: str | None = None, figsize: Tuple[int, int] = (10, 6), log_scale: bool = False) Figure[source]

Plot validation curves for hyperparameter analysis.

Parameters:

modelBaseEstimator

The model to analyze

X_trainpd.DataFrame

Training features

y_trainpd.Series

Training target

param_namestr

Name of the parameter to vary

param_rangeList[Any]

Range of parameter values to test

cvint, default=5

Number of cross-validation folds

scoringstr, default=’auto’

Scoring metric

titlestr, optional

Plot title

figsizeTuple[int, int], default=(10, 6)

Figure size

log_scalebool, default=False

Whether to use log scale for x-axis

Returns:

plt.Figure

The matplotlib figure

edaflow.ml.plot_roc_curves(models: Dict[str, BaseEstimator], X_val: DataFrame, y_val: Series, title: str | None = None, figsize: Tuple[int, int] = (10, 8)) Figure[source]

Plot ROC curves for multiple models (binary classification only).

Parameters:

modelsDict[str, BaseEstimator]

Dictionary of model name -> fitted model pairs

X_valpd.DataFrame

Validation features

y_valpd.Series

Validation target

titlestr, optional

Plot title

figsizeTuple[int, int], default=(10, 8)

Figure size

Returns:

plt.Figure

The matplotlib figure

edaflow.ml.plot_precision_recall_curves(models: Dict[str, BaseEstimator], X_val: DataFrame, y_val: Series, title: str | None = None, figsize: Tuple[int, int] = (10, 8)) Figure[source]

Plot Precision-Recall curves for multiple models.

Parameters:

modelsDict[str, BaseEstimator]

Dictionary of model name -> fitted model pairs

X_valpd.DataFrame

Validation features

y_valpd.Series

Validation target

titlestr, optional

Plot title

figsizeTuple[int, int], default=(10, 8)

Figure size

Returns:

plt.Figure

The matplotlib figure

edaflow.ml.plot_confusion_matrix(model: BaseEstimator, X_val: DataFrame, y_val: Series, normalize: bool = False, title: str | None = None, figsize: Tuple[int, int] = (8, 6)) Figure[source]

Plot confusion matrix for a classification model.

Parameters:

modelBaseEstimator

Fitted classification model

X_valpd.DataFrame

Validation features

y_valpd.Series

Validation target

normalizebool, default=False

Whether to normalize the confusion matrix

titlestr, optional

Plot title

figsizeTuple[int, int], default=(8, 6)

Figure size

Returns:

plt.Figure

The matplotlib figure

edaflow.ml.plot_feature_importance(model: BaseEstimator, feature_names: List[str], top_n: int = 20, title: str | None = None, figsize: Tuple[int, int] = (10, 8)) Figure[source]

Plot feature importance for models that support it.

Parameters:

modelBaseEstimator

Fitted model with feature_importances_ attribute

feature_namesList[str]

Names of the features

top_nint, default=20

Number of top features to display

titlestr, optional

Plot title

figsizeTuple[int, int], default=(10, 8)

Figure size

Returns:

plt.Figure

The matplotlib figure

Model Artifacts Functions

edaflow.ml.save_model_artifacts(model: Any, model_name: str, experiment_config: Dict[str, Any], performance_metrics: Dict[str, float], save_dir: str = 'model_artifacts', include_data_sample: bool = True, X_sample: DataFrame | None = None, format: str = 'joblib') Dict[str, str][source]

Save complete model artifacts including model, config, and metadata.

Parameters:

modelAny

The trained model to save

model_namestr

Name of the model for file naming

experiment_configDict[str, Any]

Configuration dictionary from setup_ml_experiment

performance_metricsDict[str, float]

Dictionary of performance metrics

save_dirstr, default=”model_artifacts”

Directory to save artifacts

include_data_samplebool, default=True

Whether to save a sample of training data

X_samplepd.DataFrame, optional

Sample data to save (if not provided, uses first 100 rows)

formatstr, default=’joblib’

Format to save model (‘joblib’ or ‘pickle’)

Returns:

Dict[str, str]

Dictionary with paths to saved artifacts

edaflow.ml.load_model_artifacts(artifact_path: str, load_model: bool = True, load_config: bool = True, load_metrics: bool = True) Dict[str, Any][source]

Load model artifacts from saved files.

Parameters:

artifact_pathstr

Path to the model file or directory containing artifacts

load_modelbool, default=True

Whether to load the model

load_configbool, default=True

Whether to load the configuration

load_metricsbool, default=True

Whether to load the metrics

Returns:

Dict[str, Any]

Dictionary containing loaded artifacts

edaflow.ml.track_experiment(experiment_name: str, model_results: Dict[str, Any], experiment_config: Dict[str, Any], notes: str | None = None, log_file: str = 'experiment_log.csv') None[source]

Track experiment results in a CSV log file.

Parameters:

experiment_namestr

Name of the experiment

model_resultsDict[str, Any]

Results dictionary from model comparison

experiment_configDict[str, Any]

Configuration dictionary from setup_ml_experiment

notesstr, optional

Additional notes about the experiment

log_filestr, default=”experiment_log.csv”

Path to the log file

edaflow.ml.create_model_report(model: Any, model_name: str, experiment_config: Dict[str, Any], performance_metrics: Dict[str, float], feature_importance: DataFrame | None = None, validation_results: Dict[str, Any] | None = None, save_path: str | None = None) str[source]

Generate a comprehensive model report.

Parameters:

modelAny

The trained model

model_namestr

Name of the model

experiment_configDict[str, Any]

Configuration dictionary

performance_metricsDict[str, float]

Performance metrics dictionary

feature_importancepd.DataFrame, optional

Feature importance data

validation_resultsDict[str, Any], optional

Validation results from validate_ml_data

save_pathstr, optional

Path to save the report

Returns:

str

The generated report as a string