Machine Learning Functions
This section documents the complete ML workflow functions introduced in edaflow v0.13.0.
ML Configuration & Setup
|
Set up a complete ML experiment with train/validation/test splits. |
|
Configure a preprocessing pipeline for the ML experiment. |
|
Validate data quality for ML experiments. |
Model Comparison & Ranking
|
Compare multiple models across various performance metrics. |
|
Rank models based on performance metrics. |
|
Display a visual leaderboard of model performance. |
|
Export model comparison results to file. |
Hyperparameter Optimization
|
Optimize hyperparameters using various search strategies. |
|
Perform grid search optimization for multiple models. |
|
Perform Bayesian optimization using scikit-optimize. |
|
Perform random search optimization for multiple models. |
Performance Visualization
|
Plot learning curves to analyze model performance vs training set size. |
|
Plot validation curves for hyperparameter analysis. |
|
Plot ROC curves for multiple models (binary classification only). |
|
Plot Precision-Recall curves for multiple models. |
|
Plot confusion matrix for a classification model. |
|
Plot feature importance for models that support it. |
Model Artifacts & Tracking
|
Save complete model artifacts including model, config, and metadata. |
|
Load model artifacts from saved files. |
|
Track experiment results in a CSV log file. |
|
Generate a comprehensive model report. |
Function Details
Configuration Functions
- edaflow.ml.setup_ml_experiment(data: DataFrame | None = None, target_column: str | None = None, test_size: float = 0.2, validation_size: float | None = None, random_state: int = 42, stratify: bool = True, verbose: bool = True, experiment_name: str | None = None, X: DataFrame | None = None, y: Series | None = None, val_size: float | None = None, primary_metric: str | None = None) Dict[str, Any][source]
Set up a complete ML experiment with train/validation/test splits.
This function supports two calling patterns: 1. DataFrame with target column: setup_ml_experiment(data, target_column) 2. Sklearn-style: setup_ml_experiment(X=X, y=y)
Parameters:
…existing parameters… primary_metric : str, optional
The main metric used for model selection and ranking (e.g., ‘roc_auc’, ‘f1’, ‘accuracy’, ‘r2’). This will be stored in the config for downstream use.
- target_columnstr, optional
Name of the target variable column (required if using data parameter)
- test_sizefloat, default=0.2
Proportion of data to use for testing
- validation_sizefloat, optional
Proportion of training data to use for validation (default=0.2)
- random_stateint, default=42
Random seed for reproducibility
- stratifybool, default=True
Whether to stratify the splits (for classification)
- verbosebool, default=True
Whether to print experiment setup details
- experiment_namestr, optional
Name for the experiment (default=’ml_experiment’)
- Xpd.DataFrame, optional
Feature matrix (alternative to data + target_column pattern)
- ypd.Series, optional
Target vector (alternative to data + target_column pattern)
- val_sizefloat, optional
Alternative name for validation_size (for compatibility)
Returns:
- Dict[str, Any]
Dictionary containing X_train, X_val, X_test, y_train, y_val, y_test, feature_names, target_name, and experiment_config
Examples:
# Method 1: DataFrame with target column (recommended) >>> experiment = ml.setup_ml_experiment(df, target_column=’target’)
# Method 2: Sklearn-style (also supported) >>> X = df.drop(‘target’, axis=1) >>> y = df[‘target’] >>> experiment = ml.setup_ml_experiment(X=X, y=y)
- edaflow.ml.configure_model_pipeline(data_config: Dict[str, Any], numerical_strategy: str = 'standard', categorical_strategy: str = 'onehot', handle_missing: str = 'drop', verbose: bool = True) Pipeline[source]
Configure a preprocessing pipeline for the ML experiment.
Parameters:
- data_configDict[str, Any]
Configuration dictionary from setup_ml_experiment
- numerical_strategystr, default=’standard’
Scaling strategy for numerical features (‘standard’, ‘minmax’, ‘robust’, ‘none’)
- categorical_strategystr, default=’onehot’
Encoding strategy for categorical features (‘onehot’, ‘target’, ‘none’)
- handle_missingstr, default=’drop’
Missing value strategy (‘drop’, ‘impute’, ‘flag’)
- verbosebool, default=True
Whether to print pipeline configuration details
Returns:
- Pipeline
Configured sklearn Pipeline for preprocessing
- edaflow.ml.validate_ml_data(experiment_data: Dict[str, Any] | None = None, check_missing: bool = True, check_duplicates: bool = True, check_outliers: bool = True, verbose: bool = True, X: DataFrame | None = None, y: Series | None = None, check_cardinality: bool = True, check_distributions: bool = True) Dict[str, Any][source]
Validate data quality for ML experiments.
This function supports two calling patterns: 1. Experiment config: validate_ml_data(experiment_config) 2. Sklearn-style: validate_ml_data(X=X_train, y=y_train)
Parameters:
- experiment_dataDict[str, Any], optional
Dictionary from setup_ml_experiment containing splits
- check_missingbool, default=True
Whether to check for missing values
- check_duplicatesbool, default=True
Whether to check for duplicate rows
- check_outliersbool, default=True
Whether to check for outliers
- verbosebool, default=True
Whether to print validation details
- Xpd.DataFrame, optional
Feature data (alternative to experiment_data)
- ypd.Series, optional
Target data (alternative to experiment_data)
- check_cardinalitybool, default=True
Whether to check feature cardinality
- check_distributionsbool, default=True
Whether to check feature distributions
Returns:
- Dict[str, Any]
Dictionary containing validation results and recommendations
Model Comparison Functions
- edaflow.ml.compare_models(models: Dict[str, BaseEstimator], X_train: DataFrame | None = None, X_val: DataFrame | None = None, X_test: DataFrame | None = None, y_train: Series | None = None, y_val: Series | None = None, y_test: Series | None = None, experiment_config: Dict[str, Any] | None = None, problem_type: str = 'auto', metrics: List[str] | None = None, cv_folds: int = 5, scoring: str | List[str] | None = None, verbose: bool = True) DataFrame[source]
Compare multiple models across various performance metrics.
Parameters:
- modelsDict[str, BaseEstimator]
Dictionary of model name -> fitted model pairs
- X_trainpd.DataFrame, optional
Training features (can be provided via experiment_config)
- X_valpd.DataFrame, optional
Validation features (can be provided via experiment_config)
- X_testpd.DataFrame, optional
Test features for final evaluation
- y_trainpd.Series, optional
Training target (can be provided via experiment_config)
- y_valpd.Series, optional
Validation target (can be provided via experiment_config)
- y_testpd.Series, optional
Test target for final evaluation
- experiment_configDict[str, Any], optional
Complete experiment configuration from setup_ml_experiment() If provided, will extract X_train, X_val, y_train, y_val from it
- problem_typestr, default=’auto’
‘classification’, ‘regression’, or ‘auto’ to detect
- metricsList[str], optional
Specific metrics to calculate. If None, uses default metrics
- cv_foldsint, default=5
Number of cross-validation folds (if applicable)
- scoringstr or List[str], optional
Scoring metric(s) to use for evaluation
- verbosebool, default=True
Whether to print comparison progress
Returns:
- pd.DataFrame
Comparison results with models as rows and metrics as columns
- edaflow.ml.rank_models(comparison_df: DataFrame, primary_metric: str, ascending: bool = False, secondary_metrics: List[str] | None = None, weights: Dict[str, float] | None = None, return_format: str = 'dataframe') DataFrame | List[Dict][source]
Rank models based on performance metrics.
Parameters:
- comparison_dfpd.DataFrame
Results from compare_models()
- primary_metricstr
Main metric to rank by
- ascendingbool, default=False
Whether to sort in ascending order (True for error metrics)
- secondary_metricsList[str], optional
Additional metrics to consider for tie-breaking
- weightsDict[str, float], optional
Weights for weighted ranking across multiple metrics
- return_formatstr, default=’dataframe’
Format to return: ‘dataframe’ or ‘list’
Returns:
- Union[pd.DataFrame, List[Dict]]
If ‘dataframe’: Ranked models DataFrame If ‘list’: List of dicts for easy access with pattern [0][“model_name”]
Examples:
# DataFrame format (default) ranked_df = rank_models(results, ‘accuracy’) best_model = ranked_df.iloc[0][‘model’]
# List format for easier access ranked_list = rank_models(results, ‘accuracy’, return_format=’list’) best_model = ranked_list[0][“model_name”]
- edaflow.ml.display_leaderboard(comparison_results: DataFrame = None, ranked_df: DataFrame = None, sort_by: str = None, ascending: bool = False, show_std: bool = False, top_n: int = 10, show_metrics: List[str] | None = None, highlight_best: bool = True, figsize: Tuple[int, int] = (12, 8)) None[source]
Display a visual leaderboard of model performance.
Parameters:
- comparison_resultspd.DataFrame, optional
Raw comparison results from compare_models()
- ranked_dfpd.DataFrame, optional
Pre-ranked results (alternative to comparison_results)
- sort_bystr, optional
Metric to sort by. If None, uses first numeric column
- ascendingbool, default=False
Whether to sort in ascending order
- show_stdbool, default=False
Whether to show standard deviation columns
- top_nint, default=10
Number of top models to display
- show_metricsList[str], optional
Specific metrics to show. If None, shows all numeric metrics
- highlight_bestbool, default=True
Whether to highlight the best performing model
- figsizeTuple[int, int], default=(12, 8)
Figure size for the visualization
- edaflow.ml.export_model_comparison(comparison_df: DataFrame, filepath: str, include_config: bool = True, format: str = 'csv') None[source]
Export model comparison results to file.
Parameters:
- comparison_dfpd.DataFrame
Comparison results to export
- filepathstr
Path where to save the file
- include_configbool, default=True
Whether to include experiment configuration
- formatstr, default=’csv’
Export format (‘csv’, ‘excel’, ‘json’)
Hyperparameter Tuning Functions
- edaflow.ml.optimize_hyperparameters(model: BaseEstimator, param_distributions: Dict[str, Any], X_train: DataFrame, y_train: Series, cv: int = 5, scoring: str = 'auto', n_iter: int = 50, method: str = 'random', verbose: bool = True, random_state: int = 42) Dict[str, Any][source]
Optimize hyperparameters using various search strategies.
Parameters:
- modelBaseEstimator
The base model to optimize
- param_distributionsDict[str, Any]
Parameter distributions to search over
- X_trainpd.DataFrame
Training features
- y_trainpd.Series
Training target
- cvint, default=5
Number of cross-validation folds
- scoringstr, default=’auto’
Scoring metric (‘auto’ detects based on problem type)
- n_iterint, default=50
Number of iterations for random/bayesian search
- methodstr, default=’random’
Search method (‘grid’, ‘random’, ‘bayesian’)
- verbosebool, default=True
Whether to print optimization progress
- random_stateint, default=42
Random seed for reproducibility
Returns:
- Dict[str, Any]
Dictionary containing best model, parameters, and optimization results
- edaflow.ml.grid_search_models(models: Dict[str, BaseEstimator], param_grids: Dict[str, Dict[str, Any]], X_train: DataFrame, y_train: Series, cv: int = 5, scoring: str = 'auto', verbose: bool = True) Dict[str, Dict[str, Any]][source]
Perform grid search optimization for multiple models.
Parameters:
- modelsDict[str, BaseEstimator]
Dictionary of model name -> model pairs
- param_gridsDict[str, Dict[str, Any]]
Dictionary of model name -> parameter grid pairs
- X_trainpd.DataFrame
Training features
- y_trainpd.Series
Training target
- cvint, default=5
Number of cross-validation folds
- scoringstr, default=’auto’
Scoring metric
- verbosebool, default=True
Whether to print progress
Returns:
- Dict[str, Dict[str, Any]]
Dictionary of model name -> optimization results pairs
- edaflow.ml.bayesian_optimization(model: BaseEstimator, param_space: Dict[str, Any], X_train: DataFrame, y_train: Series, n_calls: int = 50, cv: int = 5, scoring: str = 'auto', verbose: bool = True, random_state: int = 42) Dict[str, Any][source]
Perform Bayesian optimization using scikit-optimize.
Parameters:
- modelBaseEstimator
The base model to optimize
- param_spaceDict[str, Any]
Parameter space definition (requires skopt)
- X_trainpd.DataFrame
Training features
- y_trainpd.Series
Training target
- n_callsint, default=50
Number of optimization calls
- cvint, default=5
Number of cross-validation folds
- scoringstr, default=’auto’
Scoring metric
- verbosebool, default=True
Whether to print progress
- random_stateint, default=42
Random seed for reproducibility
Returns:
- Dict[str, Any]
Optimization results including best parameters and convergence plot
- edaflow.ml.random_search_models(models: Dict[str, BaseEstimator], param_distributions: Dict[str, Dict[str, Any]], X_train: DataFrame, y_train: Series, n_iter: int = 50, cv: int = 5, scoring: str = 'auto', verbose: bool = True, random_state: int = 42) Dict[str, Dict[str, Any]][source]
Perform random search optimization for multiple models.
Parameters:
- modelsDict[str, BaseEstimator]
Dictionary of model name -> model pairs
- param_distributionsDict[str, Dict[str, Any]]
Dictionary of model name -> parameter distributions pairs
- X_trainpd.DataFrame
Training features
- y_trainpd.Series
Training target
- n_iterint, default=50
Number of random search iterations
- cvint, default=5
Number of cross-validation folds
- scoringstr, default=’auto’
Scoring metric
- verbosebool, default=True
Whether to print progress
- random_stateint, default=42
Random seed for reproducibility
Returns:
- Dict[str, Dict[str, Any]]
Dictionary of model name -> optimization results pairs
Performance Visualization Functions
- edaflow.ml.plot_learning_curves(model: BaseEstimator, X_train: DataFrame, y_train: Series, cv: int = 5, scoring: str = 'auto', train_sizes: ndarray | None = None, title: str | None = None, figsize: Tuple[int, int] = (10, 6), show_std: bool = True) Figure[source]
Plot learning curves to analyze model performance vs training set size.
Parameters:
- modelBaseEstimator
The model to analyze
- X_trainpd.DataFrame
Training features
- y_trainpd.Series
Training target
- cvint, default=5
Number of cross-validation folds
- scoringstr, default=’auto’
Scoring metric
- train_sizesnp.ndarray, optional
Training set sizes to use
- titlestr, optional
Plot title
- figsizeTuple[int, int], default=(10, 6)
Figure size
- show_stdbool, default=True
Whether to show standard deviation bands
Returns:
- plt.Figure
The matplotlib figure
- edaflow.ml.plot_validation_curves(model: BaseEstimator, X_train: DataFrame, y_train: Series, param_name: str, param_range: List[Any], cv: int = 5, scoring: str = 'auto', title: str | None = None, figsize: Tuple[int, int] = (10, 6), log_scale: bool = False) Figure[source]
Plot validation curves for hyperparameter analysis.
Parameters:
- modelBaseEstimator
The model to analyze
- X_trainpd.DataFrame
Training features
- y_trainpd.Series
Training target
- param_namestr
Name of the parameter to vary
- param_rangeList[Any]
Range of parameter values to test
- cvint, default=5
Number of cross-validation folds
- scoringstr, default=’auto’
Scoring metric
- titlestr, optional
Plot title
- figsizeTuple[int, int], default=(10, 6)
Figure size
- log_scalebool, default=False
Whether to use log scale for x-axis
Returns:
- plt.Figure
The matplotlib figure
- edaflow.ml.plot_roc_curves(models: Dict[str, BaseEstimator], X_val: DataFrame, y_val: Series, title: str | None = None, figsize: Tuple[int, int] = (10, 8)) Figure[source]
Plot ROC curves for multiple models (binary classification only).
Parameters:
- modelsDict[str, BaseEstimator]
Dictionary of model name -> fitted model pairs
- X_valpd.DataFrame
Validation features
- y_valpd.Series
Validation target
- titlestr, optional
Plot title
- figsizeTuple[int, int], default=(10, 8)
Figure size
Returns:
- plt.Figure
The matplotlib figure
- edaflow.ml.plot_precision_recall_curves(models: Dict[str, BaseEstimator], X_val: DataFrame, y_val: Series, title: str | None = None, figsize: Tuple[int, int] = (10, 8)) Figure[source]
Plot Precision-Recall curves for multiple models.
Parameters:
- modelsDict[str, BaseEstimator]
Dictionary of model name -> fitted model pairs
- X_valpd.DataFrame
Validation features
- y_valpd.Series
Validation target
- titlestr, optional
Plot title
- figsizeTuple[int, int], default=(10, 8)
Figure size
Returns:
- plt.Figure
The matplotlib figure
- edaflow.ml.plot_confusion_matrix(model: BaseEstimator, X_val: DataFrame, y_val: Series, normalize: bool = False, title: str | None = None, figsize: Tuple[int, int] = (8, 6)) Figure[source]
Plot confusion matrix for a classification model.
Parameters:
- modelBaseEstimator
Fitted classification model
- X_valpd.DataFrame
Validation features
- y_valpd.Series
Validation target
- normalizebool, default=False
Whether to normalize the confusion matrix
- titlestr, optional
Plot title
- figsizeTuple[int, int], default=(8, 6)
Figure size
Returns:
- plt.Figure
The matplotlib figure
- edaflow.ml.plot_feature_importance(model: BaseEstimator, feature_names: List[str], top_n: int = 20, title: str | None = None, figsize: Tuple[int, int] = (10, 8)) Figure[source]
Plot feature importance for models that support it.
Parameters:
- modelBaseEstimator
Fitted model with feature_importances_ attribute
- feature_namesList[str]
Names of the features
- top_nint, default=20
Number of top features to display
- titlestr, optional
Plot title
- figsizeTuple[int, int], default=(10, 8)
Figure size
Returns:
- plt.Figure
The matplotlib figure
Model Artifacts Functions
- edaflow.ml.save_model_artifacts(model: Any, model_name: str, experiment_config: Dict[str, Any], performance_metrics: Dict[str, float], save_dir: str = 'model_artifacts', include_data_sample: bool = True, X_sample: DataFrame | None = None, format: str = 'joblib') Dict[str, str][source]
Save complete model artifacts including model, config, and metadata.
Parameters:
- modelAny
The trained model to save
- model_namestr
Name of the model for file naming
- experiment_configDict[str, Any]
Configuration dictionary from setup_ml_experiment
- performance_metricsDict[str, float]
Dictionary of performance metrics
- save_dirstr, default=”model_artifacts”
Directory to save artifacts
- include_data_samplebool, default=True
Whether to save a sample of training data
- X_samplepd.DataFrame, optional
Sample data to save (if not provided, uses first 100 rows)
- formatstr, default=’joblib’
Format to save model (‘joblib’ or ‘pickle’)
Returns:
- Dict[str, str]
Dictionary with paths to saved artifacts
- edaflow.ml.load_model_artifacts(artifact_path: str, load_model: bool = True, load_config: bool = True, load_metrics: bool = True) Dict[str, Any][source]
Load model artifacts from saved files.
Parameters:
- artifact_pathstr
Path to the model file or directory containing artifacts
- load_modelbool, default=True
Whether to load the model
- load_configbool, default=True
Whether to load the configuration
- load_metricsbool, default=True
Whether to load the metrics
Returns:
- Dict[str, Any]
Dictionary containing loaded artifacts
- edaflow.ml.track_experiment(experiment_name: str, model_results: Dict[str, Any], experiment_config: Dict[str, Any], notes: str | None = None, log_file: str = 'experiment_log.csv') None[source]
Track experiment results in a CSV log file.
Parameters:
- experiment_namestr
Name of the experiment
- model_resultsDict[str, Any]
Results dictionary from model comparison
- experiment_configDict[str, Any]
Configuration dictionary from setup_ml_experiment
- notesstr, optional
Additional notes about the experiment
- log_filestr, default=”experiment_log.csv”
Path to the log file
- edaflow.ml.create_model_report(model: Any, model_name: str, experiment_config: Dict[str, Any], performance_metrics: Dict[str, float], feature_importance: DataFrame | None = None, validation_results: Dict[str, Any] | None = None, save_path: str | None = None) str[source]
Generate a comprehensive model report.
Parameters:
- modelAny
The trained model
- model_namestr
Name of the model
- experiment_configDict[str, Any]
Configuration dictionary
- performance_metricsDict[str, float]
Performance metrics dictionary
- feature_importancepd.DataFrame, optional
Feature importance data
- validation_resultsDict[str, Any], optional
Validation results from validate_ml_data
- save_pathstr, optional
Path to save the report
Returns:
- str
The generated report as a string