edaflow Documentation .. toctree:

:maxdepth: 2
:caption: Table of Contents

quickstart
installation
user_guide/learning_path
user_guide/index
user_guide/ml_workflow
user_guide/advanced_features
user_guide/best_practices
api_reference/index
examples/index

PyPI version Python versions License

edaflow is a Python package designed to streamline both exploratory data analysis (EDA) and machine learning (ML) workflows. It provides 19+ comprehensive EDA functions and 26 powerful ML functions that cover the essential steps from data exploration to model deployment.

edaflow simplifies and accelerates data science workflows by providing a collection of powerful functions for data scientists and analysts. The package integrates popular data science libraries to create a cohesive workflow for data exploration, visualization, preprocessing, machine learning model development, and intelligent categorical encoding - now including computer vision datasets and quality assessment.

🎯 Key Features

Exploratory Data Analysis (EDA)

  • ⭐ NEW: Automated Profiling Report: Generate comprehensive EDA reports with a single function call (similar to ydata-profiling)

  • Missing Data Analysis: Color-coded analysis of null values with customizable thresholds

  • Categorical Data Insights: Identify object columns that might be numeric, detect data type issues

  • Automatic Data Type Conversion: Smart conversion of object columns to numeric when appropriate

  • Data Imputation: Smart missing value imputation using median for numerical and mode for categorical columns

  • Advanced Visualizations: Interactive boxplots, comprehensive heatmaps, statistical histograms

  • Scatter Matrix Analysis: Advanced pairwise relationship visualization with regression lines

  • Computer Vision EDA: Class-wise image sample visualization for image classification datasets

  • Image Quality Assessment: Automated detection of corrupted, blurry, or low-quality images

  • Smart Categorical Encoding: Intelligent analysis and automated application of optimal encoding strategies

  • Outlier Handling: Automated outlier detection and replacement using multiple statistical methods

Machine Learning (ML) Workflow

  • ML Experiment Setup: Automated train/validation/test splits and configuration management

  • Model Comparison: Multi-model evaluation with comprehensive performance leaderboards

  • Hyperparameter Optimization: Grid search, random search, and Bayesian optimization

  • Performance Visualization: Learning curves, ROC curves, confusion matrices, and feature importance

  • Model Persistence: Complete model artifacts saving with metadata and experiment tracking

  • Pipeline Configuration: Automated preprocessing pipeline setup for ML workflows

Professional Output: Beautiful, color-coded results optimized for Jupyter notebooks

πŸ“¦ Quick Installation

pip install edaflow

πŸš€ Quick Start Example

EDA Workflow:

import edaflow
import pandas as pd

# Load your data
df = pd.read_csv('your_data.csv')

# Complete EDA workflow
edaflow.check_null_columns(df)
edaflow.analyze_categorical_columns(df)
edaflow.visualize_heatmap(df)
edaflow.visualize_scatter_matrix(df)

ML Workflow:

import edaflow.ml as ml
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Setup ML experiment (supports both calling patterns)

# DataFrame-style (recommended)
config = ml.setup_ml_experiment(
   df, 'target',
   val_size=0.15,
   test_size=0.2,
   experiment_name="model_comparison",
   random_state=42,
   stratify=True,
   primary_metric="roc_auc"  # πŸ‘ˆ Set your main metric here! (e.g., 'f1', 'accuracy', 'r2', etc.)
)

# Alternative: sklearn-style
config = ml.setup_ml_experiment(
   X=X, y=y,
   val_size=0.15,
   test_size=0.2,
   experiment_name="model_comparison",
   random_state=42,
   stratify=True,
   primary_metric="roc_auc"  # πŸ‘ˆ Set your main metric here! (e.g., 'f1', 'accuracy', 'r2', etc.)
)

# Compare multiple models
models = {
    'rf': RandomForestClassifier(),
    'lr': LogisticRegression()
}

results = ml.compare_models(
    models=models,
    X_train=config['X_train'],
    y_train=config['y_train'],
    X_test=config['X_test'],
    y_test=config['y_test']
)
df = pd.read_csv('your_data.csv')

# Complete EDA workflow with 18 functions
edaflow.check_null_columns(df)                    # 1. Missing data analysis
edaflow.analyze_categorical_columns(df)           # 2. Categorical insights
df_clean = edaflow.convert_to_numeric(df)         # 3. Smart type conversion
edaflow.visualize_categorical_values(df_clean)    # 4. Category exploration
edaflow.visualize_scatter_matrix(df_clean)        # 5. Relationship analysis
edaflow.visualize_heatmap(df_clean)              # 6. Correlation heatmaps
edaflow.visualize_histograms(df_clean)           # 7. Distribution analysis
# ... and 11 more powerful functions!

# NEW: Computer Vision EDA & Quality Assessment
edaflow.visualize_image_classes(
    data_source='dataset/images/',  # Simple directory path
    samples_per_class=4,
    max_classes_display=8
)
edaflow.assess_image_quality(
    image_paths=image_list,         # List of image paths
    check_corruption=True,
    analyze_color=True,
    detect_blur=True,
    sample_size=200
)

πŸ“š Documentation Contents

Contents:

πŸ“Š Function Overview

edaflow provides 18 comprehensive EDA functions organized into logical categories:

Data Quality & Analysis

Data Cleaning & Preprocessing

Visualization & Analysis

Computer Vision EDA πŸ–ΌοΈ NEW in v0.9.0-v0.12.3!

  • visualize_image_classes() - Class-wise image sample visualization for image classification datasets

Image Quality Assessment πŸ” NEW in v0.10.0-v0.12.3!

  • assess_image_quality() - Comprehensive automated quality assessment and corruption detection for image datasets

Smart Encoding 🧠 NEW in v0.12.4-v0.12.7!

  • analyze_encoding_needs() - Intelligent categorical encoding analysis and recommendations

  • apply_smart_encoding() - Automated encoding application with optimal strategy selection

Helper Functions

  • hello() - Package verification function

πŸŽ“ Background

edaflow was developed in part of a Capstone project during an AI/ML course conducted by NTUC LearningHub (Cohort 15). Special thanks to our instructor, Ms. Isha Sehgal, who inspired the project works which led to the development of this comprehensive EDA toolkit.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.