edaflow Documentation .. toctree:
:maxdepth: 2
:caption: Table of Contents
quickstart
installation
user_guide/learning_path
user_guide/index
user_guide/ml_workflow
user_guide/advanced_features
user_guide/best_practices
api_reference/index
examples/index
edaflow is a Python package designed to streamline both exploratory data analysis (EDA) and machine learning (ML) workflows. It provides 19+ comprehensive EDA functions and 26 powerful ML functions that cover the essential steps from data exploration to model deployment.
edaflow simplifies and accelerates data science workflows by providing a collection of powerful functions for data scientists and analysts. The package integrates popular data science libraries to create a cohesive workflow for data exploration, visualization, preprocessing, machine learning model development, and intelligent categorical encoding - now including computer vision datasets and quality assessment.
π― Key Featuresο
Exploratory Data Analysis (EDA)
β NEW: Automated Profiling Report: Generate comprehensive EDA reports with a single function call (similar to ydata-profiling)
Missing Data Analysis: Color-coded analysis of null values with customizable thresholds
Categorical Data Insights: Identify object columns that might be numeric, detect data type issues
Automatic Data Type Conversion: Smart conversion of object columns to numeric when appropriate
Data Imputation: Smart missing value imputation using median for numerical and mode for categorical columns
Advanced Visualizations: Interactive boxplots, comprehensive heatmaps, statistical histograms
Scatter Matrix Analysis: Advanced pairwise relationship visualization with regression lines
Computer Vision EDA: Class-wise image sample visualization for image classification datasets
Image Quality Assessment: Automated detection of corrupted, blurry, or low-quality images
Smart Categorical Encoding: Intelligent analysis and automated application of optimal encoding strategies
Outlier Handling: Automated outlier detection and replacement using multiple statistical methods
Machine Learning (ML) Workflow
ML Experiment Setup: Automated train/validation/test splits and configuration management
Model Comparison: Multi-model evaluation with comprehensive performance leaderboards
Hyperparameter Optimization: Grid search, random search, and Bayesian optimization
Performance Visualization: Learning curves, ROC curves, confusion matrices, and feature importance
Model Persistence: Complete model artifacts saving with metadata and experiment tracking
Pipeline Configuration: Automated preprocessing pipeline setup for ML workflows
Professional Output: Beautiful, color-coded results optimized for Jupyter notebooks
π¦ Quick Installationο
pip install edaflow
π Quick Start Exampleο
EDA Workflow:
import edaflow
import pandas as pd
# Load your data
df = pd.read_csv('your_data.csv')
# Complete EDA workflow
edaflow.check_null_columns(df)
edaflow.analyze_categorical_columns(df)
edaflow.visualize_heatmap(df)
edaflow.visualize_scatter_matrix(df)
ML Workflow:
import edaflow.ml as ml
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# Setup ML experiment (supports both calling patterns)
# DataFrame-style (recommended)
config = ml.setup_ml_experiment(
df, 'target',
val_size=0.15,
test_size=0.2,
experiment_name="model_comparison",
random_state=42,
stratify=True,
primary_metric="roc_auc" # π Set your main metric here! (e.g., 'f1', 'accuracy', 'r2', etc.)
)
# Alternative: sklearn-style
config = ml.setup_ml_experiment(
X=X, y=y,
val_size=0.15,
test_size=0.2,
experiment_name="model_comparison",
random_state=42,
stratify=True,
primary_metric="roc_auc" # π Set your main metric here! (e.g., 'f1', 'accuracy', 'r2', etc.)
)
# Compare multiple models
models = {
'rf': RandomForestClassifier(),
'lr': LogisticRegression()
}
results = ml.compare_models(
models=models,
X_train=config['X_train'],
y_train=config['y_train'],
X_test=config['X_test'],
y_test=config['y_test']
)
df = pd.read_csv('your_data.csv')
# Complete EDA workflow with 18 functions
edaflow.check_null_columns(df) # 1. Missing data analysis
edaflow.analyze_categorical_columns(df) # 2. Categorical insights
df_clean = edaflow.convert_to_numeric(df) # 3. Smart type conversion
edaflow.visualize_categorical_values(df_clean) # 4. Category exploration
edaflow.visualize_scatter_matrix(df_clean) # 5. Relationship analysis
edaflow.visualize_heatmap(df_clean) # 6. Correlation heatmaps
edaflow.visualize_histograms(df_clean) # 7. Distribution analysis
# ... and 11 more powerful functions!
# NEW: Computer Vision EDA & Quality Assessment
edaflow.visualize_image_classes(
data_source='dataset/images/', # Simple directory path
samples_per_class=4,
max_classes_display=8
)
edaflow.assess_image_quality(
image_paths=image_list, # List of image paths
check_corruption=True,
analyze_color=True,
detect_blur=True,
sample_size=200
)
π Documentation Contentsο
Contents:
- Installation Guide
- Quick Start Guide
- π Installation & Basic Setup
- π Automated Profiling - Get Started in Seconds!
- πΌοΈ Computer Vision Quick Start
- π Function Categories
- π¨ Perfect Display Optimization β New in v0.12.30
- π Complete EDA Workflow
- π€ Complete ML Workflow β Enhanced in v0.14.0
- π― Key Function Examples
- πΌοΈ Computer Vision EDA β New in v0.9.0-v0.12.3
- οΏ½π Function Categories
- π‘ Pro Tips
- π Next Steps
- User Guide
- Learning Path for Data Science with edaflow
- Data Quality & Cleaning
- Advanced Time Series Topics
- 1. Forecasting
- 2. Autocorrelation & Lag Analysis
- 3. Feature Engineering for Time Series
- 4. Integrating Time Series Models
- Getting Started
- Correlation Analysis
- Advanced & Interactive Plots
- More Visualization Examples
- Best Practices
- Feature Dependency Table:
- How edaflow Splits Your Dataset: Training, Validation, and Test
- Further Resources & FAQ
- Choosing the Right Performance Visualization
- Overview
- Data Validation: A Critical First Step
- Best practice: Aim for a high data quality score to ensure robust, reliable model results.
- Complete ML Workflow Example
- Individual Function Examples
- Why is it important?
- How does it work?
- Best Practices
- This approach ensures you have a solid reference point and helps you build more robust, trustworthy machine learning solutions.
- Widely Used Model Types in Machine Learning
- Refer to scikit-learn and the respective library documentation for more details and advanced options.
- Whatβs Next After Training the Model?
- Machine Learning Workflow with edaflow
- Advanced Features in edaflow
- Best Practices for edaflow
- Overview
- New Features & Advanced Visualization
- External Library Requirements
- Getting Started
- API Reference
- Basic EDA Workflow
- Advanced Visualization
- Data Cleaning Pipeline
- Overview
- Example Datasets
- Running the Examples
- Version 0.18.0 (2025-12-01) - Automated Profiling Report Release π
- Version 0.16.4 (2025-09-12) - Major Examples & Docs Update π
- Version 0.12.33 (2025-01-11) - Major API Improvement π
- Version 0.12.32 (2025-08-11) - Critical Input Validation Fix π
- Version 0.12.31 (2025-01-05) - Critical KeyError Hotfix π¨
- Version 0.12.30 (2025-01-05) - Universal Display Optimization Breakthrough π¨
- Version 0.12.29 (2025-08-11) - Critical Bug Fix for Unhashable Types π
- Version 0.12.28 (2025-08-11) - Comprehensive Display Formatting Excellence π¨
- Version 0.12.26 (2025-08-09) - Categorical Display Polish π
- Version 0.12.25 (2025-08-08) - Missing Data Display Enhancement π¨
- Version 0.12.24 (2025-08-08) - Texture Analysis Warning Fix π§
- Version 0.12.23 (2025-08-08) - Critical RTD Documentation Parameter Fix π¨
- Version 0.12.22 (2025-08-08) - Google Colab Compatibility & Clean Workflow π
- Version 0.12.3 (2025-08-06) - Complete Positional Argument Compatibility Fix π§
- Version 0.12.2 (2025-08-06) - Documentation Refresh Release π
- Version 0.12.1 (2025-08-05) - Enhanced Computer Vision EDA πΌοΈ
- Version 0.10.0 (2025-08-05) - Image Quality Assessment Release π
- Version 0.9.0 (2025-08-05) - Computer Vision EDA Release πΌοΈ
- Version 0.8.6 (2025-08-05) - PyPI Changelog Display Fix
- Version 0.8.5 (2025-08-05) - Code Organization and Structure Improvement
- Version 0.8.4 (2025-08-05) - Comprehensive Scatter Matrix Visualization Release
- Version 0.8.3 (2025-08-04) - Critical Documentation Fix Release
- Version 0.8.2 (2025-08-04) - Metadata Enhancement Release
- Version 0.8.1 (2025-08-04) - Changelog Formatting Release
- Version 0.8.0 (2025-08-04) - Statistical Histogram Analysis Release
- Version 0.7.0 (2025-08-03) - Comprehensive Heatmap Visualization Release
- Version 0.6.0 (2025-08-02) - Interactive Boxplot Visualization Release
- Version 0.5.1 (2024-01-14) - Documentation Enhancement
- Version 0.5.0 (2025-08-04) - Outlier Handling Release
- Earlier Versions
- Contributing
π Useful Linksο
GitHub Repository: https://github.com/evanlow/edaflow
PyPI Package: https://pypi.org/project/edaflow/
Issue Tracker: https://github.com/evanlow/edaflow/issues
Changelog: Version 0.18.0 (2025-12-01) - Automated Profiling Report Release π
π Function Overviewο
edaflow provides 18 comprehensive EDA functions organized into logical categories:
Data Quality & Analysisο
check_null_columns()- Missing data analysis with color codinganalyze_categorical_columns()- Categorical data insightsconvert_to_numeric()- Smart data type conversiondisplay_column_types()- Column type classification
Data Cleaning & Preprocessingο
impute_numerical_median()- Numerical missing value imputationimpute_categorical_mode()- Categorical missing value imputationhandle_outliers_median()- Outlier detection and handling
Visualization & Analysisο
visualize_categorical_values()- Categorical value explorationvisualize_numerical_boxplots()- Distribution and outlier analysisvisualize_interactive_boxplots()- Interactive Plotly visualizationsvisualize_heatmap()- Comprehensive heatmap analysisvisualize_histograms()- Statistical distribution analysisvisualize_scatter_matrix()- Advanced pairwise relationship analysis
Computer Vision EDA πΌοΈ NEW in v0.9.0-v0.12.3!ο
visualize_image_classes()- Class-wise image sample visualization for image classification datasets
Image Quality Assessment π NEW in v0.10.0-v0.12.3!ο
assess_image_quality()- Comprehensive automated quality assessment and corruption detection for image datasets
Smart Encoding π§ NEW in v0.12.4-v0.12.7!ο
analyze_encoding_needs()- Intelligent categorical encoding analysis and recommendationsapply_smart_encoding()- Automated encoding application with optimal strategy selection
Helper Functionsο
hello()- Package verification function
π Backgroundο
edaflow was developed in part of a Capstone project during an AI/ML course conducted by NTUC LearningHub (Cohort 15). Special thanks to our instructor, Ms. Isha Sehgal, who inspired the project works which led to the development of this comprehensive EDA toolkit.
π Licenseο
This project is licensed under the MIT License - see the LICENSE file for details.