Data Quality & Cleaning

This guide covers edaflow’s data quality assessment and cleaning capabilities.

Overview

Data quality is the foundation of any successful data analysis or machine learning project. edaflow provides comprehensive tools for:

Missing Data Analysis: Identify and visualize null values
Data Type Issues: Detect and fix incorrect data types
Data Imputation: Smart strategies for handling missing values
Outlier Detection: Identify and handle anomalous values

Missing Data Analysis What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow.check_null_columns uses pandas to scan each column for null values and calculates the percentage of missing data. - It generates a visual summary (e.g., heatmap or color-coded table) to highlight missingness patterns. - The function applies user-defined thresholds to trigger warnings and provides actionable recommendations based on the severity of missing data. ———————

The first step in any EDA workflow should be understanding your missing data patterns.

Basic Null Analysis

import edaflow
import pandas as pd

df = pd.read_csv('your_data.csv')

# Comprehensive null analysis
edaflow.check_null_columns(df)

This function provides:

Visual null pattern display with color coding
Percentage calculations for each column
Threshold-based warnings for high null percentages
Actionable recommendations for handling missing data

Customizing Null Analysis

# Custom threshold for warnings
edaflow.check_null_columns(df, threshold=15)  # Warn if >15% null

Data Type Conversion What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow.analyze_categorical_columns inspects object columns to determine if they can be safely converted to numeric types, using heuristics and value checks. - edaflow.convert_to_numeric attempts conversion for columns that meet the threshold, handling errors gracefully and reporting any columns that could not be converted. - edaflow.display_column_types summarizes the current data types for all columns, making it easy to spot inconsistencies. ——————–

Many datasets have columns that should be numeric but are stored as objects/strings.

Smart Type Detection

# Analyze which object columns could be numeric
edaflow.analyze_categorical_columns(df)

Automatic Conversion

# Convert object columns to numeric when appropriate
df_converted = edaflow.convert_to_numeric(df, threshold=25)

Display Current Types

# See all column data types in a clean format
edaflow.display_column_types(df)

Data Imputation Strategies What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow’s imputation utilities detect column types (numerical, categorical) and select the appropriate imputation strategy (mean, median, mode, or custom logic). - The imputation process is applied column-wise, with missing values filled in-place or in a copy, depending on user preference. - All imputation steps are logged or reported, ensuring transparency and reproducibility in the cleaning workflow. —————————

edaflow provides intelligent imputation methods for different data types.

Numerical Imputation

# Use median for numerical columns (robust to outliers)
df_imputed = edaflow.impute_numerical_median(df)

Categorical Imputation

# Use mode (most frequent value) for categorical columns
df_imputed = edaflow.impute_categorical_mode(df)

Combined Approach

# Complete imputation workflow
df_clean = edaflow.convert_to_numeric(df)
df_clean = edaflow.impute_numerical_median(df_clean)
df_clean = edaflow.impute_categorical_mode(df_clean)

Outlier Detection and Handling What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow’s outlier detection tools use statistical methods (e.g., IQR, Z-score) to flag values that deviate significantly from the norm. - Visualizations (such as boxplots) are generated to help users quickly spot and interpret outliers. - The functions can optionally remove or cap outliers, with all actions logged for transparency. ——————————-

Outliers can significantly impact analysis and model performance.

Automated Outlier Handling

# Replace outliers with median values
df_no_outliers = edaflow.handle_outliers_median(
    df,
    method='z_score',  # or 'iqr'
    threshold=3
)

Method Options:

Z-Score Method: Identifies values >3 standard deviations from mean
IQR Method: Uses interquartile range to identify outliers
Median Replacement: Robust strategy that maintains data distribution

Best Practices Workflow What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - The workflow combines all edaflow data quality utilities in a recommended sequence, ensuring each cleaning step builds on the previous one. - Each function logs its actions and results, making the process auditable and reproducible. - The workflow is modular, so users can customize or skip steps as needed for their specific data challenges. ————————

Here’s a recommended data quality workflow:

import edaflow
import pandas as pd

# 1. Load and inspect data
df = pd.read_csv('your_data.csv')
print(f"Dataset shape: {df.shape}")

# 2. Check for missing values
edaflow.check_null_columns(df, threshold=10)

# 3. Analyze data types
edaflow.analyze_categorical_columns(df)
edaflow.display_column_types(df)

# 4. Convert types where appropriate
df_converted = edaflow.convert_to_numeric(df, threshold=30)

# 5. Handle missing values
df_imputed = edaflow.impute_numerical_median(df_converted)
df_imputed = edaflow.impute_categorical_mode(df_imputed)

# 6. Handle outliers
df_clean = edaflow.handle_outliers_median(
    df_imputed,
    method='iqr'
)

# 7. Verify improvements
print(f"Null values after cleaning: {df_clean.isnull().sum().sum()}")
edaflow.display_column_types(df_clean)

Common Data Quality Issues What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow scans for common data quality issues using a library of heuristics and pattern checks (e.g., constant columns, duplicate rows, mixed types). - Detected issues are summarized in a report, with severity levels and suggested remediation steps. - The system is extensible, allowing new issue types to be added as data challenges evolve. —————————

Mixed Data Types in Columns

# Example: Price column with '$' symbols and 'Free' text
# edaflow automatically handles these cases
df_converted = edaflow.convert_to_numeric(df)

Inconsistent Missing Value Representations

# Handle various null representations before using edaflow
df = df.replace(['N/A', 'n/a', 'NULL', ''], pd.NA)
edaflow.check_null_columns(df)

Date Columns as Objects

# Convert dates after using edaflow's type analysis
date_columns = ['created_date', 'modified_date']
for col in date_columns:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')

Integration with ML Workflow What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow’s data quality outputs (cleaned data, reports, logs) are designed to integrate seamlessly with its ML experiment setup and modeling functions. - Cleaned datasets and quality scores can be passed directly to ML workflows, ensuring consistency and traceability. - Integration points are documented so users can automate the transition from EDA to modeling. —————————–

Clean data is essential for machine learning:

# After cleaning with edaflow
import edaflow.ml as ml

# The ML functions expect clean data
X = df_clean.drop('target', axis=1)
y = df_clean['target']

# Setup ML experiment with validated data
config = ml.setup_ml_experiment(
   X=X, y=y,
   primary_metric="roc_auc"  # 👈 Set your main metric here! (e.g., 'f1', 'accuracy', 'r2', etc.)
)

# Additional ML-specific validation
validation_report = ml.validate_ml_data(
    X=config['X_train'],
    y=config['y_train']
)

Tips for Large Datasets What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow uses efficient, vectorized operations (via pandas/numpy) to handle large datasets without excessive memory usage. - For very large data, sampling and chunked processing options are available to maintain performance. - All tips are based on practical experience with real-world, large-scale data cleaning tasks. ————————

Memory Efficiency

# For large datasets, process in chunks or use specific columns
columns_to_analyze = ['col1', 'col2', 'col3']
edaflow.check_null_columns(df[columns_to_analyze])

Sampling Strategy

# Analyze a representative sample first
sample_df = df.sample(n=10000, random_state=42)
edaflow.analyze_categorical_columns(sample_df)

# Apply insights to full dataset
df_converted = edaflow.convert_to_numeric(df)

Quality Assessment Checklist What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - The checklist is generated from the results of all previous edaflow data quality checks and cleaning steps. - It provides a structured summary of what has been validated, cleaned, or flagged for review. - The checklist can be exported or included in reports for documentation and audit purposes. —————————–

Use this checklist to ensure comprehensive data quality assessment:

[ ] Missing Data: Check null patterns and percentages
[ ] Data Types: Verify appropriate types for each column
[ ] Duplicates: Check for and remove duplicate rows
[ ] Outliers: Identify and decide on handling strategy
[ ] Consistency: Check for consistent formatting within columns
[ ] Completeness: Ensure all required fields are present
[ ] Validity: Verify data values make sense in context
[ ] Uniqueness: Check ID fields are truly unique