Data Quality & Cleaning

This guide covers edaflow’s data quality assessment and cleaning capabilities.

Overview

Data quality is the foundation of any successful data analysis or machine learning project. edaflow provides comprehensive tools for:

  • Missing Data Analysis: Identify and visualize null values

  • Data Type Issues: Detect and fix incorrect data types

  • Data Imputation: Smart strategies for handling missing values

  • Outlier Detection: Identify and handle anomalous values

Missing Data Analysis What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow.check_null_columns uses pandas to scan each column for null values and calculates the percentage of missing data. - It generates a visual summary (e.g., heatmap or color-coded table) to highlight missingness patterns. - The function applies user-defined thresholds to trigger warnings and provides actionable recommendations based on the severity of missing data. β€”β€”β€”β€”β€”β€”β€”

The first step in any EDA workflow should be understanding your missing data patterns.

Basic Null Analysis

import edaflow
import pandas as pd

df = pd.read_csv('your_data.csv')

# Comprehensive null analysis
edaflow.check_null_columns(df)

This function provides:

  • Visual null pattern display with color coding

  • Percentage calculations for each column

  • Threshold-based warnings for high null percentages

  • Actionable recommendations for handling missing data

Customizing Null Analysis

# Custom threshold for warnings
edaflow.check_null_columns(df, threshold=15)  # Warn if >15% null

Data Type Conversion What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow.analyze_categorical_columns inspects object columns to determine if they can be safely converted to numeric types, using heuristics and value checks. - edaflow.convert_to_numeric attempts conversion for columns that meet the threshold, handling errors gracefully and reporting any columns that could not be converted. - edaflow.display_column_types summarizes the current data types for all columns, making it easy to spot inconsistencies. ——————–

Many datasets have columns that should be numeric but are stored as objects/strings.

Smart Type Detection

# Analyze which object columns could be numeric
edaflow.analyze_categorical_columns(df)

Automatic Conversion

# Convert object columns to numeric when appropriate
df_converted = edaflow.convert_to_numeric(df, threshold=25)

Display Current Types

# See all column data types in a clean format
edaflow.display_column_types(df)

Data Imputation Strategies What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow’s imputation utilities detect column types (numerical, categorical) and select the appropriate imputation strategy (mean, median, mode, or custom logic). - The imputation process is applied column-wise, with missing values filled in-place or in a copy, depending on user preference. - All imputation steps are logged or reported, ensuring transparency and reproducibility in the cleaning workflow. β€”β€”β€”β€”β€”β€”β€”β€”β€”

edaflow provides intelligent imputation methods for different data types.

Numerical Imputation

# Use median for numerical columns (robust to outliers)
df_imputed = edaflow.impute_numerical_median(df)

Categorical Imputation

# Use mode (most frequent value) for categorical columns
df_imputed = edaflow.impute_categorical_mode(df)

Combined Approach

# Complete imputation workflow
df_clean = edaflow.convert_to_numeric(df)
df_clean = edaflow.impute_numerical_median(df_clean)
df_clean = edaflow.impute_categorical_mode(df_clean)

Outlier Detection and Handling What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow’s outlier detection tools use statistical methods (e.g., IQR, Z-score) to flag values that deviate significantly from the norm. - Visualizations (such as boxplots) are generated to help users quickly spot and interpret outliers. - The functions can optionally remove or cap outliers, with all actions logged for transparency. β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”-

Outliers can significantly impact analysis and model performance.

Automated Outlier Handling

# Replace outliers with median values
df_no_outliers = edaflow.handle_outliers_median(
    df,
    method='z_score',  # or 'iqr'
    threshold=3
)

Method Options:

  • Z-Score Method: Identifies values >3 standard deviations from mean

  • IQR Method: Uses interquartile range to identify outliers

  • Median Replacement: Robust strategy that maintains data distribution

Best Practices Workflow What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - The workflow combines all edaflow data quality utilities in a recommended sequence, ensuring each cleaning step builds on the previous one. - Each function logs its actions and results, making the process auditable and reproducible. - The workflow is modular, so users can customize or skip steps as needed for their specific data challenges. β€”β€”β€”β€”β€”β€”β€”β€”

Here’s a recommended data quality workflow:

import edaflow
import pandas as pd

# 1. Load and inspect data
df = pd.read_csv('your_data.csv')
print(f"Dataset shape: {df.shape}")

# 2. Check for missing values
edaflow.check_null_columns(df, threshold=10)

# 3. Analyze data types
edaflow.analyze_categorical_columns(df)
edaflow.display_column_types(df)

# 4. Convert types where appropriate
df_converted = edaflow.convert_to_numeric(df, threshold=30)

# 5. Handle missing values
df_imputed = edaflow.impute_numerical_median(df_converted)
df_imputed = edaflow.impute_categorical_mode(df_imputed)

# 6. Handle outliers
df_clean = edaflow.handle_outliers_median(
    df_imputed,
    method='iqr'
)

# 7. Verify improvements
print(f"Null values after cleaning: {df_clean.isnull().sum().sum()}")
edaflow.display_column_types(df_clean)

Common Data Quality Issues What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow scans for common data quality issues using a library of heuristics and pattern checks (e.g., constant columns, duplicate rows, mixed types). - Detected issues are summarized in a report, with severity levels and suggested remediation steps. - The system is extensible, allowing new issue types to be added as data challenges evolve. β€”β€”β€”β€”β€”β€”β€”β€”β€”

Mixed Data Types in Columns

# Example: Price column with '$' symbols and 'Free' text
# edaflow automatically handles these cases
df_converted = edaflow.convert_to_numeric(df)

Inconsistent Missing Value Representations

# Handle various null representations before using edaflow
df = df.replace(['N/A', 'n/a', 'NULL', ''], pd.NA)
edaflow.check_null_columns(df)

Date Columns as Objects

# Convert dates after using edaflow's type analysis
date_columns = ['created_date', 'modified_date']
for col in date_columns:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')

Integration with ML Workflow What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow’s data quality outputs (cleaned data, reports, logs) are designed to integrate seamlessly with its ML experiment setup and modeling functions. - Cleaned datasets and quality scores can be passed directly to ML workflows, ensuring consistency and traceability. - Integration points are documented so users can automate the transition from EDA to modeling. —————————–

Clean data is essential for machine learning:

# After cleaning with edaflow
import edaflow.ml as ml

# The ML functions expect clean data
X = df_clean.drop('target', axis=1)
y = df_clean['target']

# Setup ML experiment with validated data
config = ml.setup_ml_experiment(
   X=X, y=y,
   primary_metric="roc_auc"  # πŸ‘ˆ Set your main metric here! (e.g., 'f1', 'accuracy', 'r2', etc.)
)

# Additional ML-specific validation
validation_report = ml.validate_ml_data(
    X=config['X_train'],
    y=config['y_train']
)

Tips for Large Datasets What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow uses efficient, vectorized operations (via pandas/numpy) to handle large datasets without excessive memory usage. - For very large data, sampling and chunked processing options are available to maintain performance. - All tips are based on practical experience with real-world, large-scale data cleaning tasks. β€”β€”β€”β€”β€”β€”β€”β€”

Memory Efficiency

# For large datasets, process in chunks or use specific columns
columns_to_analyze = ['col1', 'col2', 'col3']
edaflow.check_null_columns(df[columns_to_analyze])

Sampling Strategy

# Analyze a representative sample first
sample_df = df.sample(n=10000, random_state=42)
edaflow.analyze_categorical_columns(sample_df)

# Apply insights to full dataset
df_converted = edaflow.convert_to_numeric(df)

Quality Assessment Checklist What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - The checklist is generated from the results of all previous edaflow data quality checks and cleaning steps. - It provides a structured summary of what has been validated, cleaned, or flagged for review. - The checklist can be exported or included in reports for documentation and audit purposes. —————————–

Use this checklist to ensure comprehensive data quality assessment:

  • [ ] Missing Data: Check null patterns and percentages

  • [ ] Data Types: Verify appropriate types for each column

  • [ ] Duplicates: Check for and remove duplicate rows

  • [ ] Outliers: Identify and decide on handling strategy

  • [ ] Consistency: Check for consistent formatting within columns

  • [ ] Completeness: Ensure all required fields are present

  • [ ] Validity: Verify data values make sense in context

  • [ ] Uniqueness: Check ID fields are truly unique