Data Cleaning Pipeline ====================== This example demonstrates a robust data cleaning workflow using edaflow functions. **1. Identify Missing Data** ---------------------------- .. code-block:: python import edaflow as eda missing_report = eda.check_null_columns(df) **2. Impute Missing Values** ---------------------------- .. code-block:: python df = eda.impute_numerical_median(df, column='income') df = eda.impute_categorical_mode(df, column='category') **3. Convert Data Types** ------------------------- .. code-block:: python df = eda.convert_to_numeric(df, columns=['score', 'income']) **4. Handle Outliers** ---------------------- .. code-block:: python df = eda.handle_outliers_median(df, column='score') **5. Validate Data Quality** ---------------------------- .. code-block:: python eda.display_column_types(df) eda.validate_ml_data(df) **6. Remove Duplicates** ------------------------ .. code-block:: python df = df.drop_duplicates() **7. Handle Rare Categories** ----------------------------- .. code-block:: python df['region'] = df['region'].replace(df['region'].value_counts()[df['region'].value_counts() < 10].index, 'Other') **8. Feature Scaling** ---------------------- .. code-block:: python from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df['income_scaled'] = scaler.fit_transform(df[['income']]) **Best Practices:** - Always check for missing and invalid values before modeling - Use appropriate imputation strategies for each data type - Validate data after cleaning to ensure readiness for analysis - Remove duplicates to avoid bias - Group rare categories to improve model stability - Scale features for algorithms sensitive to magnitude Refer to the ML Workflow guide for next steps after cleaning.