Data Cleaning Pipeline
This example demonstrates a robust data cleaning workflow using edaflow functions.
1. Identify Missing Data
import edaflow as eda
missing_report = eda.check_null_columns(df)
2. Impute Missing Values
df = eda.impute_numerical_median(df, column='income')
df = eda.impute_categorical_mode(df, column='category')
3. Convert Data Types
df = eda.convert_to_numeric(df, columns=['score', 'income'])
4. Handle Outliers
df = eda.handle_outliers_median(df, column='score')
5. Validate Data Quality
eda.display_column_types(df)
eda.validate_ml_data(df)
6. Remove Duplicates
df = df.drop_duplicates()
7. Handle Rare Categories
df['region'] = df['region'].replace(df['region'].value_counts()[df['region'].value_counts() < 10].index, 'Other')
8. Feature Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['income_scaled'] = scaler.fit_transform(df[['income']])
Best Practices: - Always check for missing and invalid values before modeling - Use appropriate imputation strategies for each data type - Validate data after cleaning to ensure readiness for analysis - Remove duplicates to avoid bias - Group rare categories to improve model stability - Scale features for algorithms sensitive to magnitude
Refer to the ML Workflow guide for next steps after cleaning.