Data Quality & Cleaning ======================== This guide covers edaflow's data quality assessment and cleaning capabilities. Overview -------- Data quality is the foundation of any successful data analysis or machine learning project. edaflow provides comprehensive tools for: * **Missing Data Analysis**: Identify and visualize null values * **Data Type Issues**: Detect and fix incorrect data types * **Data Imputation**: Smart strategies for handling missing values * **Outlier Detection**: Identify and handle anomalous values Missing Data Analysis What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - `edaflow.check_null_columns` uses pandas to scan each column for null values and calculates the percentage of missing data. - It generates a visual summary (e.g., heatmap or color-coded table) to highlight missingness patterns. - The function applies user-defined thresholds to trigger warnings and provides actionable recommendations based on the severity of missing data. --------------------- The first step in any EDA workflow should be understanding your missing data patterns. **Basic Null Analysis** .. code-block:: python import edaflow import pandas as pd df = pd.read_csv('your_data.csv') # Comprehensive null analysis edaflow.check_null_columns(df) This function provides: * **Visual null pattern display** with color coding * **Percentage calculations** for each column * **Threshold-based warnings** for high null percentages * **Actionable recommendations** for handling missing data **Customizing Null Analysis** .. code-block:: python # Custom threshold for warnings edaflow.check_null_columns(df, threshold=15) # Warn if >15% null Data Type Conversion What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - `edaflow.analyze_categorical_columns` inspects object columns to determine if they can be safely converted to numeric types, using heuristics and value checks. - `edaflow.convert_to_numeric` attempts conversion for columns that meet the threshold, handling errors gracefully and reporting any columns that could not be converted. - `edaflow.display_column_types` summarizes the current data types for all columns, making it easy to spot inconsistencies. -------------------- Many datasets have columns that should be numeric but are stored as objects/strings. **Smart Type Detection** .. code-block:: python # Analyze which object columns could be numeric edaflow.analyze_categorical_columns(df) **Automatic Conversion** .. code-block:: python # Convert object columns to numeric when appropriate df_converted = edaflow.convert_to_numeric(df, threshold=25) **Display Current Types** .. code-block:: python # See all column data types in a clean format edaflow.display_column_types(df) Data Imputation Strategies What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow’s imputation utilities detect column types (numerical, categorical) and select the appropriate imputation strategy (mean, median, mode, or custom logic). - The imputation process is applied column-wise, with missing values filled in-place or in a copy, depending on user preference. - All imputation steps are logged or reported, ensuring transparency and reproducibility in the cleaning workflow. --------------------------- edaflow provides intelligent imputation methods for different data types. **Numerical Imputation** .. code-block:: python # Use median for numerical columns (robust to outliers) df_imputed = edaflow.impute_numerical_median(df) **Categorical Imputation** .. code-block:: python # Use mode (most frequent value) for categorical columns df_imputed = edaflow.impute_categorical_mode(df) **Combined Approach** .. code-block:: python # Complete imputation workflow df_clean = edaflow.convert_to_numeric(df) df_clean = edaflow.impute_numerical_median(df_clean) df_clean = edaflow.impute_categorical_mode(df_clean) Outlier Detection and Handling What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow’s outlier detection tools use statistical methods (e.g., IQR, Z-score) to flag values that deviate significantly from the norm. - Visualizations (such as boxplots) are generated to help users quickly spot and interpret outliers. - The functions can optionally remove or cap outliers, with all actions logged for transparency. ------------------------------- Outliers can significantly impact analysis and model performance. **Automated Outlier Handling** .. code-block:: python # Replace outliers with median values df_no_outliers = edaflow.handle_outliers_median( df, method='z_score', # or 'iqr' threshold=3 ) **Method Options:** * **Z-Score Method**: Identifies values >3 standard deviations from mean * **IQR Method**: Uses interquartile range to identify outliers * **Median Replacement**: Robust strategy that maintains data distribution Best Practices Workflow What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - The workflow combines all edaflow data quality utilities in a recommended sequence, ensuring each cleaning step builds on the previous one. - Each function logs its actions and results, making the process auditable and reproducible. - The workflow is modular, so users can customize or skip steps as needed for their specific data challenges. ------------------------ Here's a recommended data quality workflow: .. code-block:: python import edaflow import pandas as pd # 1. Load and inspect data df = pd.read_csv('your_data.csv') print(f"Dataset shape: {df.shape}") # 2. Check for missing values edaflow.check_null_columns(df, threshold=10) # 3. Analyze data types edaflow.analyze_categorical_columns(df) edaflow.display_column_types(df) # 4. Convert types where appropriate df_converted = edaflow.convert_to_numeric(df, threshold=30) # 5. Handle missing values df_imputed = edaflow.impute_numerical_median(df_converted) df_imputed = edaflow.impute_categorical_mode(df_imputed) # 6. Handle outliers df_clean = edaflow.handle_outliers_median( df_imputed, method='iqr' ) # 7. Verify improvements print(f"Null values after cleaning: {df_clean.isnull().sum().sum()}") edaflow.display_column_types(df_clean) Common Data Quality Issues What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow scans for common data quality issues using a library of heuristics and pattern checks (e.g., constant columns, duplicate rows, mixed types). - Detected issues are summarized in a report, with severity levels and suggested remediation steps. - The system is extensible, allowing new issue types to be added as data challenges evolve. --------------------------- **Mixed Data Types in Columns** .. code-block:: python # Example: Price column with '$' symbols and 'Free' text # edaflow automatically handles these cases df_converted = edaflow.convert_to_numeric(df) **Inconsistent Missing Value Representations** .. code-block:: python # Handle various null representations before using edaflow df = df.replace(['N/A', 'n/a', 'NULL', ''], pd.NA) edaflow.check_null_columns(df) **Date Columns as Objects** .. code-block:: python # Convert dates after using edaflow's type analysis date_columns = ['created_date', 'modified_date'] for col in date_columns: if col in df.columns: df[col] = pd.to_datetime(df[col], errors='coerce') Integration with ML Workflow What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow’s data quality outputs (cleaned data, reports, logs) are designed to integrate seamlessly with its ML experiment setup and modeling functions. - Cleaned datasets and quality scores can be passed directly to ML workflows, ensuring consistency and traceability. - Integration points are documented so users can automate the transition from EDA to modeling. ----------------------------- Clean data is essential for machine learning: .. code-block:: python # After cleaning with edaflow import edaflow.ml as ml # The ML functions expect clean data X = df_clean.drop('target', axis=1) y = df_clean['target'] # Setup ML experiment with validated data config = ml.setup_ml_experiment( X=X, y=y, primary_metric="roc_auc" # 👈 Set your main metric here! (e.g., 'f1', 'accuracy', 'r2', etc.) ) # Additional ML-specific validation validation_report = ml.validate_ml_data( X=config['X_train'], y=config['y_train'] ) Tips for Large Datasets What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - edaflow uses efficient, vectorized operations (via pandas/numpy) to handle large datasets without excessive memory usage. - For very large data, sampling and chunked processing options are available to maintain performance. - All tips are based on practical experience with real-world, large-scale data cleaning tasks. ------------------------ **Memory Efficiency** .. code-block:: python # For large datasets, process in chunks or use specific columns columns_to_analyze = ['col1', 'col2', 'col3'] edaflow.check_null_columns(df[columns_to_analyze]) **Sampling Strategy** .. code-block:: python # Analyze a representative sample first sample_df = df.sample(n=10000, random_state=42) edaflow.analyze_categorical_columns(sample_df) # Apply insights to full dataset df_converted = edaflow.convert_to_numeric(df) Quality Assessment Checklist What Happens Under the Hood ~~~~~~~~~~~~~~~~~~~~~~~~~~ - The checklist is generated from the results of all previous edaflow data quality checks and cleaning steps. - It provides a structured summary of what has been validated, cleaned, or flagged for review. - The checklist can be exported or included in reports for documentation and audit purposes. ----------------------------- Use this checklist to ensure comprehensive data quality assessment: - [ ] **Missing Data**: Check null patterns and percentages - [ ] **Data Types**: Verify appropriate types for each column - [ ] **Duplicates**: Check for and remove duplicate rows - [ ] **Outliers**: Identify and decide on handling strategy - [ ] **Consistency**: Check for consistent formatting within columns - [ ] **Completeness**: Ensure all required fields are present - [ ] **Validity**: Verify data values make sense in context - [ ] **Uniqueness**: Check ID fields are truly unique