Visualization Functions
Advanced visualization functions for comprehensive data exploration and analysis.
Distribution Analysis
- edaflow.visualize_numerical_boxplots(df: DataFrame, columns: List[str] | None = None, figsize: tuple | None = None, rows: int | None = None, cols: int | None = None, title: str = 'Boxplots for Numerical Columns', show_skewness: bool = True, orientation: str = 'horizontal', color_palette: str = 'Set2') None[source]
Create boxplots for numerical columns to visualize distributions and outliers.
This function automatically detects numerical columns and creates a grid of boxplots to help identify outliers, skewness, and distribution characteristics. Each boxplot can optionally display the skewness value in the title.
- Parameters:
df (pd.DataFrame) – The input DataFrame to analyze
columns (Optional[List[str]], optional) – Specific columns to plot. If None, all numerical columns are used. Defaults to None.
figsize (Optional[tuple], optional) – Figure size (width, height). If None, automatically calculated based on subplot grid. Defaults to None.
rows (Optional[int], optional) – Number of rows in subplot grid. If None, automatically calculated. Defaults to None.
cols (Optional[int], optional) – Number of columns in subplot grid. If None, automatically calculated. Defaults to None.
title (str, optional) – Main title for the entire plot. Defaults to “Boxplots for Numerical Columns”.
show_skewness (bool, optional) – Whether to show skewness values in subplot titles. Defaults to True.
orientation (str, optional) – Boxplot orientation. Either ‘horizontal’ or ‘vertical’. Defaults to ‘horizontal’.
color_palette (str, optional) – Seaborn color palette to use. Defaults to ‘Set2’.
- Returns:
Displays the boxplot visualization
- Return type:
None
- Raises:
ValueError – If orientation is not ‘horizontal’ or ‘vertical’
ValueError – If no numerical columns are found
Example
>>> import pandas as pd >>> import edaflow >>> df = pd.DataFrame({ ... 'age': [25, 30, 35, 40, 100, 28, 32], # 100 is outlier ... 'salary': [50000, 60000, 75000, 80000, 200000, 55000, 65000], # 200000 is outlier ... 'experience': [2, 5, 8, 12, 25, 3, 6], ... 'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'] ... }) >>> >>> # Basic boxplot visualization >>> edaflow.visualize_numerical_boxplots(df) >>> >>> # Custom layout and styling >>> edaflow.visualize_numerical_boxplots(df, ... rows=2, cols=2, ... title="Custom Boxplots", ... orientation='vertical', ... color_palette='viridis') >>> >>> # Specific columns only >>> edaflow.visualize_numerical_boxplots(df, columns=['age', 'salary']) >>> >>> # Alternative import style: >>> from edaflow.analysis import visualize_numerical_boxplots >>> visualize_numerical_boxplots(df, show_skewness=False)
Notes
Automatically identifies numerical columns (int64, float64, etc.)
Skips columns with all missing values
Outliers are clearly visible as points beyond the whiskers
Skewness interpretation: * |skewness| < 0.5: Approximately symmetric * 0.5 ≤ |skewness| < 1: Moderately skewed * |skewness| ≥ 1: Highly skewed
Uses seaborn styling for better visual appearance
- edaflow.visualize_interactive_boxplots(df: DataFrame, columns: str | List[str] | None = None, title: str = 'Interactive Boxplot Analysis', height: int = 600, color_sequence: List[str] | None = None, show_points: str = 'outliers', verbose: bool = True) None[source]
Create interactive boxplots for numerical columns using Plotly Express.
This function provides an interactive alternative to matplotlib-based boxplots, allowing users to hover, zoom, and explore data distributions dynamically. Perfect for final visualization after data cleaning and outlier handling.
- Parameters:
df (pd.DataFrame) – The input DataFrame
columns (Optional[Union[str, List[str]]], optional) – Column name(s) to visualize. If None, processes all numerical columns. Defaults to None.
title (str, optional) – Title for the interactive plot. Defaults to “Interactive Boxplot Analysis”.
height (int, optional) – Height of the plot in pixels. Defaults to 600.
color_sequence (Optional[List[str]], optional) – Custom color sequence for the boxplots. If None, uses Plotly’s default colors. Defaults to None.
show_points (str, optional) – Points to show on boxplots. Options: - “outliers”: Show only outlier points - “all”: Show all data points - “suspectedoutliers”: Show suspected outliers - False: Show no points Defaults to “outliers”.
verbose (bool, optional) – If True, displays detailed information about the visualization process. Defaults to True.
- Returns:
Displays the interactive plot directly
- Return type:
None
- Raises:
ValueError – If no valid numerical columns are found.
KeyError – If specified column(s) don’t exist in the DataFrame.
ImportError – If plotly is not installed.
Example
>>> import pandas as pd >>> import edaflow >>> >>> # Create sample data >>> df = pd.DataFrame({ ... 'age': [25, 30, 28, 35, 32, 29, 31, 33], ... 'income': [50000, 55000, 48000, 62000, 51000, 45000, 53000, 49000], ... 'score': [85, 90, 78, 92, 88, 95, 81, 87], ... 'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'] ... }) >>> >>> # Interactive visualization of all numerical columns >>> edaflow.visualize_interactive_boxplots(df) >>> >>> # Visualize specific columns with custom styling >>> edaflow.visualize_interactive_boxplots( ... df, ... columns=['age', 'income'], ... title="Age and Income Distribution", ... height=500, ... show_points="all" ... )
# Alternative import style: >>> from edaflow.analysis import visualize_interactive_boxplots >>> visualize_interactive_boxplots(df, verbose=True)
- edaflow.visualize_histograms(df: DataFrame, columns: str | List[str] | None = None, title: str | None = None, figsize: tuple | None = None, bins: int | str = 'auto', kde: bool = True, show_stats: bool = True, show_normal_curve: bool = True, color_palette: str = 'Set2', alpha: float = 0.7, grid_alpha: float = 0.3, rows: int | None = None, cols: int | None = None, statistical_tests: bool = True, verbose: bool = True) None[source]
Create comprehensive histogram visualizations with distribution analysis and skewness detection.
This function provides detailed histogram analysis for numerical columns, including: - Distribution shape visualization with histograms and KDE curves - Skewness and kurtosis analysis with interpretation - Normal distribution comparison overlay - Statistical tests for normality (Shapiro-Wilk, Anderson-Darling) - Comprehensive distribution statistics and insights
- Parameters:
df (pd.DataFrame) – The input DataFrame
columns (Optional[Union[str, List[str]]], optional) – Column name(s) to visualize. If None, processes all numerical columns. Defaults to None.
title (Optional[str], optional) – Main title for the entire plot. If None, auto-generated. Defaults to None.
figsize (Optional[tuple], optional) – Figure size (width, height). If None, auto-calculated. Defaults to None.
bins (Union[int, str], optional) – Number of bins or binning strategy. Options: int, ‘auto’, ‘sturges’, ‘fd’, ‘scott’, ‘sqrt’. Defaults to ‘auto’.
kde (bool, optional) – Whether to show Kernel Density Estimation curve. Defaults to True.
show_stats (bool, optional) – Whether to display statistics on each subplot. Defaults to True.
show_normal_curve (bool, optional) – Whether to overlay normal distribution curve. Defaults to True.
color_palette (str, optional) – Seaborn color palette. Defaults to ‘Set2’.
alpha (float, optional) – Transparency of histogram bars (0-1). Defaults to 0.7.
grid_alpha (float, optional) – Transparency of grid lines (0-1). Defaults to 0.3.
rows (Optional[int], optional) – Number of rows in subplot grid. If None, auto-calculated. Defaults to None.
cols (Optional[int], optional) – Number of columns in subplot grid. If None, auto-calculated. Defaults to None.
statistical_tests (bool, optional) – Whether to run normality tests (Shapiro-Wilk, etc.). Defaults to True.
verbose (bool, optional) – If True, displays detailed distribution analysis. Defaults to True.
- Returns:
Displays the histogram visualization
- Return type:
None
- Raises:
ValueError – If no numerical columns are found or DataFrame is empty.
KeyError – If specified column(s) don’t exist in the DataFrame.
Example
>>> import pandas as pd >>> import numpy as np >>> import edaflow >>> >>> # Create sample data with different distributions >>> np.random.seed(42) >>> df = pd.DataFrame({ ... 'normal': np.random.normal(100, 15, 1000), ... 'skewed_right': np.random.exponential(2, 1000), ... 'skewed_left': 10 - np.random.exponential(2, 1000), ... 'uniform': np.random.uniform(0, 100, 1000) ... }) >>> >>> # Basic histogram analysis >>> edaflow.visualize_histograms(df) >>> >>> # Custom analysis with specific columns >>> edaflow.visualize_histograms( ... df, ... columns=['normal', 'skewed_right'], ... bins=30, ... show_normal_curve=True, ... statistical_tests=True ... ) >>> >>> # Detailed styling >>> edaflow.visualize_histograms( ... df, ... title="Distribution Analysis Dashboard", ... color_palette='viridis', ... alpha=0.8, ... figsize=(15, 10) ... )
Relationship Analysis
- edaflow.visualize_heatmap(df: DataFrame, heatmap_type: str = 'correlation', columns: str | List[str] | None = None, title: str | None = None, figsize: tuple | None = None, cmap: str = 'RdYlBu_r', annot: bool = True, fmt: str = '.2f', square: bool = True, linewidths: float = 0.5, cbar_kws: dict | None = None, method: str = 'pearson', missing_threshold: float = 5.0, verbose: bool = True) None[source]
Create comprehensive heatmap visualizations for exploratory data analysis.
This function provides multiple types of heatmaps for different EDA purposes: - Correlation heatmaps for numerical relationships - Missing data pattern heatmaps - Numerical data value heatmaps - Cross-tabulation heatmaps for categorical relationships
- Parameters:
df (pd.DataFrame) – The input DataFrame
heatmap_type (str, optional) – Type of heatmap to create. Options: - “correlation”: Correlation matrix heatmap (default) - “missing”: Missing data pattern heatmap - “values”: Raw data values heatmap (for small datasets) - “crosstab”: Cross-tabulation heatmap for categorical data Defaults to “correlation”.
columns (Optional[Union[str, List[str]]], optional) – Column name(s) to include. If None, uses appropriate columns based on heatmap_type. Defaults to None.
title (Optional[str], optional) – Custom title for the heatmap. If None, auto-generated. Defaults to None.
figsize (Optional[tuple], optional) – Figure size (width, height). If None, auto-calculated. Defaults to None.
cmap (str, optional) – Colormap for the heatmap. Defaults to “RdYlBu_r”.
annot (bool, optional) – Whether to annotate cells with values. Defaults to True.
fmt (str, optional) – String formatting code for annotations. Defaults to “.2f”.
square (bool, optional) – Whether to make cells square-shaped. Defaults to True.
linewidths (float, optional) – Width of lines separating cells. Defaults to 0.5.
cbar_kws (Optional[dict], optional) – Keyword arguments for colorbar. Defaults to None.
method (str, optional) – Correlation method for correlation heatmaps. Options: “pearson”, “kendall”, “spearman”. Defaults to “pearson”.
missing_threshold (float, optional) – Threshold for missing data highlighting (%). Only used for missing data heatmaps. Defaults to 5.0.
verbose (bool, optional) – If True, displays detailed information about the heatmap creation process. Defaults to True.
- Returns:
Displays the heatmap visualization
- Return type:
None
- Raises:
ValueError – If heatmap_type is not supported or no suitable data found.
KeyError – If specified column(s) don’t exist in the DataFrame.
Example
>>> import pandas as pd >>> import edaflow >>> >>> # Create sample data >>> df = pd.DataFrame({ ... 'age': [25, 30, 28, 35, 32, 29, 31, 33], ... 'income': [50000, 55000, 48000, 62000, 51000, 45000, 53000, 49000], ... 'score': [85, 90, 78, 92, 88, 95, 81, 87], ... 'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'] ... }) >>> >>> # Correlation heatmap (default) >>> edaflow.visualize_heatmap(df) >>> >>> # Missing data pattern heatmap >>> edaflow.visualize_heatmap(df, heatmap_type="missing") >>> >>> # Custom styling >>> edaflow.visualize_heatmap( ... df, ... heatmap_type="correlation", ... method="spearman", ... cmap="viridis", ... title="Spearman Correlation Analysis" ... )
- edaflow.visualize_scatter_matrix(df: DataFrame, columns: str | List[str] | None = None, diagonal: str = 'hist', upper: str = 'scatter', lower: str = 'scatter', color_by: str | None = None, show_regression: bool = True, regression_type: str = 'linear', alpha: float = 0.6, figsize: tuple | None = None, title: str = 'Scatter Matrix Analysis', color_palette: str = 'Set2', verbose: bool = True) None[source]
Create comprehensive scatter matrix visualization for pairwise relationship analysis.
This function provides a powerful scatter matrix (also known as pairs plot) that shows: - Diagonal: Distribution of individual variables (histograms, KDE, or box plots) - Off-diagonal: Scatter plots showing pairwise relationships between variables - Optional: Color coding by categorical variables - Optional: Regression lines to highlight trends - Statistical insights: Correlation coefficients and relationship patterns
Perfect for: - Exploring pairwise relationships between numerical variables - Validating correlation analysis with visual patterns - Identifying non-linear relationships missed by correlation coefficients - Feature engineering and transformation planning - Publication-ready relationship visualization
- Parameters:
df (pd.DataFrame) – The input DataFrame
columns (Optional[Union[str, List[str]]], optional) – Columns to include in scatter matrix. If None, uses all numerical columns. If str, uses single column with others. If list, uses specified columns. Defaults to None.
diagonal (str, optional) – Type of plot for diagonal elements. Options: - “hist”: Histograms (default) - “kde”: Kernel Density Estimation curves - “box”: Box plots Defaults to “hist”.
upper (str, optional) – Type of plot for upper triangle. Options: - “scatter”: Scatter plots (default) - “corr”: Correlation coefficients - “blank”: Empty (for cleaner look) Defaults to “scatter”.
lower (str, optional) – Type of plot for lower triangle. Options: - “scatter”: Scatter plots (default) - “corr”: Correlation coefficients - “blank”: Empty (for cleaner look) Defaults to “scatter”.
color_by (Optional[str], optional) – Name of categorical column to use for color coding. If provided, scatter plots will be colored by this variable. Defaults to None.
show_regression (bool, optional) – Whether to add regression lines to scatter plots. Defaults to True.
regression_type (str, optional) – Type of regression line. Options: - “linear”: Linear regression (default) - “poly2”: 2nd degree polynomial - “poly3”: 3rd degree polynomial - “lowess”: LOWESS smoothing Defaults to “linear”.
alpha (float, optional) – Transparency level for scatter plot points (0.0 to 1.0). Defaults to 0.6.
figsize (Optional[tuple], optional) – Figure size as (width, height). If None, automatically calculated based on number of variables. Defaults to None.
title (str, optional) – Main title for the scatter matrix. Defaults to “Scatter Matrix Analysis”.
color_palette (str, optional) – Color palette for categorical coloring. Defaults to “Set2”.
verbose (bool, optional) – If True, displays detailed information about the analysis. Defaults to True.
- Returns:
Displays the scatter matrix plot directly
- Return type:
None
- Raises:
ValueError – If DataFrame is empty or no numerical columns found
ValueError – If specified columns don’t exist or aren’t numerical
ValueError – If color_by column doesn’t exist or isn’t categorical
ValueError – If invalid diagonal, upper, or lower options provided
Examples
>>> import pandas as pd >>> import numpy as np >>> import edaflow >>> >>> # Create sample data >>> np.random.seed(42) >>> df = pd.DataFrame({ ... 'height': np.random.normal(170, 10, 100), ... 'weight': np.random.normal(70, 15, 100), ... 'age': np.random.uniform(20, 60, 100), ... 'income': np.random.lognormal(10, 0.5, 100), ... 'category': np.random.choice(['A', 'B', 'C'], 100) ... }) >>> >>> # Basic scatter matrix (all numerical columns) >>> edaflow.visualize_scatter_matrix(df) >>> >>> # Custom configuration with specific columns >>> edaflow.visualize_scatter_matrix( ... df, ... columns=['height', 'weight', 'age'], ... diagonal='kde', ... upper='corr', ... lower='scatter', ... show_regression=True, ... title="Body Measurements Relationships" ... ) >>> >>> # Color-coded by categorical variable >>> edaflow.visualize_scatter_matrix( ... df, ... columns=['height', 'weight', 'income'], ... color_by='category', ... regression_type='poly2', ... alpha=0.7 ... ) >>> >>> # Alternative import style: >>> from edaflow.analysis import visualize_scatter_matrix >>> visualize_scatter_matrix(df, diagonal='box', upper='blank')
Notes
Scatter matrices work best with 2-7 numerical variables (readability)
For large datasets (>1000 rows), consider sampling for performance
Regression lines help identify linear vs non-linear relationships
Color coding reveals group-specific patterns in relationships
Upper/lower triangle customization allows focus on specific aspects
Compatible with matplotlib.pyplot.savefig() for export
- Statistical Insights:
Diagonal plots show univariate distributions and skewness
Scatter plots reveal bivariate relationship patterns
Regression lines indicate trend strength and direction
Color coding shows group differences in relationships
Correlation values validate visual relationship strength
- Integration with other edaflow functions:
Use after visualize_heatmap() to validate correlation patterns
Combine with visualize_histograms() for detailed distribution analysis
Follow up with handle_outliers_median() based on scatter plot insights
Use before feature engineering to identify transformation needs
Categorical Analysis
- edaflow.visualize_categorical_values(df: DataFrame, max_unique_values: int | None = 20, show_counts: bool = True, show_percentages: bool = True) None[source]
Visualize unique values in categorical (object-type) columns with counts and percentages.
This function provides a comprehensive overview of categorical columns by displaying: - Unique values in each categorical column - Value counts (frequency of each unique value) - Percentages (relative frequency) - Summary statistics for each column
- Parameters:
df (pd.DataFrame) – The input DataFrame to analyze
max_unique_values (Optional[int], optional) – Maximum number of unique values to display per column. If a column has more unique values, only the top N most frequent will be shown. Defaults to 20.
show_counts (bool, optional) – Whether to show the count of each unique value. Defaults to True.
show_percentages (bool, optional) – Whether to show the percentage of each unique value. Defaults to True.
- Returns:
Prints visualization results directly to console with formatting
- Return type:
None
Example
>>> import pandas as pd >>> import edaflow >>> df = pd.DataFrame({ ... 'category': ['A', 'B', 'A', 'C', 'B', 'A'], ... 'status': ['active', 'inactive', 'active', 'pending', 'active', 'active'], ... 'region': ['North', 'South', 'North', 'East', 'West', 'North'], ... 'score': [85, 92, 78, 88, 95, 82] ... }) >>> >>> # Basic visualization >>> edaflow.visualize_categorical_values(df) >>> >>> # Show only top 10 values per column, without percentages >>> edaflow.visualize_categorical_values(df, max_unique_values=10, show_percentages=False) >>> >>> # Alternative import style: >>> from edaflow.analysis import visualize_categorical_values >>> visualize_categorical_values(df, max_unique_values=15)
Notes
Only analyzes columns with object dtype (categorical/string columns)
Columns with many unique values are truncated to show most frequent ones
Provides summary statistics including total unique values and most common value
Uses color coding to highlight column names and important information