Visualization Functions

Advanced visualization functions for comprehensive data exploration and analysis.

Distribution Analysis

edaflow.visualize_numerical_boxplots(df: DataFrame, columns: List[str] | None = None, figsize: tuple | None = None, rows: int | None = None, cols: int | None = None, title: str = 'Boxplots for Numerical Columns', show_skewness: bool = True, orientation: str = 'horizontal', color_palette: str = 'Set2') None[source]

Create boxplots for numerical columns to visualize distributions and outliers.

This function automatically detects numerical columns and creates a grid of boxplots to help identify outliers, skewness, and distribution characteristics. Each boxplot can optionally display the skewness value in the title.

Parameters:
  • df (pd.DataFrame) – The input DataFrame to analyze

  • columns (Optional[List[str]], optional) – Specific columns to plot. If None, all numerical columns are used. Defaults to None.

  • figsize (Optional[tuple], optional) – Figure size (width, height). If None, automatically calculated based on subplot grid. Defaults to None.

  • rows (Optional[int], optional) – Number of rows in subplot grid. If None, automatically calculated. Defaults to None.

  • cols (Optional[int], optional) – Number of columns in subplot grid. If None, automatically calculated. Defaults to None.

  • title (str, optional) – Main title for the entire plot. Defaults to “Boxplots for Numerical Columns”.

  • show_skewness (bool, optional) – Whether to show skewness values in subplot titles. Defaults to True.

  • orientation (str, optional) – Boxplot orientation. Either ‘horizontal’ or ‘vertical’. Defaults to ‘horizontal’.

  • color_palette (str, optional) – Seaborn color palette to use. Defaults to ‘Set2’.

Returns:

Displays the boxplot visualization

Return type:

None

Raises:
  • ValueError – If orientation is not ‘horizontal’ or ‘vertical’

  • ValueError – If no numerical columns are found

Example

>>> import pandas as pd
>>> import edaflow
>>> df = pd.DataFrame({
...     'age': [25, 30, 35, 40, 100, 28, 32],  # 100 is outlier
...     'salary': [50000, 60000, 75000, 80000, 200000, 55000, 65000],  # 200000 is outlier
...     'experience': [2, 5, 8, 12, 25, 3, 6],
...     'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C']
... })
>>>
>>> # Basic boxplot visualization
>>> edaflow.visualize_numerical_boxplots(df)
>>>
>>> # Custom layout and styling
>>> edaflow.visualize_numerical_boxplots(df,
...                                     rows=2, cols=2,
...                                     title="Custom Boxplots",
...                                     orientation='vertical',
...                                     color_palette='viridis')
>>>
>>> # Specific columns only
>>> edaflow.visualize_numerical_boxplots(df, columns=['age', 'salary'])
>>>
>>> # Alternative import style:
>>> from edaflow.analysis import visualize_numerical_boxplots
>>> visualize_numerical_boxplots(df, show_skewness=False)

Notes

  • Automatically identifies numerical columns (int64, float64, etc.)

  • Skips columns with all missing values

  • Outliers are clearly visible as points beyond the whiskers

  • Skewness interpretation: * |skewness| < 0.5: Approximately symmetric * 0.5 ≤ |skewness| < 1: Moderately skewed * |skewness| ≥ 1: Highly skewed

  • Uses seaborn styling for better visual appearance

edaflow.visualize_interactive_boxplots(df: DataFrame, columns: str | List[str] | None = None, title: str = 'Interactive Boxplot Analysis', height: int = 600, color_sequence: List[str] | None = None, show_points: str = 'outliers', verbose: bool = True) None[source]

Create interactive boxplots for numerical columns using Plotly Express.

This function provides an interactive alternative to matplotlib-based boxplots, allowing users to hover, zoom, and explore data distributions dynamically. Perfect for final visualization after data cleaning and outlier handling.

Parameters:
  • df (pd.DataFrame) – The input DataFrame

  • columns (Optional[Union[str, List[str]]], optional) – Column name(s) to visualize. If None, processes all numerical columns. Defaults to None.

  • title (str, optional) – Title for the interactive plot. Defaults to “Interactive Boxplot Analysis”.

  • height (int, optional) – Height of the plot in pixels. Defaults to 600.

  • color_sequence (Optional[List[str]], optional) – Custom color sequence for the boxplots. If None, uses Plotly’s default colors. Defaults to None.

  • show_points (str, optional) – Points to show on boxplots. Options: - “outliers”: Show only outlier points - “all”: Show all data points - “suspectedoutliers”: Show suspected outliers - False: Show no points Defaults to “outliers”.

  • verbose (bool, optional) – If True, displays detailed information about the visualization process. Defaults to True.

Returns:

Displays the interactive plot directly

Return type:

None

Raises:
  • ValueError – If no valid numerical columns are found.

  • KeyError – If specified column(s) don’t exist in the DataFrame.

  • ImportError – If plotly is not installed.

Example

>>> import pandas as pd
>>> import edaflow
>>>
>>> # Create sample data
>>> df = pd.DataFrame({
...     'age': [25, 30, 28, 35, 32, 29, 31, 33],
...     'income': [50000, 55000, 48000, 62000, 51000, 45000, 53000, 49000],
...     'score': [85, 90, 78, 92, 88, 95, 81, 87],
...     'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B']
... })
>>>
>>> # Interactive visualization of all numerical columns
>>> edaflow.visualize_interactive_boxplots(df)
>>>
>>> # Visualize specific columns with custom styling
>>> edaflow.visualize_interactive_boxplots(
...     df,
...     columns=['age', 'income'],
...     title="Age and Income Distribution",
...     height=500,
...     show_points="all"
... )

# Alternative import style: >>> from edaflow.analysis import visualize_interactive_boxplots >>> visualize_interactive_boxplots(df, verbose=True)

edaflow.visualize_histograms(df: DataFrame, columns: str | List[str] | None = None, title: str | None = None, figsize: tuple | None = None, bins: int | str = 'auto', kde: bool = True, show_stats: bool = True, show_normal_curve: bool = True, color_palette: str = 'Set2', alpha: float = 0.7, grid_alpha: float = 0.3, rows: int | None = None, cols: int | None = None, statistical_tests: bool = True, verbose: bool = True) None[source]

Create comprehensive histogram visualizations with distribution analysis and skewness detection.

This function provides detailed histogram analysis for numerical columns, including: - Distribution shape visualization with histograms and KDE curves - Skewness and kurtosis analysis with interpretation - Normal distribution comparison overlay - Statistical tests for normality (Shapiro-Wilk, Anderson-Darling) - Comprehensive distribution statistics and insights

Parameters:
  • df (pd.DataFrame) – The input DataFrame

  • columns (Optional[Union[str, List[str]]], optional) – Column name(s) to visualize. If None, processes all numerical columns. Defaults to None.

  • title (Optional[str], optional) – Main title for the entire plot. If None, auto-generated. Defaults to None.

  • figsize (Optional[tuple], optional) – Figure size (width, height). If None, auto-calculated. Defaults to None.

  • bins (Union[int, str], optional) – Number of bins or binning strategy. Options: int, ‘auto’, ‘sturges’, ‘fd’, ‘scott’, ‘sqrt’. Defaults to ‘auto’.

  • kde (bool, optional) – Whether to show Kernel Density Estimation curve. Defaults to True.

  • show_stats (bool, optional) – Whether to display statistics on each subplot. Defaults to True.

  • show_normal_curve (bool, optional) – Whether to overlay normal distribution curve. Defaults to True.

  • color_palette (str, optional) – Seaborn color palette. Defaults to ‘Set2’.

  • alpha (float, optional) – Transparency of histogram bars (0-1). Defaults to 0.7.

  • grid_alpha (float, optional) – Transparency of grid lines (0-1). Defaults to 0.3.

  • rows (Optional[int], optional) – Number of rows in subplot grid. If None, auto-calculated. Defaults to None.

  • cols (Optional[int], optional) – Number of columns in subplot grid. If None, auto-calculated. Defaults to None.

  • statistical_tests (bool, optional) – Whether to run normality tests (Shapiro-Wilk, etc.). Defaults to True.

  • verbose (bool, optional) – If True, displays detailed distribution analysis. Defaults to True.

Returns:

Displays the histogram visualization

Return type:

None

Raises:
  • ValueError – If no numerical columns are found or DataFrame is empty.

  • KeyError – If specified column(s) don’t exist in the DataFrame.

Example

>>> import pandas as pd
>>> import numpy as np
>>> import edaflow
>>>
>>> # Create sample data with different distributions
>>> np.random.seed(42)
>>> df = pd.DataFrame({
...     'normal': np.random.normal(100, 15, 1000),
...     'skewed_right': np.random.exponential(2, 1000),
...     'skewed_left': 10 - np.random.exponential(2, 1000),
...     'uniform': np.random.uniform(0, 100, 1000)
... })
>>>
>>> # Basic histogram analysis
>>> edaflow.visualize_histograms(df)
>>>
>>> # Custom analysis with specific columns
>>> edaflow.visualize_histograms(
...     df,
...     columns=['normal', 'skewed_right'],
...     bins=30,
...     show_normal_curve=True,
...     statistical_tests=True
... )
>>>
>>> # Detailed styling
>>> edaflow.visualize_histograms(
...     df,
...     title="Distribution Analysis Dashboard",
...     color_palette='viridis',
...     alpha=0.8,
...     figsize=(15, 10)
... )

Relationship Analysis

edaflow.visualize_heatmap(df: DataFrame, heatmap_type: str = 'correlation', columns: str | List[str] | None = None, title: str | None = None, figsize: tuple | None = None, cmap: str = 'RdYlBu_r', annot: bool = True, fmt: str = '.2f', square: bool = True, linewidths: float = 0.5, cbar_kws: dict | None = None, method: str = 'pearson', missing_threshold: float = 5.0, verbose: bool = True) None[source]

Create comprehensive heatmap visualizations for exploratory data analysis.

This function provides multiple types of heatmaps for different EDA purposes: - Correlation heatmaps for numerical relationships - Missing data pattern heatmaps - Numerical data value heatmaps - Cross-tabulation heatmaps for categorical relationships

Parameters:
  • df (pd.DataFrame) – The input DataFrame

  • heatmap_type (str, optional) – Type of heatmap to create. Options: - “correlation”: Correlation matrix heatmap (default) - “missing”: Missing data pattern heatmap - “values”: Raw data values heatmap (for small datasets) - “crosstab”: Cross-tabulation heatmap for categorical data Defaults to “correlation”.

  • columns (Optional[Union[str, List[str]]], optional) – Column name(s) to include. If None, uses appropriate columns based on heatmap_type. Defaults to None.

  • title (Optional[str], optional) – Custom title for the heatmap. If None, auto-generated. Defaults to None.

  • figsize (Optional[tuple], optional) – Figure size (width, height). If None, auto-calculated. Defaults to None.

  • cmap (str, optional) – Colormap for the heatmap. Defaults to “RdYlBu_r”.

  • annot (bool, optional) – Whether to annotate cells with values. Defaults to True.

  • fmt (str, optional) – String formatting code for annotations. Defaults to “.2f”.

  • square (bool, optional) – Whether to make cells square-shaped. Defaults to True.

  • linewidths (float, optional) – Width of lines separating cells. Defaults to 0.5.

  • cbar_kws (Optional[dict], optional) – Keyword arguments for colorbar. Defaults to None.

  • method (str, optional) – Correlation method for correlation heatmaps. Options: “pearson”, “kendall”, “spearman”. Defaults to “pearson”.

  • missing_threshold (float, optional) – Threshold for missing data highlighting (%). Only used for missing data heatmaps. Defaults to 5.0.

  • verbose (bool, optional) – If True, displays detailed information about the heatmap creation process. Defaults to True.

Returns:

Displays the heatmap visualization

Return type:

None

Raises:
  • ValueError – If heatmap_type is not supported or no suitable data found.

  • KeyError – If specified column(s) don’t exist in the DataFrame.

Example

>>> import pandas as pd
>>> import edaflow
>>>
>>> # Create sample data
>>> df = pd.DataFrame({
...     'age': [25, 30, 28, 35, 32, 29, 31, 33],
...     'income': [50000, 55000, 48000, 62000, 51000, 45000, 53000, 49000],
...     'score': [85, 90, 78, 92, 88, 95, 81, 87],
...     'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B']
... })
>>>
>>> # Correlation heatmap (default)
>>> edaflow.visualize_heatmap(df)
>>>
>>> # Missing data pattern heatmap
>>> edaflow.visualize_heatmap(df, heatmap_type="missing")
>>>
>>> # Custom styling
>>> edaflow.visualize_heatmap(
...     df,
...     heatmap_type="correlation",
...     method="spearman",
...     cmap="viridis",
...     title="Spearman Correlation Analysis"
... )
edaflow.visualize_scatter_matrix(df: DataFrame, columns: str | List[str] | None = None, diagonal: str = 'hist', upper: str = 'scatter', lower: str = 'scatter', color_by: str | None = None, show_regression: bool = True, regression_type: str = 'linear', alpha: float = 0.6, figsize: tuple | None = None, title: str = 'Scatter Matrix Analysis', color_palette: str = 'Set2', verbose: bool = True) None[source]

Create comprehensive scatter matrix visualization for pairwise relationship analysis.

This function provides a powerful scatter matrix (also known as pairs plot) that shows: - Diagonal: Distribution of individual variables (histograms, KDE, or box plots) - Off-diagonal: Scatter plots showing pairwise relationships between variables - Optional: Color coding by categorical variables - Optional: Regression lines to highlight trends - Statistical insights: Correlation coefficients and relationship patterns

Perfect for: - Exploring pairwise relationships between numerical variables - Validating correlation analysis with visual patterns - Identifying non-linear relationships missed by correlation coefficients - Feature engineering and transformation planning - Publication-ready relationship visualization

Parameters:
  • df (pd.DataFrame) – The input DataFrame

  • columns (Optional[Union[str, List[str]]], optional) – Columns to include in scatter matrix. If None, uses all numerical columns. If str, uses single column with others. If list, uses specified columns. Defaults to None.

  • diagonal (str, optional) – Type of plot for diagonal elements. Options: - “hist”: Histograms (default) - “kde”: Kernel Density Estimation curves - “box”: Box plots Defaults to “hist”.

  • upper (str, optional) – Type of plot for upper triangle. Options: - “scatter”: Scatter plots (default) - “corr”: Correlation coefficients - “blank”: Empty (for cleaner look) Defaults to “scatter”.

  • lower (str, optional) – Type of plot for lower triangle. Options: - “scatter”: Scatter plots (default) - “corr”: Correlation coefficients - “blank”: Empty (for cleaner look) Defaults to “scatter”.

  • color_by (Optional[str], optional) – Name of categorical column to use for color coding. If provided, scatter plots will be colored by this variable. Defaults to None.

  • show_regression (bool, optional) – Whether to add regression lines to scatter plots. Defaults to True.

  • regression_type (str, optional) – Type of regression line. Options: - “linear”: Linear regression (default) - “poly2”: 2nd degree polynomial - “poly3”: 3rd degree polynomial - “lowess”: LOWESS smoothing Defaults to “linear”.

  • alpha (float, optional) – Transparency level for scatter plot points (0.0 to 1.0). Defaults to 0.6.

  • figsize (Optional[tuple], optional) – Figure size as (width, height). If None, automatically calculated based on number of variables. Defaults to None.

  • title (str, optional) – Main title for the scatter matrix. Defaults to “Scatter Matrix Analysis”.

  • color_palette (str, optional) – Color palette for categorical coloring. Defaults to “Set2”.

  • verbose (bool, optional) – If True, displays detailed information about the analysis. Defaults to True.

Returns:

Displays the scatter matrix plot directly

Return type:

None

Raises:
  • ValueError – If DataFrame is empty or no numerical columns found

  • ValueError – If specified columns don’t exist or aren’t numerical

  • ValueError – If color_by column doesn’t exist or isn’t categorical

  • ValueError – If invalid diagonal, upper, or lower options provided

Examples

>>> import pandas as pd
>>> import numpy as np
>>> import edaflow
>>>
>>> # Create sample data
>>> np.random.seed(42)
>>> df = pd.DataFrame({
...     'height': np.random.normal(170, 10, 100),
...     'weight': np.random.normal(70, 15, 100),
...     'age': np.random.uniform(20, 60, 100),
...     'income': np.random.lognormal(10, 0.5, 100),
...     'category': np.random.choice(['A', 'B', 'C'], 100)
... })
>>>
>>> # Basic scatter matrix (all numerical columns)
>>> edaflow.visualize_scatter_matrix(df)
>>>
>>> # Custom configuration with specific columns
>>> edaflow.visualize_scatter_matrix(
...     df,
...     columns=['height', 'weight', 'age'],
...     diagonal='kde',
...     upper='corr',
...     lower='scatter',
...     show_regression=True,
...     title="Body Measurements Relationships"
... )
>>>
>>> # Color-coded by categorical variable
>>> edaflow.visualize_scatter_matrix(
...     df,
...     columns=['height', 'weight', 'income'],
...     color_by='category',
...     regression_type='poly2',
...     alpha=0.7
... )
>>>
>>> # Alternative import style:
>>> from edaflow.analysis import visualize_scatter_matrix
>>> visualize_scatter_matrix(df, diagonal='box', upper='blank')

Notes

  • Scatter matrices work best with 2-7 numerical variables (readability)

  • For large datasets (>1000 rows), consider sampling for performance

  • Regression lines help identify linear vs non-linear relationships

  • Color coding reveals group-specific patterns in relationships

  • Upper/lower triangle customization allows focus on specific aspects

  • Compatible with matplotlib.pyplot.savefig() for export

Statistical Insights:
  • Diagonal plots show univariate distributions and skewness

  • Scatter plots reveal bivariate relationship patterns

  • Regression lines indicate trend strength and direction

  • Color coding shows group differences in relationships

  • Correlation values validate visual relationship strength

Integration with other edaflow functions:
  • Use after visualize_heatmap() to validate correlation patterns

  • Combine with visualize_histograms() for detailed distribution analysis

  • Follow up with handle_outliers_median() based on scatter plot insights

  • Use before feature engineering to identify transformation needs

Categorical Analysis

edaflow.visualize_categorical_values(df: DataFrame, max_unique_values: int | None = 20, show_counts: bool = True, show_percentages: bool = True) None[source]

Visualize unique values in categorical (object-type) columns with counts and percentages.

This function provides a comprehensive overview of categorical columns by displaying: - Unique values in each categorical column - Value counts (frequency of each unique value) - Percentages (relative frequency) - Summary statistics for each column

Parameters:
  • df (pd.DataFrame) – The input DataFrame to analyze

  • max_unique_values (Optional[int], optional) – Maximum number of unique values to display per column. If a column has more unique values, only the top N most frequent will be shown. Defaults to 20.

  • show_counts (bool, optional) – Whether to show the count of each unique value. Defaults to True.

  • show_percentages (bool, optional) – Whether to show the percentage of each unique value. Defaults to True.

Returns:

Prints visualization results directly to console with formatting

Return type:

None

Example

>>> import pandas as pd
>>> import edaflow
>>> df = pd.DataFrame({
...     'category': ['A', 'B', 'A', 'C', 'B', 'A'],
...     'status': ['active', 'inactive', 'active', 'pending', 'active', 'active'],
...     'region': ['North', 'South', 'North', 'East', 'West', 'North'],
...     'score': [85, 92, 78, 88, 95, 82]
... })
>>>
>>> # Basic visualization
>>> edaflow.visualize_categorical_values(df)
>>>
>>> # Show only top 10 values per column, without percentages
>>> edaflow.visualize_categorical_values(df, max_unique_values=10, show_percentages=False)
>>>
>>> # Alternative import style:
>>> from edaflow.analysis import visualize_categorical_values
>>> visualize_categorical_values(df, max_unique_values=15)

Notes

  • Only analyzes columns with object dtype (categorical/string columns)

  • Columns with many unique values are truncated to show most frequent ones

  • Provides summary statistics including total unique values and most common value

  • Uses color coding to highlight column names and important information