edaflow.visualize_scatter_matrix

edaflow.visualize_scatter_matrix(df: DataFrame, columns: str | List[str] | None = None, diagonal: str = 'hist', upper: str = 'scatter', lower: str = 'scatter', color_by: str | None = None, show_regression: bool = True, regression_type: str = 'linear', alpha: float = 0.6, figsize: tuple | None = None, title: str = 'Scatter Matrix Analysis', color_palette: str = 'Set2', verbose: bool = True) → None[source]

Create comprehensive scatter matrix visualization for pairwise relationship analysis.

This function provides a powerful scatter matrix (also known as pairs plot) that shows: - Diagonal: Distribution of individual variables (histograms, KDE, or box plots) - Off-diagonal: Scatter plots showing pairwise relationships between variables - Optional: Color coding by categorical variables - Optional: Regression lines to highlight trends - Statistical insights: Correlation coefficients and relationship patterns

Perfect for: - Exploring pairwise relationships between numerical variables - Validating correlation analysis with visual patterns - Identifying non-linear relationships missed by correlation coefficients - Feature engineering and transformation planning - Publication-ready relationship visualization

Parameters:

df (pd.DataFrame) – The input DataFrame
columns (Optional[Union[str, List[str]]], optional) – Columns to include in scatter matrix. If None, uses all numerical columns. If str, uses single column with others. If list, uses specified columns. Defaults to None.
diagonal (str, optional) – Type of plot for diagonal elements. Options: - “hist”: Histograms (default) - “kde”: Kernel Density Estimation curves - “box”: Box plots Defaults to “hist”.
upper (str, optional) – Type of plot for upper triangle. Options: - “scatter”: Scatter plots (default) - “corr”: Correlation coefficients - “blank”: Empty (for cleaner look) Defaults to “scatter”.
lower (str, optional) – Type of plot for lower triangle. Options: - “scatter”: Scatter plots (default) - “corr”: Correlation coefficients - “blank”: Empty (for cleaner look) Defaults to “scatter”.
color_by (Optional[str], optional) – Name of categorical column to use for color coding. If provided, scatter plots will be colored by this variable. Defaults to None.
show_regression (bool, optional) – Whether to add regression lines to scatter plots. Defaults to True.
regression_type (str, optional) – Type of regression line. Options: - “linear”: Linear regression (default) - “poly2”: 2nd degree polynomial - “poly3”: 3rd degree polynomial - “lowess”: LOWESS smoothing Defaults to “linear”.
alpha (float, optional) – Transparency level for scatter plot points (0.0 to 1.0). Defaults to 0.6.
figsize (Optional[tuple], optional) – Figure size as (width, height). If None, automatically calculated based on number of variables. Defaults to None.
title (str, optional) – Main title for the scatter matrix. Defaults to “Scatter Matrix Analysis”.
color_palette (str, optional) – Color palette for categorical coloring. Defaults to “Set2”.
verbose (bool, optional) – If True, displays detailed information about the analysis. Defaults to True.

Returns:

Displays the scatter matrix plot directly

Return type:

None

Raises:

ValueError – If DataFrame is empty or no numerical columns found
ValueError – If specified columns don’t exist or aren’t numerical
ValueError – If color_by column doesn’t exist or isn’t categorical
ValueError – If invalid diagonal, upper, or lower options provided

Examples

>>> import pandas as pd
>>> import numpy as np
>>> import edaflow
>>>
>>> # Create sample data
>>> np.random.seed(42)
>>> df = pd.DataFrame({
...     'height': np.random.normal(170, 10, 100),
...     'weight': np.random.normal(70, 15, 100),
...     'age': np.random.uniform(20, 60, 100),
...     'income': np.random.lognormal(10, 0.5, 100),
...     'category': np.random.choice(['A', 'B', 'C'], 100)
... })
>>>
>>> # Basic scatter matrix (all numerical columns)
>>> edaflow.visualize_scatter_matrix(df)
>>>
>>> # Custom configuration with specific columns
>>> edaflow.visualize_scatter_matrix(
...     df,
...     columns=['height', 'weight', 'age'],
...     diagonal='kde',
...     upper='corr',
...     lower='scatter',
...     show_regression=True,
...     title="Body Measurements Relationships"
... )
>>>
>>> # Color-coded by categorical variable
>>> edaflow.visualize_scatter_matrix(
...     df,
...     columns=['height', 'weight', 'income'],
...     color_by='category',
...     regression_type='poly2',
...     alpha=0.7
... )
>>>
>>> # Alternative import style:
>>> from edaflow.analysis import visualize_scatter_matrix
>>> visualize_scatter_matrix(df, diagonal='box', upper='blank')

Notes

Scatter matrices work best with 2-7 numerical variables (readability)
For large datasets (>1000 rows), consider sampling for performance
Regression lines help identify linear vs non-linear relationships
Color coding reveals group-specific patterns in relationships
Upper/lower triangle customization allows focus on specific aspects
Compatible with matplotlib.pyplot.savefig() for export

Statistical Insights:

Diagonal plots show univariate distributions and skewness
Scatter plots reveal bivariate relationship patterns
Regression lines indicate trend strength and direction
Color coding shows group differences in relationships
Correlation values validate visual relationship strength

Integration with other edaflow functions:

Use after visualize_heatmap() to validate correlation patterns
Combine with visualize_histograms() for detailed distribution analysis
Follow up with handle_outliers_median() based on scatter plot insights
Use before feature engineering to identify transformation needs