edaflow.visualize_scatter_matrix

edaflow.visualize_scatter_matrix(df: DataFrame, columns: str | List[str] | None = None, diagonal: str = 'hist', upper: str = 'scatter', lower: str = 'scatter', color_by: str | None = None, show_regression: bool = True, regression_type: str = 'linear', alpha: float = 0.6, figsize: tuple | None = None, title: str = 'Scatter Matrix Analysis', color_palette: str = 'Set2', verbose: bool = True) None[source]

Create comprehensive scatter matrix visualization for pairwise relationship analysis.

This function provides a powerful scatter matrix (also known as pairs plot) that shows: - Diagonal: Distribution of individual variables (histograms, KDE, or box plots) - Off-diagonal: Scatter plots showing pairwise relationships between variables - Optional: Color coding by categorical variables - Optional: Regression lines to highlight trends - Statistical insights: Correlation coefficients and relationship patterns

Perfect for: - Exploring pairwise relationships between numerical variables - Validating correlation analysis with visual patterns - Identifying non-linear relationships missed by correlation coefficients - Feature engineering and transformation planning - Publication-ready relationship visualization

Parameters:
  • df (pd.DataFrame) – The input DataFrame

  • columns (Optional[Union[str, List[str]]], optional) – Columns to include in scatter matrix. If None, uses all numerical columns. If str, uses single column with others. If list, uses specified columns. Defaults to None.

  • diagonal (str, optional) – Type of plot for diagonal elements. Options: - “hist”: Histograms (default) - “kde”: Kernel Density Estimation curves - “box”: Box plots Defaults to “hist”.

  • upper (str, optional) – Type of plot for upper triangle. Options: - “scatter”: Scatter plots (default) - “corr”: Correlation coefficients - “blank”: Empty (for cleaner look) Defaults to “scatter”.

  • lower (str, optional) – Type of plot for lower triangle. Options: - “scatter”: Scatter plots (default) - “corr”: Correlation coefficients - “blank”: Empty (for cleaner look) Defaults to “scatter”.

  • color_by (Optional[str], optional) – Name of categorical column to use for color coding. If provided, scatter plots will be colored by this variable. Defaults to None.

  • show_regression (bool, optional) – Whether to add regression lines to scatter plots. Defaults to True.

  • regression_type (str, optional) – Type of regression line. Options: - “linear”: Linear regression (default) - “poly2”: 2nd degree polynomial - “poly3”: 3rd degree polynomial - “lowess”: LOWESS smoothing Defaults to “linear”.

  • alpha (float, optional) – Transparency level for scatter plot points (0.0 to 1.0). Defaults to 0.6.

  • figsize (Optional[tuple], optional) – Figure size as (width, height). If None, automatically calculated based on number of variables. Defaults to None.

  • title (str, optional) – Main title for the scatter matrix. Defaults to “Scatter Matrix Analysis”.

  • color_palette (str, optional) – Color palette for categorical coloring. Defaults to “Set2”.

  • verbose (bool, optional) – If True, displays detailed information about the analysis. Defaults to True.

Returns:

Displays the scatter matrix plot directly

Return type:

None

Raises:
  • ValueError – If DataFrame is empty or no numerical columns found

  • ValueError – If specified columns don’t exist or aren’t numerical

  • ValueError – If color_by column doesn’t exist or isn’t categorical

  • ValueError – If invalid diagonal, upper, or lower options provided

Examples

>>> import pandas as pd
>>> import numpy as np
>>> import edaflow
>>>
>>> # Create sample data
>>> np.random.seed(42)
>>> df = pd.DataFrame({
...     'height': np.random.normal(170, 10, 100),
...     'weight': np.random.normal(70, 15, 100),
...     'age': np.random.uniform(20, 60, 100),
...     'income': np.random.lognormal(10, 0.5, 100),
...     'category': np.random.choice(['A', 'B', 'C'], 100)
... })
>>>
>>> # Basic scatter matrix (all numerical columns)
>>> edaflow.visualize_scatter_matrix(df)
>>>
>>> # Custom configuration with specific columns
>>> edaflow.visualize_scatter_matrix(
...     df,
...     columns=['height', 'weight', 'age'],
...     diagonal='kde',
...     upper='corr',
...     lower='scatter',
...     show_regression=True,
...     title="Body Measurements Relationships"
... )
>>>
>>> # Color-coded by categorical variable
>>> edaflow.visualize_scatter_matrix(
...     df,
...     columns=['height', 'weight', 'income'],
...     color_by='category',
...     regression_type='poly2',
...     alpha=0.7
... )
>>>
>>> # Alternative import style:
>>> from edaflow.analysis import visualize_scatter_matrix
>>> visualize_scatter_matrix(df, diagonal='box', upper='blank')

Notes

  • Scatter matrices work best with 2-7 numerical variables (readability)

  • For large datasets (>1000 rows), consider sampling for performance

  • Regression lines help identify linear vs non-linear relationships

  • Color coding reveals group-specific patterns in relationships

  • Upper/lower triangle customization allows focus on specific aspects

  • Compatible with matplotlib.pyplot.savefig() for export

Statistical Insights:
  • Diagonal plots show univariate distributions and skewness

  • Scatter plots reveal bivariate relationship patterns

  • Regression lines indicate trend strength and direction

  • Color coding shows group differences in relationships

  • Correlation values validate visual relationship strength

Integration with other edaflow functions:
  • Use after visualize_heatmap() to validate correlation patterns

  • Combine with visualize_histograms() for detailed distribution analysis

  • Follow up with handle_outliers_median() based on scatter plot insights

  • Use before feature engineering to identify transformation needs