Core Functions

These are the core analysis functions that form the foundation of edaflow’s EDA capabilities.

Data Quality Analysis

edaflow.check_null_columns(df: DataFrame, threshold: float | None = 10) → DataFrame[source]

Check null values in DataFrame columns with rich styled output.

Calculates the percentage of null values per column and applies color styling based on the percentage of nulls relative to the threshold.

Parameters:

df (pd.DataFrame) – The input DataFrame to analyze
threshold (Optional[float], optional) – The threshold percentage for highlighting. Defaults to 10.

Returns:

A styled DataFrame showing column names and null: percentages with color coding: - Red: > 2*threshold (high null percentage) - Yellow: > threshold but <= 2*threshold (medium null %) - Light yellow: > 0 but <= threshold (low null %) - Gray: 0 (no nulls)

Return type:

pd.DataFrame

Example

>>> import pandas as pd
>>> import edaflow
>>> df = pd.DataFrame({'A': [1, 2, None], 'B': [1, None, None]})
>>> styled_result = edaflow.check_null_columns(df, threshold=20)
>>> # Returns styled DataFrame with null percentages

# Alternative import style: >>> from edaflow.analysis import check_null_columns >>> styled_result = check_null_columns(df, threshold=20)

edaflow.analyze_categorical_columns(df: DataFrame, threshold: float | None = 35) → None[source]

Analyze categorical columns of object type to identify potential data issues.

This function examines object-type columns to detect: 1. Columns that might be numeric but stored as strings 2. Categorical columns with their unique values 3. Data type consistency issues

Parameters:

df (pd.DataFrame) – The input DataFrame to analyze
threshold (Optional[float], optional) – The threshold percentage for non-numeric values. If a column has less than this percentage of non-numeric values, it’s flagged as potentially numeric. Defaults to 35.

Returns:

Prints analysis results directly to console with rich color coding

Return type:

None

Example

>>> import pandas as pd
>>> import edaflow
>>> df = pd.DataFrame({
...     'name': ['Alice', 'Bob', 'Charlie'],
...     'age_str': ['25', '30', '35'],
...     'mixed': ['1', '2', 'three'],
...     'numbers': [1, 2, 3]
... })
>>> edaflow.analyze_categorical_columns(df, threshold=35)
# Output with rich color coding and tables

# Alternative import style: >>> from edaflow.analysis import analyze_categorical_columns

edaflow.convert_to_numeric(df: DataFrame, threshold: float | None = 35, inplace: bool = False) → DataFrame[source]

Convert object columns to numeric when appropriate based on data analysis with rich formatting.

This function examines object-type columns and converts them to numeric if the percentage of non-numeric values is below the specified threshold. This helps clean datasets where numeric data is stored as strings.

Parameters:

df (pd.DataFrame) – The input DataFrame to process
threshold (Optional[float], optional) – The threshold percentage for non-numeric values. Columns with fewer non-numeric values than this threshold will be converted to numeric. Defaults to 35.
inplace (bool, optional) – If True, modify the DataFrame in place and return None. If False, return a new DataFrame with conversions applied. Defaults to False.

Returns:

If inplace=False, returns a new DataFrame with: numeric conversions applied. If inplace=True, modifies the original DataFrame and returns None.

Return type:

pd.DataFrame or None

Example

>>> import pandas as pd
>>> import edaflow
>>> df = pd.DataFrame({
...     'name': ['Alice', 'Bob', 'Charlie'],
...     'age_str': ['25', '30', '35'],
...     'mixed': ['1', '2', 'three'],
...     'numbers': [1, 2, 3]
... })
>>>
>>> # Create a copy with conversions
>>> df_cleaned = edaflow.convert_to_numeric(df, threshold=35)
>>>
>>> # Or modify the original DataFrame
>>> edaflow.convert_to_numeric(df, threshold=35, inplace=True)
>>>
>>> # Alternative import style:
>>> from edaflow.analysis import convert_to_numeric
>>> df_cleaned = convert_to_numeric(df, threshold=50)

Notes

Values that cannot be converted to numeric become NaN
The function provides colored output showing which columns were converted
Use a lower threshold to be more strict about conversions
Use a higher threshold to be more lenient about mixed data

edaflow.display_column_types(df)[source]

Display categorical and numerical columns in a DataFrame with rich formatting.

This function separates DataFrame columns into categorical (object dtype) and numerical (non-object dtypes) columns and displays them in a clear format.

Parameters:

dfpandas.DataFrame: The DataFrame to analyze

Returns:

dict: Dictionary containing ‘categorical’ and ‘numerical’ lists of column names

Example:

>>> import pandas as pd
>>> from edaflow import display_column_types
>>>
>>> # Create sample data
>>> data = {
...     'name': ['Alice', 'Bob', 'Charlie'],
...     'age': [25, 30, 35],
...     'city': ['NYC', 'LA', 'Chicago'],
...     'salary': [50000, 60000, 70000],
...     'is_active': [True, False, True]
... }
>>> df = pd.DataFrame(data)
>>>
>>> # Display column types
>>> result = display_column_types(df)
>>> print("Categorical columns:", result['categorical'])
>>> print("Numerical columns:", result['numerical'])

Data Imputation

edaflow.impute_numerical_median(df, columns=None, inplace=False)[source]

Impute missing values in numerical columns using median values with rich formatting.

This function identifies numerical columns and fills missing values (NaN) with the median value of each column. It provides detailed reporting of the imputation process and handles edge cases safely.

Parameters:

df (pandas.DataFrame) – The DataFrame containing data to impute
columns (list, optional) – Specific columns to impute. If None, all numerical columns will be processed
inplace (bool, default False) – If True, modify the original DataFrame. If False, return a new DataFrame

Returns:

If inplace=False, returns the DataFrame with imputed values If inplace=True, returns None and modifies the original DataFrame

Return type:

pandas.DataFrame or None

Examples

>>> import pandas as pd
>>> import edaflow
>>>
>>> # Create sample data with missing values
>>> df = pd.DataFrame({
...     'age': [25, None, 35, None, 45],
...     'salary': [50000, 60000, None, 70000, None],
...     'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
... })
>>>
>>> # Impute all numerical columns
>>> df_imputed = edaflow.impute_numerical_median(df)
>>>
>>> # Impute specific columns only
>>> df_imputed = edaflow.impute_numerical_median(df, columns=['age'])
>>>
>>> # Impute in place
>>> edaflow.impute_numerical_median(df, inplace=True)

edaflow.impute_categorical_mode(df, columns=None, inplace=False)[source]

Impute missing values in categorical columns using mode (most frequent value).

This function identifies categorical columns and fills missing values (NaN) with the mode (most frequent value) of each column. It provides detailed reporting of the imputation process and handles edge cases safely.

Parameters:

df (pandas.DataFrame) – The DataFrame containing data to impute
columns (list, optional) – Specific columns to impute. If None, all categorical columns will be processed
inplace (bool, default False) – If True, modify the original DataFrame. If False, return a new DataFrame

Returns:

If inplace=False, returns the DataFrame with imputed values If inplace=True, returns None and modifies the original DataFrame

Return type:

pandas.DataFrame or None

Examples

>>> import pandas as pd
>>> import edaflow
>>>
>>> # Create sample data with missing values
>>> df = pd.DataFrame({
...     'category': ['A', 'B', 'A', None, 'A'],
...     'status': ['Active', None, 'Active', 'Inactive', None],
...     'age': [25, 30, 35, 40, 45]
... })
>>>
>>> # Impute all categorical columns
>>> df_imputed = edaflow.impute_categorical_mode(df)
>>>
>>> # Impute specific columns only
>>> df_imputed = edaflow.impute_categorical_mode(df, columns=['category'])
>>>
>>> # Impute in place
>>> edaflow.impute_categorical_mode(df, inplace=True)

Outlier Handling

edaflow.handle_outliers_median(df: DataFrame, columns: str | List[str] | None = None, method: str = 'iqr', iqr_multiplier: float = 1.5, inplace: bool = False, verbose: bool = True) → DataFrame[source]

Replace outliers in numerical columns with the median value.

This function identifies outliers using statistical methods and replaces them with the median value of the respective column. It’s designed to work seamlessly with the visualize_numerical_boxplots function for a complete outlier workflow.

Parameters:

df (pd.DataFrame) – The input DataFrame
columns (Optional[Union[str, List[str]]], optional) – Column name(s) to process. If None, processes all numerical columns. Defaults to None.
method (str, optional) – Method to identify outliers. Options: - ‘iqr’: Interquartile Range method (Q1 - 1.5*IQR, Q3 + 1.5*IQR) - ‘zscore’: Z-score method (values with |z-score| > 3) - ‘modified_zscore’: Modified Z-score using median absolute deviation Defaults to ‘iqr’.
iqr_multiplier (float, optional) – Multiplier for IQR method. Defaults to 1.5.
inplace (bool, optional) – If True, modifies the original DataFrame. If False, returns a new DataFrame. Defaults to False.
verbose (bool, optional) – If True, displays detailed information about the outlier handling process. Defaults to True.

Returns:

DataFrame with outliers replaced by median values.: If inplace=True, returns the modified original DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If no valid numerical columns are found or if an invalid method is specified.
KeyError – If specified column(s) don’t exist in the DataFrame.

Example

>>> import pandas as pd
>>> import edaflow
>>>
>>> # Create sample data with outliers
>>> df = pd.DataFrame({
...     'A': [1, 2, 3, 4, 5, 100],  # 100 is an outlier
...     'B': [10, 20, 30, 40, 50, 60],
...     'C': ['x', 'y', 'z', 'x', 'y', 'z']
... })
>>>
>>> # First visualize outliers
>>> edaflow.visualize_numerical_boxplots(df)
>>>
>>> # Then handle outliers
>>> df_clean = edaflow.handle_outliers_median(df)
>>>
>>> # Or handle specific columns
>>> df_clean = edaflow.handle_outliers_median(df, columns=['A'])
>>>
>>> # Or modify inplace
>>> edaflow.handle_outliers_median(df, inplace=True)

# Alternative import style: >>> from edaflow.analysis import handle_outliers_median >>> df_clean = handle_outliers_median(df, method=’zscore’)

EDA Insights & Summary

edaflow.summarize_eda_insights(df: DataFrame, target_column: str | None = None, eda_functions_used: List[str] | None = None, class_threshold: float = 0.1) → dict[source]

Generate comprehensive EDA insights and recommendations after completing analysis workflow.

This function analyzes the DataFrame and provides intelligent insights about: - Dataset characteristics and shape - Data quality assessment - Class distribution and imbalance detection - Missing data patterns - Feature type analysis - Actionable recommendations for modeling

Parameters:

df (pandas.DataFrame) – The DataFrame that has been analyzed
target_column (str, optional) – The name of the target column for classification/regression analysis
eda_functions_used (list of str, optional) – List of edaflow functions that have been executed
class_threshold (float, default 0.1) – Threshold below which a class is considered underrepresented (10%)

Returns:

Comprehensive insights dictionary with analysis results and recommendations

Return type:

dict

Examples

>>> import pandas as pd
>>> import edaflow
>>>
>>> # After completing EDA workflow
>>> df = pd.read_csv('healthcare_data.csv')
>>> # ... run various edaflow functions ...
>>>
>>> # Generate comprehensive insights
>>> insights = edaflow.summarize_eda_insights(df, target_column='ckd_status')
>>>
>>> # Insights with specific functions tracked
>>> functions_used = ['check_null_columns', 'analyze_categorical_columns',
...                   'visualize_histograms', 'handle_outliers_median']
>>> insights = edaflow.summarize_eda_insights(df, 'ckd_status', functions_used)

Automated Profiling

edaflow.profile_report(df: DataFrame, top_n_categorical: int = 5, output_format: str = 'html') → Any[source]

Generate a comprehensive profiling report for a DataFrame.

This function creates an automated EDA report similar to ydata-profiling’s ProfileReport, including dataset overview, missing value analysis, categorical insights, and visualizations.

Parameters:

df (pd.DataFrame) – The input DataFrame to profile
top_n_categorical (int, optional) – Number of top categorical columns to analyze. Defaults to 5.
output_format (str, optional) – Output format for the report. Options: “html” (saves to temp file), “dict” (returns dict). Defaults to “html”.

Returns:

If output_format=”html”, returns path to HTML file.: If output_format=”dict”, returns dict with: - ‘overview’: DataFrame with dataset info - ‘summary_stats’: DataFrame with summary statistics - ‘missing_values’: DataFrame with null analysis - ‘categorical_insights’: Dict with category distributions - ‘numeric_insights’: Dict with numeric column info - ‘visualizations’: Dict with matplotlib figures

Return type:

Any

Raises:

ValueError – If df is empty or output_format is invalid
TypeError – If df is not a pandas DataFrame

Examples

>>> import pandas as pd
>>> import edaflow
>>>
>>> # Create sample data
>>> df = pd.DataFrame({
...     'age': [25, 30, 35, 28, None, 45],
...     'salary': [50000, 60000, 70000, 55000, 65000, 80000],
...     'department': ['HR', 'IT', 'IT', 'HR', 'Finance', 'IT'],
...     'city': ['NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA']
... })
>>>
>>> # Generate HTML report
>>> report_path = edaflow.profile_report(df)
>>> print(f"Report saved to: {report_path}")
>>>
>>> # Generate dict report
>>> report_dict = edaflow.profile_report(df, output_format="dict")
>>> print(report_dict['overview'])
>>>
>>> # Analyze top 3 categorical columns
>>> report = edaflow.profile_report(df, top_n_categorical=3, output_format="dict")
>>> print(report['categorical_insights'])

Alternative import: >>> from edaflow.analysis import profile_report >>> report = profile_report(df)

Helper Functions

edaflow.hello()[source]

A sample hello function to test the package installation.

Returns:: A greeting message
Return type:: str