Guide

Advanced Data Transformation And CleaningDeep dive

Cleaning Excel Data With Pandas: A Production Workflow

Q: Does read_excel() have a skip_blank_lines argument?

No — that parameter belongs to read_csv(). With read_excel(), remove fully blank rows after loading using df.dropna(how="all").

Q: Why apply string operations on the status column with .astype(str) first?

Once a column is converted to the category dtype, vectorized .str methods operate on the categories rather than giving the behaviour you expect. Cast back with .astype(str) before .str.upper() and .replace(...), as the deduplicate step does.

Q: Why strip $ and , before pd.to_numeric?

to_numeric with errors="coerce" turns any unparseable string into NaN, so a value like "$1,200.50" collapses to NaN and the whole column can end up empty. Remove currency symbols and thousands separators first, then cast.

Q: How do I exclude footer notes or merged cells that inflate the column count?

Restrict parsing to the real data region with usecols (for example usecols="A:D") and skipfooter so trailing notes don't become phantom rows or columns.

Turn messy Excel exports into reliable reporting inputs with pandas: load as text, normalize headers and types, drop noise, deduplicate, validate, and export.

Raw Excel exports rarely arrive analysis-ready: inconsistent headers, hidden whitespace, placeholder strings, duplicate rows, and numbers stored as text all break downstream reports. Cleaning that data in pandas gives you a scriptable, version-controlled, testable alternative to manual spreadsheet edits. This guide builds a linear cleaning pipeline — ingest, normalize, denoise, deduplicate, validate, export — one stage at a time, extending the Advanced Data Transformation and Cleaning overview.

Every block runs in order against a sample workbook created in the first step.

Install dependencies

Bash

pip install pandas openpyxl numpy

Create a sample workbook

This workbook deliberately contains the problems you meet in the wild: messy header casing, whitespace, a placeholder "N/A", a fully blank row, a duplicate order, a negative amount, and amounts stored as text:

Python

import pandas as pd

raw = pd.DataFrame({
    " Order ID ": ["1001", "1002", "1003", "1003", "1004", None],
    "Transaction Date": ["2024-01-05", "2024-01-06", "2024-01-07",
                         "2024-01-07", "2019-12-31", None],
    "Status": ["pending", "Complete", "PENDING", "PENDING", "complete", None],
    "Amount": ["$1,200.50", "980.00", "450.00", "450.00", "-75.00", None],
})
raw.to_excel("report_input.xlsx", sheet_name="Data", index=False)
print(f"Wrote {len(raw)} rows")

Step 1: Load everything as text

Load with dtype=str so pandas does not guess a type per cell — guessing is what produces object columns mixing strings, numbers, and dates. We coerce deliberately in Step 2.

Python

import logging
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

def load_excel_data(file_path: str, sheet_name=0) -> pd.DataFrame:
    df = pd.read_excel(
        file_path,
        sheet_name=sheet_name,
        header=0,
        engine="openpyxl",
        dtype=str,        # prevent premature, per-cell type coercion
    )
    logging.info(f"Loaded {len(df)} rows from {file_path}")
    return df

df = load_excel_data("report_input.xlsx", sheet_name="Data")
print(df.columns.tolist())

read_excel() has no skip_blank_lines argument (that belongs to read_csv()); remove blank rows with dropna(how="all") in Step 3 instead.

Step 2: Standardize headers and types

Normalize column names to predictable snake_case, then coerce each column to its intended type. Stripping currency symbols and separators before pd.to_numeric keeps the cast from collapsing to all-NaN.

Python

def standardize_schema(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df.columns = (
        df.columns.str.strip()
        .str.lower()
        .str.replace(r"\s+", "_", regex=True)
    )

    # amount: strip "$" and "," before numeric cast
    if "amount" in df.columns:
        df["amount"] = (df["amount"].str.replace(r"[$,]", "", regex=True)
                        .pipe(pd.to_numeric, errors="coerce"))

    if "transaction_date" in df.columns:
        df["transaction_date"] = pd.to_datetime(df["transaction_date"], errors="coerce")

    if "status" in df.columns:
        df["status"] = df["status"].astype("category")

    return df

df = standardize_schema(df)
print(df.dtypes)

Step 3: Drop structural noise

Excel exports carry empty rows from copy-paste, template padding, and footer notes. Drop fully blank rows, then drop rows missing a critical identifier, and strip residual whitespace from text columns.

Python

def purge_noise(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    initial = len(df)

    df = df.dropna(how="all")                       # fully blank rows
    df = df.dropna(subset=["order_id", "transaction_date"])  # missing keys

    text_cols = df.select_dtypes(include=["object", "string"]).columns
    for col in text_cols:
        df[col] = df[col].str.strip()

    logging.info(f"Purged {initial - len(df)} noisy/empty rows")
    return df

df = purge_noise(df)
print(f"{len(df)} rows remain")

Step 4: Deduplicate and normalize values

Duplicates come from repeated exports and overlapping date ranges. Sort first so the survivor is deterministic, then drop on the business key. For a focused treatment of single-column deduplication, see Pandas Drop Duplicates from Excel Column.

Python

def deduplicate_records(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    # Newest first, so keep="first" retains the most recent record per key
    df = df.sort_values("transaction_date", ascending=False)
    df = df.drop_duplicates(subset=["order_id"], keep="first")

    # Normalize the status vocabulary (string ops on a category need .astype(str))
    df["status"] = (df["status"].astype(str).str.upper()
                    .replace({"PENDING": "OPEN", "COMPLETE": "CLOSED"}))
    return df

df = deduplicate_records(df)
print(df[["order_id", "status"]])

Step 5: Validate and derive

Run business-rule checks and compute the columns the report needs. Logging the row counts you drop gives you an audit trail when a scheduled run produces unexpected totals.

Python

def validate_and_prepare(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    neg = df["amount"] < 0
    if neg.any():
        logging.warning(f"Dropping {int(neg.sum())} rows with negative amounts")
        df = df[~neg]

    # Keep only in-scope dates
    df = df[df["transaction_date"] >= pd.Timestamp("2020-01-01")]

    df["fiscal_quarter"] = df["transaction_date"].dt.quarter
    df["fiscal_year"] = df["transaction_date"].dt.year
    return df

df = validate_and_prepare(df)
print(df)

Step 6: Export the cleaned data

Write the result for the next stage. CSV is universally interoperable; for a styled workbook, use to_excel with the openpyxl engine.

Python

def export_clean_data(df: pd.DataFrame, output_path: str):
    df.to_csv(output_path, index=False)
    logging.info(f"Exported {len(df)} rows to {output_path}")

export_clean_data(df, "report_clean.csv")
df.to_excel("report_clean.xlsx", index=False, engine="openpyxl")
print("Export complete")

For large static datasets you read repeatedly, Parquet (via pip install pyarrow) compresses well and preserves dtypes — swap to_csv for df.to_parquet(path, index=False).

Frequently asked questions

Does read_excel() have a skip_blank_lines argument? No — that parameter belongs to read_csv(). With read_excel(), remove fully blank rows after loading using df.dropna(how="all").

Why apply string operations on the status column with .astype(str) first? Once a column is converted to the category dtype, vectorized .str methods operate on the categories rather than giving the behaviour you expect. Cast back with .astype(str) before .str.upper() and .replace(...), as the deduplicate step does.

Why strip $ and , before pd.to_numeric?to_numeric with errors="coerce" turns any unparseable string into NaN, so a value like "$1,200.50" collapses to NaN and the whole column can end up empty. Remove currency symbols and thousands separators first, then cast.

How do I exclude footer notes or merged cells that inflate the column count? Restrict parsing to the real data region with usecols (for example usecols="A:D") and skipfooter so trailing notes don't become phantom rows or columns.

Should I export to CSV, Excel, or something else? CSV is the most interoperable for handoff to the next stage. Use to_excel with the openpyxl engine when you need a styled workbook, or to_parquet (via pyarrow) for large static datasets you re-read often, since it compresses well and preserves dtypes.

Conclusion

A reliable cleaning pipeline follows the same order every time: load as text, normalize headers and types, drop structural noise, deduplicate on the business key, validate business rules, and export. Keeping each stage in its own function means you can log row counts at every boundary, test each step in isolation, and add new rules without touching the rest of the pipeline. The result is a script you can re-run on the next month's export and trust to produce the same clean output.

Common errors and fixes

ValueError: could not convert string to float — currency symbols, thousands separators, or trailing spaces in a numeric column. Strip them first:

Python

s = pd.Series(["$1,200.50", "980.00"])
cleaned = pd.to_numeric(s.str.replace(r"[$,]", "", regex=True), errors="coerce")
print(cleaned.tolist())

Footer notes or merged cells inflate the column count — restrict parsing to the real data region with usecols and skipfooter:

Python

subset = pd.read_excel("report_input.xlsx", usecols="A:D", skipfooter=0,
                       engine="openpyxl")
print(subset.shape)

MemoryError on a large workbook — declare dtypes up front and drop columns you do not need; convert high-cardinality strings to category to shrink memory.

Ambiguous dates (MM/DD vs DD/MM) — coerce explicitly and set dayfirst to match the source:

Python

dates = pd.to_datetime(pd.Series(["05/01/2024", "31/12/2019"]),
                       dayfirst=True, errors="coerce")
print(dates.tolist())

Where to go next

Remove repeated entries precisely: Pandas Drop Duplicates from Excel Column.
Strip out empty rows: Remove Blank Rows from Excel with Pandas.
Combine cleaned exports: Merging and Joining Excel DataFrames.
Style the output: Applying Conditional Formatting with openpyxl.