Guide

Advanced Data Transformation And CleaningDeep dive

Pandas: Drop Duplicates From an Excel Column

Remove duplicate rows by a single Excel column with pandas drop_duplicates and the subset parameter — including keep options, NaN behavior, and pre-cleaning.

To drop duplicates by a single Excel column with pandas, load the workbook with pd.read_excel(), call df.drop_duplicates(subset=["column"]), and write the result back. This removes whole rows where the target column repeats, keeping the first occurrence by default — the other columns of the surviving row stay intact.

Every block below runs in order against a sample workbook built in the first step.

drop_duplicates keeps the first occurrence of each key A column with repeated keys A-100, A-100, B-200, C-300, B-200; drop_duplicates with keep first retains the first A-100, B-200, C-300 and greys out the two later repeats. TargetColumn A-100 A-100 repeat B-200 C-300 B-200 repeat drop_duplicates keep="first" Result A-100 B-200 C-300 first of each key kept

Create a sample workbook

Python
import pandas as pd

df = pd.DataFrame({
    "TargetColumn": ["A-100", "A-100", "B-200", "C-300", "B-200"],
    "MetricA": [10, 99, 20, 30, 21],
    "UpdatedAt": ["2024-01-01", "2024-03-01", "2024-01-02",
                  "2024-01-03", "2024-02-15"],
})
df.to_excel("report_input.xlsx", index=False, engine="openpyxl")
print(f"Wrote {len(df)} rows")

The core operation

Python
df = pd.read_excel("report_input.xlsx", engine="openpyxl")

df_clean = df.drop_duplicates(subset=["TargetColumn"], keep="first", ignore_index=True)

df_clean.to_excel("report_output.xlsx", index=False, engine="openpyxl")
print(df_clean)

Five rows collapse to three — one row per distinct TargetColumn.

The parameters that matter

  • subset — the column(s) checked for uniqueness. ["TargetColumn"] evaluates only that column while keeping every other column of the surviving rows.
  • keep — which duplicate survives: "first" (default), "last", or False (drop all rows that have any duplicate).
  • ignore_index — resets the index to 0, 1, 2, …. Set True for clean exports and predictable downstream joins.
Python
# keep="last" retains the final occurrence instead of the first
last = df.drop_duplicates(subset=["TargetColumn"], keep="last", ignore_index=True)
print(last)

# keep=False removes every row whose TargetColumn appears more than once
unique_only = df.drop_duplicates(subset=["TargetColumn"], keep=False)
print(unique_only)

Clean the column before deduplicating

Exact matching is literal: "A-100" and "a-100 " are different values. Manual Excel entry routinely introduces trailing whitespace and inconsistent casing, so normalize the key first or duplicates slip through:

Python
df["TargetColumn"] = df["TargetColumn"].astype(str).str.strip().str.lower()
deduped = df.drop_duplicates(subset=["TargetColumn"], ignore_index=True)
print(deduped["TargetColumn"].tolist())

Keep the highest-value record, not just the first

keep="first" is positional. To keep a meaningful survivor — say the row with the largest MetricA per key — sort first, then drop:

Python
best = (df.sort_values("MetricA", ascending=False)
        .drop_duplicates(subset=["TargetColumn"], keep="first")
        .sort_index())
print(best[["TargetColumn", "MetricA"]])

NaN behavior

pandas treats NaN values as equal to each other for deduplication, so multiple null keys collapse to one. To preserve them, fill with a sentinel before dropping and revert afterward:

Python
import numpy as np

s = pd.DataFrame({"TargetColumn": ["x", np.nan, np.nan, "x"]})
filled = s.assign(TargetColumn=s["TargetColumn"].fillna("__NULL__"))
kept = filled.drop_duplicates(subset=["TargetColumn"])
kept["TargetColumn"] = kept["TargetColumn"].replace("__NULL__", np.nan)
print(kept)

Inspect before you drop

To review what would be removed, build a mask with duplicated() rather than dropping blind:

Python
mask = df.duplicated(subset=["TargetColumn"], keep="first")
removed = df[mask]
print(f"{len(removed)} rows would be removed:")
print(removed)

Resolve conflicting columns with groupby

When duplicates carry different values in other columns and you want a deterministic single row per key, groupby is clearer than drop_duplicates:

Python
resolved = df.groupby("TargetColumn", as_index=False).first()
print(resolved)

Log removal counts in a pipeline

Tracking how many duplicates you remove surfaces upstream issues — repeated exports, sync errors, template drift. Wrap it so a missing column fails loudly:

Python
def dedupe_logged(df: pd.DataFrame, column: str) -> pd.DataFrame:
    if column not in df.columns:
        raise KeyError(f"Column not found: {column}")
    before = len(df)
    out = df.drop_duplicates(subset=[column], ignore_index=True)
    print(f"[INFO] Removed {before - len(out)} duplicate rows on '{column}'")
    return out

result = dedupe_logged(df, "TargetColumn")

This fits inside the broader Cleaning Excel Data with Pandas workflow. For multi-source joins and reshaping, see the Advanced Data Transformation and Cleaning overview.

Frequently asked questions

What does keep=False do compared to the default? The default keep="first" retains one row per key; keep="last" retains the final occurrence. keep=False is different — it drops every row whose key appears more than once, leaving only keys that were unique to begin with.

Why do "A-100" and "a-100 " count as separate values? Matching is exact and literal, so casing and trailing whitespace make otherwise-equal keys distinct. Normalize first with .astype(str).str.strip().str.lower() or duplicates slip through.

How are NaN keys treated? pandas considers NaN values equal to each other for deduplication, so multiple null keys collapse to one. To keep them, fill with a sentinel like "__NULL__" before dropping and revert afterward.

How do I keep the highest-value row per key instead of the first?keep="first" is purely positional. Sort by the column you care about first — df.sort_values("MetricA", ascending=False) — then drop_duplicates, optionally .sort_index() to restore the original order.

What's the point of ignore_index=True? It resets the index to 0, 1, 2, … after rows are removed, instead of leaving gaps. Set it for clean exports and predictable downstream joins.