Guide
Pandas: Drop Duplicates From an Excel Column
Remove duplicate rows by a single Excel column with pandas drop_duplicates and the subset parameter — including keep options, NaN behavior, and pre-cleaning.
To drop duplicates by a single Excel column with pandas, load the workbook with pd.read_excel(), call df.drop_duplicates(subset=["column"]), and write the result back. This removes whole rows where the target column repeats, keeping the first occurrence by default — the other columns of the surviving row stay intact.
Every block below runs in order against a sample workbook built in the first step.
Create a sample workbook
import pandas as pd
df = pd.DataFrame({
"TargetColumn": ["A-100", "A-100", "B-200", "C-300", "B-200"],
"MetricA": [10, 99, 20, 30, 21],
"UpdatedAt": ["2024-01-01", "2024-03-01", "2024-01-02",
"2024-01-03", "2024-02-15"],
})
df.to_excel("report_input.xlsx", index=False, engine="openpyxl")
print(f"Wrote {len(df)} rows")
The core operation
df = pd.read_excel("report_input.xlsx", engine="openpyxl")
df_clean = df.drop_duplicates(subset=["TargetColumn"], keep="first", ignore_index=True)
df_clean.to_excel("report_output.xlsx", index=False, engine="openpyxl")
print(df_clean)
Five rows collapse to three — one row per distinct TargetColumn.
The parameters that matter
subset— the column(s) checked for uniqueness.["TargetColumn"]evaluates only that column while keeping every other column of the surviving rows.keep— which duplicate survives:"first"(default),"last", orFalse(drop all rows that have any duplicate).ignore_index— resets the index to0, 1, 2, …. SetTruefor clean exports and predictable downstream joins.
# keep="last" retains the final occurrence instead of the first
last = df.drop_duplicates(subset=["TargetColumn"], keep="last", ignore_index=True)
print(last)
# keep=False removes every row whose TargetColumn appears more than once
unique_only = df.drop_duplicates(subset=["TargetColumn"], keep=False)
print(unique_only)
Clean the column before deduplicating
Exact matching is literal: "A-100" and "a-100 " are different values. Manual Excel entry routinely introduces trailing whitespace and inconsistent casing, so normalize the key first or duplicates slip through:
df["TargetColumn"] = df["TargetColumn"].astype(str).str.strip().str.lower()
deduped = df.drop_duplicates(subset=["TargetColumn"], ignore_index=True)
print(deduped["TargetColumn"].tolist())
Keep the highest-value record, not just the first
keep="first" is positional. To keep a meaningful survivor — say the row with the largest MetricA per key — sort first, then drop:
best = (df.sort_values("MetricA", ascending=False)
.drop_duplicates(subset=["TargetColumn"], keep="first")
.sort_index())
print(best[["TargetColumn", "MetricA"]])
NaN behavior
pandas treats NaN values as equal to each other for deduplication, so multiple null keys collapse to one. To preserve them, fill with a sentinel before dropping and revert afterward:
import numpy as np
s = pd.DataFrame({"TargetColumn": ["x", np.nan, np.nan, "x"]})
filled = s.assign(TargetColumn=s["TargetColumn"].fillna("__NULL__"))
kept = filled.drop_duplicates(subset=["TargetColumn"])
kept["TargetColumn"] = kept["TargetColumn"].replace("__NULL__", np.nan)
print(kept)
Inspect before you drop
To review what would be removed, build a mask with duplicated() rather than dropping blind:
mask = df.duplicated(subset=["TargetColumn"], keep="first")
removed = df[mask]
print(f"{len(removed)} rows would be removed:")
print(removed)
Resolve conflicting columns with groupby
When duplicates carry different values in other columns and you want a deterministic single row per key, groupby is clearer than drop_duplicates:
resolved = df.groupby("TargetColumn", as_index=False).first()
print(resolved)
Log removal counts in a pipeline
Tracking how many duplicates you remove surfaces upstream issues — repeated exports, sync errors, template drift. Wrap it so a missing column fails loudly:
def dedupe_logged(df: pd.DataFrame, column: str) -> pd.DataFrame:
if column not in df.columns:
raise KeyError(f"Column not found: {column}")
before = len(df)
out = df.drop_duplicates(subset=[column], ignore_index=True)
print(f"[INFO] Removed {before - len(out)} duplicate rows on '{column}'")
return out
result = dedupe_logged(df, "TargetColumn")
This fits inside the broader Cleaning Excel Data with Pandas workflow. For multi-source joins and reshaping, see the Advanced Data Transformation and Cleaning overview.
Frequently asked questions
What does keep=False do compared to the default?
The default keep="first" retains one row per key; keep="last" retains the final occurrence. keep=False is different — it drops every row whose key appears more than once, leaving only keys that were unique to begin with.
Why do "A-100" and "a-100 " count as separate values?
Matching is exact and literal, so casing and trailing whitespace make otherwise-equal keys distinct. Normalize first with .astype(str).str.strip().str.lower() or duplicates slip through.
How are NaN keys treated?
pandas considers NaN values equal to each other for deduplication, so multiple null keys collapse to one. To keep them, fill with a sentinel like "__NULL__" before dropping and revert afterward.
How do I keep the highest-value row per key instead of the first?keep="first" is purely positional. Sort by the column you care about first — df.sort_values("MetricA", ascending=False) — then drop_duplicates, optionally .sort_index() to restore the original order.
What's the point of ignore_index=True?
It resets the index to 0, 1, 2, … after rows are removed, instead of leaving gaps. Set it for clean exports and predictable downstream joins.