Guide

Advanced Data Transformation And CleaningDeep dive

Merge Two Excel Files on a Common Column in Python

Merge two Excel files on a shared column in Python with pandas: load both workbooks, normalize the key, join with pd.merge, handle different names and duplicate keys, export.

To merge two Excel files on a shared column in Python, read each workbook into a DataFrame with pd.read_excel(), then join them with pd.merge() on the common key. The recipe below is fully runnable: the first block writes both source files so the reads have something to open.

For the wider set of join patterns, see Merging and Joining Excel DataFrames.

Two Excel files merged on a common id column File A holds id and name, file B holds id and amount; pd.merge on id joins matching rows into one table with id, name, and amount side by side. a.xlsx id name 1 Acme 2 Beta b.xlsx id amount 1 500 pd.merge on="id" merged on id id name amount 1 Acme 500 columns combined, one row per id

Create the two sample workbooks

Python
import pandas as pd

sales = pd.DataFrame({
    "product_sku": ["A-100", "B-200", "C-300", "A-100"],
    "region": ["North", "South", "West", "East"],
    "units": [10, 5, 8, 3],
})
sales.to_excel("sales_Q3.xlsx", index=False)

catalog = pd.DataFrame({
    "product_sku": ["A-100", "B-200", "C-300"],
    "product_name": ["Widget", "Gadget", "Gizmo"],
    "unit_price": [19.99, 49.50, 8.75],
})
catalog.to_excel("product_catalog.xlsx", index=False)

Merge on the common column

Load both files, then join. A left join keeps every row of the primary table and attaches matching catalog columns; suffixes disambiguates any non-key columns that share a name:

Python
df_primary = pd.read_excel("sales_Q3.xlsx", engine="openpyxl")
df_lookup = pd.read_excel("product_catalog.xlsx", engine="openpyxl")

merged = pd.merge(
    df_primary,
    df_lookup,
    on="product_sku",
    how="left",
    suffixes=("_sales", "_catalog"),
)
merged.to_excel("merged_sales_report.xlsx", index=False, engine="openpyxl")
print(merged)

Normalize the key before joining

merge matches keys exactly, so casing, trailing whitespace, or a dtype difference (object vs int64) silently produce empty matches. Normalize both sides first:

Python
for d in (df_primary, df_lookup):
    d["product_sku"] = d["product_sku"].astype(str).str.strip().str.upper()
print(df_primary["product_sku"].tolist())

Choosing the join type

howKeeps
innerOnly keys present in both files
leftAll rows from the first file
rightAll rows from the second file
outerAll keys from either file

When the column names differ

If the key has a different name in each file, use left_on/right_on, then drop the redundant column:

Python
alt_lookup = df_lookup.rename(columns={"product_sku": "ProductCode"})
merged_alt = pd.merge(
    df_primary, alt_lookup,
    left_on="product_sku", right_on="ProductCode", how="inner",
).drop(columns=["ProductCode"])
print(merged_alt.columns.tolist())

When duplicate keys multiply rows

If the lookup has more than one row per key, a join repeats the primary rows. Deduplicate the lookup before merging, or pass validate="m:1" to fail loudly if it isn't unique:

Python
df_lookup_unique = df_lookup.drop_duplicates(subset=["product_sku"], keep="last")
safe = pd.merge(df_primary, df_lookup_unique, on="product_sku",
                how="left", validate="m:1")
print(f"{len(safe)} rows (primary has {len(df_primary)})")

Validate the merge

After joining, confirm you didn't lose or gain rows unexpectedly and report unmatched keys. With a left join the row count should not drop below the primary's:

Python
assert len(merged) >= len(df_primary), "Unexpected row loss during merge"
missing = merged["unit_price"].isna().sum()
print(f"{missing} unmatched keys" if missing else "All keys matched")

Frequently asked questions

What if the common column has a different name in each file? Use left_on and right_on instead of on, e.g. pd.merge(df_primary, alt_lookup, left_on="product_sku", right_on="ProductCode"), then drop the redundant duplicate column afterward.

Why do matching rows fail to join?merge matches keys exactly, so casing, trailing whitespace, or a dtype difference (object vs int64) silently produces empty matches. Normalize both sides with .astype(str).str.strip().str.upper() before joining.

Which how should I pick to keep all rows of my main file? Use how="left". It keeps every row of the first file and attaches matching columns from the lookup, leaving NaN where no match exists.

How do I stop a duplicate lookup from multiplying my rows? Deduplicate the lookup with drop_duplicates(subset=["product_sku"]) before merging, or pass validate="m:1" so the merge raises loudly if the lookup isn't unique on the key.

What does suffixes do? When both files have a non-key column with the same name, suffixes=("_sales", "_catalog") renames the clashing columns so neither is overwritten. It does not affect the join key itself.

Where to go next