Guide

Advanced Data Transformation And CleaningQuick guide

Merge Two Excel Files on a Common Column in Python

To merge two Excel files on a shared column in Python, load both workbooks into pandas DataFrames with pd.read_excel(), then align them using pd.merge(). This pattern handles dtype coercion automatically, preserves row alignment for left joins, and integrates cleanly into automated reporting pipelines. For broader strategies on Merging and Joining Excel DataFrames, follow the production-ready implementation below.

Merge Two Excel Files on a Common Column in Python

To merge two Excel files on a shared column in Python, load both workbooks into pandas DataFrames with pd.read_excel(), then align them using pd.merge(). This pattern handles dtype coercion automatically, preserves row alignment for left joins, and integrates cleanly into automated reporting pipelines. For broader strategies on Merging and Joining Excel DataFrames, follow the production-ready implementation below.

Core Implementation

Python
      import pandas as pd

# 1. Load workbooks (explicit engine prevents silent fallbacks)
df_primary = pd.read_excel("sales_Q3.xlsx", engine="openpyxl")
df_lookup = pd.read_excel("product_catalog.xlsx", engine="openpyxl")

# 2. Merge on shared key with collision-safe suffixes
merged = pd.merge(
 df_primary,
 df_lookup,
 on="product_sku",
 how="left",
 suffixes=("_sales", "_catalog")
)

# 3. Export result
merged.to_excel("merged_sales_report.xlsx", index=False, engine="openpyxl")

    

Critical Configuration Notes

  • Dependencies: Requires pandas>=1.5.0 and openpyxl>=3.1.0. Install via pip install pandas openpyxl.
  • Key Alignment: The on parameter is strictly case-sensitive. Mismatched dtypes (object vs int64) or trailing whitespace cause silent empty joins. Normalize keys first: df["product_sku"] = df["product_sku"].astype(str).str.strip().
  • Memory Limits: read_excel() loads entire sheets into RAM. For files >500MB or 1M+ rows, convert to Parquet/CSV first or switch to Polars.
  • Python Version: Optimized for Python 3.8+. EOL versions lack modern pathlib integration and stable Excel I/O.

Targeted Fallbacks for Edge Cases

Different Column Names Use left_on/right_on and drop the duplicate post-join:

Python
      merged = pd.merge(df_primary, df_lookup, left_on="sku_id", right_on="ProductCode", how="inner").drop(columns=["ProductCode"])

    

Duplicate Keys Causing Row Multiplication Deduplicate the lookup table before merging to prevent Cartesian expansion:

Python
      df_lookup = df_lookup.drop_duplicates(subset=["product_sku"], keep="last")

    

Large Dataset Memory Overflow Switch to Polars for lazy evaluation and ~50% lower RAM footprint:

Python
      import polars as pl
# Requires: pip install polars xlsx2csv
df1 = pl.read_excel("sales_Q3.xlsx")
df2 = pl.read_excel("product_catalog.xlsx")
merged = df1.join(df2, on="product_sku", how="left")
merged.write_excel("merged_sales_report.xlsx")

    

Air-Gapped/Restricted Environments Bypass third-party libraries using Python’s built-in sqlite3 and csv modules:

Python
      import csv, sqlite3

conn = sqlite3.connect(":memory:")
conn.execute("CREATE TABLE t1 (sku TEXT, qty INT)")
conn.execute("CREATE TABLE t2 (sku TEXT, price REAL)")

with open("sales.csv") as f:
 conn.executemany("INSERT INTO t1 VALUES (?, ?)", csv.reader(f))
with open("catalog.csv") as f:
 conn.executemany("INSERT INTO t2 VALUES (?, ?)", csv.reader(f))

cursor = conn.execute("SELECT t1.*, t2.price FROM t1 LEFT JOIN t2 ON t1.sku = t2.sku")
with open("merged_output.csv", "w", newline="") as out:
 csv.writer(out).writerows(cursor.fetchall())
conn.close()

    

Validation & Pipeline Integration

Embed deterministic checks to catch upstream data drift before reports deploy. Wrap execution in error handling that logs row counts and missing keys:

Python
      # Validate join integrity (adjust threshold based on `how` parameter)
assert len(merged) >= len(df_primary), "Unexpected row loss during merge"
missing_keys = merged["product_sku"].isna().sum()
if missing_keys > 0:
 print(f"Warning: {missing_keys} unmatched keys detected")

    

Parameterize file paths, enforce strict schema contracts, and log merge statistics. This workflow slots directly into larger Advanced Data Transformation and Cleaning pipelines where audit trails and idempotent execution are mandatory.