Guide

Getting Started With Python Excel AutomationDeep dive

Combine Multiple Excel Files into One in Python

Q: Does pd.concat need the files to have columns in the same order?

No. pd.concat aligns columns by name, so a file that lists its columns in a different order still slots into place. Use ignore_index=True to renumber the stacked rows.

Q: What happens when files have different columns?

A file missing a column gets NaN there, and any extra column appears as NaN for every other file. To pin the output to a fixed schema, reindex(columns=master_cols) after concatenating to drop unexpected columns and fix the order.

Q: Why does my run break when a workbook is open in Excel?

Excel writes a lock file whose name starts with ~$, and reading it raises PermissionError. Filter those out in your glob: f for f in src.glob("*.xlsx") if not f.name.startswith("~$").

Q: How do I track which file each row came from?

Add a column as you read each file — df["source_file"] = f.name — before appending the DataFrame to your list. Every row then carries its origin into the combined table.

Merge many .xlsx files into one workbook with pandas — glob the folder, concat with name-based column alignment, track the source file, and export the result.

When a folder fills up with monthly exports or per-region submissions, you usually want them stacked into one table. The standard approach reads each .xlsx into a DataFrame and stacks them with pd.concat(), which aligns columns by name so files in slightly different orders still line up. This guide builds the whole pipeline on sample files the first step generates, so every snippet runs in order.

Step 1: Install the toolchain

Bash

pip install pandas openpyxl

openpyxl is the engine pandas uses for .xlsx I/O. Convert any legacy .xls files to .xlsx first.

Step 2: Create a folder of sample files

To make the example self-contained, generate three monthly files in a monthly_reports/ directory. Note the third file lists its columns in a different order — pd.concat will still align them by name:

Python

import pandas as pd
from pathlib import Path

src = Path("monthly_reports")
src.mkdir(exist_ok=True)

pd.DataFrame({
    "date": ["2024-01-31", "2024-01-31"],
    "region": ["North", "South"],
    "revenue": [1200.0, 980.0],
}).to_excel(src / "jan.xlsx", index=False)

pd.DataFrame({
    "date": ["2024-02-29", "2024-02-29"],
    "region": ["North", "West"],
    "revenue": [1350.0, 720.0],
}).to_excel(src / "feb.xlsx", index=False)

# Same columns, different order
pd.DataFrame({
    "revenue": [1410.0, 1010.0],
    "region": ["East", "South"],
    "date": ["2024-03-31", "2024-03-31"],
}).to_excel(src / "mar.xlsx", index=False)

print("Created", len(list(src.glob('*.xlsx'))), "files")

Step 3: Find the files with glob

Path.glob("*.xlsx") lists matching files. Filter out Excel's lock files (their names start with ~$) so a workbook left open in Excel does not break the run:

Python

files = sorted(f for f in src.glob("*.xlsx") if not f.name.startswith("~$"))
print([f.name for f in files])

Step 4: Read and concatenate

Read each file into a DataFrame, collect them in a list, then pd.concat with ignore_index=True to renumber the rows. Column alignment is by name, so mar.xlsx's reordered columns slot into place:

Python

frames = []
for f in files:
    df = pd.read_excel(f, engine="openpyxl")
    if not df.empty:
        frames.append(df)

combined = pd.concat(frames, ignore_index=True)
print(combined)
print("Combined shape:", combined.shape)

Step 5: Track which file each row came from

A consolidated table is far more useful when each row remembers its origin. Add a source_file column as you read:

Python

frames = []
for f in files:
    df = pd.read_excel(f, engine="openpyxl")
    df["source_file"] = f.name          # tag every row with its origin
    frames.append(df)

combined = pd.concat(frames, ignore_index=True)
print(combined[["date", "region", "revenue", "source_file"]])

Step 6: Validate the row count

Before exporting, confirm the combined frame holds exactly as many rows as the source files did. Catching a mismatch here beats discovering missing data downstream:

Python

expected = sum(pd.read_excel(f, engine="openpyxl").shape[0] for f in files)
assert combined.shape[0] == expected, f"Expected {expected} rows, got {combined.shape[0]}"
print(f"Row count verified: {combined.shape[0]} rows")

Step 7: Export the consolidated workbook

Write the result to a single .xlsx. Use index=False so the DataFrame's row numbers do not become a stray column:

Python

combined.to_excel("consolidated_report.xlsx", index=False, engine="openpyxl")
print("Wrote consolidated_report.xlsx")

Handling schema drift

pd.concat aligns by column name, so a file missing a column gets NaN there, and an extra column appears for every other file as NaN. To pin the output to an exact schema, reindex after concatenating — this drops unexpected columns and guarantees column order:

Python

master_cols = ["date", "region", "revenue", "source_file"]
combined = combined.reindex(columns=master_cols)
print("Final columns:", list(combined.columns))

If different files use different names for the same field (rev vs revenue), rename each file's columns to a canonical set with a rename map before appending it to the list.

Fallback: append raw cells with openpyxl

When files carry merged cells or quirks that trip up pandas, you can copy raw values cell by cell. This keeps the first file's header and appends only data rows from the rest. It preserves raw values but skips pandas' type handling and name-based alignment, so it assumes a consistent column order:

Python

from openpyxl import load_workbook, Workbook

out = Workbook()
ws_out = out.active
first = True

for f in files:
    wb_in = load_workbook(f, read_only=True, data_only=True)
    ws_in = wb_in.active
    start = 1 if first else 2          # keep the header only from the first file
    for i, row in enumerate(ws_in.iter_rows(values_only=True), start=1):
        if i < start:
            continue
        if any(cell is not None for cell in row):
            ws_out.append(row)
    wb_in.close()
    first = False

out.save("consolidated_raw.xlsx")
print("Wrote consolidated_raw.xlsx via openpyxl")

Common errors and fixes

Error	Cause	Fix
`FileNotFoundError` / empty result	`glob` pattern or directory wrong	Print the resolved path and `list(src.glob("*.xlsx"))` to confirm
Unexpected `NaN` columns	Column names differ across files	Rename to a canonical schema, then `reindex(columns=...)`
`PermissionError` on read	A source file is open in Excel	Skip `~$` lock files (Step 3); close the workbook
Row count assertion fails	An empty or unreadable file was skipped	Log skipped files; decide whether to fail or continue

Frequently asked questions

Does pd.concat need the files to have columns in the same order? No. pd.concat aligns columns by name, so a file that lists its columns in a different order still slots into place. Use ignore_index=True to renumber the stacked rows.

What happens when files have different columns? A file missing a column gets NaN there, and any extra column appears as NaN for every other file. To pin the output to a fixed schema, reindex(columns=master_cols) after concatenating to drop unexpected columns and fix the order.

Why does my run break when a workbook is open in Excel? Excel writes a lock file whose name starts with ~$, and reading it raises PermissionError. Filter those out in your glob: f for f in src.glob("*.xlsx") if not f.name.startswith("~$").

How do I track which file each row came from? Add a column as you read each file — df["source_file"] = f.name — before appending the DataFrame to your list. Every row then carries its origin into the combined table.

When should I use the openpyxl fallback instead of pandas? When files carry merged cells or other quirks that trip up pandas. Copying raw values cell by cell preserves them, but it skips pandas' type handling and name-based alignment, so it assumes a consistent column order across files.

Conclusion

The glob-read-concat pipeline is the idiomatic pandas approach: three lines in the common case, three failure modes to guard against (lock files, schema drift, and empty inputs), and a row-count assertion before export that catches mismatches early. If column-name inconsistency is common in your file set, build the reindex step with a master_cols list into the pipeline from the start rather than adding it after the first failure.

Where to go next

Working with Multiple Excel Sheets in Python — combine across tabs within one workbook.
Merging and Joining Excel DataFrames — when files relate by key rather than stacking.
Scheduling Python Excel Scripts with cron — run this consolidation on a schedule.