Guide

Getting Started With Python Excel AutomationDeep dive

Combine Multiple Excel Files into One in Python

Merge many .xlsx files into one workbook with pandas — glob the folder, concat with name-based column alignment, track the source file, and export the result.

When a folder fills up with monthly exports or per-region submissions, you usually want them stacked into one table. The standard approach reads each .xlsx into a DataFrame and stacks them with pd.concat(), which aligns columns by name so files in slightly different orders still line up. This guide builds the whole pipeline on sample files the first step generates, so every snippet runs in order.

Combining many Excel files into one with pd.concat Three separate files — file1.xlsx, file2.xlsx, and file3.xlsx — are read into DataFrames and stacked with pd.concat into a single combined.xlsx. Many files file1.xlsx file2.xlsx file3.xlsx pd.concat() aligns by column name combined.xlsx one stacked table

Step 1: Install the toolchain

Bash
pip install pandas openpyxl

openpyxl is the engine pandas uses for .xlsx I/O. Convert any legacy .xls files to .xlsx first.

Step 2: Create a folder of sample files

To make the example self-contained, generate three monthly files in a monthly_reports/ directory. Note the third file lists its columns in a different order — pd.concat will still align them by name:

Python
import pandas as pd
from pathlib import Path

src = Path("monthly_reports")
src.mkdir(exist_ok=True)

pd.DataFrame({
    "date": ["2024-01-31", "2024-01-31"],
    "region": ["North", "South"],
    "revenue": [1200.0, 980.0],
}).to_excel(src / "jan.xlsx", index=False)

pd.DataFrame({
    "date": ["2024-02-29", "2024-02-29"],
    "region": ["North", "West"],
    "revenue": [1350.0, 720.0],
}).to_excel(src / "feb.xlsx", index=False)

# Same columns, different order
pd.DataFrame({
    "revenue": [1410.0, 1010.0],
    "region": ["East", "South"],
    "date": ["2024-03-31", "2024-03-31"],
}).to_excel(src / "mar.xlsx", index=False)

print("Created", len(list(src.glob('*.xlsx'))), "files")

Step 3: Find the files with glob

Path.glob("*.xlsx") lists matching files. Filter out Excel's lock files (their names start with ~$) so a workbook left open in Excel does not break the run:

Python
files = sorted(f for f in src.glob("*.xlsx") if not f.name.startswith("~$"))
print([f.name for f in files])

Step 4: Read and concatenate

Read each file into a DataFrame, collect them in a list, then pd.concat with ignore_index=True to renumber the rows. Column alignment is by name, so mar.xlsx's reordered columns slot into place:

Python
frames = []
for f in files:
    df = pd.read_excel(f, engine="openpyxl")
    if not df.empty:
        frames.append(df)

combined = pd.concat(frames, ignore_index=True)
print(combined)
print("Combined shape:", combined.shape)

Step 5: Track which file each row came from

A consolidated table is far more useful when each row remembers its origin. Add a source_file column as you read:

Python
frames = []
for f in files:
    df = pd.read_excel(f, engine="openpyxl")
    df["source_file"] = f.name          # tag every row with its origin
    frames.append(df)

combined = pd.concat(frames, ignore_index=True)
print(combined[["date", "region", "revenue", "source_file"]])

Step 6: Validate the row count

Before exporting, confirm the combined frame holds exactly as many rows as the source files did. Catching a mismatch here beats discovering missing data downstream:

Python
expected = sum(pd.read_excel(f, engine="openpyxl").shape[0] for f in files)
assert combined.shape[0] == expected, f"Expected {expected} rows, got {combined.shape[0]}"
print(f"Row count verified: {combined.shape[0]} rows")

Step 7: Export the consolidated workbook

Write the result to a single .xlsx. Use index=False so the DataFrame's row numbers do not become a stray column:

Python
combined.to_excel("consolidated_report.xlsx", index=False, engine="openpyxl")
print("Wrote consolidated_report.xlsx")

Handling schema drift

pd.concat aligns by column name, so a file missing a column gets NaN there, and an extra column appears for every other file as NaN. To pin the output to an exact schema, reindex after concatenating — this drops unexpected columns and guarantees column order:

Python
master_cols = ["date", "region", "revenue", "source_file"]
combined = combined.reindex(columns=master_cols)
print("Final columns:", list(combined.columns))

If different files use different names for the same field (rev vs revenue), rename each file's columns to a canonical set with a rename map before appending it to the list.

Fallback: append raw cells with openpyxl

When files carry merged cells or quirks that trip up pandas, you can copy raw values cell by cell. This keeps the first file's header and appends only data rows from the rest. It preserves raw values but skips pandas' type handling and name-based alignment, so it assumes a consistent column order:

Python
from openpyxl import load_workbook, Workbook

out = Workbook()
ws_out = out.active
first = True

for f in files:
    wb_in = load_workbook(f, read_only=True, data_only=True)
    ws_in = wb_in.active
    start = 1 if first else 2          # keep the header only from the first file
    for i, row in enumerate(ws_in.iter_rows(values_only=True), start=1):
        if i < start:
            continue
        if any(cell is not None for cell in row):
            ws_out.append(row)
    wb_in.close()
    first = False

out.save("consolidated_raw.xlsx")
print("Wrote consolidated_raw.xlsx via openpyxl")

Common errors and fixes

ErrorCauseFix
FileNotFoundError / empty resultglob pattern or directory wrongPrint the resolved path and list(src.glob("*.xlsx")) to confirm
Unexpected NaN columnsColumn names differ across filesRename to a canonical schema, then reindex(columns=...)
PermissionError on readA source file is open in ExcelSkip ~$ lock files (Step 3); close the workbook
Row count assertion failsAn empty or unreadable file was skippedLog skipped files; decide whether to fail or continue

Frequently asked questions

Does pd.concat need the files to have columns in the same order? No. pd.concat aligns columns by name, so a file that lists its columns in a different order still slots into place. Use ignore_index=True to renumber the stacked rows.

What happens when files have different columns? A file missing a column gets NaN there, and any extra column appears as NaN for every other file. To pin the output to a fixed schema, reindex(columns=master_cols) after concatenating to drop unexpected columns and fix the order.

Why does my run break when a workbook is open in Excel? Excel writes a lock file whose name starts with ~$, and reading it raises PermissionError. Filter those out in your glob: f for f in src.glob("*.xlsx") if not f.name.startswith("~$").

How do I track which file each row came from? Add a column as you read each file — df["source_file"] = f.name — before appending the DataFrame to your list. Every row then carries its origin into the combined table.

When should I use the openpyxl fallback instead of pandas? When files carry merged cells or other quirks that trip up pandas. Copying raw values cell by cell preserves them, but it skips pandas' type handling and name-based alignment, so it assumes a consistent column order across files.

Conclusion

The glob-read-concat pipeline is the idiomatic pandas approach: three lines in the common case, three failure modes to guard against (lock files, schema drift, and empty inputs), and a row-count assertion before export that catches mismatches early. If column-name inconsistency is common in your file set, build the reindex step with a master_cols list into the pipeline from the start rather than adding it after the first failure.

Where to go next