How to read and compare two excel files with multiple worksheets?

How to read and compare two excel files with multiple worksheets? - python

I have two excel files and both of them have 10 worksheets. I wanted to read each worksheets, compare them and print data in 3rd excel file, even that would be written in multiple worksheets.
The below program works for single worksheet
import pandas as pd
df1 = pd.read_excel('zyx_5661.xlsx')
df2 = pd.read_excel('zyx_5662.xlsx')
df1.rename(columns= lambda x : x + '_file1', inplace=True)
df2.rename(columns= lambda x : x + '_file2', inplace=True)
df_join = df1.merge(right = df2, left_on = df1.columns.to_list(), right_on = df2.columns.to_list(), how = 'outer')
with pd.ExcelWriter('xl_join_diff.xlsx') as writer:
df_join.to_excel(writer, sheet_name='testing', index=False)
How can I optimize it to work with multiple worksheets?

I think this should achieve what you need. Loop through each sheet name (assuming they're named the same across both excel documents. If not, you can use numbers instead). Write the new output to a new sheet, and save the excel document.
import pandas as pd
writer = pd.ExcelWriter('xl_join_diff.xlsx')
for sheet in ['sheet1', 'sheet2', 'sheet3']: #list of sheet names
#Pull in data for each sheet, and merge together.
df1 = pd.read_excel('zyx_5661.xlsx', sheet_name=sheet)
df2 = pd.read_excel('zyx_5662.xlsx', sheet_name=sheet)
df1.rename(columns= lambda x : x + '_file1', inplace=True)
df2.rename(columns= lambda x : x + '_file2', inplace=True)
df_join = df1.merge(right=df2, left_on=df1.columns.to_list(),
right_on=df2.columns.to_list(), how='outer')
df_join.to_excel(writer, sheet, index=False) #write to excel as new sheet
writer.save() #save excel document once all sheets have been done

You can use the loop to read files and sheets
writer = pd.ExcelWriter('multiple.xlsx', engine='xlsxwriter')
# create writer for writing all sheets in 1 file
list_files=['zyx_5661.xlsx','zyx_5662.xlsx']
count_sheets=0
for file_name in list_files:
file = pd.ExcelFile(file_name)
for sheet_name in file.sheet_names:
df = pd.read_excel(file, sheet_name)
# ... you can do your process
count_sheets=count_sheets + 1
df.to_excel(writer, sheet_name='Sheet-'+count_sheets)
writer.save()

Related

Combine excel files

Can someone help how to get output in excel readable format? I am getting output as dataframe but #data is embedded a string in row number 2 and 3
import pandas as pd
import os
input_path = 'C:/Users/Admin/Downloads/Test/'
output_path = 'C:/Users/Admin/Downloads/Test/'
[enter image description here][1]
excel_file_list = os.listdir(input_path)
df = pd.DataFrame()
for file in excel_file_list:
if file.endswith('.xlsx'):
df1 = pd.read_excel(input_path+file, sheet_name=None)
df = df.append(df1, ignore_index=True)enter image description here
writer = pd.ExcelWriter('combined.xlsx', engine='xlsxwriter')
for sheet_name in df.keys():
df[sheet_name].to_excel(writer, sheet_name=sheet_name, index=False)
writer.save()

Your issue may be in using sheet_name=None. If any of the files have multiple sheets, a dictionary will be returned by pd.read_excel() with {'sheet_name':dataframe} format.
To .append() with this, you can try something like this, using python's Dictionary.items() method:
def combotime(dfinput):
df1 = pd.DataFrame()
for k, v in dfinput.items():
df1 = df1.append(dfin[k])
return df1
EDIT: If you mean to keep the sheets separate as implied by your writer loop, do not use a pd.DataFrame() object like your df to add the dictionary items. Instead, add to an existing dictionary:
sheets = {}
sheets = sheets.update(df1) #df1 is your read_excel dictionary
for sheet in sheets.keys():
sheets[sheet].to_excel(writer, sheet_name=sheet, index=Fasle)

Read each excel sheet as a different dataframe in Python

I have an excel file with 40 sheet_names. I want to read each sheet to a different dataframe, so I can export an xlsx file for each sheet.
Instead of writing all the sheet names one by one, I want to create a loop that will get all sheet names and add them as a variable in the "sheet_name" option of "pandas_read_excel"
I am trying to avoid this:
df1 = pd.read_excel(r'C:\Users\filename.xlsx', sheet_name= 'Sheet1');
df2 = pd.read_excel(r'C:\Users\filename.xlsx', sheet_name= 'Sheet2');
....
df40 = pd.read_excel(r'C:\Users\filename.xlsx', sheet_name= 'Sheet40');
thank you all guys

Specifying sheet_name as None with read_excel reads all worksheets and returns a dict of DataFrames.
import pandas as pd
file = 'C:\Users\filename.xlsx'
xl = pd.read_excel(file, sheet_name=None)
sheets = xl.keys()
for sheet in sheets:
xl[sheet].to_excel(f"{sheet}.xlsx")

I think this is what you are looking for.
import pandas as pd
xlsx = pd.read_excel('file.xlsx', sheet_name=None, header=None)
for sheet in xlsx.keys(): xlsx[sheet].to_excel(sheet+'.xlsx', header=False, index=False)

How do I drop certain columns by colname for workbooks using Python?

I am trying to understand how I can add to my current script where I'm able to make changes at sheet level. I want to be able to delete columns from the worksheets in my flat file here. For example, if a column is called 'company' I want to delete it so that my final wb.save drops those columns. I have multiple column names i want to drop from all sheets in the wb-
cols_to_drop = ['Company','Type','Firstname','lastname']
My code so far where I have managed to delete a specific sheet from a file and update colnames is below-
from openpyxl import load_workbook
import os
column_name_update_map = {'LocationName': 'Company Name','StreetAddress':'Address','City':'City','State':'State',
'Zip':'Zip','GeneralPhone':'Phone Number','GeneralEmail':'Email','DateJoined':'Status Date',
'Date Removed':'Status Date'}
for file in os.listdir("C:/Users/hhh/Desktop/aaa/python/Matching"):
if file.startswith("TVC"):
wb = load_workbook(file)
if 'Opt-Ins' in wb.sheetnames:
wb.remove(wb['Opt-Ins'])
wb.remove(wb['New Voting Members'])
wb.remove(wb['Temporary Members'])
for ws in wb:
for header in next(ws.rows):
try:
header.value = column_name_update_map[header.value]
except KeyError:
pass
wb.save(file + " (updated headers).xlsx")
This part of the code works perfectly and gives me the desired result. however, I'm unable to apply a dataframe logic like df.drop(['Company', 'Type', 'Firstname'], axis=1) since it is a workbook and not a dataframe

Since you've tagged the question as pandas, you could just use pandas to read and drop:
for file in os.listdir("C:/Users/hhh/Desktop/aaa/python/Matching"):
if file.startswith("TVC"):
dfs = pd.read_excel(file, sheet_name=None)
output = dict()
for ws, df in dfs.items():
if ws in ["Opt-Ins", "New Voting Members", "Temporary Members"]:
continue
#drop unneeded columns
temp = df.drop(cols_to_drop, errors="ignore", axis=1)
#rename columns
temp = temp.rename(columns=column_name_update_map)
#drop empty columns
temp = temp.dropna(how="all", axis=1)
output[ws] = temp
writer = pd.ExcelWriter(f'{file.replace(".xlsx","")} (updated headers).xlsx')
for ws, df in output.items():
df.to_excel(writer, index=None, sheet_name=ws)
writer.save()
writer.close()

How do I write to individual excel sheets for each dataframe generated from for loop?

I have input data in the form of a dictionary consisting of 3 dataframes of numbers. I wish to iterate through each dataframe with some operations and then finally write results for each dataframe to excel.
The following code works fine except that it only writes the resulting dataframe for the last key in the dictionary.
How do I get results for all 3 dataframes written to individual sheets?
Input_Data={'k1':test1,'k2':test24,'k3':test3}
for v in Input_Data.values():
df1 = v[126:236]
df=df1.sort_index(ascending=False)
Indexer=df.columns.tolist()
df = [(pd.concat([df[Indexer[0]],df[Indexer[num]]],axis=1)) for num in [1,2,3,4,5,6]]
df = [(df[num].astype(str).agg(','.join, axis=1)) for num in [0,1,2,3,4,5]]
df=pd.DataFrame(df)
dff=df.loc[0].append(df.loc[1].append(df.loc[2].append(df.loc[3].append(df.loc[4].append(df.loc[5])))))
dff.to_excel('test.xlsx',index=False, header=False)

Your first issue is that with each iteration of the loop you are opening a new file.
As per pandas documentation:
"Multiple sheets may be written to by specifying unique sheet_name. With all data written to the file it is necessary to save the changes. Note that creating an ExcelWriter object with a file name that already exists will result in the contents of the existing file being erased."
Second, you are not providing a variable sheet name, so each time the data is being re-written as the same sheet.
An example solution, with ExcelWriter
#df1, df2, df3 - dataframes
input_data={
'sheet_name1' : df1,
'sheet_name2' : df2,
'sheet_name3' : df3
}
# Initiate ExcelWriter - use xlsx engine
writer = pd.ExcelWriter('multiple_sheets.xlsx', engine='xlsxwriter')
# Iterate over input_data dictionary
for sheet_name, df in input_data.items():
"""
Perform operations here
"""
# Write each dataframe to a different worksheet.
df.to_excel(writer, sheet_name=sheet_name)
# Finally, save ExcelWriter to file
writer.save()
Note 1. You only initiate and save the ExcelWriter object once, the iterations only add sheets to that object
Note 2. Compared to your code, the variable "sheet_name" is provided to the "to_excel()" function

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
# Write each dataframe to a different worksheet.
for sheet_name, df in zip(sheet_names, dfs):
df.to_excel(writer, sheet_name=sheet_name)
# Close the Pandas Excel writer and output the Excel file.
writer.save()

Try to change the file name at each iteration:
Input_Data={'k1':test1,'k2':test24,'k3':test3}
file_number = 1
for v in Input_Data.values():
df1 = v[126:236]
df=df1.sort_index(ascending=False)
Indexer=df.columns.tolist()
df = [(pd.concat([df[Indexer[0]],df[Indexer[num]]],axis=1)) for num in [1,2,3,4,5,6]]
df = [(df[num].astype(str).agg(','.join, axis=1)) for num in [0,1,2,3,4,5]]
df=pd.DataFrame(df)
dff=df.loc[0].append(df.loc[1].append(df.loc[2].append(df.loc[3].append(df.loc[4].append(df.loc[5])))))
file_name='test'
file_number=str(file_number)
dff.to_excel( str(file_name+file_number)+".xlsx",index=False, header=False)
file_number=int(file_number)
file_number = file_number+1

Append to Existing excel without changing formatting

I'm trying to take data from other excel/csv sheets and append them to an existing workbook/worksheet. The code below works in terms of appending, however, it removes formatting for not only the sheet I've appended, but all other sheets in the workbook.
From what I understand, the reason this happens is because I'm reading the entire ExcelWorkbook as a dictionary, turning it into a Pandas Dataframe, and then rewriting it back into Excel. But I'm not sure how to go about it otherwise.
How do I need to modify my code to make it so that I'm only appending the data I need, and leaving everything else untouched? Is pandas the incorrect way to go about this?
import os
import pandas as pd
import openpyxl
#Read Consolidated Sheet as Dictionary
#the 'Consolidation' excel sheet has 3 sheets:
#Consolidate, Skip1, Skip2
ws_dict = pd.read_excel('.\Consolidation.xlsx', sheet_name=None)
#convert relevant sheet to datafarme
mod_df = ws_dict['Consolidate']
#check that mod_df is the 'Consolidate' Tab
mod_df
#do work on mod_df
#grab extra sheets with data and make into pd dataframes
excel1 = 'doc1.xlsx'
excel2 = 'doc2.xlsx'
df1 = pd.read_excel(excel1)
df1 = df1.reset_index(drop=True)
df2 = pd.read_excel(excel2)
df2 = df2.reset_index(drop=True)
#concate the sheets
mod_df = pd.concat([mod_df, df1, df2], axis=0, ignore_index = True, sort=False)
#reassign the modified to the sheet
ws_dict['Consolidate'] = mod_df
#write to the consolidation workbook
with pd.ExcelWriter('Consolidation.xlsx', engine='xlsxwriter') as writer:
for ws_name, df_sheet in ws_dict.items():
df_sheet.to_excel(writer, sheet_name=ws_name, index=False)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read and compare two excel files with multiple worksheets? - python

Related

Combine excel files

Read each excel sheet as a different dataframe in Python

How do I drop certain columns by colname for workbooks using Python?

How do I write to individual excel sheets for each dataframe generated from for loop?

Append to Existing excel without changing formatting

Categories

Resources