How to merge excel files to one in Pandas? - python

I have 9 excel files named "1_mock" to "10_mock" (It needs to skip "5_mock").
These files just have a single sheet.
Then, I want to combine those 9 files into 9 single sheets, but not to concat them into one sheet.
I've found this online merger https://products.aspose.app/cells/merger. Though it's convenient to use, I need to stop processing the code and turn to this link every time. Not very efficient at all! Thus, I hope there are ways to write this process into a code.
Thanks!

You just have to read all the files first and then write them to the same output file using the to_excel function but have a different sheet_name parameter every time. You can try something like this:
import pandas as pd
files_list = ["1_mock.xlsx", "2_mock.xlsx", "3_mock.xlsx", .....]
df_list = [pd.read_excel(file_name) for file_name in files_list]
with pd.ExcelWriter('output.xlsx') as writer:
for i in range(len(files_list)):
df_list[i].to_excel(writer, sheet_name=files_list[i].replace(".xlsx",""))

Related

Iterate through multiple sheets in excel file and filter all the data after a value in a row and append all the sheets

I have an excel file with around 50 sheets. All the data in the excel file looks like below:
I want to read the first row and all the rows after 'finish' in the first column.
I have written my script as something like this.
df = pd.read_excel('excel1_all_data.xlsx')
df = df.head(1).append(df[df.index>df.iloc[:,0] == 'finish'].index[0]+1])
The output looks like below:
The start and finish are gone.
My question is - How can I iterate through all the sheets in a similar way and append them into one dataframe? Also have a column which is the sheet name please.
The data in other sheets is similar too, but will have different dates and Names. But start and finish will still be present and we want to get everything after 'finish'.
Thank you so much for your help !!
Try this code and let me know if if it works for you :
import pandas as pd
wbSheets = pd.ExcelFile("excel1_all_data.xlsx").sheet_names
frames = []
for st in wbSheets:
df = pd.read_excel("excel1_all_data.xlsx",st)
frames.append(df.iloc[[0]])
frames.append(df[5:])
res = pd.concat(frames)
print(res)
The pd.ExcelFile("excel1_all_data.xlsx").sheet_names is what will get you the sheet you need to iterate.
In Pandas.read_excel's documentation! you'll find that you can read a specific sheet of the workbook. Which I've used in the for loop.
I don't know if concat is the best way to solve this for huge files but it worked fine on my sample.

How to improve my append and read excel For loop in python

Hope you can help me.
I have a folder where there are several .xlsx files with similar structure (NOTE that some of the files might be bigger than 50MB). I want to combine them all together and (eventually) send them to a database. But before that, I need to improve the performance of this block of code because sometimes it takes a lot of time to process all those files.
The code in question is this:
df_list = []
for file in location:
df_list.append(pd.read_excel(file, header=0, engine='openpyxl'))
df_concat = pd.concat(df_list)
Any suggestions?
Somewhere I read that converting Excel files to CSV might improve the performance, but should I do that before appending the files or after everything is concatenated?
And considering df_list is a list, can I do that conversion?
I've found a solution with xlsx2csv
xlsx_path = './data/Extract/'
csv_path = './data/csv/'
list_of_xlsx = glob.glob(xlsx_path+'*.xlsx')
for xlsx in list_of_xlsx:
# Extract File Name on group 2 "(.+)"
filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
# Setup the call for subprocess.call()
call = ["python", "./xlsx2csv.py", xlsx, csv_path+filename+'.csv']
try:
subprocess.call(call) # On Windows use shell=True
except:
print('Failed with {}'.format(filepath)
outputcsv = './data/bigcsv.csv' #specify filepath+filename of output csv
listofdataframes = []
for file in glob.glob(csv_path+'*.csv'):
df = pd.read_csv(file)
if df.shape[1] == 24: # make sure 24 columns
listofdataframes.append(df)
else:
print('{} has {} columns - skipping'.format(file,df.shape[1]))
bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
bigdataframe.to_csv(outputcsv,index=False)
I tried to make this work for me but had no success. Maybe you might be able to have it working for you? Or does anyone have any ideas?
Reading excel files is quite slow in pandas as you stated, you shoudld have a look at this answer. It bascally uses a vbscript before running the python script to convert excel file to csv file, which is way faster to read for the python script.
To be more specific and answer the second part of your question, you should convert teh excel files to csv before loading them with pandas. The read_excel function is the slow part.

how do i assemble bunch of excel files into one or more using python

how to there is around 10k .csv files named as data0,data1 like that in sequence, want to combine them and want to have a master sheet in one file or at least couple of sheets using python because i think there is limitation of around 1070000 rows in one excel file i think?
import pandas as pd
import os
master_df = pd.DataFrame()
for file in os.listdir(os.getcwd()):
if file.endswith('.csv'):
master_df = master_df.append(pd.read_csv(file))
master_df.to_csv('master file.CSV', index=False)
A few things to note:
Please check your csv file content first. It would easily mismatch columns when reading csv with text(maybe ; in the content). Or you can try changing the csv engine
df= pd.read_csv(csvfilename,sep=';', encoding='utf-8',engine='python')
If you want to combing into one sheet, your can concat into one dataframe first, then to_excel
df = pd.concat([df,sh_tmp],axis=0,sort=False)
note: concat or append would be a straightforward way to combine data. However, 10k would lead to a perfomance topic. Try list instead of pd.concat if you facing perfomance issue.
Excel has maximum row limitation. 10k files would easily exceed the limit (1048576). You might change the output to csv file or split into multiple .xlsx
----update the 3rd----
You can try grouping the data first (1000k each), then write to sheet one by one.
row_limit = 1000000
master_df['group']=master_df.index//row_limit
writer = pd.ExcelWriter(path_out)
for gr in range(0,master_df['group'].max()+1):
master_df.loc[master_df['group']==gr].to_excel(writer,sheet_name='Sheet'+str(gr),index=False)
writer.save()

Multiple excel files and multiple sheets to a single excel file but with different excel sheets in Python

firstly I ask admin not to close the topic. Because last time I opened and it was closed that there are similar topics. But those are not same. Thanks in advance.
Every day I receive 15-20 excel files with huge number of worksheets (more than 200). Fortunately worksheet names and counts are same for all excel files. I want to merge all excel files into one but with multiple worksheets. I am new in Python, actually I watched and read a lot about the options how to do but could not find a way. Thanks for your support.
example I tried: I have two files with two sheets (actual sheet count is huge as mentioned above), I want to merge both files into one with two sheets as sum.xlsx.
Data1.xlsx
Data1.xlsx
sum.xlsx
import os
import openpyxl
import pandas as pd
files = os.listdir(r'C:\Python\db') # we open the folder where files are located
os.chdir(r'C:\Python\db') # we change working direction
df = pd.DataFrame() # we create an empty data frame
wb = openpyxl.load_workbook(r'C:\Python\db\Data1.xlsx') # we load one of file to extract list of sheet names
sh_name = wb.sheetnames # we extract list of names into a list
for i in sh_name:
for f in files:
data = pd.read_excel(f, sheet_name=i)
df = df.append(data)
df.to_excel('sum.xlsx', index=False, sheet_name=i)

Looping through .xlsx files using pandas, only does first file

My ultimate goal is to merge the contents of a folder full of .xlsx files into one big file.
I thought the below code would suffice, but it only does the first file, and I can't figure out why it stops there. The files are small (~6 KB), so it shouldn't be a matter of waiting. If I print f_list, it shows the complete list of files. So, where am I going wrong? To be clear, there is no error returned, it just does not do the entire for loop. I feel like there should be a simple fix, but being new to Python and coding, I'm having trouble seeing it.
I'm doing this with Anaconda on Windows 8.
import pandas as pd
import glob
f_list = glob.glob("C:\\Users\\me\\dt\\xx\\*.xlsx") # creates my file list
all_data = pd.DataFrame() # creates my DataFrame
for f in f_list: # basic for loop to go through file list but doesn't
df = pd.read_excel(f) # reads .xlsx file
all_data = all_data.append(df) # appends file contents to DataFrame
all_data.to_excel("output.xlsx") # creates new .xlsx
Edit with new information:
After trying some of the suggested changes, I noticed the output claiming the files are empty, except for 1 of them which is slightly larger than the others. If I put them into the DataFrame, it claims the DataFrame is empty. If I put it into the dict, it claims there are no values associated. Could this have something to do with the file size? Many, if not most, of these files have 3-5 rows with 5 columns. The one it does see has 12 rows.
I strongly recommend reading the DataFrames into a dict:
sheets = {f: pd.read_excel(f) for f in f_list}
For one thing this is very easy to debug: just inspect the dict in the REPL.
Another is that you can then concat these into one DataFrame efficiently in one pass:
pd.concat(sheets.values())
Note: This is significantly faster than append, which has to allocate a temporary DataFrame at each append-call.
An alternative issue is that your glob may not be picking up all the files, you should check that it is by printing f_list.

Categories

Resources