Stuck dataframes in while loop - Python [duplicate] - python

I am accessing a series of Excel files in a for loop. I then read the data in the excel file to a pandas dataframe. I cant figure out how to append these dataframes together to then save the dataframe (now containing the data from all the files) as a new Excel file.
Here's what I tried:
for infile in glob.glob("*.xlsx"):
data = pandas.read_excel(infile)
appended_data = pandas.DataFrame.append(data) # requires at least two arguments
appended_data.to_excel("appended.xlsx")
Thanks!

Use pd.concat to merge a list of DataFrame into a single big DataFrame.
appended_data = []
for infile in glob.glob("*.xlsx"):
data = pandas.read_excel(infile)
# store DataFrame in list
appended_data.append(data)
# see pd.concat documentation for more info
appended_data = pd.concat(appended_data)
# write DataFrame to an excel sheet
appended_data.to_excel('appended.xlsx')

you can try this.
data_you_need=pd.DataFrame()
for infile in glob.glob("*.xlsx"):
data = pandas.read_excel(infile)
data_you_need=data_you_need.append(data,ignore_index=True)
I hope it can help.

DataFrame.append() and Series.append() have been deprecated and will be removed in a future version. Use pandas.concat() instead (GH35407).

Related

Iterate through multiple sheets in excel file and filter all the data after a value in a row and append all the sheets

I have an excel file with around 50 sheets. All the data in the excel file looks like below:
I want to read the first row and all the rows after 'finish' in the first column.
I have written my script as something like this.
df = pd.read_excel('excel1_all_data.xlsx')
df = df.head(1).append(df[df.index>df.iloc[:,0] == 'finish'].index[0]+1])
The output looks like below:
The start and finish are gone.
My question is - How can I iterate through all the sheets in a similar way and append them into one dataframe? Also have a column which is the sheet name please.
The data in other sheets is similar too, but will have different dates and Names. But start and finish will still be present and we want to get everything after 'finish'.
Thank you so much for your help !!
Try this code and let me know if if it works for you :
import pandas as pd
wbSheets = pd.ExcelFile("excel1_all_data.xlsx").sheet_names
frames = []
for st in wbSheets:
df = pd.read_excel("excel1_all_data.xlsx",st)
frames.append(df.iloc[[0]])
frames.append(df[5:])
res = pd.concat(frames)
print(res)
The pd.ExcelFile("excel1_all_data.xlsx").sheet_names is what will get you the sheet you need to iterate.
In Pandas.read_excel's documentation! you'll find that you can read a specific sheet of the workbook. Which I've used in the for loop.
I don't know if concat is the best way to solve this for huge files but it worked fine on my sample.

how do i assemble bunch of excel files into one or more using python

how to there is around 10k .csv files named as data0,data1 like that in sequence, want to combine them and want to have a master sheet in one file or at least couple of sheets using python because i think there is limitation of around 1070000 rows in one excel file i think?
import pandas as pd
import os
master_df = pd.DataFrame()
for file in os.listdir(os.getcwd()):
if file.endswith('.csv'):
master_df = master_df.append(pd.read_csv(file))
master_df.to_csv('master file.CSV', index=False)
A few things to note:
Please check your csv file content first. It would easily mismatch columns when reading csv with text(maybe ; in the content). Or you can try changing the csv engine
df= pd.read_csv(csvfilename,sep=';', encoding='utf-8',engine='python')
If you want to combing into one sheet, your can concat into one dataframe first, then to_excel
df = pd.concat([df,sh_tmp],axis=0,sort=False)
note: concat or append would be a straightforward way to combine data. However, 10k would lead to a perfomance topic. Try list instead of pd.concat if you facing perfomance issue.
Excel has maximum row limitation. 10k files would easily exceed the limit (1048576). You might change the output to csv file or split into multiple .xlsx
----update the 3rd----
You can try grouping the data first (1000k each), then write to sheet one by one.
row_limit = 1000000
master_df['group']=master_df.index//row_limit
writer = pd.ExcelWriter(path_out)
for gr in range(0,master_df['group'].max()+1):
master_df.loc[master_df['group']==gr].to_excel(writer,sheet_name='Sheet'+str(gr),index=False)
writer.save()

How to save into one Excelsheet many (n) dataframes using python without deleting the previous dataframe

After parsing eml files and extracting them to create many dataframes, I want to save them to one Excel file.
After saving all my dataframes into the Excel file I still have only the last dataframe among them not all of them in the Sheet.
Someone have an idea how I can solve that ?
you should use sheet name parameter:
import pandas as pd
df_1 = pd.DataFrame()
df_2 = pd.DataFrame()
df_1.to_excel(filename, sheet_name='df1')
df_2.to_excel(filename, sheet_name='df2')

Python Append an excel sheet into Pandas-DataFrame

I am trying to create a database and fill it with values gotten from an excel sheet.
My code:
new_db = pd.DataFrame()
workbook = pd.ExcelFile(filename)
df = workbook.parse('Sheet1')
print(df)
new_db.append(df)
print(new_db.head())
But whenever I seem to do this, I get an empty dataframe back.
My excel sheet however is packed with values. When it is printed(print(df)) it prints it out with ID values and all the correct columns and rows.
My knowledge with Pandas-Dataframes is limited so excuse me if I do not know something I should. All help is appreciated.
I think pandas.read_excel is what you're looking for. here is an example:
import pandas as pd
df = pd.read_excel(filename)
print(df.head())
df will have the type pandas.DataFrame
The default parameters of read_excel are set in a way that the first sheet in the excel file will be read, check the documentation for more options(if you provide a list of sheets to read by setting the sheetname parameter df will be a dictionary with sheetnames as keys and their correspoding Dataframes as values). Depending on the version of Python you're using and its distribution you may need to install the xlrd module, which you can do using pip.
You need to reassign the df after appending to it, as #ayhan pointed out in the comments:
new_db = new_db.append(df)
From the Panda's Documentation for append, it returns an appended dataframe, which means you need to assign it to a variable.

How to open an excel file with multiple sheets in pandas?

I have an excel file composed of several sheets. I need to load them as separate dataframes individually. What would be a similar function as pd.read_csv("") for this kind of task?
P.S. due to the size I cannot copy and paste individual sheets in excel
Use pandas read_excel() method that accepts a sheet_name parameter:
import pandas as pd
df = pd.read_excel(excel_file_path, sheet_name="sheet_name")
Multiple data frames can be loaded by passing in a list. For a more in-depth explanation of how read_excel() works see: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
If you can't type out each sheet name and want to read whole worksheet try this:
dfname=pd.ExcelFile('C://full_path.xlsx')
print(dfname.sheet_names)
df=pd.read_excel('C://fullpath.xlsx')
for items in dfname.sheet_names[1:]:
dfnew=pd.read_excel(full_path,sheet_name=items)
df=pd.concat([df,dfnew])
The thing is that pd.read_excel() can read the very first sheet and rest are unread.So you can use this
import pandas
# setting sheet_name = None, reads all sheets into a dict
sheets = pandas.read_excel(filepath, sheet_name=None)
# i will be the keys in a dictionary object
# the values are the dataframes of each sheet
for i in sheets:
print(f"sheet[{i}]")
print(f"sheet[{i}].columns={sheets[i].columns}")
for index, row in sheets[i].iterrows():
print(f"index={index} row={row}")
exFile = ExcelFile(f) #load file f
data = ExcelFile.parse(exFile) #this creates a dataframe out of the first sheet in file

Categories

Resources