Error while appending many Excel files to one in Python - python

I am trying to append 10 Excel files to one in Python,
The code below was used and I am getting
TypeError: first argument must be an iterable of pandas objects,
you passed an object of type "DataFrame"
Once I change sheet_name argument to None, the code run perfectly.
However, all the 10 excel files has three sheets and I only want specific sheet per excel file.
Is there a way to get it done?
your help is appreciated.
import pandas as pd
import glob
path = r'Folder path'
filenames = glob.glob(path + "\*.xlsx")
finalexcelsheet = pd.DataFrame()
for file in filenames:
df = pd.concat(pd.read_excel(file, sheet_name= 'Selected Sheet'), ignore_index=True,sort=False)
finalexcelsheet=finalexcelsheet.append(df,ignore_index=True)

I can't test it but problem is because you use concat in wrong way - or rather because you don't need concat in your situation.
concat needs list with dataframes like
concat( [df1, df2, ...], ...)
but read_excel gives different objects for different sheet_name=... and this makes problem.
read_excel for sheet_name=None returns list or dict with all sheets in separated dataFrames
[df_sheet_1, df_sheet_2, ...]
and then concat can join them to one dataframe
read_excel for sheet_name=name returns single dataframe
df_sheet
and then concat has nothing co join - and it gives error.
But it means you don't need concat.
You should directly assign read_excel to df
for file in filenames:
df = pd.read_excel(file, sheet_name='Selected Sheet')
finalexcelsheet = finalexcelsheet.append(df, ignore_index=True)

Related

Iterate through all sheets of all workbooks in a directory

I am trying to combine all spreadsheets from all workbooks in a directory into a single df. I've tried with glob and with os.scandir but either way I keep only getting the first sheet of all workbooks.
First attempt:
import pandas as pd
import glob
workbooks = glob.glob(r"\mydirectory\*.xlsx")
list = []
for file in workbooks:
df = pd.concat(pd.read_excel(file, sheet_name=None), ignore_index = True)
list.append(df)
dataframe = pd.concat(list, axis = 0)
Second attempt:
import os
import pandas as pd
df = pd.DataFrame()
path = r"\mydirectory\"
with os.scandir(path) as files:
for file in files:
data = pd.read_excel(file, sheet_name = None)
df = df.append(data)
I think the issue lies with the for loop but I'm too inexperienced to pin down the problem. Any help would be greatly appreciated, thx!!!
If I understand what you have written correctly, you want something like this:
import pandas as pd
import glob
# list of workbooks in directory
workbooks = glob.glob(r"\mydirectory\*.xlsx")
l = []
# for each file in list
for file in workbooks:
# Class for file allows for retrieving sheet names
xl_file = pd.ExcelFile(file)
# concatenate DataFrames created from each sheet in the file
df = pd.concat([pd.read_excel(file, sheet) for sheet in xl_file.sheet_names], ignore_index=True)
# append to list
l.append(df)
# concatenate all file DataFrames to one DataFrame.
dataframe = pd.concat(l, axis=0)
This loops through the sheets within the Excel file for the concatenation, the only difference to what you had already written.
Alternative:
Alternatively, without needing to first find the sheet names, the dictionary created by pd.read_excel(file, sheet_name=None) can used.
import pandas as pd
import glob
# list of workbooks in directory
workbooks = glob.glob(r"\mydirectory\*.xlsx")
l = []
# for each file in list
for file in workbooks:
# concatenate the dictionary of dataframes from pd.read_excel
df = pd.concat(pd.read_excel(file, sheet_name=None), ignore_index=True)
l.append(df)
# concatenate all file DataFrames to one DataFrame.
dataframe = pd.concat(l, axis=0)
A good explanation/example of the use of sheet_name=None can be found here. In short, the use of this returns a dictionary of DataFrames for each sheet. This can then be concatenated to one DataFrame, as above, or an individual sheet's DataFrame could be accessed through dictionary["sheet_name"].

Import several sheets from the same excel into one dataframe in pandas

I have one excel file with several identical structured sheets on it (same headers and number of columns) (sheetsname: 01,02,...,12).
How can I get this into one dataframe?
Right now I would load it all seperate with:
df1 = pd.read_excel('path.xls', sheet_name='01')
df2 = pd.read_excel('path.xls', sheet_name='02')
...
and would then concentate it.
What is the most pythonic way to do it and get directly one dataframe with all the sheets? Also assumping I do not know every sheetname in advance.
read the file as:
collection = pd.read_excel('path.xls', sheet_name=None)
combined = pd.concat([value.assign(sheet_source=key)
for key,value in collection.items()],
ignore_index=True)
sheet_name = None ensures all the sheets are read in.
collection is a dictionary, with the sheet_name as key, and the actual data as the values. combined uses the pandas concat method to get you one dataframe. I added the extra column sheet_source, in case you need to track where the data for each row comes from.
You can read more about it on the pandas doco
you can use:
df_final = pd.concat([pd.read_excel('path.xls', sheet_name="{:02d}".format(sheet)) for sheet in range(12)], axis=0)

python efficient way to append all worksheets in multiple excel into pandas dataframe

I have around 20++ xlsx files, inside each xlsx files might contain different numbers of worksheets. But thank god, all the columns are the some in all worksheets and all xlsx files. By referring to here", i got some idea. I have been trying a few ways to import and append all excel files (all worksheet) into a single dataframe (around 4 million rows of records).
Note: i did check here" as well, but it only include file level, mine consits file and down to worksheet level.
I have tried below code
# import all necessary package
import pandas as pd
from pathlib import Path
import glob
import sys
# set source path
source_dataset_path = "C:/Users/aaa/Desktop/Sample_dataset/"
source_dataset_list = glob.iglob(source_dataset_path + "Sales transaction *")
for file in source_dataset_list:
#xls = pd.ExcelFile(source_dataset_list[i])
sys.stdout.write(str(file))
sys.stdout.flush()
xls = pd.ExcelFile(file)
out_df = pd.DataFrame() ## create empty output dataframe
for sheet in xls.sheet_names:
sys.stdout.write(str(sheet))
sys.stdout.flush() ## # View the excel files sheet names
#df = pd.read_excel(source_dataset_list[i], sheet_name=sheet)
df = pd.read_excel(file, sheetname=sheet)
out_df = out_df.append(df) ## This will append rows of one dataframe to another(just like your expected output)
Question:
My approach is like first read the every single excel file and get a list of sheets inside it, then load the sheets and append all sheets. The looping seems not very efficient expecially when datasize increase for every append.
Is there any other efficient way to import and append all sheets from multiple excel files?
Use sheet_name=None in read_excel for return orderdict of DataFrames created from all sheetnames, then join together by concat and last DataFrame.append to final DataFrame:
out_df = pd.DataFrame()
for f in source_dataset_list:
df = pd.read_excel(f, sheet_name=None)
cdf = pd.concat(df.values())
out_df = out_df.append(cdf,ignore_index=True)
Another solution:
cdf = [pd.read_excel(excel_names, sheet_name=None).values()
for excel_names in source_dataset_list]
out_df = pd.concat([pd.concat(x) for x in cdf], ignore_index=True)
If i understand your problem correctly, set sheet_name=None in pd.read_excel does the trick.
import os
import pandas as pd
path = "C:/Users/aaa/Desktop/Sample_dataset/"
dfs = [
pd.concat(pd.read_excel(path + x, sheet_name=None))
for x in os.listdir(path)
if x.endswith(".xlsx") or x.endswith(".xls")
]
df = pd.concat(dfs)
I have a pretty straight forward solution if you want to read all the sheets.
import pandas as pd
df = pd.concat(pd.read_excel(path+file_name, sheet_name=None),
ignore_index=True)

Python Pandas - Create function to use read_excel() for multiple files

I am trying to make a reusable functions in python that would read two Excel files and save save to another.
My function looks like this:
def excel_reader(df1, file1, sheet1, df2, file2, sheet2):
df1 = pd.read_excel(file1, sheet1)
df2 = pd.read_excel(file2, sheet2)
def save_to_excel(df1, filename1, sheet1, df2, filename2, sheet2):
df1.to_excel(filename1, sheet1)
df2.to_excel(filename2, sheet2)
I am calling to the functions as:
excel_reader(df1, 'some_file1.xlsx', 'sheet_name1',
df2, 'some_file2.xlsx', 'sheet_name2')
save_to_excel(df1, 'some_file1.xlsx', 'sheet_name1',
df2, 'some_file2.xlsx', 'sheet_name2')
It do not have any errors but it do not create the Excel files that should be performed by save_to_excel function.
It reads the function parameters until df2 parameter and returns error for the last two.
I will be using pd.read_excel() quite a number of times in my code so I am trying to make it a function. I am also aware the the read_excel() reads the filenames as string and tried doing `somefile.xlsx' but still the same result.
The Excel files that will be read are on the same path of the python script.
Question: Any advice on how this would work? Is it advisable to make this a function or should I just use read_excel() repetitively?
I don't think this function would improve anything...
Just imagine that you have to call pd.read_excel() with different parameters for different Excel files - for example:
df1 = pd.read_excel(file1, 'Sheet1', skiprows=1)
df2 = pd.read_excel(file2, 'Sheet2', usecols="A,C,E:F")
You will loose all this flexibility using your custom function...
You are missing the return call. If you just want the function to return the dataframes then
def excel_reader(file1, sheet1, file2, sheet2):
df1 = pd.read_excel(file1, sheet1)
df2 = pd.read_excel(file2, sheet2)
return df1, df2
df1, df2 = excel_reader('some_file1.xlsx', 'sheet_name1', 'some_file2.xlsx', 'sheet_name2')

Python / glob glob - change datatype during import

I'm looping through all excel files in a folder and appending them to a dataframe. One column (column C) has an ID number. In some of the sheets, the ID is formatted as text and in others it's formatted as a number. What's the best way to change the data type during or after the import so that the datatype is consistent? I could always change them in each excel file before importing but there are 40+ sheets.
for f in glob.glob(path):
dftemp = pd.read_excel(f,sheetname=0,skiprows=13)
dftemp['file_name'] = os.path.basename(f)
df = df.append(dftemp,ignore_index=True)
Don't append to a dataframe in a loop, every append relocates the whole dataframe to a new location in memory, very slow. Do one single concat after reading all your dataframes:
dfs = []
for f in glob.glob(path):
df = pd.read_excel(f,sheetname=0,skiprows=13)
df['file_name'] = os.path.basename(f)
df['c'] = df['c'].astype(str)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
It sounds like your ID, that's the c column, is a string, but sometimes lacks alphabets. Ideally this should be used as a string.

Categories

Resources