Python Pandas - Create function to use read_excel() for multiple files - python

I am trying to make a reusable functions in python that would read two Excel files and save save to another.
My function looks like this:
def excel_reader(df1, file1, sheet1, df2, file2, sheet2):
df1 = pd.read_excel(file1, sheet1)
df2 = pd.read_excel(file2, sheet2)
def save_to_excel(df1, filename1, sheet1, df2, filename2, sheet2):
df1.to_excel(filename1, sheet1)
df2.to_excel(filename2, sheet2)
I am calling to the functions as:
excel_reader(df1, 'some_file1.xlsx', 'sheet_name1',
df2, 'some_file2.xlsx', 'sheet_name2')
save_to_excel(df1, 'some_file1.xlsx', 'sheet_name1',
df2, 'some_file2.xlsx', 'sheet_name2')
It do not have any errors but it do not create the Excel files that should be performed by save_to_excel function.
It reads the function parameters until df2 parameter and returns error for the last two.
I will be using pd.read_excel() quite a number of times in my code so I am trying to make it a function. I am also aware the the read_excel() reads the filenames as string and tried doing `somefile.xlsx' but still the same result.
The Excel files that will be read are on the same path of the python script.
Question: Any advice on how this would work? Is it advisable to make this a function or should I just use read_excel() repetitively?

I don't think this function would improve anything...
Just imagine that you have to call pd.read_excel() with different parameters for different Excel files - for example:
df1 = pd.read_excel(file1, 'Sheet1', skiprows=1)
df2 = pd.read_excel(file2, 'Sheet2', usecols="A,C,E:F")
You will loose all this flexibility using your custom function...

You are missing the return call. If you just want the function to return the dataframes then
def excel_reader(file1, sheet1, file2, sheet2):
df1 = pd.read_excel(file1, sheet1)
df2 = pd.read_excel(file2, sheet2)
return df1, df2
df1, df2 = excel_reader('some_file1.xlsx', 'sheet_name1', 'some_file2.xlsx', 'sheet_name2')

Related

Amending dataframe from a generator that reads multiple excel files

My question ultimately is - is it possible to amend inplace each dataframe of a generator of dataframes?
I have a series of excel files in a folder that each have a table in the same format. Ultimately I want to concatenate each file into 1 large dataframe. They all have unique column headers but share the same indices (historical dates but may be across different time frames) so I want to concatenate the dataframes but aligned by their date. So I first created a generator function to create dataframes from each 'Data1' worksheet in the excel files
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_excel(f,'Data1') for f in all_files) #generator comprehension
The below code is the formatting that needs to be done to each dataframe so that I can concatenate them correctly in my final line. I changed the index to the date column but there are also some rows that contain data that is not relevant.
def format_ABS(df):
df.drop(labels=range(0, 9), axis=0,inplace=True)
df.set_index(df.iloc[:,0],inplace=True)
df.drop(df.columns[0],axis=1,inplace=True)
However this doesn't work when I place the function within a generator comphrension (as i am amending all the dataframes inplace). The generator produced has no objects. Why doesn't the below line work? Is it because it can only loop through the generator once?
format_df = (format_ABS(x) for x in df_from_each_file)
but
format_df(next(df_from_each_file)
does work on each individual dataframe
The final product is then the below
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
I have gotten what I wanted by assigning index_col=0 in the pd.read_excel line but it go me thinking about generators and amending the dataframe in general.

Error while appending many Excel files to one in Python

I am trying to append 10 Excel files to one in Python,
The code below was used and I am getting
TypeError: first argument must be an iterable of pandas objects,
you passed an object of type "DataFrame"
Once I change sheet_name argument to None, the code run perfectly.
However, all the 10 excel files has three sheets and I only want specific sheet per excel file.
Is there a way to get it done?
your help is appreciated.
import pandas as pd
import glob
path = r'Folder path'
filenames = glob.glob(path + "\*.xlsx")
finalexcelsheet = pd.DataFrame()
for file in filenames:
df = pd.concat(pd.read_excel(file, sheet_name= 'Selected Sheet'), ignore_index=True,sort=False)
finalexcelsheet=finalexcelsheet.append(df,ignore_index=True)
I can't test it but problem is because you use concat in wrong way - or rather because you don't need concat in your situation.
concat needs list with dataframes like
concat( [df1, df2, ...], ...)
but read_excel gives different objects for different sheet_name=... and this makes problem.
read_excel for sheet_name=None returns list or dict with all sheets in separated dataFrames
[df_sheet_1, df_sheet_2, ...]
and then concat can join them to one dataframe
read_excel for sheet_name=name returns single dataframe
df_sheet
and then concat has nothing co join - and it gives error.
But it means you don't need concat.
You should directly assign read_excel to df
for file in filenames:
df = pd.read_excel(file, sheet_name='Selected Sheet')
finalexcelsheet = finalexcelsheet.append(df, ignore_index=True)

Import several sheets from the same excel into one dataframe in pandas

I have one excel file with several identical structured sheets on it (same headers and number of columns) (sheetsname: 01,02,...,12).
How can I get this into one dataframe?
Right now I would load it all seperate with:
df1 = pd.read_excel('path.xls', sheet_name='01')
df2 = pd.read_excel('path.xls', sheet_name='02')
...
and would then concentate it.
What is the most pythonic way to do it and get directly one dataframe with all the sheets? Also assumping I do not know every sheetname in advance.
read the file as:
collection = pd.read_excel('path.xls', sheet_name=None)
combined = pd.concat([value.assign(sheet_source=key)
for key,value in collection.items()],
ignore_index=True)
sheet_name = None ensures all the sheets are read in.
collection is a dictionary, with the sheet_name as key, and the actual data as the values. combined uses the pandas concat method to get you one dataframe. I added the extra column sheet_source, in case you need to track where the data for each row comes from.
You can read more about it on the pandas doco
you can use:
df_final = pd.concat([pd.read_excel('path.xls', sheet_name="{:02d}".format(sheet)) for sheet in range(12)], axis=0)

How to output multiple pandas dataframes to the same csv or excel with different dimensions

So I have multiple data tables saved as pandas dataframes, and I want to output all of them into the same CSV for ease of access. However, I am not really sure the best way to go about this, as I want to maintain each dataframes inherent structure (ie columns and index), so I cant combine them all into 1 single dataframe.
Is there a method by which I can write them all at once with ease, akin the the usual pd.to_csv method?
Use mode='a':
df = pd.DataFrame(np.random.randint(0,100,(4,4)))
df1 = pd.DataFrame(np.random.randint(0,500,(5,5)))
df.to_csv('out.csv')
df1.to_csv('out.csv', mode='a')
!type out.csv
Output:
,0,1,2,3
0,0,0,36,53
1,5,38,17,79
2,4,42,58,31
3,1,65,41,57
,0,1,2,3,4
0,291,358,119,267,430
1,82,91,384,398,99
2,53,396,121,426,84
3,203,324,262,452,47
4,127,131,460,356,180
For Excel you can do:
from pandas import ExcelWriter
frames = [df1, df2, df3]
saveFile = 'file.xlsx'
writer = ExcelWriter(saveFile)
for x in range(len(frames)):
sheet_name = 'sheet' + str(x+1)
frames[x].to_excel(writer, sheet_name)
writer.save()
You should now have all of your dataframes in 3 different sheets: sheet1, sheet2 and sheet3.

Exported and imported DataFrames differ but should be the same

I tried to import some data from an Excel file to a pandas DataFrame, convert it into a csv file and read it back in (need to do some further file based handling on that exported csv file later on, so that is a necessary step).
For the sake of data integrity, exported and re-imported data should be the same. So, I compared the different DataFrames and encountered, that these are not the same, at least according to pandas' .equals() function.
I thought this might be an issue related to string encoding when exporting and re-importing the data since I had to transfer char encoding etc. while file handling. However, I was able to reproduce similar behavior without any encoding-related issues as follows:
import pandas as pd
import numpy as np
# https://stackoverflow.com/a/32752318
df1 = pd.DataFrame(np.random.randint(0, 10, size=(10, 4)), columns=list('ABCD'))
df1.to_csv('foo.csv', index=False)
df2 = pd.read_csv('foo.csv')
df1.to_csv('bar.csv', index=True)
df3 = pd.read_csv('bar.csv')
print(df1.equals(df2), df1.equals(df3), df2.equals(df3))
print(all(df1 == df2))
Why does .equals() tell that the DataFrames differ, but all(df1 == df2) tells they are equal? According to the docs, .equals() even considers NaNs at same locations to be equal, whereas df1 == df2 should not. Due to this, comparing different DataFrames with .equals() is less strict than df1 == df2, but does not return the same result in the example I provided.
Which criteria do df1 == df2 and df1.equals(df2) consider I am not aware of? I assume, that the implementation inside pandas is correct (did not look into the implementation inside the code itself, but export and re-import should be a standard interface test case). What am I doing wrong then?
I think that df1.equals(df2) return False because it takes into account the DataFrame dtype. df1 should have int32 columns, while df2 should have int64 columns (you can use the info() method to verify it).
You can specify the df2 dtype as follow in order to have the same dtype of df1:
df2 = pd.read_csv('foo.csv', dtype=np.int32)
if dtype is the same, .equals() should return True
When you write dataframe to .csv format with index=True ; it adds up extra column with name Unnamed: 0. That's why both .equals() and all(df1 == df2) tells dataframes are different. But, if you write .csv with index=False it will not add up an extra column and you will get output .csv equal to input dataframe.
If you don't care about dataframe index you can set index=False while writing dataframe to .csv or use pd.read_csv('bar.csv').drop(['Unnamed: 0'],axis=1) while reading csv.

Categories

Resources