I have my code ready, I just can't seem to wrap my head around how to pass the sheet name into the column I have created called "Month" (s['Month'] = 0)?
The purpose of the loop is to go to each Sheet and clean up the data and add a column that will say the current Sheet name.
import pandas as pd
# list of sheet names
sheets = pd.read_excel('Royalties Jan to Dec 21.xlsx', sheet_name=None).values()
# create an empty dataframe to store the merged data
merged_df = []
# loop through the sheet names and read each sheet into a dataframe
for s in sheets:
s['Month'] = 0 ##This is the line I need to change so instead of 0 I get Sheet Name.
s = s.fillna('')
s.columns = (s.iloc[2] + ' ' + s.iloc[3])
s = s[s.iloc[:,0] == 'TOTAL']
# append the data from each sheet to the merged dataframe
merged_df.append(s)
merged_df = pd.concat(merged_df)
merged_df
Any help would be appreciated! Thank you!
Related
I have a scenario where I need to read excel spreadsheet with multiple sheets inside and process each sheet separately.
The Sheets inside the excel workbook are named something like [sheet1, data1,data2,data3,summary,reference,other_info,old_records]
I need to read only sheets [reference, data1,data2,data3]
I can hardcode the name reference which is static everytime, but the names data1,data2,data3 are not static as there maybe data1 only or data1,data2 or it can be (eg) data1,data2….data(n)
whatever the count of the sheets be, it will remain same across all files (eg - its not allowed to have Data1,Data2 in one file and Data1,Data2,Data3 in the other - Just to clarify the requirement).
I can check the name by using the following code
reallall = [key for key in pd.read_excel(path,sheet_name = None) if ('Data') in key]
for n in range(0,len(readall)):
sheetname = readall[n]
dfname = df_list[n] – trying to create dynamic dataframe so that we can create separate tables at the end
for s in allsheets:
sheetname = s
data_df = readfile(path,s,”’Data1!C5’”) -- function to read excel file into dataframe
df_ref = readreference(path,s,”’Reference!A1’”)
df_ref is same for all sheets in a workbook, and the data_df is joined with the reference. (Just adding this as an info – there are some other processing that needs to be done as well, which I have already done)
the above is a sample code to read a particular sheet.
My Problem is:
I have Multiple excel files (around 100 files) to read.
Matching sheets from all files should be combined together (eg) ‘Data1’ from file1 should be combined with data1 from file2, data1 from file3 … and so on. Similary Data2 from all files should be combined into a separate dataframe (all sheets have same columns)
Separate delta tables should be created for each tab (eg) table for Data1 should be something like Customers_Data1, Table for Data2 should be Customers_Data2 and so on.
Any help on this please ?
Thanks
Resolved my Issue through the following code.
final_dflist = []
sheet_names = [['Data1','CUSTOMER_DATA1'],['Data2','CUSTOMER_DATA2']]
for shname in sheet_names:
final_df = spark.createDataFrame([],final_schema)
print(f'Initializing final df - record count: {final_df.count()}')
sheetname = shname[0]
dfname = shname[1]
print(f'Processing: {sheetname}')
for file in historic_files:
fn = file.rsplit('/',1)[-1]
fpath = '/dbfs' + file
print(f'Reading file:{fn} -->{sheetname}')
indx_df = pd.read_excel(fpath,sheet_name = 'Index', skiprows = None)
for index,row in indx_df.iterrows():
if row[0] == 'Data Item Description':
row_index = 'A' + str(index+2)
df_index = read_index(file,'Index',row_index)
df_index = df_index.selectExpr('`Col 1` as co_1','Col2','Col3','Col4')
df_data = read_data(file,sheetname,'A10')
# Join Data with index here
# Drop null columns from final df
df = combine_df.na.drop("all")
exec(f'{dfname} = final_df.select("*")')
final_dflist.append([dfname,final_df])
print(f'Data Frame created:{dfname}')
print (f'final df - record count: {final_df.count()}')
Any suggestions to improve this ?
hi:) i am trying to make a for loop to reduce redundancy in my code, where i need to access a number of different sheets within an excel file, count the number of specific values and later plot a graph.
my code for my for loop looks like this at the moment:
df = pd.read_excel('C:/Users/julia/OneDrive/Documents/python assignment/2016 data -EU values.xlsx',
skiprows=6)
sheets_1 = ["PM10 ", "PM2.5", "O3 ", "NO2 ", "BaP", "SO2"]
resultM1 = 0
for sheet in sheets_1:
print(sheet[0:5])
for row in df.iterrows():
if row[1]['Country'] == 'Malta':
resultM1 += row[1]['AirPollutionLevel']
print(resultM1)
i would like for the output to look something like this:
PM10 142
PM2.5 53
O3 21
NO2 3
BaP 21
SO2 32
but what i'm getting is just the sheet names printed after each other and the total amount of the sepcific value i need across all sheets. i.e.
PM10
PM2.5
O3
NO2
BaP
SO2
284.913786
i really need the values separated into their respective sheet and not added together.
attached is a screeshot of the excel file. as u can see, there are different sheets and many values within -i need to add values for a specific country in each sheet.
any help would be greatly appreciated!
import pandas as pd
# Open as Excel object
xls = pd.ExcelFile('C:/Users/julia/OneDrive/Documents/python assignment/2016 data -EU values.xlsx')
# Get sheet names
sheets_1 = xls.sheet_names
# Dictionary of SheetNames:dfOfSheets
sheet_to_df_map = {}
for sheet_name in xls.sheet_names:
sheet_to_df_map[sheet_name] = xls.parse(sheet_name)
# create list to store results
resultM1 = []
# Loop over keys and df in dictionary
for key, df in sheet_to_df_map.items():
# remove top 5 blank rows
df = df = df.iloc[5:]
# set column names as first row values
headers = df.iloc[0]
df = pd.DataFrame(df.values[1:], columns=headers)
#loop over rows in the df and create pd series to store Malta results
results =df.loc[df['Country'] == "Malta", 'AirPollutionLevel']
# Loop over the results for Malta from each sheet and append 'Malta' and then append the value to a list
for i in results:
resultM1.append(key)
resultM1.append(i)
# Convert list to df
df = pd.DataFrame(resultM1)
# Rename column
df = df.rename({0: 'Sheet'}, axis=1)
# create two columns
final = pd.DataFrame({'Sheet':df['Sheet'].iloc[::2].values, 'Value':df['Sheet'].iloc[1::2].values})
I am working with a large excel file having 22 sheets, where each sheet has the same coulmn headings but do not have equal number of rows. I would like to obtain the mean values (excluding zeros) for columns AA to AX for all the 22 sheets. The columns have titles which I use in my code.
Rather than reading each sheet, I want to loop through the sheets and get as output the mean values.
With help from answers to other posts, I have this:
import pandas as pd
xls = pd.ExcelFile('myexcelfile.xlsx')
xls.sheet_names
#print(xls.sheet_names)
out_df = pd.DataFrame()
for sheets in xls.sheet_names:
df = pd.read_excel('myexcelfile.xlsx', sheet_names= None)
df1= df[df[:]!=0]
df2=df1.loc[:,'aa':'ax'].mean()
out_df.append(df2) ## This will append rows of one dataframe to another(just like your expected output)
print(out_df2)
## out_df will have data from all the sheets
The code works so far, but only one of the sheets. How do I get it to work for all 22 sheets?
You can use numpy to perform basic math on pandas.DataFrame or pandas.Series
take a look at my code below
import pandas as pd, numpy as np
XL_PATH = r'C:\Users\YourName\PythonProject\Book1.xlsx'
xlFile = pd.ExcelFile(XL_PATH)
xlSheetNames = xlFile.sheet_names
dfList = [] # variable to store all DataFrame
for shName in xlSheetNames:
df = pd.read_excel(XL_PATH, sheet_name=shName) # read sheet X as DataFrame
dfList.append(df) # put DataFrame into a list
for df in dfList:
print(df)
dfAverage = np.average(df) # use numpy to get DataFrame average
print(dfAverage)
#Try code below
import pandas as pd, numpy as np, os
XL_PATH = "YOUR EXCEL FULL PATH"
SH_NAMES = "WILL CONTAINS LIST OF EXCEL SHEET NAME"
DF_DICT = {} """WILL CONTAINS DICTIONARY OF DATAFRAME"""
def readExcel():
if not os.path.isfile(XL_PATH): return FileNotFoundError
SH_NAMES = pd.ExcelFile(XL_PATH).sheet_names
# pandas.read_excel() have argument 'sheet_name'
# when you put a list to 'sheet_name' argument
# pandas will return dictionary of dataframe with sheet_name as keys
DF_DICT = pd.read_excel(XL_PATH, sheet_name=SH_NAMES)
return SH_NAMES, DF_DICT
#Now you have DF_DICT that contains all DataFrame for each sheet in excel
#Next step is to append all rows data from Sheet1 to SheetX
#This will only works if you have same column for all DataFrame
def appendAllSheets():
dfAp = pd.DataFrame()
for dict in DF_DICT:
df = DF_DICT[dict]
dfAp = pd.DataFrame.append(self=dfAp, other=df)
return dfAp
#you can now call the function as below:
dfWithAllData = appendAllSheets()
#now you have one DataFrame with all rows combine from Sheet1 to SheetX
#you can fixed the data, for example to drop all rows which contain '0'
dfZero_Removed = dfWithAllData[[dfWithAllData['Column_Name'] != 0]]
dfNA_removed = dfWithAllData[not[pd.isna(dfWithAllData['Column_Name'])]]
#last step, to find average or other math operation
#just let numpy do the job
average_of_all_1 = np.average(dfZero_Removed)
average_of_all_2 = np.average(dfNA_Removed)
#show result
#this will show the average of all
#rows of data from Sheet1 to SheetX from your target Excel File
print(average_of_all_1, average_of_all_2)
How do i loop through my excel sheet and add each 'Adjusted Close' to a dataframe? I want to summarize all adj close and make an stock indice.
When i try with the below code the dataframe Percent_Change is empty.
xls = pd.ExcelFile('databas.xlsx')
countSheets = len(xls.sheet_names)
Percent_Change = pd.DataFrame()
x = 0
for x in range(countSheets):
data = pd.read_excel('databas.xlsx', sheet_name=x, index_col='Date')
# Calculate the percent change from day to day
Percent_Change[x] = pd.Series(data['Adj Close'].pct_change()*100, index=Percent_Change.index)
stock_index = data['Percent_Change'].cumsum()
unfortunately I do not have the data to replicate your complete example. However, there appears to be a bug in your code.
You are looping over "x" and "x" is a list of integers. You probably want to loop over the sheet names and append them to your DF. If you want to do that your code should be:
import pandas as pd
xls = pd.ExcelFile('databas.xlsx')
# pep8 unto thyself only, it is conventional to use "_" instead of camelCase or to avoid longer names if at all possible
sheets = xls.sheet_names
Percent_Change = pd.DataFrame()
# using sheet instead of x is more "pythonic"
for sheet in sheets:
data = pd.read_excel('databas.xlsx', sheet_name=sheet, index_col='Date')
# Calculate the percent change from day to day
Percent_Change[sheet] = pd.Series(data['Adj Close'].pct_change()*100, index=Percent_Change.index)
stock_index = data['Percent_Change'].cumsum()
I am trying to acomplish something that seems to be very simple using Pandas, but getting stuck.
I want to merge multiple spreadsheets (that have multiple sheets) to one single MasterSpreadSheet with all the sheets.
input example:
spreadsheet1 -> sheetname_a, sheetname_b, sheetname_c, sheetname_d
spreadsheet2 -> sheetname_a, sheetname_b, sheetname_c, sheetname_d
spreadsheet3 ......
output desired:
one single file with the data from all spreadsheets separated by the especific sheetname
MasterSpreadSheet -> sheetname_a, sheetname_b, sheetname_c, sheetname_d
Here is my code that generates that single MasterSpreadSheet, but it overrides the previous spreadsheet data, leaving the MasterFile with only data from the last spreadsheet:
with pd.ExcelWriter(outputfolder + '/' + country + '-MasterSheet.xlsx') as writer:
for spreadsheet in glob.glob(os.path.join(outputfolder, '*-Spreadsheet.xlsx')):
sheets = pd.ExcelFile(spreadsheet).sheet_names
for sheet in sheets:
df = pd.DataFrame()
sheetname = sheet.split('-')[-1]
data = pd.read_excel(spreadsheet, sheet)
data.index = [basename(spreadsheet)] * len(data)
df = df.append(data)
df.to_excel(writer, sheet_name = sheetname)
writer.save()
writer.close()
Suggestions ?
Thank you !
Got that working now :). Have looped and append first sheet by sheet, followed by the spreadsheet file, have also add the pandas concat at the end of it sheet loop:
df1 = []
sheet_list = []
sheet_counter = 0
with pd.ExcelWriter(outputfolder + '/' + country + '-MasterSheet.xlsx') as writer:
for template in glob.glob( os.path.join(templatefolder, '*.textfsm') ):
template_name = template.split('\\')[-1].split('.textfsm')[0]
sheet_list.append(template_name) ## List of Sheets per Spreadsheet file
for sheet in sheet_list:
for spreadsheet in glob.glob(os.path.join(outputfolder, '*-Spreadsheet.xlsx')):
data = pd.read_excel(spreadsheet, sheet_counter)
data.index = [basename(spreadsheet)] * len(data)
df1.append(data)
df1 = pd.concat(df1)
df1.to_excel(writer, sheet)
df1 = []
sheet_counter += 1 ##Adding a counter to get the next Sheet of each Spreadsheet