I am trying to read multiple excel files in a loop using read_excel :
Different excel files contain sheet names which contain the word "staff"
eg Staff_2013 , Staff_list etc
Is there a way to read all these files dynamically using some wild card concept ?
Something like the code below :
df = pd.read_excel(folder,col_names=True,sheet_name='Staff*')
You can list the sheets and select the ones you want to read one by one.
For instance:
xls_file = pd.ExcelFile('my_excel_file.xls')
staff_fnames = [sheet for sheet in xls.sheet_names if sheet.startswith('Staff')]
for staff_fname in staff_fnames:
df = pd.read_excel('my_excel_file.xls'), sheet_name=staff_fname)
Or, if you don't mind loading all the sheets, you can also use sheet_name=None to load all sheets in a dict and filter afterwards:
dfs_dict = pd.read_excel('my_excel_file.xls', sheet_name=None)
dfs_dict = {s: df for s, df in dfs_dict.items() if s.startswith('Staff')}
Related
My first time using pandas. I am sure the answer is something along the lines of storing the worksheet names in a list, then looping through the list based on the name I am looking for. I'm just not experienced enough to know how to do that.
The goal is to use pandas to extract and concatenate data from multiple worksheets from a user selected workbook. The final output being a single worksheet excel containing all data extracted from the various worksheets.
The excel workbook consist of approximately 100 worksheets. The qty of visible sheets will always vary, with the qty of sheets occurring before 'Main Frames BUP1' being variable as well.
I currently have the portion of code checking for page visibility working. I can not seem to figure out how to start at a specific worksheet when that worksheets position in the workbook could vary (i.e. not always the 3rd worksheet starting from 0 it could be the 5th in a users excel). It will however, always be the sheet that data should start being pulled from. Everything I find are examples of specifying specific sheets to read.
Any help/direction would be appreciated.
# user selected file from GUI
xl = values["-xl_file-"]
loc = os.path.dirname(xl)
xls = pd.ExcelFile(xl)
sheets = xls.book.worksheets
for x in sheets:
print(x.title, x.sheet_state)
if x.sheet_state == 'visible':
df = pd.concat(pd.read_excel(xls, sheet_name=None, header=None,
skiprows=5, nrows=32, usecols='M:AD'), ignore_index=True)
writer = pd.ExcelWriter(f'{loc}/test.xlsx')
df.to_excel(writer, 'bananas')
writer.save()
*******Additional clarification on final goal: Exclude all sheets occurring before 'Main Frames BUP 1', only consider visible sheets, pull data from 'M6:AD37', if entire row is blank do not add(or at least remove) from data frame, stop pulling data at the sheet just before a worksheet who's name has a partial match to 'panel'
If I create a dictionary of visible sheets, how do you create a new dictionary useing that dictionary only consisting of 'Main Frames BUP 1' to whatever sheet occurs just before a partial match of 'panel'? Then I can use that dictionary for my data pull.
I created a minimal sample myself and worked it out for you.
xls = pd.ExcelFile('data/Test.xlsx')
sheets = xls.book.worksheets
sList = [x.title for x in sheets if x.sheet_state == 'visible']
dfs = [pd.read_excel('data/Test.xlsx', sheet_name=s, skiprows=5, nrows=32, usecols='M:AD') for s in sList]
dfconcat = pd.concat(dfs)
Now you need adjust the columns, headers and so on as you did in your question. I hope that it works out for you. From my side here it worked like a charm.
It is a bit hard without actually see what is going on with your data.
I believe that what you are missing is that you need to create one dataframe first and after concat the others. Also you need to pass a sheet(x) in order to pandas be able to read it, otherwise it will become a dictionary. In case it does not work, get the first sheet and create a df, then you concat.
# user selected file from GUI
xl = values["-xl_file-"]
loc = os.path.dirname(xl)
xls = pd.ExcelFile(xl)
sheets = xls.book.worksheets
df = pd.DataFrame()
for x in sheets:
print(x.title, x.sheet_state)
if x.sheet_state == 'visible':
df = pd.concat(pd.read_excel(xls, sheet_name=x, header=None,
skiprows=5, nrows=32, usecols='M:AD'), ignore_index=True)
writer = pd.ExcelWriter(f'{loc}/test.xlsx')
df.to_excel(writer, 'bananas')
writer.save()
You can also put all the dfs in a dictionary, again it is difficult without knowing what you are working with.
xl = pd.ExcelFile('yourFile.xlsx')
#collect all sheet names
sheets = xl.sheet_names
#build dictionaries from all sheets passing None to sheet_name
diDF = pd.read_excel('yourFile.xlsx', sheet_name=None)
di = {k : diDF[k] for k in diDF if k in sheets}
for x in sheets:
if x.sheet_state == 'visible':
dfs = {x: pd.DataFrame(di[x])}
I'm currently creating a dataframe from an excel spreadsheet in Pandas. For most of the files, they only contain 1 sheet. However, with some of the files that I have the sheet is not the first sheet. However, all of the sheets in all of the files have the same format. They have 'ITD_XXX_XXXX'. Is there a way to input into pandas to select the sheet that has the form.
df = pd.read_excel(path, sheet_name = contains('ITD_')
Here pandas would only select data from the sheet that has the string 'ITD_' in front of it?
Cheers.
I think the answer here would probably give you what you need.
Bring in the file as an Excelfile before reading it as a dataframe. Get the Sheet_names, and then extract the sheet_name that has 'ITD_'.
excel = pd.ExcelFile("your_excel.xlsx")
excel.sheet_names
# ["Sheet1", "Sheet2"]
for n in excel.sheet_names:
if n.startswith('ITD_'):
sheetname = n
break
df = excel.parse(sheetname)
I have one excel file with several identical structured sheets on it (same headers and number of columns) (sheetsname: 01,02,...,12).
How can I get this into one dataframe?
Right now I would load it all seperate with:
df1 = pd.read_excel('path.xls', sheet_name='01')
df2 = pd.read_excel('path.xls', sheet_name='02')
...
and would then concentate it.
What is the most pythonic way to do it and get directly one dataframe with all the sheets? Also assumping I do not know every sheetname in advance.
read the file as:
collection = pd.read_excel('path.xls', sheet_name=None)
combined = pd.concat([value.assign(sheet_source=key)
for key,value in collection.items()],
ignore_index=True)
sheet_name = None ensures all the sheets are read in.
collection is a dictionary, with the sheet_name as key, and the actual data as the values. combined uses the pandas concat method to get you one dataframe. I added the extra column sheet_source, in case you need to track where the data for each row comes from.
You can read more about it on the pandas doco
you can use:
df_final = pd.concat([pd.read_excel('path.xls', sheet_name="{:02d}".format(sheet)) for sheet in range(12)], axis=0)
Can anyone know that how to delete same single column from multiple xlsx sheet using python?
After that those sheets save it to same path.
First of all you should create a list with the Excel workbooks and save in a variable called "files":
dfs = []
for file in files:
df = pd.read_excel(file, sheet_name = None)
dfs.append(df)
col_to_delete = ['Your columns']
for df in dfs:
#Here you delete one by one
df.drop(labels = col_to_delete, inplace = True)
After that, I don't know if you want to merge all the .xlsx or overwrite, but this is another question,
look at documentation for pd.Write() to save the files
Hope it helps
I have an excel file with multiple sheets that I convert into a dictionary of dataframes where the key represents the sheet's name:
xl = pd.ExcelFile(r"D:\Python Code\PerformanceTable.xlsx")
pacdict = { i : pd.read_excel(xl, i) for c, i in enumerate(xl.sheet_names, 1)}
I would like to replace this input Excel file with a flat text file -- but would still like to end up with the same outcome of a dictionary of dataframes.
Any suggestions on how I might be able to format the text file so it still contains data for multiple, named tables/sheets and can be read into the above format? Preferrably still making Pandas' built-in functionality do the heavy lifting.
Loop through each sheet. Create a new column called "sheet_source". Concatenate the sheet dataframes to a master dataframe. Lastly export to CSV file.
# create a master dataframe to store the sheets
df_master = pd.DataFrame()
# loop through each dict key
for each_df_key in pacdict.keys():
# dataframe for each sheet
sheet_df = pacdict[each_df_key]
# add column for sheet name
sheet_df['sheet_source'] = each_df_key
# concatenate each sheet to the master
df_master = pd.concat([df_master, sheet_df])
# after the for-loop, export the master dataframe to CSV
df_master.to_csv('new_dataframe.csv', index=False)