Choose A Specific Sheet In Excel Containing a String Pandas - python

I'm currently creating a dataframe from an excel spreadsheet in Pandas. For most of the files, they only contain 1 sheet. However, with some of the files that I have the sheet is not the first sheet. However, all of the sheets in all of the files have the same format. They have 'ITD_XXX_XXXX'. Is there a way to input into pandas to select the sheet that has the form.
df = pd.read_excel(path, sheet_name = contains('ITD_')
Here pandas would only select data from the sheet that has the string 'ITD_' in front of it?
Cheers.

I think the answer here would probably give you what you need.
Bring in the file as an Excelfile before reading it as a dataframe. Get the Sheet_names, and then extract the sheet_name that has 'ITD_'.
excel = pd.ExcelFile("your_excel.xlsx")
excel.sheet_names
# ["Sheet1", "Sheet2"]
for n in excel.sheet_names:
if n.startswith('ITD_'):
sheetname = n
break
df = excel.parse(sheetname)

Related

Concatenate excel sheets using Pandas

My first time using pandas. I am sure the answer is something along the lines of storing the worksheet names in a list, then looping through the list based on the name I am looking for. I'm just not experienced enough to know how to do that.
The goal is to use pandas to extract and concatenate data from multiple worksheets from a user selected workbook. The final output being a single worksheet excel containing all data extracted from the various worksheets.
The excel workbook consist of approximately 100 worksheets. The qty of visible sheets will always vary, with the qty of sheets occurring before 'Main Frames BUP1' being variable as well.
I currently have the portion of code checking for page visibility working. I can not seem to figure out how to start at a specific worksheet when that worksheets position in the workbook could vary (i.e. not always the 3rd worksheet starting from 0 it could be the 5th in a users excel). It will however, always be the sheet that data should start being pulled from. Everything I find are examples of specifying specific sheets to read.
Any help/direction would be appreciated.
# user selected file from GUI
xl = values["-xl_file-"]
loc = os.path.dirname(xl)
xls = pd.ExcelFile(xl)
sheets = xls.book.worksheets
for x in sheets:
print(x.title, x.sheet_state)
if x.sheet_state == 'visible':
df = pd.concat(pd.read_excel(xls, sheet_name=None, header=None,
skiprows=5, nrows=32, usecols='M:AD'), ignore_index=True)
writer = pd.ExcelWriter(f'{loc}/test.xlsx')
df.to_excel(writer, 'bananas')
writer.save()
*******Additional clarification on final goal: Exclude all sheets occurring before 'Main Frames BUP 1', only consider visible sheets, pull data from 'M6:AD37', if entire row is blank do not add(or at least remove) from data frame, stop pulling data at the sheet just before a worksheet who's name has a partial match to 'panel'
If I create a dictionary of visible sheets, how do you create a new dictionary useing that dictionary only consisting of 'Main Frames BUP 1' to whatever sheet occurs just before a partial match of 'panel'? Then I can use that dictionary for my data pull.
I created a minimal sample myself and worked it out for you.
xls = pd.ExcelFile('data/Test.xlsx')
sheets = xls.book.worksheets
sList = [x.title for x in sheets if x.sheet_state == 'visible']
dfs = [pd.read_excel('data/Test.xlsx', sheet_name=s, skiprows=5, nrows=32, usecols='M:AD') for s in sList]
dfconcat = pd.concat(dfs)
Now you need adjust the columns, headers and so on as you did in your question. I hope that it works out for you. From my side here it worked like a charm.
It is a bit hard without actually see what is going on with your data.
I believe that what you are missing is that you need to create one dataframe first and after concat the others. Also you need to pass a sheet(x) in order to pandas be able to read it, otherwise it will become a dictionary. In case it does not work, get the first sheet and create a df, then you concat.
# user selected file from GUI
xl = values["-xl_file-"]
loc = os.path.dirname(xl)
xls = pd.ExcelFile(xl)
sheets = xls.book.worksheets
df = pd.DataFrame()
for x in sheets:
print(x.title, x.sheet_state)
if x.sheet_state == 'visible':
df = pd.concat(pd.read_excel(xls, sheet_name=x, header=None,
skiprows=5, nrows=32, usecols='M:AD'), ignore_index=True)
writer = pd.ExcelWriter(f'{loc}/test.xlsx')
df.to_excel(writer, 'bananas')
writer.save()
You can also put all the dfs in a dictionary, again it is difficult without knowing what you are working with.
xl = pd.ExcelFile('yourFile.xlsx')
#collect all sheet names
sheets = xl.sheet_names
#build dictionaries from all sheets passing None to sheet_name
diDF = pd.read_excel('yourFile.xlsx', sheet_name=None)
di = {k : diDF[k] for k in diDF if k in sheets}
for x in sheets:
if x.sheet_state == 'visible':
dfs = {x: pd.DataFrame(di[x])}

Reading through different excel sheets to plot

I am trying to read through multiple sheets within same excel file. I want to plot specific columns for every sheet on same figure but it says that 'Excelfile' has no attribute 'iloc'. Can someone tell me what is wrong here? thank you
df = pd.ExcelFile ('Current parametric sweep_reference.xlsx')
Sheet=df.sheet_names
print(Sheet)
for sheet_names in Sheet:
plt.plot(df.iloc[:,1],iloc[:,9])
You are not using the data-frame but the sheet-names. You can do the following
dfs = pd.ExcelFile ('Current parametric sweep_reference.xlsx')
for sheet in df.sheet_names: #loop over all sheets
df = pd.read_excel("Current parametric sweep_reference.xlsx",sheet_name=sheet)
plt.plot(df.iloc[:,1],df.iloc[:,9])
Your object df is not a pandas DataFrame but an ExcelFile object, which does not support iloc. To use iloc you should first represent the individual sheets as DataFrames, like so:
...
for sheet_name in Sheet:
sheet_df = df.parse(sheet_name)
you should use ´pd.read_excel´ for loading your excel file. By providing ´sheet=None´ to ´pd.read_excel´ you load all sheets into a dictionary of dataframes per sheet. Then you can iterate over the sheets as following:
import pandas as pd
sheets = pd.read_excel("'Current parametric sweep_reference.xlsx'", sheet_name=None)
for sheetname, df in sheets.items():
plt.plot(df.iloc[:,1],df.iloc[:,9])

Read excel sheet in pandas with different sheet names in a pattern

I am trying to read multiple excel files in a loop using read_excel :
Different excel files contain sheet names which contain the word "staff"
eg Staff_2013 , Staff_list etc
Is there a way to read all these files dynamically using some wild card concept ?
Something like the code below :
df = pd.read_excel(folder,col_names=True,sheet_name='Staff*')
You can list the sheets and select the ones you want to read one by one.
For instance:
xls_file = pd.ExcelFile('my_excel_file.xls')
staff_fnames = [sheet for sheet in xls.sheet_names if sheet.startswith('Staff')]
for staff_fname in staff_fnames:
df = pd.read_excel('my_excel_file.xls'), sheet_name=staff_fname)
Or, if you don't mind loading all the sheets, you can also use sheet_name=None to load all sheets in a dict and filter afterwards:
dfs_dict = pd.read_excel('my_excel_file.xls', sheet_name=None)
dfs_dict = {s: df for s, df in dfs_dict.items() if s.startswith('Staff')}

Convert Excel sheets to Pandas df's

I have an excel file with one sheet name "info" as follows
Name Number
S1 50
S2 100
S3 400
This sheet give info about other sheet which I need to convert into pandas df's.
but, when I read this sheet and loop to create other df's. My code is also looking for a sheet name "Name" and thus breaking...any way to avoid this?
Use a header row or skip the first row as mentioned in the comments.
df_info = pd.read_excel('file.xlsx', sheet_name='info', header=0)
sheets = {}
for sheet_name in df_info['Name']:
sheets[sheet_name] = pd.read_excel('file.xlsx', sheet_name=sheet_name, header=None)
Pandas Read Excel Documentation

How to read Excel Workbook (pandas)

First I want to say that I am not an expert by any means. I am versed but carry a burden of schedule and learning Python like I should have at a younger age!
Question:
I have a workbook that will on occasion have more than one worksheet. When reading in the workbook I will not know the number of sheets or their sheet name. The data arrangement will be the same on every sheet with some columns going by the name of 'Unnamed'. The problem is that everything I try or find online uses the pandas.ExcelFile to gather all sheets which is fine but i need to be able to skips 4 rows and only read 42 rows after that and parse specific columns. Although the sheets might have the exact same structure the column names might be the same or different but would like them to be merged.
So here is what I have:
import pandas as pd
from openpyxl import load_workbook
# Load in the file location and name
cause_effect_file = r'C:\Users\Owner\Desktop\C&E Template.xlsx'
# Set up the ability to write dataframe to the same workbook
book = load_workbook(cause_effect_file)
writer = pd.ExcelWriter(cause_effect_file)
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
# Get the file skip rows and parse columns needed
xl_file = pd.read_excel(cause_effect_file, skiprows=4, parse_cols = 'B:AJ', na_values=['NA'], convert_float=False)
# Loop through the sheets loading data in the dataframe
dfi = {sheet_name: xl_file.parse(sheet_name)
for sheet_name in xl_file.sheet_names}
# Remove columns labeled as un-named
for col in dfi:
if r'Unnamed' in col:
del dfi[col]
# Write dataframe to sheet so we can see what the data looks like
dfi.to_excel(writer, "PyDF", index=False)
# Save it back to the book
writer.save()
The link to the file i am working with is below
Excel File
Try to modify the following based on your specific need:
import os
import pandas as pd
df = pd.DataFrame()
xls = pd.ExcelFile(path)
Then iterate over all the available data sheets:
for x in range(0, len(xls.sheet_names)):
a = xls.parse(x,header = 4, parse_cols = 'B:AJ')
a["Sheet Name"] = [xls.sheet_names[x]] * len(a)
df = df.append(a)
You can adjust the header row and the columns to read for each sheet. I added a column that will indicate the name of the data sheet the row came from.
You probably want to look at using read_only mode in openpyxl. This will allow you to load only those sheets that you're interested and look at only the cells you're interested in.
If you want to work with Pandas dataframes then you'll have to create these yourself but that shouldn't be too hard.

Categories

Resources