First I want to say that I am not an expert by any means. I am versed but carry a burden of schedule and learning Python like I should have at a younger age!
Question:
I have a workbook that will on occasion have more than one worksheet. When reading in the workbook I will not know the number of sheets or their sheet name. The data arrangement will be the same on every sheet with some columns going by the name of 'Unnamed'. The problem is that everything I try or find online uses the pandas.ExcelFile to gather all sheets which is fine but i need to be able to skips 4 rows and only read 42 rows after that and parse specific columns. Although the sheets might have the exact same structure the column names might be the same or different but would like them to be merged.
So here is what I have:
import pandas as pd
from openpyxl import load_workbook
# Load in the file location and name
cause_effect_file = r'C:\Users\Owner\Desktop\C&E Template.xlsx'
# Set up the ability to write dataframe to the same workbook
book = load_workbook(cause_effect_file)
writer = pd.ExcelWriter(cause_effect_file)
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
# Get the file skip rows and parse columns needed
xl_file = pd.read_excel(cause_effect_file, skiprows=4, parse_cols = 'B:AJ', na_values=['NA'], convert_float=False)
# Loop through the sheets loading data in the dataframe
dfi = {sheet_name: xl_file.parse(sheet_name)
for sheet_name in xl_file.sheet_names}
# Remove columns labeled as un-named
for col in dfi:
if r'Unnamed' in col:
del dfi[col]
# Write dataframe to sheet so we can see what the data looks like
dfi.to_excel(writer, "PyDF", index=False)
# Save it back to the book
writer.save()
The link to the file i am working with is below
Excel File
Try to modify the following based on your specific need:
import os
import pandas as pd
df = pd.DataFrame()
xls = pd.ExcelFile(path)
Then iterate over all the available data sheets:
for x in range(0, len(xls.sheet_names)):
a = xls.parse(x,header = 4, parse_cols = 'B:AJ')
a["Sheet Name"] = [xls.sheet_names[x]] * len(a)
df = df.append(a)
You can adjust the header row and the columns to read for each sheet. I added a column that will indicate the name of the data sheet the row came from.
You probably want to look at using read_only mode in openpyxl. This will allow you to load only those sheets that you're interested and look at only the cells you're interested in.
If you want to work with Pandas dataframes then you'll have to create these yourself but that shouldn't be too hard.
Related
My first time using pandas. I am sure the answer is something along the lines of storing the worksheet names in a list, then looping through the list based on the name I am looking for. I'm just not experienced enough to know how to do that.
The goal is to use pandas to extract and concatenate data from multiple worksheets from a user selected workbook. The final output being a single worksheet excel containing all data extracted from the various worksheets.
The excel workbook consist of approximately 100 worksheets. The qty of visible sheets will always vary, with the qty of sheets occurring before 'Main Frames BUP1' being variable as well.
I currently have the portion of code checking for page visibility working. I can not seem to figure out how to start at a specific worksheet when that worksheets position in the workbook could vary (i.e. not always the 3rd worksheet starting from 0 it could be the 5th in a users excel). It will however, always be the sheet that data should start being pulled from. Everything I find are examples of specifying specific sheets to read.
Any help/direction would be appreciated.
# user selected file from GUI
xl = values["-xl_file-"]
loc = os.path.dirname(xl)
xls = pd.ExcelFile(xl)
sheets = xls.book.worksheets
for x in sheets:
print(x.title, x.sheet_state)
if x.sheet_state == 'visible':
df = pd.concat(pd.read_excel(xls, sheet_name=None, header=None,
skiprows=5, nrows=32, usecols='M:AD'), ignore_index=True)
writer = pd.ExcelWriter(f'{loc}/test.xlsx')
df.to_excel(writer, 'bananas')
writer.save()
*******Additional clarification on final goal: Exclude all sheets occurring before 'Main Frames BUP 1', only consider visible sheets, pull data from 'M6:AD37', if entire row is blank do not add(or at least remove) from data frame, stop pulling data at the sheet just before a worksheet who's name has a partial match to 'panel'
If I create a dictionary of visible sheets, how do you create a new dictionary useing that dictionary only consisting of 'Main Frames BUP 1' to whatever sheet occurs just before a partial match of 'panel'? Then I can use that dictionary for my data pull.
I created a minimal sample myself and worked it out for you.
xls = pd.ExcelFile('data/Test.xlsx')
sheets = xls.book.worksheets
sList = [x.title for x in sheets if x.sheet_state == 'visible']
dfs = [pd.read_excel('data/Test.xlsx', sheet_name=s, skiprows=5, nrows=32, usecols='M:AD') for s in sList]
dfconcat = pd.concat(dfs)
Now you need adjust the columns, headers and so on as you did in your question. I hope that it works out for you. From my side here it worked like a charm.
It is a bit hard without actually see what is going on with your data.
I believe that what you are missing is that you need to create one dataframe first and after concat the others. Also you need to pass a sheet(x) in order to pandas be able to read it, otherwise it will become a dictionary. In case it does not work, get the first sheet and create a df, then you concat.
# user selected file from GUI
xl = values["-xl_file-"]
loc = os.path.dirname(xl)
xls = pd.ExcelFile(xl)
sheets = xls.book.worksheets
df = pd.DataFrame()
for x in sheets:
print(x.title, x.sheet_state)
if x.sheet_state == 'visible':
df = pd.concat(pd.read_excel(xls, sheet_name=x, header=None,
skiprows=5, nrows=32, usecols='M:AD'), ignore_index=True)
writer = pd.ExcelWriter(f'{loc}/test.xlsx')
df.to_excel(writer, 'bananas')
writer.save()
You can also put all the dfs in a dictionary, again it is difficult without knowing what you are working with.
xl = pd.ExcelFile('yourFile.xlsx')
#collect all sheet names
sheets = xl.sheet_names
#build dictionaries from all sheets passing None to sheet_name
diDF = pd.read_excel('yourFile.xlsx', sheet_name=None)
di = {k : diDF[k] for k in diDF if k in sheets}
for x in sheets:
if x.sheet_state == 'visible':
dfs = {x: pd.DataFrame(di[x])}
I'm currently creating a dataframe from an excel spreadsheet in Pandas. For most of the files, they only contain 1 sheet. However, with some of the files that I have the sheet is not the first sheet. However, all of the sheets in all of the files have the same format. They have 'ITD_XXX_XXXX'. Is there a way to input into pandas to select the sheet that has the form.
df = pd.read_excel(path, sheet_name = contains('ITD_')
Here pandas would only select data from the sheet that has the string 'ITD_' in front of it?
Cheers.
I think the answer here would probably give you what you need.
Bring in the file as an Excelfile before reading it as a dataframe. Get the Sheet_names, and then extract the sheet_name that has 'ITD_'.
excel = pd.ExcelFile("your_excel.xlsx")
excel.sheet_names
# ["Sheet1", "Sheet2"]
for n in excel.sheet_names:
if n.startswith('ITD_'):
sheetname = n
break
df = excel.parse(sheetname)
I have an excel file with one sheet name "info" as follows
Name Number
S1 50
S2 100
S3 400
This sheet give info about other sheet which I need to convert into pandas df's.
but, when I read this sheet and loop to create other df's. My code is also looking for a sheet name "Name" and thus breaking...any way to avoid this?
Use a header row or skip the first row as mentioned in the comments.
df_info = pd.read_excel('file.xlsx', sheet_name='info', header=0)
sheets = {}
for sheet_name in df_info['Name']:
sheets[sheet_name] = pd.read_excel('file.xlsx', sheet_name=sheet_name, header=None)
Pandas Read Excel Documentation
I have been searching over on how to append/insert/concat a row from one excel to another but with merged cells. I was not able to find what I am looking for.
What I need to get is this:
and append to the very first row of this:
I tried using pandas append() but it destroyed the arrangement of columns.
df = pd.DataFrame()
for f in ['merge1.xlsx', 'test1.xlsx']:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
df.to_excel('test3.xlsx')
Is there way pandas could do it? I just need to literally insert the header to the top row.
Although I am still trying to find a way, it would actually be fine to me if this question had a duplicate as long as I can find answers or advice.
You can use pd.read_excel to read in the workbook with the data you want, in your case that is 'test1.xlsx'. You could then utilize openpyxl.load_workbook() to open an existing workbook with the header, in your case that is 'merge1.xlsx'. Finally you could save the new workbbok by a new name ('test3.xlsx') without changing the two existing workbooks.
Below I've provided a fully reproducible example of how you can do this. To make this example fully reproducible, I create 'merge1.xlsx' and 'test1.xlsx'.
Please note that if in your 'merge1.xlsx', if you only have the header that you want and nothing else in the file, you can make use of the two lines I've left commented out below. This would just append your data from 'test1.xlsx' to the header in 'merge1.xlsx'. If this is the case then you can get rid of the two for llops at the end. Otherwise as in my example it's a bit more complicated.
In creating 'test3.xlsx', we loop through each row and we determine how many columns there are using len(df3.columns). In my example this is equal to two but this code would also work for a greater number of columns.
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows
df1 = pd.DataFrame()
writer = pd.ExcelWriter('merge1.xlsx') #xlsxwriter engine
df1.to_excel(writer, sheet_name='Sheet1')
ws = writer.sheets['Sheet1']
ws.merge_range('A1:C1', 'This is a merged cell')
ws.write('A3', 'some string I might not want in other workbooks')
writer.save()
df2 = pd.DataFrame({'col_1': [1,2,3,4,5,6], 'col_2': ['A','B','C','D','E','F']})
writer = pd.ExcelWriter('test1.xlsx')
df2.to_excel(writer, sheet_name='Sheet1')
writer.save()
df3 = pd.read_excel('test1.xlsx')
wb = load_workbook('merge1.xlsx')
ws = wb['Sheet1']
#for row in dataframe_to_rows(df3):
# ws.append(row)
column = 2
for item in list(df3.columns.values):
ws.cell(2, column=column).value = str(item)
column = column + 1
for row_index, row in df3.iterrows():
ws.cell(row=row_index+3, column=1).value = row_index #comment out to remove index
for i in range(0, len(df3.columns)):
ws.cell(row=row_index+3, column=i+2).value = row[i]
wb.save("test3.xlsx")
Expected Output of the 3 Workbooks:
I am having trouble updating an Excel Sheet using pandas by writing new values in it. I already have an existing frame df1 that reads the values from MySheet1.xlsx. so this needs to either be a new dataframe or somehow to copy and overwrite the existing one.
The spreadsheet is in this format:
I have a python list: values_list = [12.34, 17.56, 12.45]. My goal is to insert the list values under Col_C header vertically. It is currently overwriting the entire dataframe horizontally, without preserving the current values.
df2 = pd.DataFrame({'Col_C': values_list})
writer = pd.ExcelWriter('excelfile.xlsx', engine='xlsxwriter')
df2.to_excel(writer, sheet_name='MySheet1')
workbook = writer.book
worksheet = writer.sheets['MySheet1']
How to get this end result? Thank you!
Below I've provided a fully reproducible example of how you can go about modifying an existing .xlsx workbook using pandas and the openpyxl module (link to Openpyxl Docs).
First, for demonstration purposes, I create a workbook called test.xlsx:
from openpyxl import load_workbook
import pandas as pd
writer = pd.ExcelWriter('test.xlsx', engine='openpyxl')
wb = writer.book
df = pd.DataFrame({'Col_A': [1,2,3,4],
'Col_B': [5,6,7,8],
'Col_C': [0,0,0,0],
'Col_D': [13,14,15,16]})
df.to_excel(writer, index=False)
wb.save('test.xlsx')
This is the Expected output at this point:
In this second part, we load the existing workbook ('test.xlsx') and modify the third column with different data.
from openpyxl import load_workbook
import pandas as pd
df_new = pd.DataFrame({'Col_C': [9, 10, 11, 12]})
wb = load_workbook('test.xlsx')
ws = wb['Sheet1']
for index, row in df_new.iterrows():
cell = 'C%d' % (index + 2)
ws[cell] = row[0]
wb.save('test.xlsx')
This is the Expected output at the end:
In my opinion, the easiest solution is to read the excel as a panda's dataframe, and modify it and write out as an excel. So for example:
Comments:
Import pandas as pd.
Read the excel sheet into pandas data-frame called.
Take your data, which could be in a list format, and assign it to the column you want. (just make sure the lengths are the same). Save your data-frame as an excel, either override the old excel or create a new one.
Code:
import pandas as pd
ExcelDataInPandasDataFrame = pd.read_excel("./YourExcel.xlsx")
YourDataInAList = [12.34,17.56,12.45]
ExcelDataInPandasDataFrame ["Col_C"] = YourDataInAList
ExcelDataInPandasDataFrame .to_excel("./YourNewExcel.xlsx",index=False)