Process multiple excel files with different number of sheets - python

I need to process a number of excel files with different # of tabs and different names. I'm creating a function to load the files with pandas, loop over the sheets, and then return a data frame.
def process_file(file_name):
# just junk code - will use pandas
for sheet_name in file_name:
sheet_x = sheet_name
return sheet_x
sheet_1, sheet_2 = process_file(excel_file)
Because there are an unknown amount of sheets in each file, trying to create a variable for each one seems manual. If I wanted to return each sheet as a variable, whether it's 2 or 10 sheets, is there a way to do that instead of naming each one?

Use an array to store all of your sheets:
def process_file(file_name):
sheets = []
# just junk code - will use pandas
for sheet_name in file_name:
sheet_x = sheet_name
sheets.append(sheet_x)
return sheets
sheets_to_process = []
for excel_file in files:
sheets_to_process += process_file(excel_file)

Related

Reading only visible sheets using Pandas

I have an existing code which uses pandas to read an excel workbook.
It reads all the sheets including the hidden ones however, I want to read only the visible ones. Tried using this, but it didn't help:
The below are the snippets tried out.
xlFile = '/path/file_name.xlsx'
xl = pd.ExcelFile(xlFile)
list_of_visible_sheets = []
sheets = xl.book.sheets()
for sheet in sheets:
if sheet.visibility == 0:
list_of_visible_sheets.append(sheets)
print(list_of_visible_sheets)
and
list_of_visible_sheets = []
sheets = xl.sheet_names
for sheet in sheets:
if sheet.visibility == 0:
list_of_visible_sheets.append(sheets)
print(list_of_visible_sheets)
How can I get the visible sheets alone?
You can use this code with openpyxl. It roughly does what the pandas read_excel() function does:
import openpyxl
import pandas as pd
filename = 'TestFile.xlsx'
#Number of rows to skip in each sheet
nSkip = 1
dfs = { ws.title : pd.DataFrame(list(ws.values)[nSkip:])
for ws in openpyxl.load_workbook(filename,read_only=True).worksheets
if ws.sheet_state != 'hidden' }
print(dfs)
Try to pass the sheetname argument to pandas.read_excel
If there are not too many sheets, you can create the desired list manually, or use the recipe from that answer and lambdas.

Concat Read Excel Pandas

I'm needing to read in an excel file and read all sheets inside that excel file.
I've tried:
sample_df = pd.concat(pd.read_excel("sample_master.xlsx", sheet_name=None), ignore_index=True)
This code worked, but it's suddenly giving me this error:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
After reading in the excel file, I need to run the following command:
new_id = sample_df.loc[(sample_df['Sequencing_ID'] == line) & (sample_df['Experiment_ID'] == experiment_id), \
'Sample_ID_for_report'].item()
Any help?
First, you will want to know all of the sheets that need to be read in. Second, you will want to iterate over each sheet.
Getting Sheet names.- You can get a list of the sheet names in a workbook with sheets = pd.ExcelFile(path).sheet_names, where path is the full path to your file. The function below reads a workbook and returns a list of sheet names that contain specific key words.
import re
import pandas as pd
def get_sheets(path):
sheets = pd.ExcelFile(path).sheet_names
sheets_to_process = []
for sheet in sheets:
excludes = ['exclude_term1', 'exclude_term1']
includes = ['find_term1', 'find_term2']
sheet_stnd = re.sub('[^0-9A-Za-z_]+', '', sheet).lower().strip(' ')
for exclude in excludes:
if sheet_stnd != exclude:
for include in includes:
if include in sheet_stnd:
sheets_to_process.append(sheet)
return list(set(sheets_to_process))
Loop over sheets- You can then loop over the sheets to read them in. In this example,
for sheet in get_sheets(path):
df = pd.concat(pd.read_excel("sample_master.xlsx", sheet_name=sheet),
ignore_index=True)
Depending on your use case, you may also want to append each sheet into a larger data frame

Python, how to combine different excel workbooks into one excel workbook as sheets

is there any way in python by which we can combine different excel workbooks into one excel workbook having sheets containing data of those different excel workbooks?
For example lets say I am having two excel workbooks 1) emp.xlsx and 2) dept.xlsx i want output as output.xlsx (having worksheets as emp and dept with data of emp.xlsx and dept.xlsx). Request you to please share your thoughts on this.
Regards
Kawaljeet
What you need to do is get each sheet one by one and then create an excel with each one of those sheets. You can use the file name to name the new sheets as in emp-sheet1, emp-sheet2, dept-sheet1, and so on.
The nest example assumes you have two excel files named emp.xlsx and dept.xlsx and generates a new output.xlsx file containing all the sheets and values:
#!pip install openpyxl
#!pip install xlrd
import pandas as pd
def get_sheets(filenames):
'''
This function generates dataframes from excel sheets.
Returns:
- dfs: a list of dataframes one for each sheet
- sheets: combined names for the new sheets filename+-+sheetname
'''
sheets = []
dfs = []
for file in filenames:
xl = pd.ExcelFile(file)
sheet_names = xl.sheet_names
for sheet in sheet_names:
dfs.append(xl.parse(sheet, header=None))
sheets.append(file.split('.')[0]+'-'+sheet)
return dfs, sheets
def save_xls(dfs, sheets, xls_path):
'''
Saves each dataframe in dfs as a sheet with the name in sheets
into the file specified in xls_path
'''
with pd.ExcelWriter(xls_path) as writer:
for n, df in enumerate(dfs):
df.to_excel(writer, sheets[n], index = False, header = None)
writer.save()
filenames = ['emp.xlsx', 'dept.xlsx']
dfs, sheets = get_sheets(filenames)
save_xls(dfs, sheets, 'output.xlsx')

Overwrite sheets in Excel with Python

I'm new to Python (and programming in general) and am running into a problem when writing data out to sheets in Excel.
I'm reading in an Excel file, performing a sum calculation on specific columns, and then writing the results out to a new workbook. Then at the end, it creates two charts based on the results.
The code works, except every time I run it, it creates new sheets with numbers appended to the end. I really just want it to overwrite the sheet names I provide, instead of creating new ones.
I'm not familiar enough with all the modules to understand all the options that are available. I've researched openpyxl, and pandas, and similar examples to what I'm trying to do either aren't easy to find, or don't seem to work when I try them.
import pandas as pd
import xlrd
import openpyxl as op
from openpyxl import load_workbook
import matplotlib.pyplot as plt
# declare the input file
input_file = 'TestData.xlsx'
# declare the output_file name to be written to
output_file = 'TestData_Output.xlsx'
book = load_workbook(output_file)
writer = pd.ExcelWriter(output_file, engine='openpyxl')
writer.book = book
# read the source Excel file and calculate sums
excel_file = pd.read_excel(input_file)
num_events_main = excel_file.groupby(['Column1']).sum()
num_events_type = excel_file.groupby(['Column2']).sum()
# create dataframes and write names and sums out to new workbook/sheets
df_1 = pd.DataFrame(num_events_main)
df_2 = pd.DataFrame(num_events_type)
df_1.to_excel(writer, sheet_name = 'TestSheet1')
df_2.to_excel(writer, sheet_name = 'TestSheet2')
# save and close
writer.save()
writer.close()
# dataframe for the first sheet
df = pd.read_excel(output_file, sheet_name='TestSheet1')
values = df[['Column1', 'Column3']]
# dataframe for the second sheet
df = pd.read_excel(output_file, sheet_name='TestSheet2')
values_2 = df[['Column2', 'Column3']]
# create the graphs
events_graph = values.plot.bar(x = 'Column1', y = 'Column3', rot = 60) # rot = rotation
type_graph = values_2.plot.bar(x = 'Column2', y = 'Column3', rot = 60) # rot = rotation
plt.show()
I get the expected results, and the charts work fine. I'd really just like to get the sheets to overwrite with each run.
From the pd.DataFrame.to_excel documentation:
Multiple sheets may be written to by specifying unique sheet_name.
With all data written to the file it is necessary to save the changes.
Note that creating an ExcelWriter object with a file name that already
exists will result in the contents of the existing file being erased.
Try writing to the book like
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3],'col2':[4,5,6]})
writer = pd.ExcelWriter('g.xlsx')
df.to_excel(writer, sheet_name = 'first_df')
df.to_excel(writer, sheet_name = 'second_df')
writer.save()
If you inspect the workbook, you will have two worksheets.
Then lets say you wanted to write new data to the same workbook:
writer = pd.ExcelWriter('g.xlsx')
df.to_excel(writer, sheet_name = 'new_df')
writer.save()
If you inspect the workbook now, you will just have one worksheet named new_df
If there are other worksheets in the excel file that you want to keep and just overwrite the desired worksheets, you would need to use load_workbook.
Before you wrtie any data, you could delete the sheets you want to write to with:
std=book.get_sheet_by_name(<sheee_name>)
book.remove_sheet(std)
That will stop the behavior where a number gets appended to the worksheet name once you attempt to write a workbook with a duplicate sheet name.

Writing multiple pandas dataframes to multiple excel worksheets

I'd like for the code to run 12345 thru the loop, input it in a worksheet, then start on 54321 and do the same thing except input the dataframe into a new worksheet but in the same workbook. Below is my code.
workbook = xlsxwriter.Workbook('Renewals.xlsx')
groups = ['12345', '54321']
for x in groups:
(Do a bunch of data manipulation and get pandas df called renewals)
writer = pd.ExcelWriter('Renewals.xlsx', engine='xlsxwriter')
worksheet = workbook.add_worksheet(str(x))
renewals.to_excel(writer, sheet_name=str(x))
When this runs, I am left with a workbook with only 1 worksheet (54321).
try something like this:
import pandas as pd
#initialze the excel writer
writer = pd.ExcelWriter('MyFile.xlsx', engine='xlsxwriter')
#store your dataframes in a dict, where the key is the sheet name you want
frames = {'sheetName_1': dataframe1, 'sheetName_2': dataframe2,
'sheetName_3': dataframe3}
#now loop thru and put each on a specific sheet
for sheet, frame in frames.iteritems(): # .use .items for python 3.X
frame.to_excel(writer, sheet_name = sheet)
#critical last step
writer.save()
import pandas as pd
writer = pd.ExcelWriter('Renewals.xlsx', engine='xlsxwriter')
renewals.to_excel(writer, sheet_name=groups[0])
renewals.to_excel(writer, sheet_name=groups[1])
writer.save()
Building on the accepted answer, you can find situations where the sheet name will cause the save to fail if it has invalid characters or is too long. This could happen if you are using grouped values for the sheet name as an example. A helper function could address this and save you some pain.
def clean_sheet_name(sheet):
"""Clean sheet name so that it is a valid Excel sheet name.
Removes characters in []:*?/\ and limits to 30 characters.
Args:
sheet (str): Name to use for sheet.
Returns:
cleaned_sheet (str): Cleaned sheet name.
"""
if sheet in (None, ''):
return sheet
clean_sheet = sheet.translate({ord(i): None for i in '[]:*?/\\'})
if len(clean_sheet) > 30: # Set value you feel is appropriate
clean_sheet = clean_sheet[:30]
return clean_sheet
Then add a call to the helper function before writing to Excel.
for sheet, frame in groups.items():
# Clean sheet name for length and invalid characters
sheet = clean_sheet_name(sheet)
frame.to_excel(writer, sheet_name = sheet, index=False)
writer.save()

Categories

Resources