Combining excel workbook sheet into one using python - python

I have roughly 30 excel workbooks I need to combine into one. Each workbook has a variable number of sheets but the sheet I need to combine from each workbook is called "Output" and the format of the columns in this sheet is consistent.
I need to import the Output sheet from the first file, then append the remaining files and ignore the header row.
I have tried to do this using glob/pandas to no avail.

You could use openpyxl. Here's a sketch of the code:
from openpyxl import load_workbook
compiled_wb = load_workbook(filename = 'yourfile1.xlsx')
compiled_ws = compiled['Output']
for i in range(1, 30):
wb = load_workbook(filename = 'yourfile{}.xlsx'.format(i))
ws = wb['Output']
compiled_ws.append(ws.rows()[1:]) # ignore row 0
compiled_wb.save('compiled.xlsx')

Method shown by Clinton c. Brownley in Foundations for Analytics with Python:
execute in shell indicating the path to the folder with excel files ( make sure the argument defining all_workbooks is correct) and then followed by the excel output file as follows:
python script.py <the /path/ to/ excel folder/> < your/ final/output.xlsx>
script.py:
import pandas as pd
import sys
import os
import glob
input_path = sys.argv[1]
output_file = sys.argv[2]
all_workbooks = glob.glob(os.path.join(input_file, '*.xlsx'))
all_df = []
for workbook in all_workbooks:
all_worksheets = pd.read_excel(workbook, sheetname='Output', index_col=None)
for worksheet, data in all_worksheets.items:
all_df.append(data)
data_concatenated = pd.concat(all_df, axis=0, ignore_index=True)
writer = pd.ExcelWriter(output_file)
data_concatenated.to_excel(writer, sheetname='concatenated_Output', index=False)
writer.save()

This will probably get down-voted because this isn't a Python answer, but honestly, I wouldn't use Python for this kind of task. I think you are far better off installing the AddIn below, and using that for the job.
https://www.rondebruin.nl/win/addins/rdbmerge.htm
Click 'Merge all files from the folder in the Files location selection' and click 'Use a Worksheet name' = 'Output', and finally, I think you want 'First cell'. Good luck!

Related

How can I add multiple sheets from multiple workbooks into one workbook without overwriting the whole file?

I have two excel files (.xls) in the "Files" folder. I want to take each sheet of them both and put them into one separate workbook, called masterFile.xls. The code below downloads some example files so you can see what I'm working with.
import pandas as pd
import os
import requests
resp = requests.get("https://www.ons.gov.uk/file?uri=%2femploymentandlabourmarket%2fpeopleinwork%2femploymentandemployeetypes%2fdatasets%2fsummaryoflabourmarketstatistics%2fcurrent/a01dec2021.xls")
output = open("1.xls", 'wb')
output.write(resp.content)
output.close()
resp = requests.get("https://www.ons.gov.uk/file?uri=%2femploymentandlabourmarket%2fpeopleinwork%2femploymentandemployeetypes%2fdatasets%2femploymentunemploymentandeconomicinactivityforpeopleaged16andoverandagedfrom16to64seasonallyadjusteda02sa%2fcurrent/a02sadec2021.xls")
output = open("2.xls", 'wb')
output.write(resp.content)
output.close()
cwd = os.path.abspath('')
files = os.listdir(cwd)
for file in files:
if file.endswith('.xls'):
excelFile = pd.ExcelFile(file)
sheets = excelFile.sheet_names
for sheet in sheets:
data = pd.read_excel(excelFile,sheet_name = sheet)
data.to_excel("masterFile.xls",sheet_name = sheet)
Each time it adds the sheet, it replaces whatever was already there instead of adding a new sheet.

Saving XLSX workbooks as multiple CSV files

Trying to save Excel files with multiple sheets as corresponding CSV files.
I tried the following method:
import xlrd
from openpyxl import Workbook, load_workbook
import pathlib
import shutil
import pandas as pd
def strip_xlsx(inputdir, file_name, targetdir):
wb = load_workbook(inputdir)
sheets = wb.sheetnames
for s in sheets:
temp_df = pd.read_excel(inputdir, sheet_name=s)
temp_df.to_csv(targetdir + "/" + file_name.strip(".xlsx") + "_" + s + ".csv", encoding='utf-8-sig')
Where inputdir is an absolute path to a the Excel file (say: "/Users/me/test/t.xlsx"), file_name is just the name of the file ("t.xlsx") and target_dir is a path to which I wish to save the csv files.
The methods works well, thought super slow. I'm a newbie to Python and feel like I implemented the method in a very inefficient way.
Would appreciate tips from the masters.
You may have better luck if you keep everything in pandas. I see you are using openpyxl to get the sheet names, you can do this in pandas. As for speed, you'll just have to see:
EDIT:
As Charlie (the person who probably knows the most about openpyxl on the planet) pointed out, using just openpyxl will be faster. In this case about 25% faster (9.29 ms -> 6.87 ms for my two-sheet test):
from os import path, mkdir
from openpyxl import load_workbook
import csv
def xlsx_to_multi_csv(xlsx_path: str, out_dir: str = '.') -> None:
"""Write each sheet of an Excel file to a csv
"""
# make the out directory if it does not exist (this is not EAFP)
if not path.exists(out_dir):
mkdir(out_dir)
# set the prefix
prefix = path.splitext(xlsx_path)[0]
# load the workbook
wb = load_workbook(xlsx_path, read_only=True)
for sheet_name in wb.sheetnames:
# generate the out path
out_path = path.join(out_dir, f'{prefix}_{sheet_name}.csv')
# open that file
with open(out_path, 'w', newline='') as file:
# create the writer
writer = csv.writer(file)
# get the sheet
sheet = wb[sheet_name]
for row in sheet.rows:
# write each row to the csv
writer.writerow([cell.value for cell in row])
xlsx_to_multi_csv('data.xlsx')
You just need to specify a path to save the csv's to, and iterate through a dictionary created by pandas to save the frames to the directory.
csv_path = '\path\to\dir'
for name,df in pd.read_excel('xl_path',sheet_name=None).items():
df.to_excel(os.path.join(csv_path,name)

Excel python openpyxl methods

Python excel methods doesn't work correctly.
import pandas as pd
import os
from openpyxl import load_workbook
path = "C:\\My files\\Staff\\xProject\\ProjektExcelPython\\test_files\\"
spreadsheet_file = pd.read_excel(os.path.join(path, "PlikExcelDoKonwersji.xlsx"), engine='openpyxl', header = 1)
print(spreadsheet_file)
It works perfectly, but if I would like to use methods from openpyxl I have error.
import pandas as pd
import os
from openpyxl import load_workbook
path = "C:\\My files\\Staff\\xProject\\ProjektExcelPython\\test_files\\"
spreadsheet_file = pd.read_excel(os.path.join(path, "PlikExcelDoKonwersji.xlsx"), engine='openpyxl', header = 1)
#sheet = spreadsheet_file.sheet_by_name('sheet')
book = load_workbook(path, "PlikExcelDoKonwersji.xlsx")
sheet = book['SendMail1']
data = []
for row in sheet.rows:
print(row[1].value)
Error:
line 94, in _validate_archive
raise InvalidFileException(msg)
openpyxl.utils.exceptions.InvalidFileException: openpyxl does not support file format, please check you can open it with Excel first. Supported formats are: .xlsx,.xlsm,.xltx,.xltm
Process finished with exit code 1
load_workbook function only takes a single argument for file-path.
try
book = load_workbook(os.path.join(path, "PlikExcelDoKonwersji.xlsx"))
Instead of making the path 2 values, make it 1.
For Example, Try:
book = load_workbook("C:\\My files\\Staff\\xProject\\ProjektExcelPython\\test_files\\PlikExcelDoKonwersji.xlsx")
Thanks guys it works but now another problem occurs.
I would like to start as before without first row, I used header = 1, but now it doesn't work I go for documentation.
def load_workbook(filename, read_only=False, keep_vba=KEEP_VBA,
data_only=False, keep_links=True):
So it isn't there. How you manage it in openpyxl?
I would like to make something like:
Find name in column[1], the same will be in column[2] and get one of data from row[7] using column[1] and column[2].
Thanks for any suggestions.

Using openpyxl to average the value of cells from mutiple workbooks

I am new to python and openpyxl. I have the beginning of a script that will loop through a directory and subdirectories and find any xlsx file that starts with "BEAR". Each of these files are in the same exact format. What I am trying to do is find the average of cell I3 from all the xlsx files it finds. Here is where I am so far.
I have tried altering my method by appending the sheet name "Sensor Status" to a new workbook. The problem I am now having is that it is copying over all the worksheets and not just the "Sensor" status sheet and is overwriting itself so I only have the data from the last file it looked at. How can I just copy the sheet I want and not overwrite itself at the same time? Here is my code
import os
import openpyxl
from openpyxl import Workbook
from openpyxl.reader.excel import load_workbook
from openpyxl import load_workbook
import csv
directoryPath = r'c:\users\username\documents\reporting\export\q3'
os.chdir(directoryPath)
folder_list = os.listdir(directoryPath)
for folders, sub_folders, file in os.walk(directoryPath):
for name in file:
if name.startswith("BEA"):
filename = os.path.join(folders, name)
wb = load_workbook(filename, data_only=True)
sheet = wb.get_sheet_by_name("Sensor Status")
rows = []
for row in sheet.iter_rows(min_row=1):
row_data = []
for cell in row:
row_data.append(cell.value)
wb.save('test.xlsx')

How to delete an existing worksheet in excel file using xlutils, xlwt, xlrd with python

I tried to search many places but dit not see any example snippet of code about how to delete an existing worksheet in excel file by using xlutils or xlwt with python. Who can help me, please?
I just dealt with this and although this is not generally a good coding choice, you can use the internal Workbook_worksheets to access and set the worksheets for a workbook object.
write_book._Workbook__worksheets = [write_book._Workbook__worksheets[0]]
this would strip everything but the first worksheet associated with a Workbook
I just wanted to confirm that I got this to work using the answer David gave. Here is an example of where I had a spreadsheet (workbook) with 40+ sheets that needed to be split into their own workbooks. I copied the master workbook removed all but the one sheet and saved to a new spreadsheet:
from xlrd import open_workbook
from xlutils import copy
workbook = open_workbook(filepath)
# Process each sheet
for sheet in workbook.sheets():
# Make a copy of the master worksheet
new_workbook = copy.copy(workbook)
# for each time we copy the master workbook, remove all sheets except
# for the curren sheet (as defined by sheet.name)
new_workbook._Workbook__worksheets = [ worksheet for worksheet in new_workbook._Workbook__worksheets if worksheet.name == sheet.name ]
# Save the new_workbook based on sheet.name
new_workbook.save('{}_workbook.xls'.format(sheet.name))
The following method does what you need:
def deleteAllSheetBut(workingFolder, xlsxFILE, sheetNumberNotToDelete=1):
import win32com.client as win32
import os
excel = win32.gencache.EnsureDispatch('Excel.Application')
excel.Visible = False
excel.DisplayAlerts = False
wb = excel.Workbooks.Open( os.path.join( workingFolder, xlsxFILE ) )
for i in range(1, wb.Worksheets.Count):
if i != sheetNumberNotToDelete:
wb.Worksheets(i).Delete()
wb.Save()
excel.DisplayAlerts = True
excel.Application.Quit()
return
not sure about those modules but u can try win32
from win32com import client
def delete(self, number = 1):
"""
(if the sheet is the first use 1. -1 to use real number)
example: r.delete(1)
"""
sheetnumber = int(number) - 1

Categories

Resources