Trying to save Excel files with multiple sheets as corresponding CSV files.
I tried the following method:
import xlrd
from openpyxl import Workbook, load_workbook
import pathlib
import shutil
import pandas as pd
def strip_xlsx(inputdir, file_name, targetdir):
wb = load_workbook(inputdir)
sheets = wb.sheetnames
for s in sheets:
temp_df = pd.read_excel(inputdir, sheet_name=s)
temp_df.to_csv(targetdir + "/" + file_name.strip(".xlsx") + "_" + s + ".csv", encoding='utf-8-sig')
Where inputdir is an absolute path to a the Excel file (say: "/Users/me/test/t.xlsx"), file_name is just the name of the file ("t.xlsx") and target_dir is a path to which I wish to save the csv files.
The methods works well, thought super slow. I'm a newbie to Python and feel like I implemented the method in a very inefficient way.
Would appreciate tips from the masters.
You may have better luck if you keep everything in pandas. I see you are using openpyxl to get the sheet names, you can do this in pandas. As for speed, you'll just have to see:
EDIT:
As Charlie (the person who probably knows the most about openpyxl on the planet) pointed out, using just openpyxl will be faster. In this case about 25% faster (9.29 ms -> 6.87 ms for my two-sheet test):
from os import path, mkdir
from openpyxl import load_workbook
import csv
def xlsx_to_multi_csv(xlsx_path: str, out_dir: str = '.') -> None:
"""Write each sheet of an Excel file to a csv
"""
# make the out directory if it does not exist (this is not EAFP)
if not path.exists(out_dir):
mkdir(out_dir)
# set the prefix
prefix = path.splitext(xlsx_path)[0]
# load the workbook
wb = load_workbook(xlsx_path, read_only=True)
for sheet_name in wb.sheetnames:
# generate the out path
out_path = path.join(out_dir, f'{prefix}_{sheet_name}.csv')
# open that file
with open(out_path, 'w', newline='') as file:
# create the writer
writer = csv.writer(file)
# get the sheet
sheet = wb[sheet_name]
for row in sheet.rows:
# write each row to the csv
writer.writerow([cell.value for cell in row])
xlsx_to_multi_csv('data.xlsx')
You just need to specify a path to save the csv's to, and iterate through a dictionary created by pandas to save the frames to the directory.
csv_path = '\path\to\dir'
for name,df in pd.read_excel('xl_path',sheet_name=None).items():
df.to_excel(os.path.join(csv_path,name)
Related
I have two excel files (.xls) in the "Files" folder. I want to take each sheet of them both and put them into one separate workbook, called masterFile.xls. The code below downloads some example files so you can see what I'm working with.
import pandas as pd
import os
import requests
resp = requests.get("https://www.ons.gov.uk/file?uri=%2femploymentandlabourmarket%2fpeopleinwork%2femploymentandemployeetypes%2fdatasets%2fsummaryoflabourmarketstatistics%2fcurrent/a01dec2021.xls")
output = open("1.xls", 'wb')
output.write(resp.content)
output.close()
resp = requests.get("https://www.ons.gov.uk/file?uri=%2femploymentandlabourmarket%2fpeopleinwork%2femploymentandemployeetypes%2fdatasets%2femploymentunemploymentandeconomicinactivityforpeopleaged16andoverandagedfrom16to64seasonallyadjusteda02sa%2fcurrent/a02sadec2021.xls")
output = open("2.xls", 'wb')
output.write(resp.content)
output.close()
cwd = os.path.abspath('')
files = os.listdir(cwd)
for file in files:
if file.endswith('.xls'):
excelFile = pd.ExcelFile(file)
sheets = excelFile.sheet_names
for sheet in sheets:
data = pd.read_excel(excelFile,sheet_name = sheet)
data.to_excel("masterFile.xls",sheet_name = sheet)
Each time it adds the sheet, it replaces whatever was already there instead of adding a new sheet.
I am new to python and openpyxl. I have the beginning of a script that will loop through a directory and subdirectories and find any xlsx file that starts with "BEAR". Each of these files are in the same exact format. What I am trying to do is find the average of cell I3 from all the xlsx files it finds. Here is where I am so far.
I have tried altering my method by appending the sheet name "Sensor Status" to a new workbook. The problem I am now having is that it is copying over all the worksheets and not just the "Sensor" status sheet and is overwriting itself so I only have the data from the last file it looked at. How can I just copy the sheet I want and not overwrite itself at the same time? Here is my code
import os
import openpyxl
from openpyxl import Workbook
from openpyxl.reader.excel import load_workbook
from openpyxl import load_workbook
import csv
directoryPath = r'c:\users\username\documents\reporting\export\q3'
os.chdir(directoryPath)
folder_list = os.listdir(directoryPath)
for folders, sub_folders, file in os.walk(directoryPath):
for name in file:
if name.startswith("BEA"):
filename = os.path.join(folders, name)
wb = load_workbook(filename, data_only=True)
sheet = wb.get_sheet_by_name("Sensor Status")
rows = []
for row in sheet.iter_rows(min_row=1):
row_data = []
for cell in row:
row_data.append(cell.value)
wb.save('test.xlsx')
I have roughly 30 excel workbooks I need to combine into one. Each workbook has a variable number of sheets but the sheet I need to combine from each workbook is called "Output" and the format of the columns in this sheet is consistent.
I need to import the Output sheet from the first file, then append the remaining files and ignore the header row.
I have tried to do this using glob/pandas to no avail.
You could use openpyxl. Here's a sketch of the code:
from openpyxl import load_workbook
compiled_wb = load_workbook(filename = 'yourfile1.xlsx')
compiled_ws = compiled['Output']
for i in range(1, 30):
wb = load_workbook(filename = 'yourfile{}.xlsx'.format(i))
ws = wb['Output']
compiled_ws.append(ws.rows()[1:]) # ignore row 0
compiled_wb.save('compiled.xlsx')
Method shown by Clinton c. Brownley in Foundations for Analytics with Python:
execute in shell indicating the path to the folder with excel files ( make sure the argument defining all_workbooks is correct) and then followed by the excel output file as follows:
python script.py <the /path/ to/ excel folder/> < your/ final/output.xlsx>
script.py:
import pandas as pd
import sys
import os
import glob
input_path = sys.argv[1]
output_file = sys.argv[2]
all_workbooks = glob.glob(os.path.join(input_file, '*.xlsx'))
all_df = []
for workbook in all_workbooks:
all_worksheets = pd.read_excel(workbook, sheetname='Output', index_col=None)
for worksheet, data in all_worksheets.items:
all_df.append(data)
data_concatenated = pd.concat(all_df, axis=0, ignore_index=True)
writer = pd.ExcelWriter(output_file)
data_concatenated.to_excel(writer, sheetname='concatenated_Output', index=False)
writer.save()
This will probably get down-voted because this isn't a Python answer, but honestly, I wouldn't use Python for this kind of task. I think you are far better off installing the AddIn below, and using that for the job.
https://www.rondebruin.nl/win/addins/rdbmerge.htm
Click 'Merge all files from the folder in the Files location selection' and click 'Use a Worksheet name' = 'Output', and finally, I think you want 'First cell'. Good luck!
I saw this post to append a sheet using xlutils.copy:
https://stackoverflow.com/a/38086916/2910740
Is there any solution which uses only openpyxl?
I found solution. It was very easy:
def store_excel(self, file_name, sheet_name):
if os.path.isfile(file_name):
self.workbook = load_workbook(filename = file_name)
self.worksheet = self.workbook.create_sheet(sheet_name)
else:
self.workbook = Workbook()
self.worksheet = self.workbook.active
self.worksheet.title = time.strftime(sheet_name)
.
.
.
self.worksheet.cell(row=row_num, column=col_num).value = data
I would recommend storing data in a CSV file, which is a ubiquitous file format made specifically to store tabular data. Excel supports it fully, as do most open source Excel-esque programs.
In that case, it's as simple as opening up a file to append to it, rather than write or read:
with open("output.csv", "a") as csvfile:
wr = csv.writer(csvfile, dialect='excel')
wr.writerow(YOUR_LIST)
As for Openpyxl:
end_of_sheet = your_sheet.max_row
will return how many rows your sheet is so that you can start writing to the position after that.
I have quite a lot of xlsx files which is a pain to convert them one by one to tab delimited files
I would like to know if there is any solution to do this by python. Here what I found and what tried to do with failure
This I found and I tried the solution but did not work Mass Convert .xls and .xlsx to .txt (Tab Delimited) on a Mac
I also tried to do it for one file to see how it works but with no success
#!/usr/bin/python
import xlrd
import csv
def main():
# I open the xlsx file
myfile = xlrd.open_workbook('myfile.xlsx')
# I don't know the name of sheet
mysheet = myfile.sheet_by_index(0)
# I open the output csv
myCsvfile = open('my.csv', 'wb')
# I write the file into it
wr = csv.writer(myCsvfile, delimiter="\t")
for rownum in xrange(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))
myCsvfile.close()
if __name__ == '__main__':
main()
No real need for the main function.
And not sure about your indentation problems, but this is how I would write what you have. (And should work, according to first comment above)
#!/usr/bin/python
import xlrd
import csv
# open the output csv
with open('my.csv', 'wb') as myCsvfile:
# define a writer
wr = csv.writer(myCsvfile, delimiter="\t")
# open the xlsx file
myfile = xlrd.open_workbook('myfile.xlsx')
# get a sheet
mysheet = myfile.sheet_by_index(0)
# write the rows
for rownum in xrange(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))
Why go with so much pain when you can do it in 3 lines:
import pandas as pd
file = pd.read_excel('myfile.xlsx')
file.to_csv('myfile.xlsx',
sep="\t",
index=False)