MemoryError while converting txt to xlsx

MemoryError while converting txt to xlsx - python

Related questions:
1. Error in converting txt to xlsx using python
Converting txt to xlsx while setting the cell property for number cells as number
My code is
import csv
import openpyxl
import sys
def convert(input_path, output_path):
"""
Read a csv file (with no quoting), and save its contents in an excel file.
"""
wb = openpyxl.Workbook()
ws = wb.worksheets[0]
with open(input_path) as f:
reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
for row_index, row in enumerate(reader, 1):
for col_index, value in enumerate(row, 1):
ws.cell(row=row_index, column=col_index).value = value
print 'hello world'
wb.save(output_path)
print 'hello world2'
def main():
try:
input_path, output_path = sys.argv[1:]
except ValueError:
print 'Usage: python %s input_path output_path' % (sys.argv[0],)
else:
convert(input_path, output_path)
if __name__ == '__main__':
main()
This code works, except for some input files. I couldn't find what the difference is between the input txt that causes this problem and input txt that doesn't.
My first guess was encoding. I tried changing the encoding of the input file to UTF-8 and UTF-8 with BOM. But this failed.
My second guess was it used literally too much memory. But my computer has SSD with 32 GB RAM.
So perhaps this code is not fully utilizing the capacity of this RAM?
How can I fix this?
Edit: I added that line
print 'hello world'
and
print 'hello world2'
to check if all the parts before 'hello world' are run correctly.
I checked the code prints 'hello world', but not 'hello world2'
So, it really seems likely that
wb.save(output_path)
is causing the problem.

openpyxl has optimised modes for reading and writing large files.
wb = Workbook(write_only=True) will enable this.
I'd also recommend that you install lxml for speed. This is all covered in the documentation.

Below are three alternatives:
RANGE FOR LOOP
Possibly, the two enumerate() calls may have a memory footprint as indexing must occur in a nested for loop. Consider passing csv.reader content into a list (being subscriptable) and use range(). Though admittedly even this may not be efficient as starting in Python 3 each range() call (compared to deprecated xrange) generates its own list in memory as well.
with open(input_path) as f:
reader = csv.reader(f)
row = []
for data in reader:
row.append(data)
for i in range(len(row)):
for j in range(len(row[0])):
ws.cell(row=i, column=j).value = row[i][j]
OPTIMIZED WRITER
OpenPyXL even warns that scrolling through cells even without assigning values will retain them in memory. As a solution, you can use the Optimized Writer using above row list produced from csv.reader. This route appends entire rows in a write-only workbook instance:
from openpyxl import Workbook
wb = Workbook(write_only=True)
ws = wb.create_sheet()
i = 0
for irow in row:
ws.append(['%s' % j for j in row[j]])
i += 1
wb.save('C:\Path\To\Outputfile.xlsx')
WIN32COM LIBRARY
Finally, consider using the built-in win32com library where you open the csv in Excel and save as an xlsx or xls workbook. Do note this package is only for Python Windows installations.
import win32com.client as win32
excel = win32.Dispatch('Excel.Application')
# OPEN CSV DIRECTLY INSIDE EXCEL
wb = excel.Workbooks.Open(input_path)
excel.Visible = False
outxl=r'C:\Path\To\Outputfile.xlsx'
# SAVE EXCEL AS xlOpenXMLWorkbook TYPE (51)
wb.SaveAs(outxl, FileFormat=51)
wb.Close(False)
excel.Quit()

Here are fews points you can consider:
Check /tmp folder, default folder where tmp files for created;
Your code is utilizing complete space in that folder. Either increase that folder or you can change tmp file path while creating workbook;
I use in memory for performing my task and it worked.
Below is my code:
#!/usr/bin/python
import os
import csv
import io
import sys
import traceback
from xlsxwriter.workbook import Workbook
fileNames=sys.argv[1]
try:
f=open(fileNames, mode='r')
workbook = Workbook(fileNames[:-4] + '.xlsx',{'in_memory': True})
worksheet = workbook.add_worksheet()
workbook.use_zip64()
rowCnt=0
#Create the bold style for the header row
for line in f:
rowCnt = rowCnt + 1
row = line.split("\001")
for j in range(len(row)):
worksheet.write(rowCnt, j, row[j].strip())
f.close()
workbook.close()
print ('success')
except ValueError:
print ('failure')

Related

Python Pandas XLRDError when reading .xls files

I'm having a problem with reading .xls files in Pandas.
Here's the code
df = pd.read_excel('sample.xls')
And the output states,
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xff\xfeD\x00A\x00T\x00'
Anyone experiencing the same issue? How to fix it?

# Changing the data types of all strings in the module at once
from __future__ import unicode_literals
# Used to save the file as excel workbook
# Need to install this library
from xlwt import Workbook
# Used to open to corrupt excel file
import io
filename = r'sample.xls'
# Opening the file using 'utf-16' encoding
file1 = io.open(filename, "r", encoding="utf-16")
data = file1.readlines()
# Creating a workbook object
xldoc = Workbook()
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
# Iterating and saving the data to sheet
for i, row in enumerate(data):
# Two things are done here
# Removeing the '\n' which comes while reading the file using io.open
# Getting the values after splitting using '\t'
for j, val in enumerate(row.replace('\n', '').split('\t')):
sheet.write(i, j, val)
# Saving the file as an excel file
xldoc.save('1.xls')
Credits to this Medium Article

Converting XLSX files with double quotes to CSV using Python

I have a python script to convert Xlsx files to CSV like this -
import sys
import xlrd
import csv
def toCsv(xlsPath, csvPath):
wb = xlrd.open_workbook(xlsPath)
sh = wb.sheet_by_index(0)
your_csv_file = open(csvPath, 'wb')
wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
rv = sh.row(rownum)
print rv
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
toCsv(sys.argv[1], sys.argv[2])
Some of the cells in the XLSX file contain single and double quotes. When the script runs on such cells, the output CSV comes out empty. How do I escape the single or double quotes? I tried unsuccessfully adding Dialect.doublequote = false and true as the third argument to the writer() call. Please help.
Thanks

Append a sheet to an existing excel file using openpyxl

I saw this post to append a sheet using xlutils.copy:
https://stackoverflow.com/a/38086916/2910740
Is there any solution which uses only openpyxl?

I found solution. It was very easy:
def store_excel(self, file_name, sheet_name):
if os.path.isfile(file_name):
self.workbook = load_workbook(filename = file_name)
self.worksheet = self.workbook.create_sheet(sheet_name)
else:
self.workbook = Workbook()
self.worksheet = self.workbook.active
self.worksheet.title = time.strftime(sheet_name)
.
.
.
self.worksheet.cell(row=row_num, column=col_num).value = data

I would recommend storing data in a CSV file, which is a ubiquitous file format made specifically to store tabular data. Excel supports it fully, as do most open source Excel-esque programs.
In that case, it's as simple as opening up a file to append to it, rather than write or read:
with open("output.csv", "a") as csvfile:
wr = csv.writer(csvfile, dialect='excel')
wr.writerow(YOUR_LIST)
As for Openpyxl:
end_of_sheet = your_sheet.max_row
will return how many rows your sheet is so that you can start writing to the position after that.

how to convert xlsx to tab delimited files

I have quite a lot of xlsx files which is a pain to convert them one by one to tab delimited files
I would like to know if there is any solution to do this by python. Here what I found and what tried to do with failure
This I found and I tried the solution but did not work Mass Convert .xls and .xlsx to .txt (Tab Delimited) on a Mac
I also tried to do it for one file to see how it works but with no success
#!/usr/bin/python
import xlrd
import csv
def main():
# I open the xlsx file
myfile = xlrd.open_workbook('myfile.xlsx')
# I don't know the name of sheet
mysheet = myfile.sheet_by_index(0)
# I open the output csv
myCsvfile = open('my.csv', 'wb')
# I write the file into it
wr = csv.writer(myCsvfile, delimiter="\t")
for rownum in xrange(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))
myCsvfile.close()
if __name__ == '__main__':
main()

No real need for the main function.
And not sure about your indentation problems, but this is how I would write what you have. (And should work, according to first comment above)
#!/usr/bin/python
import xlrd
import csv
# open the output csv
with open('my.csv', 'wb') as myCsvfile:
# define a writer
wr = csv.writer(myCsvfile, delimiter="\t")
# open the xlsx file
myfile = xlrd.open_workbook('myfile.xlsx')
# get a sheet
mysheet = myfile.sheet_by_index(0)
# write the rows
for rownum in xrange(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))

Why go with so much pain when you can do it in 3 lines:
import pandas as pd
file = pd.read_excel('myfile.xlsx')
file.to_csv('myfile.xlsx',
sep="\t",
index=False)

Argument 1 must be an iterator - what am I doing wrong?

I've got a section of code in a project that's supposed to be reading a CSV file and writing each row to an XLSX file. Right now I'm getting the error "argument 1 must be an iterator" when I run via command line.
Here is the relevant code:
import os
import openpyxl
import csv
from datetime import datetime
from openpyxl.reader.excel import load_workbook
...
plannum = 4
...
alldata_sheetname = ("All Test Data " + str(plannum))
wb = load_workbook("testingtemplate.xlsx", keep_vba=True)
...
ws_testdata = wb.get_sheet_by_name(alldata_sheetname)
...
with open("testdata.csv", 'r') as csvfile:
table = csv.reader(csvfile)
for row in table:
ws_testdata.append(row)
csv_read = csv.reader(csvfile)
...
And the specific error reads: "TypeError: argument 1 must be an iterator", and is referencing the last line of code I've provided.
Since it didn't complain about the first time I used csvfile, would it be better if I did something like csvfile = open("testdata.csv", "r") instead of using the with (and is that what I'm doing wrong here)? If that's the case, is there anything else I need to change?
Thanks to anyone who helps!!

You've closed the file by the time you get to csv_read = csv.reader(csvfile). Alternately you can keep the file open and store what you need in variables so you don't have to iterate over the file twice. E.g.:
csvfile = open("testdata.csv", "r")
table = csv.reader(csvfile)
for row in table:
ws_testdata.append(row)
# store what you need in variables
csvfile.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

MemoryError while converting txt to xlsx - python

openpyxl has optimised modes for reading and writing large files. wb = Workbook(write_only=True) will enable this. I'd also recommend that you install lxml for speed. This is all covered in the documentation.

Related

Python Pandas XLRDError when reading .xls files

Converting XLSX files with double quotes to CSV using Python

Append a sheet to an existing excel file using openpyxl

how to convert xlsx to tab delimited files

Argument 1 must be an iterator - what am I doing wrong?

Categories

Resources