lxml + loads of files = random SerialisationError: IO_WRITE

lxml + loads of files = random SerialisationError: IO_WRITE - python

I'm using lxml and python 3 to parse many files and merge files that belong together.
The files are actually stored in pairs of two (that are also merged first) inside zip files but i don't think that matters here.
We're talking about 100k files that are about 900MB in zipped form.
My problems is that my script works fine but at somepoint (for multiple runs it's not always the same point so it shouldn't be a problem with a certain file) i get this error:
File "C:\Users\xxx\workspace\xxx\src\zip2xml.py", line 110, in
_writetonorm
normroot.getroottree().write(norm_file_path) File "lxml.etree.pyx", line 1866, in lxml.etree._ElementTree.write
(src/lxml\lxml.etree.c:46006) File "serializer.pxi", line 481, in
lxml.etree._tofilelike (src/lxml\lxml.etree.c:93719) File
"serializer.pxi", line 187, in lxml.etree._raiseSerialisationError
(src/lxml\lxml.etree.c:90965) lxml.etree.SerialisationError: IO_WRITE
I have no idea what causes this error.
The entire code is a little cumbersome so i hope the relevant areas suffice:
def _writetonorm(self, outputpath):
'''Writes the current XML to a file.
It'll update the file if it already exists and create the file otherwise'''
#Find Name
name = None
try:
name = self._xml.xpath("xxx")[0].text.rstrip().lstrip()
except Exception as e:
try:
name = self._xml.xpath("xxx")[0].text.rstrip().lstrip()
except Exception as e:
name = "damn it!"
if name != None:
#clean name a bit
name = name[:35]
table = str.maketrans(' /#*"$!&<>-:.,;()','_________________')
name = name.translate(table)
name = name.lstrip("_-").rstrip("_-")
#generate filename
norm_file_name = name + ".xml"
norm_file_path = os.path.join(outputpath, norm_file_name)
#Check if we have that completefile already. If we do, update it.
if os.path.isfile(norm_file_path):
norm_file = etree.parse(norm_file_path, self._parser)
try:
normroot = norm_file.getroot()
except:
print(norm_file_path + "is broken !!!!")
time.sleep(10)
else:
normroot = etree.Element("norm")
jurblock = etree.Element("jurblock")
self._add_jurblok_attributes(jurblock)
jurblock.insert(0, self._xml)
normroot.insert(0, jurblock)
try:
normroot.getroottree().write(norm_file_path) #here the Exception occurs
except Exception as e:
print(norm_file_path)
raise e
I know that my exception handling isn't great but this is just a proof of work for now.
Can anyone tell me why the error happens ?
Looking at the file that causes the error it's not wellformed but I suspect that is because the error happened and it was fine before the latest iteration.

It seems to have been a mistake to use maped network drives for this. No such Exception when letting it work with the files locally.
Learned something :)

Related

Python win32com SaveAs2 error when I try to save file

def convert_to_word():
target = pwd + "/open.doc"
source = pwd + "/template.html"
pythoncom.CoInitialize()
app = win32com.client.Dispatch("Word.Application")
pythoncom.CoInitialize()
# try:
app.Documents.Open(source)
app.Documents.SaveAs2(target,FileFormat=0)
app.Documents.Open(source)
app.Selection.WholeStory()
app.Selection.Fields.Unlink()
app.Documents.Save()
# except Exception as e:
# print(e)
# finally:
app.ActiveDocument.Close()
I need to save html file to .doc, but it report a error <unknown>.SaveAs2 which I cant solve.
Can anyone help me ? Thanks

My first answer on stackoverflow ...
I had had same troubles then I have found solution:
Your code seems to be almost OK. You just need to store your opened document to a variable and then use SaveAs2 function on the variable.
my_doc = app.Documents.Open(source)
my_doc.SaveAs2(target, FileFormat=0)

Fixing "TypeError: '_io.TextIOWrapper' object is not subscriptable" in python script

I'm trying to store some data in a config.json for a bot that I'm working on, but I keep getting the same error every time I attempt to run it.
I'm running Python 3.7.3, latest version of the rewrite. I've attempted moving around the config.json file around to no avail. I'm probably missing something incredibly obvious, but I don't know what.
Where the exception is being raised:
with open("config.json", "r") as infile:
try:
CONFIG = json.load(infile)
_ = infile["token"]
_ = infile["owner"]
except (KeyError, FileNotFoundError):
raise EnvironmentError(
"Your config.json file is either missing, or incomplete. Check your config.json and ensure it has the keys 'token' and 'owner_id'"
)
Expected Result: Code pulls token and owner from the file, and proceeds to run the bot.
Actual Result: Bot doesn't get launched. Traceback output -
File "/Users/prismarine/Desktop/Project_Prismarine/core.py", line 11, in <module>
_ = infile["token"]
TypeError: '_io.TextIOWrapper' object is not subscriptable

You're trying to call the file handle as a dictionary instead of the JSON dictionary stored in CONFIG. Instead, try:
with open("config.json", "r") as infile:
try:
CONFIG = json.load(infile)
token = CONFIG["token"]
owner = CONFIG["owner"]
except (KeyError, FileNotFoundError):
raise EnvironmentError(
"Your config.json file is either missing, or incomplete. Check your config.json and ensure it has the keys 'token' and 'owner_id'"
)
Note also that underscores are usually used as variable names if they won't be used anywhere, and that the underscore would be assigned to CONFIG['token'], then immediately reassigned to CONFIG['owner'] in your case. I gave them some new unique variable names if you're planning to use them later.

How to skip one part of a single loop iteration in Python

I am creating about 200 variables within a single iteration of a python loop (extracting fields from excel documents and pushing them to a SQL database) and I am trying to figure something out.
Let's say that a single iteration is a single Excel workbook that I am looping through in a directory. I am extracting around 200 fields from each workbook.
If one of these fields I extract (lets say field #56 out of 200) and it isn't in proper format (lets say the date was filled out wrong ie. 9/31/2015 which isnt a real date) and it errors out with the operation I am performing.
I want the loop to skip that variable and proceed to creating variable #57. I don't want the loop to completely go to the next iteration or workbook, I just want it to ignore that error on that variable and continue with the rest of the variables for that single loop iteration.
How would I go about doing something like this?
In this sample code I would like to continue extracting "PolicyState" even if ExpirationDate has an error.
Some sample code:
import datetime as dt
import os as os
import xlrd as rd
files = os.listdir(path)
for file in files: #Loop through all files in path directory
filename = os.fsdecode(file)
if filename.startswith('~'):
continue
elif filename.endswith( ('.xlsx', '.xlsm') ):
try:
book = rd.open_workbook(os.path.join(path,file))
except KeyError:
print ("Error opening file for "+ file)
continue
SoldModelInfo=book.sheet_by_name("SoldModelInfo")
AccountName=str(SoldModelInfo.cell(1,5).value)
ExpirationDate=dt.datetime.strftime(xldate_to_datetime(SoldModelInfo.cell(1,7).value),'%Y-%m-%d')
PolicyState=str(SoldModelInfo.cell(1,6).value)
print("Insert data of " + file +" was successful")
else:
continue

Use multiple try blocks. Wrap each decode operation that might go wrong in its own try block to catch the exception, do something, and carry on with the next one.
try:
book = rd.open_workbook(os.path.join(path,file))
except KeyError:
print ("Error opening file for "+ file)
continue
errors = []
SoldModelInfo=book.sheet_by_name("SoldModelInfo")
AccountName=str(SoldModelInfo.cell(1,5).value)
try:
ExpirationDate=dt.datetime.strftime(xldate_to_datetime(SoldModelInfo.cell(1,7).value),'%Y-%m-%d')
except WhateverError as e:
# do something, maybe set a default date?
ExpirationDate = default_date
# and/or record that it went wrong?
errors.append( [ "ExpirationDate", e ])
PolicyState=str(SoldModelInfo.cell(1,6).value)
...
# at the end
if not errors:
print("Insert data of " + file +" was successful")
else:
# things went wrong somewhere above.
# the contents of errors will let you work out what

As suggested you could use multiple try blocks on each of your extract variable, or you could streamline it with your own custom function that handles the try for you:
from functools import reduce, partial
def try_funcs(cell, default, funcs):
try:
return reduce(lambda val, func: func(val), funcs, cell)
except Exception as e:
# do something with your Exception if necessary, like logging.
return default
# Usage:
AccountName = try_funcs(SoldModelInfo.cell(1,5).value, "some default str value", str)
ExpirationDate = try_funcs(SoldModelInfo.cell(1,7).value), "some default date", [xldate_to_datetime, partial(dt.datetime.strftime, '%Y-%m-%d')])
PolicyState = try_funcs(SoldModelInfo.cell(1,6).value, "some default str value", str)
Here we use reduce to repeat multiple functions, and pass partial as a frozen function with arguments.
This can help your code look tidy without cluttering up with lots of try blocks. But the better, more explicit way is just handle the fields you anticipate might error out individually.

So, basically you need to wrap your xldate_to_datetime() call into try ... except
import datetime as dt
v = SoldModelInfo.cell(1,7).value
try:
d = dt.datetime.strftime(xldate_to_datetime(v), '%Y-%m-%d')
except TypeError as e:
print('Could not parse "{}": {}'.format(v, e)

Getting "file object does not exist" error in python django

I have a few objects where I have no file attached.
I have this code:
if os.path.isfile(object.pdf_file.url):
object.url = object.pdf_file.url
else:
object.url = ""
But I am getting this error:
The 'pdf_file' attribute has no file associated with it.

if os.path.isfile(object.pdf_file.url):
This will throw the error because you need the file to get the url. I do not think that will work even if the file exists since isfile() needs a path, not the url which is relative to your webserver/django-settings for media url not where it is located on your server.
Try:
if object.pdf_file:
object.url = object.pdf_file.url
else:
object.url = ""
This will work since the FileField will return None if it is null.

try: and exception: error

So i am working on this code below. It complied alright when my Reff.txt has more than one line. But it doesnt work when my Reff.txt file has one line. Why is that? I also wondering why my code doesn't run "try" portion of my code but it always run only "exception" part.
so i have a reference file which has a list of ids (one id per line)
I use the reference file(Reff.txt) as a reference to search through the database from the website and the database from the server within my network.
The result i should get is there should be an output file and file with information of that id; for each reference id
However, this code doesn't do anything on my "try:" portion at all
import sys
import urllib2
from lxml import etree
import os
getReference = open('Reff.txt','r') #open the file that contains list of reference ids
global tID
for tID in getReference:
tID = tID.strip()
try:
with open(''+tID.strip()+'.txt') as f: pass
fileInput = open(''+tID+'.txt','r')
readAA = fileInput.read()
store_value = (readAA.partition('\n'))
aaSequence = store_value[2].replace('\n', '') #concatenate lines
makeList = list(aaSequence)#print makeList
inRange = ''
fileAddress = '/database/int/data/'+tID+'.txt'
filename = open(fileAddress,'r')#name of the working file
print fileAddress
with open(fileAddress,'rb') as f:
root = etree.parse(f)
for lcn in root.xpath("/protein/match[#dbname='PFAM']/lcn"):#find dbname =PFAM
start = int(lcn.get("start"))#if it is PFAM then look for start value
end = int(lcn.get("end"))#if it is PFAM then also look for end value
while start <= end:
inRange = makeList[start]
start += 1
print outputFile.write(inRange)
outputFile.close()
break
break
break
except IOError as e:
newURL ='http://www.uniprot.org/uniprot/'+tID+'.fasta'
print newURL
response = urllib2.urlopen(''+newURL) #go to the website and grab the information
creatNew = open(''+uniprotID+'.txt','w')
html = response.read() #read file
creatNew.write(html)
creatNew.close()

So, when you do Try/Except - if try fails, Except runs. Except is always running, because Try is always failing.
Most likely reason for this is that you have this - "print outputFile.write(inRange)", but you have not previously declared outputFile.
ETA: Also, it looks like you are only interested in testing to the first pass of the for loop? You break at that point. Your other breaks are extraneous in that case, because they will never be reached while that one is there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

lxml + loads of files = random SerialisationError: IO_WRITE - python

It seems to have been a mistake to use maped network drives for this. No such Exception when letting it work with the files locally. Learned something :)

Related

Python win32com SaveAs2 error when I try to save file

Fixing "TypeError: '_io.TextIOWrapper' object is not subscriptable" in python script

How to skip one part of a single loop iteration in Python

Getting "file object does not exist" error in python django

try: and exception: error

Categories

Resources