Parsing Errno22 with xml Element Tree - python

I am trying to develop a simple web scraper of sorts, and keep having issues with the parsing code for the XML file used.
Whenever I run it it gives me Errno22, even though the path is valid. Could anyone assist?
try:
xmlTree = ET.parse('C:\TestWork\RWPlus\test.xml')
root = xmlTree.getroot()
returnValue = root[tariffPOS][childPOS].text
return returnValue
except Exception as error:
errorMessage = "A " + str(
error) + " error occurred when trying to read the XML file."
ErrorReport(errorMessage)

You are supposed to escape backslashes in Python strings
ET.parse('C:\\TestWork\\RWPlus\\test.xml')
or you can use raw strings (note the r)
ET.parse(r'C:\TestWork\RWPlus\test.xml')

Related

How to extract the json data without square brackets in python

I suspect I am facing an issue because my json data is within square brackets ([]).
I am processing pubsub data which is in below format.
[{"ltime":"2022-04-12T11:33:00.970Z","cnt":199,"fname":"MYFILENAME","data":[{"NAME":"N1","ID":11.4,"DATE":"2005-10-14 00:00:00"},{"NAME":"M1","ID":25.0,"DATE":"2005-10-14 00:00:00"}]
I am successfully processing/extracting all the fields except 'data'. I need to create a json file using the 'data' in a cloud storage.
when I use msg['data'] I am getting below,
[{"NAME":"N1","ID":11.4,"DATE":"2005-10-14 00:00:00"},{"NAME":"M1","ID":25.0,"DATE":"2005-10-14 00:00:00"}]
Sample code:
pubsub_msg = base64.b64decode(event['data']).decode('utf-8')
pubsub_data = json.loads(pubsub_msg)
json_file = pubsub_data['fname'] + '.json'
json_content = pubsub_data['data']
try:
<< call function >>
except Exception as e:
<< Error >>
below is the error I am getting it.
2022-04-12T17:35:01.761Zmyapp-pipelineeopcm2kjno5h Error in uploading json document to gcs: [{"NAME":"N1","ID":11.4,"DATE":"2005-10-14 00:00:00"},{"NAME":"M1","ID":25.0,"DATE":"2005-10-14 00:00:00"}] could not be converted to bytes
I am not sure the issue is because of square brackets [].
Correct me if I am wrong and please help to get the exact json data for creating a file.

Extracting values from parsed xml text

I am using lxml to parse the following XML text block:
<block>{<block_content><argument_list>(<argument><expr><name><name>String</name><operator>.</operator><name>class</name></name></expr></argument>, <argument><expr><name><name>Object</name><operator>.</operator><name>class</name></name></expr></argument>)</argument_list></block_content>}</block>
<block>{<block_content><argument_list>(<argument><expr><literal type="string">"Expected exception to be thrown"</literal></expr></argument>)</argument_list></block_content>}</block>
<block>{<block_content></block_content>}</block>
My requirement is to print the following from the above xml snippet:
String.class
Object.class
"Expected exception to be thrown"
Basically, I need to print the text values contained within the argument node of the xml snippet.
Below is the code block that I am using.
from lxml import etree
xml_text = '<unit>' \
'<block>{<block_content><argument_list>(<argument><expr><name><name>String</name><operator>.</operator><name>class</name></name></expr></argument>, <argument><expr><name><name>Object</name><operator>.</operator><name>class</name></name></expr></argument>)</argument_list></block_content>}</block> ' \
'<block>{<block_content><argument_list>(<argument><expr><literal type="string">"Expected exception to be thrown"</literal></expr></argument>)</argument_list></block_content>}</block> ' \
'<block>{<block_content></block_content>}</block>' \
'</unit>'
tree = etree.fromstring(xml_text)
args = tree.xpath('//argument_list/argument')
for i in range(len(args)):
print('%s. %s' %(i+1, etree.tostring(args[i]).decode("utf-8")))
However, the below output produced by this code does not meet my requirement.
1. <argument><expr><name><name>String</name><operator>.</operator><name>class</name></name></expr></argument>,
2. <argument><expr><name><name>Object</name><operator>.</operator><name>class</name></name></expr></argument>)
3. <argument><expr><literal type="string">"Expected exception to be thrown"</literal></expr></argument>)
Would appreciate it if someone can point out what modifications I need to make to my code
I found that the strip_tags function gets the job done. Below is the updated code:
for i in range(len(args)):
etree.strip_tags(args[i], "*")
print('%s. %s' %(i+1, args[i].text))
Output from the update code:
String.class
Object.class
"Expected exception to be thrown"

Error tolerant RDF parsing using RDFlib in python

I am writing a code, that gathers some statistics about ontologies. as input I have a folder with files some are RDF/XML, some are turtle or nt.
My problem is, that when I try to parse a file using wrong format, next time even if I parse it with correct format it fails.
Here test file is turtle format. If first parse it with turtle format all is fine. but if I first parse it with the wrong format 1. error is understandable (file:///test:1:0: not well-formed (invalid token)), but error for second is (Unknown namespace prefix : owl). Like I said when I first parse with the correct one, I don't get namespace error.
Pleas help, after 2 days, I'm getting desperate.
query = 'SELECT DISTINCT ?s ?o WHERE { ?s ?p owl:Ontology . ?s rdfs:comment ?o}'
data = open("test", "r")
g = rdflib.Graph("IOMemory")
try:
result = g.parse(file=data,format="xml")
relations = g.query(query)
print(( " graph has %s statements." % len(g)))
except:
print "bad1"
e = sys.exc_info()[1]
print e
try:
result = g.parse(file=data,format="turtle")
relations = g.query(query)
print(( " graph has %s statements." % len(g)))
except :
print "bad2"
e = sys.exc_info()[1]
print e
The problem is that the g.parse reads some part from the file input stream of data first, only to figure out afterwards that it is not xml. The second call (with the turtle format) then continues to read from the input stream after the part where the previous attempt has stopped. The part read by the first parser is lost to the secnd one.
If your test file is small, the xml-parser might have read it all, leaving an "empty" rest. It seems the turtle parser did not complain - it just read in nothing. Only the query in the next statement failed to find anything owl-like in it, as the graph is empty. (I have to admit I cannot reproduce this part, the turtle parser does complain in my case, but maybe I have a different version of rdflib)
To fix it, try to reopen the file; either reorganize the code so you have an data = open("test", "r") every time you call result = g.parse(file=data, format="(some format)"), or call data.seek(0) in the except: clause, like:
for format in 'xml','turtle':
try:
print 'reading', format
result = g.parse(data, format=format)
print 'success'
break
except Exception:
print 'failed'
data.seek(0)

Python: Why will this string print but not write to a file?

I am new to Python and working on a utility that changes an XML file into an HTML. The XML comes from a call to request = urllib2.Request(url), where I generate the custom url earlier in the code, and then set response = urllib2.urlopen(request) and, finally, xml_response = response.read(). This works okay, as far as I can tell.
My trouble is with parsing the response. For starters, here is a partial example of the XML structure I get back:
I tried adapting the slideshow example in the minidom tutorial here to parse my XML (which is ebay search results, by the way): http://docs.python.org/2/library/xml.dom.minidom.html
My code so far looks like this, with try blocks as an attempt to diagnose issues:
doc = minidom.parseString(xml_response)
#Extract relevant information and prepare it for HTML formatting.
try:
handleDocument(doc)
except:
print "Failed to handle document!"
def getText(nodelist): #taken straight from slideshow example
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
print "A TEXT NODE!"
rc.append(node.data)
return ''.join(rc) #this is a string, right?
def handleDocument(doc):
outputFile = open("EbaySearchResults.html", "w")
outputFile.write("<html>\n")
outputFile.write("<body>\n")
try:
items = doc.getElementsByTagName("item")
except:
"Failed to get elements by tag name."
handleItems(items)
outputFile.write("</html>\n")
outputFile.write("</body>\n")
def handleItems(items):
for item in items:
title = item.getElementsByTagName("title")[0] #there should be only one title
print "<h2>%s</h2>" % getText(title.childNodes) #this works fine!
try: #none of these things work!
outputFile.write("<h2>%s</h2>" % getText(title.childNodes))
#outputFile.write("<h2>" + getText(title.childNodes) + "</h2>")
#str = getText(title.childNodes)
#outputFIle.write(string(str))
#outputFile.write(getText(title.childNodes))
except:
print "FAIL"
I do not understand why the correct title text does print to the console but throws an exception and does not work for the output file. Writing plain strings like this works fine: outputFile.write("<html>\n") What is going on with my string construction? As far as I can tell, the getText method I am using from the minidom example returns a string--which is just the sort of thing you can write to a file..?
If I print the actual stack trace...
...
except:
print "Exception when trying to write to file:"
print '-'*60
traceback.print_exc(file=sys.stdout)
print '-'*60
traceback.print_tb(sys.last_traceback)
...
...I will instantly see the problem:
------------------------------------------------------------
Traceback (most recent call last):
File "tohtml.py", line 85, in handleItems
outputFile.write(getText(title.childNodes))
NameError: global name 'outputFile' is not defined
------------------------------------------------------------
Looks like something has gone out of scope!
Fellow beginners, take note.

Trigger to automatically remove EOL whitespace?

Can one write a perforce trigger to automatically remove whitespace at submission time? Preferably in python? What would that look like? Or can you not modify files as they're being submitted?
To my knowledge this cannot be done, since you cannot put the modified file-content back to the server. The only two trigger types that allow you to see the file-content with p4 print are change-content and change-commit. For the latter, the files are already submitted on the server and for the former, while you can see the (unsubmitted) file content, there is no way to modify it and put it back on the server.
The only trigger that is possible is to reject files with EOL whitespace to be submitted, so that the submitters can fix the files on their own. Here is an excerpt of a similar one that checks for tabs in files, please read the docu on triggers and look at the Perforce site for examples:
def fail(sComment):
print sComment
sys.exit(1)
return
sCmd = "p4 -G files //sw/...#=%s" % sChangeNr
stream = os.popen(sCmd, 'rb')
dictResult = []
try:
while 1:
dictResult.append(marshal.load(stream))
except EOFError:
pass
stream.close()
failures = []
# check all files for tabs
for element in dictResult:
depotFile = element['depotFile']
sCmd = "p4 print -q %s#=%s" % (depotFile,sChangeNr)
content = os.popen(sCmd, 'rb').read()
if content.find('\t') != -1:
failures.append(depotFile)
if len(failures) != 0:
error = "Files contain tabulators (instead of spaces):\n"
for i in failures:
error = error + str(i) + "\n"
fail(error)

Categories

Resources