Error tolerant RDF parsing using RDFlib in python - python

I am writing a code, that gathers some statistics about ontologies. as input I have a folder with files some are RDF/XML, some are turtle or nt.
My problem is, that when I try to parse a file using wrong format, next time even if I parse it with correct format it fails.
Here test file is turtle format. If first parse it with turtle format all is fine. but if I first parse it with the wrong format 1. error is understandable (file:///test:1:0: not well-formed (invalid token)), but error for second is (Unknown namespace prefix : owl). Like I said when I first parse with the correct one, I don't get namespace error.
Pleas help, after 2 days, I'm getting desperate.
query = 'SELECT DISTINCT ?s ?o WHERE { ?s ?p owl:Ontology . ?s rdfs:comment ?o}'
data = open("test", "r")
g = rdflib.Graph("IOMemory")
try:
result = g.parse(file=data,format="xml")
relations = g.query(query)
print(( " graph has %s statements." % len(g)))
except:
print "bad1"
e = sys.exc_info()[1]
print e
try:
result = g.parse(file=data,format="turtle")
relations = g.query(query)
print(( " graph has %s statements." % len(g)))
except :
print "bad2"
e = sys.exc_info()[1]
print e

The problem is that the g.parse reads some part from the file input stream of data first, only to figure out afterwards that it is not xml. The second call (with the turtle format) then continues to read from the input stream after the part where the previous attempt has stopped. The part read by the first parser is lost to the secnd one.
If your test file is small, the xml-parser might have read it all, leaving an "empty" rest. It seems the turtle parser did not complain - it just read in nothing. Only the query in the next statement failed to find anything owl-like in it, as the graph is empty. (I have to admit I cannot reproduce this part, the turtle parser does complain in my case, but maybe I have a different version of rdflib)
To fix it, try to reopen the file; either reorganize the code so you have an data = open("test", "r") every time you call result = g.parse(file=data, format="(some format)"), or call data.seek(0) in the except: clause, like:
for format in 'xml','turtle':
try:
print 'reading', format
result = g.parse(data, format=format)
print 'success'
break
except Exception:
print 'failed'
data.seek(0)

Related

How to skip one part of a single loop iteration in Python

I am creating about 200 variables within a single iteration of a python loop (extracting fields from excel documents and pushing them to a SQL database) and I am trying to figure something out.
Let's say that a single iteration is a single Excel workbook that I am looping through in a directory. I am extracting around 200 fields from each workbook.
If one of these fields I extract (lets say field #56 out of 200) and it isn't in proper format (lets say the date was filled out wrong ie. 9/31/2015 which isnt a real date) and it errors out with the operation I am performing.
I want the loop to skip that variable and proceed to creating variable #57. I don't want the loop to completely go to the next iteration or workbook, I just want it to ignore that error on that variable and continue with the rest of the variables for that single loop iteration.
How would I go about doing something like this?
In this sample code I would like to continue extracting "PolicyState" even if ExpirationDate has an error.
Some sample code:
import datetime as dt
import os as os
import xlrd as rd
files = os.listdir(path)
for file in files: #Loop through all files in path directory
filename = os.fsdecode(file)
if filename.startswith('~'):
continue
elif filename.endswith( ('.xlsx', '.xlsm') ):
try:
book = rd.open_workbook(os.path.join(path,file))
except KeyError:
print ("Error opening file for "+ file)
continue
SoldModelInfo=book.sheet_by_name("SoldModelInfo")
AccountName=str(SoldModelInfo.cell(1,5).value)
ExpirationDate=dt.datetime.strftime(xldate_to_datetime(SoldModelInfo.cell(1,7).value),'%Y-%m-%d')
PolicyState=str(SoldModelInfo.cell(1,6).value)
print("Insert data of " + file +" was successful")
else:
continue
Use multiple try blocks. Wrap each decode operation that might go wrong in its own try block to catch the exception, do something, and carry on with the next one.
try:
book = rd.open_workbook(os.path.join(path,file))
except KeyError:
print ("Error opening file for "+ file)
continue
errors = []
SoldModelInfo=book.sheet_by_name("SoldModelInfo")
AccountName=str(SoldModelInfo.cell(1,5).value)
try:
ExpirationDate=dt.datetime.strftime(xldate_to_datetime(SoldModelInfo.cell(1,7).value),'%Y-%m-%d')
except WhateverError as e:
# do something, maybe set a default date?
ExpirationDate = default_date
# and/or record that it went wrong?
errors.append( [ "ExpirationDate", e ])
PolicyState=str(SoldModelInfo.cell(1,6).value)
...
# at the end
if not errors:
print("Insert data of " + file +" was successful")
else:
# things went wrong somewhere above.
# the contents of errors will let you work out what
As suggested you could use multiple try blocks on each of your extract variable, or you could streamline it with your own custom function that handles the try for you:
from functools import reduce, partial
def try_funcs(cell, default, funcs):
try:
return reduce(lambda val, func: func(val), funcs, cell)
except Exception as e:
# do something with your Exception if necessary, like logging.
return default
# Usage:
AccountName = try_funcs(SoldModelInfo.cell(1,5).value, "some default str value", str)
ExpirationDate = try_funcs(SoldModelInfo.cell(1,7).value), "some default date", [xldate_to_datetime, partial(dt.datetime.strftime, '%Y-%m-%d')])
PolicyState = try_funcs(SoldModelInfo.cell(1,6).value, "some default str value", str)
Here we use reduce to repeat multiple functions, and pass partial as a frozen function with arguments.
This can help your code look tidy without cluttering up with lots of try blocks. But the better, more explicit way is just handle the fields you anticipate might error out individually.
So, basically you need to wrap your xldate_to_datetime() call into try ... except
import datetime as dt
v = SoldModelInfo.cell(1,7).value
try:
d = dt.datetime.strftime(xldate_to_datetime(v), '%Y-%m-%d')
except TypeError as e:
print('Could not parse "{}": {}'.format(v, e)

Getting KeyError when parsing JSON containing three layers of keys, using Python

I'm building a Python program to parse some calls to a social media API into CSV and I'm running into an issue with a key that has two keys above it in the hierarchy. I get this error when I run the code with PyDev in Eclipse.
Traceback (most recent call last):
line 413, in <module>
main()
line 390, in main
postAgeDemos(monitorID)
line 171, in postAgeDemos
age0To17 = str(i["ageCount"]["sortedAgeCounts"]["ZERO_TO_SEVENTEEN"])
KeyError: 'ZERO_TO_SEVENTEEN'
Here's the section of the code I'm using for it. I have a few other functions built already that work with two layers of keys.
import urllib.request
import json
def postAgeDemos(monitorID):
print("Enter the date you'd like the data to start on")
startDate = input('The date must be in the format YYYY-MM-DD. ')
print("Enter the date you'd like the data to end on")
endDate = input('The date must be in the format YYYY-MM-DD. ')
dates = "&start="+startDate+"&end="+endDate
urlStart = getURL()
authToken = getAuthToken()
endpoint = "/monitor/demographics/age?id=";
urlData = urlStart+endpoint+monitorID+authToken+dates
webURL = urllib.request.urlopen(urlData)
fPath = getFilePath()+"AgeDemographics"+startDate+"&"+endDate+".csv"
print("Connecting...")
if (webURL.getcode() == 200):
print("Connected to "+urlData)
print("This query returns information in a CSV file.")
csvFile = open(fPath, "w+")
csvFile.write("postDate,totalPosts,totalPostsWithIdentifiableAge,0-17,18-24,25-34,35+\n")
data = webURL.read().decode('utf8')
theJSON = json.loads(data)
for i in theJSON["ageCounts"]:
postDate = i["startDate"]
totalDocs = str(i["numberOfDocuments"])
totalAged = str(i["ageCount"]["totalAgeCount"])
age0To17 = str(i["ageCount"]["sortedAgeCounts"]["ZERO_TO_SEVENTEEN"])
age18To24 = str(i["ageCount"]["sortedAgeCounts"]["EIGHTEEN_TO_TWENTYFOUR"])
age25To34 = str(i["ageCount"]["sortedAgeCounts"]["TWENTYFIVE_TO_THIRTYFOUR"])
age35Over = str(i["ageCount"]["sortedAgeCounts"]["THIRTYFIVE_AND_OVER"])
csvFile.write(postDate+","+totalDocs+","+totalAged+","+age0To17+","+age18To24+","+age25To34+","+age35Over+"\n")
print("File printed to "+fPath)
csvFile.close()
else:
print("Server Error, No Data" + str(webURL.getcode()))
Here's a sample of the JSON I'm trying to parse.
{"ageCounts":[{"startDate":"2016-01-01T00:00:00","endDate":"2016-01-02T00:00:00","numberOfDocuments":520813,"ageCount":{"sortedAgeCounts":{"ZERO_TO_SEVENTEEN":3245,"EIGHTEEN_TO_TWENTYFOUR":4289,"TWENTYFIVE_TO_THIRTYFOUR":2318,"THIRTYFIVE_AND_OVER":70249},"totalAgeCount":80101}},{"startDate":"2016-01-02T00:00:00","endDate":"2016-01-03T00:00:00","numberOfDocuments":633709,"ageCount":{"sortedAgeCounts":{"ZERO_TO_SEVENTEEN":3560,"EIGHTEEN_TO_TWENTYFOUR":1702,"TWENTYFIVE_TO_THIRTYFOUR":2786,"THIRTYFIVE_AND_OVER":119657},"totalAgeCount":127705}}],"status":"success"}
Here it is again with line breaks so it's a little easier to read.
{"ageCounts":[{"startDate":"2016-01-01T00:00:00","endDate":"2016-01-02T00:00:00","numberOfDocuments":520813,"ageCount":
{"sortedAgeCounts":{"ZERO_TO_SEVENTEEN":3245,"EIGHTEEN_TO_TWENTYFOUR":4289,"TWENTYFIVE_TO_THIRTYFOUR":2318,"THIRTYFIVE_AND_OVER":70249},"totalAgeCount":80101}},
{"startDate":"2016-01-02T00:00:00","endDate":"2016-01-03T00:00:00","numberOfDocuments":633709,"ageCount":
{"sortedAgeCounts":{"ZERO_TO_SEVENTEEN":3560,"EIGHTEEN_TO_TWENTYFOUR":1702,"TWENTYFIVE_TO_THIRTYFOUR":2786,"THIRTYFIVE_AND_OVER":119657},"totalAgeCount":127705}}],"status":"success"}
I've tried removing the ["sortedAgeCounts"] from in the middle of
age0To17 = str(i["ageCount"]["sortedAgeCounts"]["ZERO_TO_SEVENTEEN"])
but I still get the same error. I've remove the 0-17 section to test the other age ranges and I get the same error for them as well. I tried removing all the underscores from the JSON and then using keys without the underscores.
I've also tried moving the str() to convert to string from the call to where the output is printed but the error persists.
Any ideas? Is this section not actually a JSON key, maybe a problem with the all caps or am I just doing something dumb? Any other code improvements are welcome as well but I'm stuck on this one.
Let me know if you need to see anything else. Thanks in advance for your help.
Edited(This works):
JSON=json.loads(s)
for i in JSON:
print str(JSON[i][0]["ageCount"]["sortedAgeCounts"]["ZERO_TO_SEVENTEEN"])
s is a string which contains the your JSON.

Python -- get at JSON info that's written like XML

In Python, I usually do simple JSON with this sort of template:
url = "url"
file = urllib2.urlopen(url)
json = file.read()
parsed = json.loads(json)
and then get at the variables with calls like:
parsed[obj name][value name]
But, this works with JSON that's formatted roughly like:
{'object':{'index':'value', 'index':'value'}}
The JSON I just encountered is formatted like:
{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}
so there are no names for me to reference the different blocks. Of course the blocks give different info, but have the same "keys" -- much like XML is usually formatted. Using my method above, how would I parse through this JSON?
The following is not a valid JSON.
{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}
Where as
[{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}] is a valid JSON.
and python trackback shows that
import json
string = "{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}"
parsed = json.loads(string)
print parsed
Traceback (most recent call last):
File "/Users/tron/Desktop/test3.py", line 3, in <module>
parsed_json = json.loads(json_string)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 27 - line 1 column 54 (char 26 - 53)
[Finished in 0.0s with exit code 1]
where is if you do
json_string = '[{"a":"value", "b":"value"},{"a":"value", "b":"value"}]'
everything works fine.
If that is the case, you can refer to it as an array of Jsons. where json_string[0] is the first JSON string. json_string[1] is the second and so on.
Otherwise if you think this is going to be an issue that you "just have to deal with". Here is one option:
Think of the ways JSON can be malformed and write a simple class to account for them. In the case above, here is a hacky way you can deal with it.
import json
json_string = '{"a":"value", "b":"value"},{"a":"value", "b":"value"}'
def parseJson(string):
parsed_json = None
try:
parsed_json = json.loads(string)
print parsed_json
except ValueError, e:
print string, "didnt parse"
if "Extra data" in str(e.args):
newString = "["+string+"]"
print newString
return parseJson(newString)
You could add more if/else to deal with various things you run into. I have to admit, this is very hacky and I don't think you can ever account for every possible mutation.
Good luck
The result must be list of dict:
[{'index1':'value1', 'index2':'value2'},{'index1':'value1', 'index2':'value2'}]
thus you can reference it using numbers: item[1]['index1']

Python: Why will this string print but not write to a file?

I am new to Python and working on a utility that changes an XML file into an HTML. The XML comes from a call to request = urllib2.Request(url), where I generate the custom url earlier in the code, and then set response = urllib2.urlopen(request) and, finally, xml_response = response.read(). This works okay, as far as I can tell.
My trouble is with parsing the response. For starters, here is a partial example of the XML structure I get back:
I tried adapting the slideshow example in the minidom tutorial here to parse my XML (which is ebay search results, by the way): http://docs.python.org/2/library/xml.dom.minidom.html
My code so far looks like this, with try blocks as an attempt to diagnose issues:
doc = minidom.parseString(xml_response)
#Extract relevant information and prepare it for HTML formatting.
try:
handleDocument(doc)
except:
print "Failed to handle document!"
def getText(nodelist): #taken straight from slideshow example
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
print "A TEXT NODE!"
rc.append(node.data)
return ''.join(rc) #this is a string, right?
def handleDocument(doc):
outputFile = open("EbaySearchResults.html", "w")
outputFile.write("<html>\n")
outputFile.write("<body>\n")
try:
items = doc.getElementsByTagName("item")
except:
"Failed to get elements by tag name."
handleItems(items)
outputFile.write("</html>\n")
outputFile.write("</body>\n")
def handleItems(items):
for item in items:
title = item.getElementsByTagName("title")[0] #there should be only one title
print "<h2>%s</h2>" % getText(title.childNodes) #this works fine!
try: #none of these things work!
outputFile.write("<h2>%s</h2>" % getText(title.childNodes))
#outputFile.write("<h2>" + getText(title.childNodes) + "</h2>")
#str = getText(title.childNodes)
#outputFIle.write(string(str))
#outputFile.write(getText(title.childNodes))
except:
print "FAIL"
I do not understand why the correct title text does print to the console but throws an exception and does not work for the output file. Writing plain strings like this works fine: outputFile.write("<html>\n") What is going on with my string construction? As far as I can tell, the getText method I am using from the minidom example returns a string--which is just the sort of thing you can write to a file..?
If I print the actual stack trace...
...
except:
print "Exception when trying to write to file:"
print '-'*60
traceback.print_exc(file=sys.stdout)
print '-'*60
traceback.print_tb(sys.last_traceback)
...
...I will instantly see the problem:
------------------------------------------------------------
Traceback (most recent call last):
File "tohtml.py", line 85, in handleItems
outputFile.write(getText(title.childNodes))
NameError: global name 'outputFile' is not defined
------------------------------------------------------------
Looks like something has gone out of scope!
Fellow beginners, take note.

try: and exception: error

So i am working on this code below. It complied alright when my Reff.txt has more than one line. But it doesnt work when my Reff.txt file has one line. Why is that? I also wondering why my code doesn't run "try" portion of my code but it always run only "exception" part.
so i have a reference file which has a list of ids (one id per line)
I use the reference file(Reff.txt) as a reference to search through the database from the website and the database from the server within my network.
The result i should get is there should be an output file and file with information of that id; for each reference id
However, this code doesn't do anything on my "try:" portion at all
import sys
import urllib2
from lxml import etree
import os
getReference = open('Reff.txt','r') #open the file that contains list of reference ids
global tID
for tID in getReference:
tID = tID.strip()
try:
with open(''+tID.strip()+'.txt') as f: pass
fileInput = open(''+tID+'.txt','r')
readAA = fileInput.read()
store_value = (readAA.partition('\n'))
aaSequence = store_value[2].replace('\n', '') #concatenate lines
makeList = list(aaSequence)#print makeList
inRange = ''
fileAddress = '/database/int/data/'+tID+'.txt'
filename = open(fileAddress,'r')#name of the working file
print fileAddress
with open(fileAddress,'rb') as f:
root = etree.parse(f)
for lcn in root.xpath("/protein/match[#dbname='PFAM']/lcn"):#find dbname =PFAM
start = int(lcn.get("start"))#if it is PFAM then look for start value
end = int(lcn.get("end"))#if it is PFAM then also look for end value
while start <= end:
inRange = makeList[start]
start += 1
print outputFile.write(inRange)
outputFile.close()
break
break
break
except IOError as e:
newURL ='http://www.uniprot.org/uniprot/'+tID+'.fasta'
print newURL
response = urllib2.urlopen(''+newURL) #go to the website and grab the information
creatNew = open(''+uniprotID+'.txt','w')
html = response.read() #read file
creatNew.write(html)
creatNew.close()
So, when you do Try/Except - if try fails, Except runs. Except is always running, because Try is always failing.
Most likely reason for this is that you have this - "print outputFile.write(inRange)", but you have not previously declared outputFile.
ETA: Also, it looks like you are only interested in testing to the first pass of the for loop? You break at that point. Your other breaks are extraneous in that case, because they will never be reached while that one is there.

Categories

Resources