having issue with parsing xml in python - python

i am really new to python. this is actually my first script with it and most of it is a copied example. i have an xml file that i need to parse out an attribute for. i got that part figured out but my issue is that that attribute does not always exist in the xml file. here is my code:
#!/usr/bin/python
#import library to do http requests:
import urllib2
import os
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#download the history:
history = urllib2.urlopen('http://192.168.1.1/example.xml')
#convert to string:
historydata = history.read()
history.close()
#parse the xml you downloaded
dom = parseString(historydata)
xmlTagHistory = dom.getElementsByTagName('loaded')[0].toxml()
xmlDataHistory=xmlTagHistory.replace('<loaded>','').replace('</loaded>','')
print xmlDataHistory
when the attribute doesnt exist i get a return of "IndexError: list index out of range". what i am attempting to do with this code is to get it to run a command if the attribute doesnt exist, or it is false. the other issue i will probably have is that there will be times when that attribute appears more than once so i would also need it to account for that scenario by NOT running the command if there is even one instance of "loaded" being true. as i said, i am really new at this so i could use all the help i can get. much appreciated.

Since dom.getElementsByTagName('loaded') returns a list, you can just check the list size with the len(list) function. Only if the list length is above 0, is it valid to do the [0] dereferencing.
An alternative is to wrap the code in try/exception pair and catch the parse exception.

http://docs.python.org/tutorial/errors.html
using the try and except you should be able to handle everything you need.

Related

Is there a way to feed downloaded xml continuously into XMLPullParser?

In my python script, I'm downloading some XML from a url. It contains the a list of elements within the root element. It really takes quite some time to do so and since the documentation of etree suggested to use the XMLPullParser for things like that, I wanted to try it, but didn't find any way of continuously reading the url into the XMLPullParser. I had hoped to already be able to process the list entries one by one that way, while still downloading. Anyone any idea?
You could try using urllib.request.urlopen from the standard library. Like open, you can use this as a context manager;
with urllib.request.urlopen("http://www.python.org/") as uf:
while True:
data = uf.read(1024) # read returns empty string when finished.
if data:
# feed to pullparser here...
print(data)
else:
break;

scrapy get data dict from json dictionary

I am trying to get all of the data stored in this json
as a dictionary that I can load and access. I am still new to writing spiders, but I believe I need something like
response.xpath().extract()
and then json.load().split() to get an element from it.
But the exact syntax I am not sure of, since there are so many elements in this file.
You can use re_first() to extract JSON from JavaScript code and next loads() it using json module:
import json
d = response.xpath('//script[contains(., "windows.PAGE_MODEL")]/text()').re_first(r'(?s)windows.PAGE_MODEL = (.+?\});')
data = json.loads(d)
property_id = data['propertyData']['id']
You're right, it pretty much works like you suggested in your question.
You can check the script tags for 'windows.PAGE_MODEL' with a simple xpath query.
Please try the following code in the callback for your request:
d = response.xpath('//script[text()[contains(., "windows.PAGE_MODEL")]]/text()').get()
from json import loads
data = loads(d)

How do I load JSON into Couchbase Headless Server in Python?

I am trying to create a Python script that can take a JSON object and insert it into a headless Couchbase server. I have been able to successfully connect to the server and insert some data. I'd like to be able to specify the path of a JSON object and upsert that.
So far I have this:
from couchbase.bucket import Bucket
from couchbase.exceptions import CouchbaseError
import json
cb = Bucket('couchbase://XXX.XXX.XXX?password=XXXX')
print cb.server_nodes
#tempJson = json.loads(open("myData.json","r"))
try:
result = cb.upsert('healthRec', {'record': 'bob'})
# result = cb.upsert('healthRec', {'record': tempJson})
except CouchbaseError as e:
print "Couldn't upsert", e
raise
print(cb.get('healthRec').value)
I know that the first commented out line that loads the json is incorrect because it is expecting a string not an actual json... Can anyone help?
Thanks!
Figured it out:
with open('myData.json', 'r') as f:
data = json.load(f)
try:
result = cb.upsert('healthRec', {'record': data})
I am looking into using cbdocloader, but this was my first step getting this to work. Thanks!
I know that you've found a solution that works for you in this instance but I thought I'd correct the issue that you experienced in your initial code snippet.
json.loads() takes a string as an input and decodes the json string into a dictionary (or whatever custom object you use based on the object_hook), which is why you were seeing the issue as you are passing it a file handle.
There is actually a method json.load() which works as expected, as you have used in your eventual answer.
You would have been able to use it as follows (if you wanted something slightly less verbose than the with statement):
tempJson = json.load(open("myData.json","r"))
As Kirk mentioned though if you have a large number of json documents to insert then it might be worth taking a look at cbdocloader as it will handle all of this boilerplate code for you (with appropriate error handling and other functionality).
This readme covers the uses of cbdocloader and how to format your data correctly to allow it to load your documents into Couchbase Server.

Parsing XML Object Python 3.4

Basically what I am doing is using urllib.request to make an API call to pubmed, receive an XML file in return, and am trying to parse it with no luck.
I have tried using Element Tree and other modules with no luck. I believe there may be an issue with XML object itself.
#Imorting URL Request Modules for API Calls
#Also importing ElemenTree as it seems to be best for XML parsing
import urllib.request
import urllib.parse
import re
import xml.etree.ElementTree as ET
from urllib import request
#Now I can make the API call.
id_request = urllib.request.urlopen('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=17570568')
#id_request will be an object that I'm not sure I understand?
#id_request Returns: "<http.client.HTTPResponse object at 0x0000000003693FD0>"
#Let's now read this baby in XML format!
id_pubmed = id_request.read()
#If I look at the id_pubmed object, I not have the XML file I want to parse.
You can see what the XML file id_pubmed is calling/prints here: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=17570568
My issue is I can't get Element Tree to parse this at all. I have tried:
tree = ET.parse(id_pubmed)
root = tree.getroot()
as well as various other suggestions from https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
ET.parse() method requires either the location of the xml file (on local file system) or a file like object , but your id_pubmed seems to be a string .
In that case , you should use ET.fromstring() . Example -
root = ET.fromstring(id_pubmed)

Open URL stored in a csv file

I'm almost an absolute beginner in Python, but I am asked to manage some difficult task. I have read many tutorials and found some very useful tips on this website, but I think that this question was not asked until now, or at least in the way I tried it in the search engine.
I have managed to write some url in a csv file. Now I would like to write a script able to open this file, to open the urls, and write their content in a dictionary. But I have failed : my script can print these addresses, but cannot process the file.
Interestingly, my script dit not send the same error message each time. Here the last : req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
So I think my script faces several problems :
1- is my method to open url the right one ?
2 - and what is wrong in the way I build the dictionnary ?
Here is my attempt below. Thanks in advance to those who would help me !
import csv
import urllib
dict = {}
test = csv.reader(open("read.csv","rb"))
for z in test:
sock = urllib.urlopen(z)
source = sock.read()
dict[z] = source
sock.close()
print dict
First thing, don't shadow built-ins. Rename your dictionary to something else as dict is used to create new dictionaries.
Secondly, the csv reader creates a list per line that would contain all the columns. Either reference the column explicitly by urllib.urlopen(z[0]) # First column in the line or open the file with a normal open() and iterate through it.
Apart from that, it works for me.

Categories

Resources