Is there a way to feed downloaded xml continuously into XMLPullParser? - python

In my python script, I'm downloading some XML from a url. It contains the a list of elements within the root element. It really takes quite some time to do so and since the documentation of etree suggested to use the XMLPullParser for things like that, I wanted to try it, but didn't find any way of continuously reading the url into the XMLPullParser. I had hoped to already be able to process the list entries one by one that way, while still downloading. Anyone any idea?

You could try using urllib.request.urlopen from the standard library. Like open, you can use this as a context manager;
with urllib.request.urlopen("http://www.python.org/") as uf:
while True:
data = uf.read(1024) # read returns empty string when finished.
if data:
# feed to pullparser here...
print(data)
else:
break;

Related

How do I load JSON into Couchbase Headless Server in Python?

I am trying to create a Python script that can take a JSON object and insert it into a headless Couchbase server. I have been able to successfully connect to the server and insert some data. I'd like to be able to specify the path of a JSON object and upsert that.
So far I have this:
from couchbase.bucket import Bucket
from couchbase.exceptions import CouchbaseError
import json
cb = Bucket('couchbase://XXX.XXX.XXX?password=XXXX')
print cb.server_nodes
#tempJson = json.loads(open("myData.json","r"))
try:
result = cb.upsert('healthRec', {'record': 'bob'})
# result = cb.upsert('healthRec', {'record': tempJson})
except CouchbaseError as e:
print "Couldn't upsert", e
raise
print(cb.get('healthRec').value)
I know that the first commented out line that loads the json is incorrect because it is expecting a string not an actual json... Can anyone help?
Thanks!
Figured it out:
with open('myData.json', 'r') as f:
data = json.load(f)
try:
result = cb.upsert('healthRec', {'record': data})
I am looking into using cbdocloader, but this was my first step getting this to work. Thanks!
I know that you've found a solution that works for you in this instance but I thought I'd correct the issue that you experienced in your initial code snippet.
json.loads() takes a string as an input and decodes the json string into a dictionary (or whatever custom object you use based on the object_hook), which is why you were seeing the issue as you are passing it a file handle.
There is actually a method json.load() which works as expected, as you have used in your eventual answer.
You would have been able to use it as follows (if you wanted something slightly less verbose than the with statement):
tempJson = json.load(open("myData.json","r"))
As Kirk mentioned though if you have a large number of json documents to insert then it might be worth taking a look at cbdocloader as it will handle all of this boilerplate code for you (with appropriate error handling and other functionality).
This readme covers the uses of cbdocloader and how to format your data correctly to allow it to load your documents into Couchbase Server.

Parameter with dictionary path

I am very new to Python and am not very familiar with the data structures in Python.
I am writing an automatic JSON parser in Python, the JSON message is read into a dictionary using Ultra-JSON:
jsonObjs = ujson.loads(data)
Now, if I try something like:
jsonObjs[param1][0][param2] it works fine
However, I need to get the path from an external source (I read it from the DB), we initially thought we'll just write in the DB:
myPath = [param1][0][param2]
and then try to access:
jsonObjs[myPath]
But after a couple of failures I realized I'm trying to access:
jsonObjs[[param1][0][param2]]
Is there a way to fix this without parsing myPath?
Many thanks for your help and advice
Store the keys in a format that preserves type information, e.g. JSON, and then use reduce() to perform recursive accesses on the structure.

How to write a python script for downloading?

I want to download some files from this site: http://www.emuparadise.me/soundtracks/highquality/index.php
But I only want to get certain ones.
Is there a way to write a python script to do this? I have intermediate knowledge of python
I'm just looking for a bit of guidance, please point me towards a wiki or library to accomplish this
thanks,
Shrub
Here's a link to my code
I looked at the page. The links seem to redirect to another page, where the file is hosted, clicking which downloads the file.
I would use mechanize to follow the required links to the right page, and then use BeautifulSoup or lxml to parse the resultant page to get the filename.
Then it's a simple matter of opening the file using urlopen and writing its contents out into a local file like so:
f = open(localFilePath, 'w')
f.write(urlopen(remoteFilePath).read())
f.close()
Hope that helps
Make a url request for the page. Once you have the source, filter out and get urls.
The files you want to download are urls that contain a specific extension. It is with this that you can do a regular expression search for all urls that match your criteria.
After filtration, then do a url request for each matched url's data and write it to memory.
Sample code:
#!/usr/bin/python
import re
import sys
import urllib
#Your sample url
sampleUrl = "http://stackoverflow.com"
urlAddInfo = urllib.urlopen(sampleUrl)
data = urlAddInfo.read()
#Sample extensions we'll be looking for: pngs and pdfs
TARGET_EXTENSIONS = "(png|pdf)"
targetCompile = re.compile(TARGET_EXTENSIONS, re.UNICODE|re.MULTILINE)
#Let's get all the urls: match criteria{no spaces or " in a url}
urls = re.findall('(https?://[^\s"]+)', data, re.UNICODE|re.MULTILINE)
#We want these folks
extensionMatches = filter(lambda url: url and targetCompile.search(url), urls)
#The rest of the unmatched urls for which the scrapping can also be repeated.
nonExtMatches = filter(lambda url: url and not targetCompile.search(url), urls)
def fileDl(targetUrl):
#Function to handle downloading of files.
#Arg: url => a String
#Output: Boolean to signify if file has been written to memory
#Validation of the url assumed, for the sake of keeping the illustration short
urlAddInfo = urllib.urlopen(targetUrl)
data = urlAddInfo.read()
fileNameSearch = re.search("([^\/\s]+)$", targetUrl) #Text right before the last slash '/'
if not fileNameSearch:
sys.stderr.write("Could not extract a filename from url '%s'\n"%(targetUrl))
return False
fileName = fileNameSearch.groups(1)[0]
with open(fileName, "wb") as f:
f.write(data)
sys.stderr.write("Wrote %s to memory\n"%(fileName))
return True
#Let's now download the matched files
dlResults = map(lambda fUrl: fileDl(fUrl), extensionMatches)
successfulDls = filter(lambda s: s, dlResults)
sys.stderr.write("Downloaded %d files from %s\n"%(len(successfulDls), sampleUrl))
#You can organize the above code into a function to repeat the process for each of the
#other urls and in that way you can make a crawler.
The above code is written mainly for Python2.X. However, I wrote a crawler that works on any version starting from 2.X
Why yes! 5 years later and, not only is this possible, but you've now got a lot of ways to do it.
I'm going to avoid code-examples here, because mainly want to help break your problem into segments and give you some options for exploration:
Segment 1: GET!
If you must stick to the stdlib, for either python2 or python3, urllib[n]* is what you're going to want to use to pull-down something from the internet.
So again, if you don't want dependencies on other packages:
urllib or urllib2 or maybe another urllib[n] I'm forgetting about.
If you don't have to restrict your imports to the Standard Library:
you're in luck!!!!! You've got:
requests with docs here. requests is the golden standard for gettin' stuff off the web with python. I suggest you use it.
uplink with docs here. It's relatively new & for more programmatic client interfaces.
aiohttp via asyncio with docs here. asyncio got included in python >= 3.5 only, and it's also extra confusing. That said, it if you're willing to put in the time it can be ridiculously efficient for exactly this use-case.
...I'd also be remiss not to mention one of my favorite tools for crawling:
fake_useragent repo here. Docs like seriously not necessary.
Segment 2: Parse!
So again, if you must stick to the stdlib and not install anything with pip, you get to use the extra-extra fun and secure (<==extreme-sarcasm) xml builtin module. Specifically, you get to use the:
xml.etree.ElementTree() with docs here.
It's worth noting that the ElementTree object is what the pip-downloadable lxml package is based on, and made make easier to use. If you want to recreate the wheel and write a bunch of your own complicated logic, using the default xml module is your option.
If you don't have to restrict your imports to the Standard Library:
lxml with docs here. As i said before, lxml is a wrapper around xml.etree that makes it human-usable & implements all those parsing tools you'd need to make yourself. However, as you can see by visiting the docs, it's not easy to use by itself. This brings us to...
BeautifulSoup aka bs4 with docs here. BeautifulSoup makes everything easier. It's my recommendation for this.
Segment 3: GET GET GET!
This section is nearly exactly the same as "Segment 1," except you have a bunch of links not one.
The only thing that changes between this section and "Segment 1" is my recommendation for what to use: aiohttp here will download way faster when dealing with several URLs because it's allows you to download them in parallel.**
* - (where n was decided-on from python-version to ptyhon-version in a somewhat frustratingly arbitrary manner. Look up which urllib[n] has .urlopen() as a top-level function. You can read more about this naming-convention clusterf**k here, here, and here.)
** - (This isn't totally true. It's more sort-of functionally-true at human timescales.)
I would use a combination of wget for downloading - http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/#more-1885 and BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs4/doc/ for parsing the downloaded file

having issue with parsing xml in python

i am really new to python. this is actually my first script with it and most of it is a copied example. i have an xml file that i need to parse out an attribute for. i got that part figured out but my issue is that that attribute does not always exist in the xml file. here is my code:
#!/usr/bin/python
#import library to do http requests:
import urllib2
import os
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#download the history:
history = urllib2.urlopen('http://192.168.1.1/example.xml')
#convert to string:
historydata = history.read()
history.close()
#parse the xml you downloaded
dom = parseString(historydata)
xmlTagHistory = dom.getElementsByTagName('loaded')[0].toxml()
xmlDataHistory=xmlTagHistory.replace('<loaded>','').replace('</loaded>','')
print xmlDataHistory
when the attribute doesnt exist i get a return of "IndexError: list index out of range". what i am attempting to do with this code is to get it to run a command if the attribute doesnt exist, or it is false. the other issue i will probably have is that there will be times when that attribute appears more than once so i would also need it to account for that scenario by NOT running the command if there is even one instance of "loaded" being true. as i said, i am really new at this so i could use all the help i can get. much appreciated.
Since dom.getElementsByTagName('loaded') returns a list, you can just check the list size with the len(list) function. Only if the list length is above 0, is it valid to do the [0] dereferencing.
An alternative is to wrap the code in try/exception pair and catch the parse exception.
http://docs.python.org/tutorial/errors.html
using the try and except you should be able to handle everything you need.

Open URL stored in a csv file

I'm almost an absolute beginner in Python, but I am asked to manage some difficult task. I have read many tutorials and found some very useful tips on this website, but I think that this question was not asked until now, or at least in the way I tried it in the search engine.
I have managed to write some url in a csv file. Now I would like to write a script able to open this file, to open the urls, and write their content in a dictionary. But I have failed : my script can print these addresses, but cannot process the file.
Interestingly, my script dit not send the same error message each time. Here the last : req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
So I think my script faces several problems :
1- is my method to open url the right one ?
2 - and what is wrong in the way I build the dictionnary ?
Here is my attempt below. Thanks in advance to those who would help me !
import csv
import urllib
dict = {}
test = csv.reader(open("read.csv","rb"))
for z in test:
sock = urllib.urlopen(z)
source = sock.read()
dict[z] = source
sock.close()
print dict
First thing, don't shadow built-ins. Rename your dictionary to something else as dict is used to create new dictionaries.
Secondly, the csv reader creates a list per line that would contain all the columns. Either reference the column explicitly by urllib.urlopen(z[0]) # First column in the line or open the file with a normal open() and iterate through it.
Apart from that, it works for me.

Categories

Resources