Fully streaming XML parser

Fully streaming XML parser - python

I'm trying to consume the Exchange GetAttachment webservice using requests, lxml and base64io. This service returns a base64-encoded file in a SOAP XML HTTP response. The file content is contained in a single line in a single XML element. GetAttachment is just an example, but the problem is more general.
I would like to stream the decoded file contents directly to disk without storing the entire contents of the attachment in-memory at any point, since an attachment could be several 100 MB.
I have tried something like this:
r = requests.post('https://example.com/EWS/Exchange.asmx', data=..., stream=True)
with open('foo.txt', 'wb') as f:
for action, elem in lxml.etree.iterparse(GzipFile(fileobj=r.raw)):
if elem.tag == 't:Content':
b64_encoder = Base64IO(BytesIO(elem.text))
f.write(b64_encoder.read())
but lxml still stores a copy of the attachment as elem.text. Is there any way I can create a fully streaming XML parser that also streams the content of an element directly from the input stream?

Don't use iterparse in this case. The iterparse() method can only issue element start and end events, so any text in an element is given to you when the closing XML tag has been found.
Instead, use a SAX parser interface. This is a general standard for XML parsing libraries, to pass on parsed data to a content handler. The ContentHandler.characters() callback is passed character data in chunks (assuming that the implementing XML library actually makes use of this possibility). This is a lower level API from the ElementTree API, and and the Python standard library already bundles the Expat parser to drive it.
So the flow then becomes:
wrap the incoming request stream in a GzipFile for easy decompression. Or, better still, set response.raw.decode_content = True and leave decompression to the requests library based on the content-encoding the server has set.
Pass the GzipFile instance or raw stream to the .parse() method of a parser created with xml.sax.make_parser(). The parser then proceeds to read from the stream in chunks. By using make_parser() you first can enable features such as namespace handling (which ensures your code doesn't break if Exchange decides to alter the short prefixes used for each namespace).
The content handler characters() method is called with chunks of XML data; check for the correct element start event, so you know when to expect base64 data. You can decode that base64 data in chunks of (a multiple of) 4 characters at a time, and write it to a file. I'd not use base64io here, just do your own chunking.
A simple content handler could be:
from xml.sax import handler
from base64 import b64decode
class AttachmentContentHandler(handler.ContentHandler):
types_ns = 'http://schemas.microsoft.com/exchange/services/2006/types'
def __init__(self, filename):
self.filename = filename
def startDocument(self):
self._buffer = None
self._file = None
def startElementNS(self, name, *args):
if name == (self.types_ns, 'Content'):
# we can expect base64 data next
self._file = open(self.filename, 'wb')
self._buffer = []
def endElementNS(self, name, *args):
if name == (self.types_ns, 'Content'):
# all attachment data received, close the file
try:
if self._buffer:
raise ValueError("Incomplete Base64 data")
finally:
self._file.close()
self._file = self._buffer = None
def characters(self, data):
if self._buffer is None:
return
self._buffer.append(data)
self._decode_buffer()
def _decode_buffer(self):
remainder = ''
for data in self._buffer:
available = len(remainder) + len(data)
overflow = available % 4
if remainder:
data = (remainder + data)
remainder = ''
if overflow:
remainder, data = data[-overflow:], data[:-overflow]
if data:
self._file.write(b64decode(data))
self._buffer = [remainder] if remainder else []
and you'd use it like this:
import requests
from xml.sax import make_parser, handler
parser = make_parser()
parser.setFeature(handler.feature_namespaces, True)
parser.setContentHandler(AttachmentContentHandler('foo.txt'))
r = requests.post('https://example.com/EWS/Exchange.asmx', data=..., stream=True)
r.raw.decode_content = True # if content-encoding is used, decompress as we read
parser.parse(r.raw)
This will parse the input XML in chunks of up to 64KB (the default IncrementalParser buffer size), so attachment data is decoded in at most 48KB blocks of raw data.
I'd probably extend the content handler to take a target directory and then look for <t:Name> elements to extract the filename, then use that to extract the data to the correct filename for each attachment found. You'd also want to verify that you are actually dealing with a GetAttachmentResponse document, and handle error responses.

Related

How does rfile.read() work?

I'm sending a text file with a string in a python script via POST to my server:
fo = open('data'.txt','a')
fo.write("hi, this is my testing data")
fo.close()
with open('data.txt', 'rb') as f:
r = requests.post("http://XXX.XX.X.X", data = {'data.txt':f})
f.close()
And receiving and handling it here in my server handler script, built off an example found online:
def do_POST(self):
data = self.rfile.read(int(self.headers.getheader('Content-Length')))
empty = [data]
with open('processing.txt', 'wb') as file:
for item in empty:
file.write("%s\n" % item)
file.close()
self._set_headers()
self.wfile.write("<html><body><h1>POST!</h1></body></html>")
My question is, how does:
self.rfile.read(int(self.headers.getheader('Content-Length')))
take the length of my data (an integer, # of bytes/characters) and read my file? I am confused how it knows what my data contains. What is going on behind the scenes with HTTP?
It outputs data.txt=hi%2C+this+is+my+testing+data
to my processing.txt, but I am expecting "hi this is my testing data"
I tried but failed to find documentation for what exactly rfile.read() does, and if simply finding that answers my question I'd appreciate it, and I could just delete this question.

Your client code snippet reads contents from the file data.txt and makes a POST request to your server with data structured as a key-value pair. The data sent to your server in this case is one key data.txt with the corresponding value being the contents of the file.
Your server code snippet reads the entire HTTP Request body and dumps it into a file. The key-value pair structured and sent from the client comes in a format that can be decoded by Python's built in library urlparse.
Here is a solution that could work:
def do_POST(self):
length = int(self.headers.getheader('content-length'))
field_data = self.rfile.read(length)
fields = urlparse.parse_qs(field_data)
This snippet of code was shamefully borrowed from: https://stackoverflow.com/a/31363982/705471
If you'd like to extract the contents of your text file back, adding the following line to the above snippet could help:
data_file = fields["data.txt"]
To learn more about how such information is encoded for the purposes of HTTP, read more at: https://en.wikipedia.org/wiki/Percent-encoding

How to get image file size in python when fetching from URL (before deciding to save)

import urllib.request,io
url = 'http://www.image.com/image.jpg'
path = io.BytesIO(urllib.request.urlopen(url).read())
I'd like to check the file size of the URL image in the filestream path before saving, how can i do this?
Also, I don't want to rely on Content-Length headers, I'd like to fetch it into a filestream, check the size and then save

You can get the size of the io.BytesIO() object the same way you can get it for any file object: by seeking to the end and asking for the file position:
path = io.BytesIO(urllib.request.urlopen(url).read())
path.seek(0, 2) # 0 bytes from the end
size = path.tell()
However, you could just as easily have just taken the len() of the bytestring you just read, before inserting it into an in-memory file object:
data = urllib.request.urlopen(url).read()
size = len(data)
path = io.BytesIO(data)
Note that this means your image has already been loaded into memory. You cannot use this to prevent loading too large an image object. For that using the Content-Length header is the only option.
If the server uses a chunked transfer encoding to facilitate streaming (so no content length has been set up front), you can use a loop limit how much data is read.

Try importing urllib.request
import urllib.request, io
url = 'http://www.elsecarrailway.co.uk/images/Events/TeddyBear-3.jpg'
path = urllib.request.urlopen(url)
meta = path.info()
>>>meta.get(name="Content-Length")
'269898' # ie 269kb

You could ask the server for the content-length information. Using urllib2 (which I hope is available in your python):
req = urllib2.urlopen(url)
meta = req,info()
length_text = meta.getparam("Content-Length")
try:
length = int(length_text)
except:
# length unknown, you may need to read
length = -1

Parsing compressed xml feed into ElementTree

I'm trying to parse the following feed into ElementTree in python: "http://smarkets.s3.amazonaws.com/oddsfeed.xml" (warning large file)
Here is what I have tried so far:
feed = urllib.urlopen("http://smarkets.s3.amazonaws.com/oddsfeed.xml")
# feed is compressed
compressed_data = feed.read()
import StringIO
compressedstream = StringIO.StringIO(compressed_data)
import gzip
gzipper = gzip.GzipFile(fileobj=compressedstream)
data = gzipper.read()
# Parse XML
tree = ET.parse(data)
but it seems to just hang on compressed_data = feed.read(), infinitely maybe?? (I know it's a big file, but seems too long compared to other non-compressed feeds I parsed, and this large is killing any bandwidth gains from the gzip compression in the first place).
Next I tried requests, with
url = "http://smarkets.s3.amazonaws.com/oddsfeed.xml"
headers = {'accept-encoding': 'gzip, deflate'}
r = requests.get(url, headers=headers, stream=True)
but now
tree=ET.parse(r.content)
or
tree=ET.parse(r.text)
but these raise exceptions.
What's the proper way to do this?

You can pass the value returned by urlopen() directly to GzipFile() and in turn you can pass it to ElementTree methods such as iterparse():
#!/usr/bin/env python3
import xml.etree.ElementTree as etree
from gzip import GzipFile
from urllib.request import urlopen, Request
with urlopen(Request("http://smarkets.s3.amazonaws.com/oddsfeed.xml",
headers={"Accept-Encoding": "gzip"})) as response, \
GzipFile(fileobj=response) as xml_file:
for elem in getelements(xml_file, 'interesting_tag'):
process(elem)
where getelements() allows to parse files that do not fit in memory.
def getelements(filename_or_file, tag):
"""Yield *tag* elements from *filename_or_file* xml incrementaly."""
context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == tag:
yield elem
root.clear() # free memory
To preserve memory, the constructed xml tree is cleared on each tag element.

The ET.parse function takes "a filename or file object containing XML data". You're giving it a string full of XML. It's going to try to open a file whose name is that big chunk of XML. There is probably no such file.
You want the fromstring function, or the XML constructor.
Or, if you prefer, you've already got a file object, gzipper; you could just pass that to parse instead of reading it into a string.
This is all covered by the short Tutorial in the docs:
We can import this data by reading from a file:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
Or directly from a string:
root = ET.fromstring(country_data_as_string)

Using configparser for text read from URL using urllib2

I have to read a txt ini file from my browser. [this is required]
res = urllib2.urlopen(URL)
inifile = res.read()
Then I want to basically use this the same way as I would have read any txt file.
config = ConfigParser.SafeConfigParser()
config.read( inifile )
But now looks like I can't use it as this is actually a string now
Can anybody suggest a way around?

You want configparser.readfp -- Presumably, you might even be able to get away with:
res = urllib2.urlopen(URL)
config = ConfgiParser.SafeConfigParser()
config.readfp(res)
assuming that urllib2.urlopen returns an object that is sufficiently file-like (i.e. it has a readline method). For easier debugging, you could do:
config.readfp(res, URL)
If you have to read it the data from a string, you could pack the whole thing into a io.StringIO (or StringIO.StringIO) buffer and read from that:
import io
res = urllib2.urlopen(URL)
inifile_text = res.read()
inifile = io.StringIO(inifile_text)
inifile.seek(0)
config.readfp(inifile)

Downloading and zipping files from amazon

I'm currently storing all my photos on amazon s3 and using django for my website. I want a to have a button that allows users to click it and have all their photos zipped and returned to them.
I'm currently using boto to interface with amazon and found that I can go through the entire bucket list / use get_key to look for specific files and download them
After this I would need temporarily store them, then zip and return.
What is the best way to go about doing this?
Thanks

you can take a look at this question or at this snippet to download the file
# This is not a full working example, just a starting point
# for downloading images in different formats.
import subprocess
import Image
def image_as_png_pdf(request):
output_format = request.GET.get('format')
im = Image.open(path_to_image) # any Image object should work
if output_format == 'png':
response = HttpResponse(mimetype='image/png')
response['Content-Disposition'] = 'attachment; filename=%s.png' % filename
im.save(response, 'png') # will call response.write()
else:
# Temporary disk space, server process needs write access
tmp_path = '/tmp/'
# Full path to ImageMagick convert binary
convert_bin = '/usr/bin/convert'
im.save(tmp_path+filename+'.png', 'png')
response = HttpResponse(mimetype='application/pdf')
response['Content-Disposition'] = 'attachment; filename=%s.pdf' % filename
ret = subprocess.Popen([ convert_bin,
"%s%s.png"%(tmp_path,filename), "pdf:-" ],
stdout=subprocess.PIPE)
response.write(ret.stdout.read())
return response
to create a zip follow the link that i gave you, you can also use zipimport as shown here examples are on the bottom of the page, follow the documentation for newer versions
you might also be interested in this although it was made for django 1.2, it might not work on 1.3

Using python-zipstream as patched with this pull request you can do something like this:
import boto
import io
import zipstream
import sys
def iterable_to_stream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module). For efficiency, the stream is buffered.
From: https://stackoverflow.com/a/20260030/729491
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def iterate_key():
b = boto.connect_s3().get_bucket('lastage')
key = b.get_key('README.markdown')
for b in key:
yield b
with open('/tmp/foo.zip', 'w') as f:
z = zipstream.ZipFile(mode='w')
z.write(iterable_to_stream(iterate_key()), arcname='foo1')
z.write(iterable_to_stream(iterate_key()), arcname='foo2')
z.write(iterable_to_stream(iterate_key()), arcname='foo3')
for chunk in z:
print "CHUNK", len(chunk)
f.write(chunk)
Basically we iterate over the key contents using boto, convert this iterator to a stream using the iterable_to_stream method from this answer and then have python-zipstream create a zip file on-the-fly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fully streaming XML parser - python

Related

How does rfile.read() work?

How to get image file size in python when fetching from URL (before deciding to save)

Parsing compressed xml feed into ElementTree

Using configparser for text read from URL using urllib2

Downloading and zipping files from amazon

Categories

Resources