Stream HTTPS GET loading and parsing XML inside GZ

Stream HTTPS GET loading and parsing XML inside GZ - python

I need to process (if possible) an XML inside GZ during stream getting it from HTTPS.
If saved the resulted file is very big : 23 GB.
Right now I GET the data from HTTPS using streaming and save the file to a storage. As the Python script needs to be deployed on AWS as a Batch Job the storage is not an option. And I prefer to not using S3 service as storage.
The algorithm should be:
while stream GET HTTPS in chunk:
- get xml chunk from GZ chunk
- process xml chunk
XML for example has the next structure :
<List>
<Property>
<id = '123>
<PhotoProperties>
<Photo>
<url = 'https://www.url.com/photo/1.jpg>
</Photo>
</PhotoProperties>
</Property>
<Property>...</Property>
I need to extract the data as a list of
#dataclass
class Picture:
id: int
url: str

Yes this is possible.
Key is that all operations support streaming and there are libraries to do so:
urllib.request for streaming the content
zlib can be used to decompress a gzip stream
regarding xml parsing it is key to understand that there are 2 major ways to parse an xml file:
DOM parsing: is useful when a full xml can be stored in memory. This allows easy manipulation and discovery of your xml content.
SAX parsing: is useful in case the xml cannot be stored in memory, e.g. because it is too big or because you want to start handling before reading the full stream. This is what you need in your case. xml.parsers.expat can be used for this.
I created a (well-formed) xml fragment based on your example:
<?xml version="1.0" encoding="UTF-8"?>
<List>
<Property id = "123">
<PhotoProperties>
<Photo url = "https://www.url.com/photo/1.jpg"/>
</PhotoProperties>
</Property>
<Property id = "456">
<PhotoProperties>
<Photo url = "https://www.url.com/photo/2.jpg"/>
</PhotoProperties>
</Property>
</List>
Because you do not load the full xml in memory, it is a bit more complex to parse it. You need to create handlers that get called when e.g. an xml element is opened or closed. In below example I've put these handlers in a class that keeps state in a Picture object and prints it when the close tag is found:
import urllib.request
import zlib
import xml.parsers.expat
from dataclasses import dataclass
URL='https://some.url.com/pictures.xml'
#dataclass
class Picture:
id: int
url: str
class ParseHandler:
def __init__(self):
self.currentPicture = None
def start_element(self, name, attrs):
if (name=='Property'):
self.currentPicture = Picture(attrs['id'], None)
elif (name=='Photo'):
self.currentPicture.url=attrs['url']
def end_element(self, name):
if (name=='Property'):
print(self.currentPicture)
self.currentPicture=None
handler = ParseHandler()
parser = xml.parsers.expat.ParserCreate()
parser.StartElementHandler = handler.start_element
parser.EndElementHandler = handler.end_element
decompressor = zlib.decompressobj(32 + zlib.MAX_WBITS)
with urllib.request.urlopen(URL) as stream:
for gzchunk in stream:
xmlchunk = decompressor.decompress(gzchunk)
parser.Parse(xmlchunk)

Related

Python flask cannot open an xml file

I try to implement a server-side multilanguage service on my website. This is the structure on the folders:
data
--locale
static
--css
--images
--js
templates
--index.html
--page1.html
...
main.py
I use Crowdin to translate the website and the output files are in XML. The locale folder contains one folder for each language with one xml file for every page.
I store on Cookies the language and here is my python code:
from flask import request
from xml.dom.minidom import parseString
def languages(page):
langcode = request.cookies.get("Language")
xml = "/data/locale/%s/%s.xml" % (langcode, page)
dom = parseString(xml)
................
.............
Which I call in every page, like languages("index")
This is an example of the exported xml files
<?xml version="1.0" encoding="utf-8"?>
<!--Generated by crowdin.com-->
<!--
This is a description of my page
-->
<resources>
<string name="name1">value 1</string>
<string name="name2">value 2</string>
<string name="name3">value 3</string>
</resources>
However, I have the following error ExpatError: not well-formed (invalid token): line 1, column 0
I googled it. I ended up to other stackoverflow questions, but most of them says about encoding problems and I cannot find any in my example.

You have to use parse() if you want to parse a file. parseString() will parse a string, the file name in your case.
from flask import request
from xml.dom.minidom import parse
def languages(page):
langcode = request.cookies.get("Language")
xml = "/data/locale/%s/%s.xml" % (langcode, page)
dom = parse(xml)

Why xml.etree.ElementTree does not support namespace URI case change

I am trying to parse XML, where the URI for the same namespace is not using the same case. (some xml owners decided to lower-case URIs). If I parse data with one type of URI followed by data with the other type, the parser fail finding my data although I update the ns dictionary to match the document URI... Here is an example:
from cStringIO import StringIO
import xml.etree.ElementTree as ET
DATA_lc = '''<?xml version="1.0" encoding="utf-8"?>
<container xmlns:roktatar="http://www.example.com/lower/case/bug">
<item>
<roktatar:author>Boby Mac Gallinger</roktatar:author>
</item>
</container>'''
DATA_UC = '''<?xml version="1.0" encoding="utf-8"?>
<container xmlns:roktatar="http://www.example.com/Lower/Case/Bug">
<item>
<roktatar:author>John-John Le Grandiosant</roktatar:author>
</item>
</container>'''
tree = ET.parse(StringIO(DATA_lc))
root = tree.getroot()
ns = {'roktatar': 'http://www.example.com/lower/case/bug'}
for item in root.iter('item'):
print item.find('roktatar:author', namespaces=ns).text.strip()
tree = ET.parse(StringIO(DATA_UC))
root = tree.getroot()
ns = {'roktatar': 'http://www.example.com/Lower/Case/Bug'}
for item in root.iter('item'):
print item.find('roktatar:author', namespaces=ns).text.strip()
If each parsing block is processed on it's own, the data gets collected properly, but if they come next to each others, the second always fail. I am missing so reset/cleaning of the parser between documents? Is this a Bug?
Thanks

The ElementTree search code parses arguments to find() and related functions for XPath expressions, and caches the resulting closed-over functions for reuse.
When you search for a roktatar:author, that expression is cached as a search for '{http://www.example.com/lower/case/bug}author', but in your second document the binding changed.
In other words, ElementTree assumes that the same namespace prefix will always map to the same namespace URI.
The better solution to this problem is to use a different prefix here, like roktatar_uc for the title-case version of the URL:
ns = {'roktatar_uc': 'http://www.example.com/Lower/Case/Bug'}
for item in root.iter('item'):
print item.find('roktatar_uc:author', namespaces=ns).text.strip()
but if that is not an option, you'll have to clear the cache instead:
from xml.etree import ElementPath
ElementPath._cache.clear()

Unable to Access Child Node in Parsing XML with Python Language

I am very new to the python scripting language and am recently working on a parser which parses a web-based xml file.
I am able to retrieve all but one of the elements using minidom in python with no issues however I have one node which I am having trouble with. The last node that I require from the XML file is 'url' within the 'image' tag and this can be found within the following xml file example:
<events>
<event id="abcde01">
<title> Name of event </title>
<url> The URL of the Event <- the url tag I do not need </url>
<image>
<url> THE URL I DO NEED </url>
</image>
</event>
Below I have copied brief sections of my code which I feel may be of relevance. I really appreciate any help with this to retrieve this last image url node. I will also include what I have tried and the error I recieved when I ran this code in GAE. The python version I am using is Python 2.7 and I should probably also point out that I am saving them within an array (for later input to a database).
class XMLParser(webapp2.RequestHandler):
def get(self):
base_url = 'http://api.eventful.com/rest/events/search?location=Dublin&date=Today'
#downloads data from xml file:
response = urllib.urlopen(base_url)
#converts data to string
data = response.read()
unicode_data = data.decode('utf-8')
data = unicode_data.encode('ascii','ignore')
#closes file
response.close()
#parses xml downloaded
dom = mdom.parseString(data)
node = dom.documentElement #needed for declaration of variable
#print out all event names (titles) found in the eventful xml
event_main = dom.getElementsByTagName('event')
#URLs list parsing - MY ATTEMPT -
urls_list = []
for im in event_main:
image_url = image.getElementsByTagName("image")[0].childNodes[0]
urls_list.append(image_url)
The error I receive is the following any help is much appreciated, Karen
image_url = im.getElementsByTagName("image")[0].childNodes[0]
IndexError: list index out of range

First of all, do not reencode the content. There is no need to do so, XML parsers are perfectly capable of handling encoded content.
Next, I'd use the ElementTree API for a task like this:
from xml.etree import ElementTree as ET
response = urllib.urlopen(base_url)
tree = ET.parse(response)
urls_list = []
for event in tree.findall('.//event[image]'):
# find the text content of the first <image><url> tag combination:
image_url = event.find('.//image/url')
if image_url is not None:
urls_list.append(image_url.text)
This only consideres event elements that have a direct image child element.

How to validate XML with multiple namespaces in Python?

I'm trying to write some unit tests in Python 2.7 to validate against some extensions I've made to the OAI-PMH schema: http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd
The problem that I'm running into is business with multiple nested namespaces is caused by this specification in the above mentioned XSD:
<complexType name="metadataType">
<annotation>
<documentation>Metadata must be expressed in XML that complies
with another XML Schema (namespace=#other). Metadata must be
explicitly qualified in the response.</documentation>
</annotation>
<sequence>
<any namespace="##other" processContents="strict"/>
</sequence>
</complexType>
Here's a snippet of the code I'm using:
import lxml.etree, urllib2
query = "http://localhost:8080/OAI-PMH?verb=GetRecord&by_doc_ID=false&metadataPrefix=nsdl_dc&identifier=http://www.purplemath.com/modules/ratio.htm"
schema_file = file("../schemas/OAI/2.0/OAI-PMH.xsd", "r")
schema_doc = etree.parse(schema_file)
oaischema = etree.XMLSchema(schema_doc)
request = urllib2.Request(query, headers=xml_headers)
response = urllib2.urlopen(request)
body = response.read()
response_doc = etree.fromstring(body)
try:
oaischema.assertValid(response_doc)
except etree.DocumentInvalid as e:
line = 1;
for i in body.split("\n"):
print "{0}\t{1}".format(line, i)
line += 1
print(e.message)
I end up with the following error:
AssertionError: http://localhost:8080/OAI-PMH?verb=GetRecord&by_doc_ID=false&metadataPrefix=nsdl_dc&identifier=http://www.purplemath.com/modules/ratio.htm
Element '{http://www.openarchives.org/OAI/2.0/oai_dc/}oai_dc': No matching global element declaration available, but demanded by the strict wildcard., line 22
I understand the error, in that the schema is requiring that the child element of the metadata element be strictly validated, which the sample xml does.
Now I've written a validator in Java that works - however it would be helpful for this to be in Python, since the rest of the solution I'm building is Python based. To make my Java variant work, I had to make my DocumentFactory namespace aware, otherwise I got the same error. I've not found any working example in python that performs this validation correctly.
Does anyone have an idea how I can get an XML document with multiple nested namespaces as my sample doc validate with Python?
Here is the sample XML document that i'm trying to validate:
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2002-02-08T08:55:46Z</responseDate>
<request verb="GetRecord" identifier="oai:arXiv.org:cs/0112017"
metadataPrefix="oai_dc">http://arXiv.org/oai2</request>
<GetRecord>
<record>
<header>
<identifier>oai:arXiv.org:cs/0112017</identifier>
<datestamp>2001-12-14</datestamp>
<setSpec>cs</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
<oai_dc:dc
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Using Structural Metadata to Localize Experience of
Digital Content</dc:title>
<dc:creator>Dushay, Naomi</dc:creator>
<dc:subject>Digital Libraries</dc:subject>
<dc:description>With the increasing technical sophistication of
both information consumers and providers, there is
increasing demand for more meaningful experiences of digital
information. We present a framework that separates digital
object experience, or rendering, from digital object storage
and manipulation, so the rendering can be tailored to
particular communities of users.
</dc:description>
<dc:description>Comment: 23 pages including 2 appendices,
8 figures</dc:description>
<dc:date>2001-12-14</dc:date>
</oai_dc:dc>
</metadata>
</record>
</GetRecord>
</OAI-PMH>

Found this in lxml's doc on validation:
>>> schema_root = etree.XML('''\
... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
... <xsd:element name="a" type="xsd:integer"/>
... </xsd:schema>
... ''')
>>> schema = etree.XMLSchema(schema_root)
>>> parser = etree.XMLParser(schema = schema)
>>> root = etree.fromstring("<a>5</a>", parser)
So, perhaps, what you need is this? (See last two lines.):
schema_doc = etree.parse(schema_file)
oaischema = etree.XMLSchema(schema_doc)
request = urllib2.Request(query, headers=xml_headers)
response = urllib2.urlopen(request)
body = response.read()
parser = etree.XMLParser(schema = oaischema)
response_doc = etree.fromstring(body, parser)

How can I convert XML into a Python object?

I need to load an XML file and convert the contents into an object-oriented Python structure. I want to take this:
<main>
<object1 attr="name">content</object>
</main>
And turn it into something like this:
main
main.object1 = "content"
main.object1.attr = "name"
The XML data will have a more complicated structure than that and I can't hard code the element names. The attribute names need to be collected when parsing and used as the object properties.
How can I convert XML data into a Python object?

It's worth looking at lxml.objectify.
xml = """<main>
<object1 attr="name">content</object1>
<object1 attr="foo">contenbar</object1>
<test>me</test>
</main>"""
from lxml import objectify
main = objectify.fromstring(xml)
main.object1[0] # content
main.object1[1] # contenbar
main.object1[0].get("attr") # name
main.test # me
Or the other way around to build xml structures:
item = objectify.Element("item")
item.title = "Best of python"
item.price = 17.98
item.price.set("currency", "EUR")
order = objectify.Element("order")
order.append(item)
order.item.quantity = 3
order.price = sum(item.price * item.quantity for item in order.item)
import lxml.etree
print(lxml.etree.tostring(order, pretty_print=True))
Output:
<order>
<item>
<title>Best of python</title>
<price currency="EUR">17.98</price>
<quantity>3</quantity>
</item>
<price>53.94</price>
</order>

I've been recommending this more than once today, but try Beautiful Soup (easy_install BeautifulSoup).
from BeautifulSoup import BeautifulSoup
xml = """
<main>
<object attr="name">content</object>
</main>
"""
soup = BeautifulSoup(xml)
# look in the main node for object's with attr=name, optionally look up attrs with regex
my_objects = soup.main.findAll("object", attrs={'attr':'name'})
for my_object in my_objects:
# this will print a list of the contents of the tag
print my_object.contents
# if only text is inside the tag you can use this
# print tag.string

David Mertz's gnosis.xml.objectify would seem to do this for you. Documentation's a bit hard to come by, but there are a few IBM articles on it, including this one (text only version).
from gnosis.xml import objectify
xml = "<root><nodes><node>node 1</node><node>node 2</node></nodes></root>"
root = objectify.make_instance(xml)
print root.nodes.node[0].PCDATA # node 1
print root.nodes.node[1].PCDATA # node 2
Creating xml from objects in this way is a different matter, though.

How about this
http://evanjones.ca/software/simplexmlparse.html

##Stephen:
#"can't hardcode the element names, so I need to collect them
#at parse and use them somehow as the object names."
#I don't think thats possible. Instead you can do this.
#this will help you getting any object with a required name.
import BeautifulSoup
class Coll(object):
"""A class which can hold your Foo clas objects
and retrieve them easily when you want
abstracting the storage and retrieval logic
"""
def __init__(self):
self.foos={}
def add(self, fooobj):
self.foos[fooobj.name]=fooobj
def get(self, name):
return self.foos[name]
class Foo(object):
"""The required class
"""
def __init__(self, name, attr1=None, attr2=None):
self.name=name
self.attr1=attr1
self.attr2=attr2
s="""<main>
<object name="somename">
<attr name="attr1">value1</attr>
<attr name="attr2">value2</attr>
</object>
<object name="someothername">
<attr name="attr1">value3</attr>
<attr name="attr2">value4</attr>
</object>
</main>
"""
#
soup=BeautifulSoup.BeautifulSoup(s)
bars=Coll()
for each in soup.findAll('object'):
bar=Foo(each['name'])
attrs=each.findAll('attr')
for attr in attrs:
setattr(bar, attr['name'], attr.renderContents())
bars.add(bar)
#retrieve objects by name
print bars.get('somename').__dict__
print '\n\n', bars.get('someothername').__dict__
output
{'attr2': 'value2', 'name': u'somename', 'attr1': 'value1'}
{'attr2': 'value4', 'name': u'someothername', 'attr1': 'value3'}

There are three common XML parsers for python: xml.dom.minidom, elementree, and BeautifulSoup.
IMO, BeautifulSoup is by far the best.
http://www.crummy.com/software/BeautifulSoup/

If googling around for a code-generator doesn't work, you could write your own that uses XML as input and outputs objects in your language of choice.
It's not terribly difficult, however the three step process of Parse XML, Generate Code, Compile/Execute Script does making debugging a bit harder.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stream HTTPS GET loading and parsing XML inside GZ - python

Related

Python flask cannot open an xml file

Why xml.etree.ElementTree does not support namespace URI case change

Unable to Access Child Node in Parsing XML with Python Language

How to validate XML with multiple namespaces in Python?

How can I convert XML into a Python object?

Categories

Resources