I try to implement a server-side multilanguage service on my website. This is the structure on the folders:
data
--locale
static
--css
--images
--js
templates
--index.html
--page1.html
...
main.py
I use Crowdin to translate the website and the output files are in XML. The locale folder contains one folder for each language with one xml file for every page.
I store on Cookies the language and here is my python code:
from flask import request
from xml.dom.minidom import parseString
def languages(page):
langcode = request.cookies.get("Language")
xml = "/data/locale/%s/%s.xml" % (langcode, page)
dom = parseString(xml)
................
.............
Which I call in every page, like languages("index")
This is an example of the exported xml files
<?xml version="1.0" encoding="utf-8"?>
<!--Generated by crowdin.com-->
<!--
This is a description of my page
-->
<resources>
<string name="name1">value 1</string>
<string name="name2">value 2</string>
<string name="name3">value 3</string>
</resources>
However, I have the following error ExpatError: not well-formed (invalid token): line 1, column 0
I googled it. I ended up to other stackoverflow questions, but most of them says about encoding problems and I cannot find any in my example.
You have to use parse() if you want to parse a file. parseString() will parse a string, the file name in your case.
from flask import request
from xml.dom.minidom import parse
def languages(page):
langcode = request.cookies.get("Language")
xml = "/data/locale/%s/%s.xml" % (langcode, page)
dom = parse(xml)
Related
I'm using Python 3.7.2 and elementtree to copy the content of a tag in an XML file.
This is my XML file:
<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.003.03">
<CstmrCdtTrfInitn>
<GrpHdr>
<MsgId>nBblsUR-uH..6jmGgZNHLQAAAXgXN1Lu</MsgId>
<CreDtTm>2016-11-10T12:00:00.000+01:00</CreDtTm>
<NbOfTxs>1</NbOfTxs>
<CtrlSum>6</CtrlSum>
<InitgPty>
<Nm>TC 03000 Kunde 55 Protokollr ckf hrung</Nm>
</InitgPty>
</GrpHdr>
</CstmrCdtTrfInitn>
</Document>
I want to copy the content of the 'MsgId' tag and save it as a string.
I've manage to do this with minidom before, but due to new circumstances, I have to settle with elementtree for now.
This is that code with minidom:
dom = xml.dom.minidom.parse('H:\\app_python/in_spsh/{}'.format(filename_string))
message = dom.getElementsByTagName('MsgId')
for MsgId in message:
print(MsgId.firstChild.nodeValue)
Now I want to do the exact same thing with elementtree. How can I achieve this?
To get the text value of a single element, you can use the findtext() method. The namespace needs to be taken into account.
from xml.etree import ElementTree as ET
tree = ET.parse("test.xml") # Your XML document
msgid = tree.findtext('.//{urn:iso:std:iso:20022:tech:xsd:pain.001.003.03}MsgId')
With Python 3.8 and later, it is possible to use a wildcard for the namespace:
msgid = tree.findtext('.//{*}MsgId')
I need to process (if possible) an XML inside GZ during stream getting it from HTTPS.
If saved the resulted file is very big : 23 GB.
Right now I GET the data from HTTPS using streaming and save the file to a storage. As the Python script needs to be deployed on AWS as a Batch Job the storage is not an option. And I prefer to not using S3 service as storage.
The algorithm should be:
while stream GET HTTPS in chunk:
- get xml chunk from GZ chunk
- process xml chunk
XML for example has the next structure :
<List>
<Property>
<id = '123>
<PhotoProperties>
<Photo>
<url = 'https://www.url.com/photo/1.jpg>
</Photo>
</PhotoProperties>
</Property>
<Property>...</Property>
I need to extract the data as a list of
#dataclass
class Picture:
id: int
url: str
Yes this is possible.
Key is that all operations support streaming and there are libraries to do so:
urllib.request for streaming the content
zlib can be used to decompress a gzip stream
regarding xml parsing it is key to understand that there are 2 major ways to parse an xml file:
DOM parsing: is useful when a full xml can be stored in memory. This allows easy manipulation and discovery of your xml content.
SAX parsing: is useful in case the xml cannot be stored in memory, e.g. because it is too big or because you want to start handling before reading the full stream. This is what you need in your case. xml.parsers.expat can be used for this.
I created a (well-formed) xml fragment based on your example:
<?xml version="1.0" encoding="UTF-8"?>
<List>
<Property id = "123">
<PhotoProperties>
<Photo url = "https://www.url.com/photo/1.jpg"/>
</PhotoProperties>
</Property>
<Property id = "456">
<PhotoProperties>
<Photo url = "https://www.url.com/photo/2.jpg"/>
</PhotoProperties>
</Property>
</List>
Because you do not load the full xml in memory, it is a bit more complex to parse it. You need to create handlers that get called when e.g. an xml element is opened or closed. In below example I've put these handlers in a class that keeps state in a Picture object and prints it when the close tag is found:
import urllib.request
import zlib
import xml.parsers.expat
from dataclasses import dataclass
URL='https://some.url.com/pictures.xml'
#dataclass
class Picture:
id: int
url: str
class ParseHandler:
def __init__(self):
self.currentPicture = None
def start_element(self, name, attrs):
if (name=='Property'):
self.currentPicture = Picture(attrs['id'], None)
elif (name=='Photo'):
self.currentPicture.url=attrs['url']
def end_element(self, name):
if (name=='Property'):
print(self.currentPicture)
self.currentPicture=None
handler = ParseHandler()
parser = xml.parsers.expat.ParserCreate()
parser.StartElementHandler = handler.start_element
parser.EndElementHandler = handler.end_element
decompressor = zlib.decompressobj(32 + zlib.MAX_WBITS)
with urllib.request.urlopen(URL) as stream:
for gzchunk in stream:
xmlchunk = decompressor.decompress(gzchunk)
parser.Parse(xmlchunk)
So I have a XML file from a local folder that I want to scrape using Python. It has CData and looks like this:
<?xml version="1.0" encoding="utf-8"?>
<trial xmlns="urn::trial">
<drksId><![CDATA[DRKS00000024]]></drksId>
<firstDrksPublishDate><![CDATA[2008-09-05T12:36:54.000+02:00]]></firstDrksPublishDate>
<firstPartnerPublishDate><![CDATA[2004-01-15T00:00:00.000+01:00]]></firstPartnerPublishDate>
......
I tried:
import xml.etree.ElementTree as Et
tree=Et.parse(filename)
root=tree.getroot()
print(root.find('drksId').text)
Output:
I am getting root.find('drksId') as None. Thanks in advance
Try to search element considering namespace:
ns = {'ns': 'urn::trial'}
drksId = root.find('./ns:drksId', ns)
print(drksId.text)
I am trying to read an xml document using Beautiful Soup on Python 3.6.2, IPython 6.1.0, Windows 10, and I can't get the encoding right.
Here's my test xml, saved as a file in UTF8-encoding:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<info name="愛よ">ÜÜÜÜÜÜÜ</info>
<items>
<item thing="ÖöÖö">"23Äßßß"</item>
</items>
</root>
First check the XML using ElementTree:
import xml.etree.ElementTree as ET
def printXML(xml,indent=''):
print(indent+str(xml.tag)+': '+(xml.text if xml.text is not None else '').replace('\n',''))
if len(xml.attrib) > 0:
for k,v in xml.attrib.items():
print(indent+'\t'+k+' - '+v)
if xml.getchildren():
for child in xml.getchildren():
printXML(child,indent+'\t')
xml0 = ET.parse("test.xml").getroot()
printXML(xml0)
The output is correct:
root:
info: ÜÜÜÜÜÜÜ
name - 愛よ
items:
item: "23Äßßß"
thing - ÖöÖö
Now read the same file with Beautiful Soup and pretty-print it:
import bs4
with open("test.xml") as ff:
xml = bs4.BeautifulSoup(ff,"html5lib")
print(xml.prettify())
Output:
<!--?xml version="1.0" encoding="UTF-8"?-->
<html>
<head>
</head>
<body>
<root>
<info name="愛よ">
ÜÜÜÜÜÜÜ
</info>
<items>
<item thing="ÖöÖö">
"23Äßßß"
</item>
</items>
</root>
</body>
</html>
This is just wrong. Doing the call with explicite encoding specified bs4.BeautifulSoup(ff,"html5lib",from_encoding="UTF-8") doesn't change the result.
Doing
print(xml.original_encoding)
outputs
None
So Beautiful Soup is apparently unable to detect the original encoding even though the file is encoded in UTF8 (according to Notepad++) and the header information says UTF-8 as well, and I do have chardet installed as the doc recommends.
Am I making a mistake here? What could be causing this?
EDIT:
When I invoke the code without the html5lib I get this warning:
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib").
This usually isn't a problem, but if you run this code on another system, or in a different virtual environment,
it may use a different parser and behave differently.
The code that caused this warning is on line 241 of the file C:\Users\My.Name\AppData\Local\Continuum\Anaconda2\envs\Python3\lib\site-packages\spyder\utils\ipython\start_kernel.py.
To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html5lib")
markup_type=markup_type))
EDIT 2:
As suggested in a comment I tried bs4.BeautifulSoup(ff,"html.parser"), but the problem remains.
Then I installed lxml and tried bs4.BeautifulSoup(ff,"lxml-xml"), still the same output.
What also strikes me as odd is that even when specifying an encoding like bs4.BeautifulSoup(ff,"lxml-xml",from_encoding='UTF-8') the value of xml.original_encoding is None contrary to what is written in the doc.
EDIT 3:
I put my xml contents into a string
xmlstring = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><info name=\"愛よ\">ÜÜÜÜÜÜÜ</info><items><item thing=\"ÖöÖö\">\"23Äßßß\"</item></items></root>"
And used bs4.BeautifulSoup(xmlstring,"lxml-xml"), now I'm getting the correct output:
<?xml version="1.0" encoding="utf-8"?>
<root>
<info name="愛よ">
ÜÜÜÜÜÜÜ
</info>
<items>
<item thing="ÖöÖö">
"23Äßßß"
</item>
</items>
</root>
So it seems something is wrong with the file after all.
Found the error, I have to specify the encoding when opening the file:
with open("test.xml",encoding='UTF-8') as ff:
xml = bs4.BeautifulSoup(ff,"html5lib")
As I'm on Python 3 I thought the value of encoding was UTF-8 by default, but it turned out it's system-dependent and on my system it's cp1252.
I am very new to the python scripting language and am recently working on a parser which parses a web-based xml file.
I am able to retrieve all but one of the elements using minidom in python with no issues however I have one node which I am having trouble with. The last node that I require from the XML file is 'url' within the 'image' tag and this can be found within the following xml file example:
<events>
<event id="abcde01">
<title> Name of event </title>
<url> The URL of the Event <- the url tag I do not need </url>
<image>
<url> THE URL I DO NEED </url>
</image>
</event>
Below I have copied brief sections of my code which I feel may be of relevance. I really appreciate any help with this to retrieve this last image url node. I will also include what I have tried and the error I recieved when I ran this code in GAE. The python version I am using is Python 2.7 and I should probably also point out that I am saving them within an array (for later input to a database).
class XMLParser(webapp2.RequestHandler):
def get(self):
base_url = 'http://api.eventful.com/rest/events/search?location=Dublin&date=Today'
#downloads data from xml file:
response = urllib.urlopen(base_url)
#converts data to string
data = response.read()
unicode_data = data.decode('utf-8')
data = unicode_data.encode('ascii','ignore')
#closes file
response.close()
#parses xml downloaded
dom = mdom.parseString(data)
node = dom.documentElement #needed for declaration of variable
#print out all event names (titles) found in the eventful xml
event_main = dom.getElementsByTagName('event')
#URLs list parsing - MY ATTEMPT -
urls_list = []
for im in event_main:
image_url = image.getElementsByTagName("image")[0].childNodes[0]
urls_list.append(image_url)
The error I receive is the following any help is much appreciated, Karen
image_url = im.getElementsByTagName("image")[0].childNodes[0]
IndexError: list index out of range
First of all, do not reencode the content. There is no need to do so, XML parsers are perfectly capable of handling encoded content.
Next, I'd use the ElementTree API for a task like this:
from xml.etree import ElementTree as ET
response = urllib.urlopen(base_url)
tree = ET.parse(response)
urls_list = []
for event in tree.findall('.//event[image]'):
# find the text content of the first <image><url> tag combination:
image_url = event.find('.//image/url')
if image_url is not None:
urls_list.append(image_url.text)
This only consideres event elements that have a direct image child element.