How to parse out xml from noisy file using python

How to parse out xml from noisy file using python - python

I have a file which contains a bunch of logging information including xml. I'd like to parse out the xml portion into a string object so I can then run some xpaths on it to ensure to existence of certain information on the 'data' element.
File to parse:
Requesting event notifications...
Receiving command objects...
<?xml version="1.0" encoding="UTF-8"?><Root xmlns="http://schemas.com/service" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><data id="123" interface="2017.1" implementation="2016.122-SNAPSHOT" Version="2016.1.2700-SNAPSHOT"></data></Root>
All information has been collected
Command execution successful...
Python:
import re
with open('./output.out', 'r') as outFile:
data = outFile.read().replace('\n','')
regex = re.escape("<.*?>.*?<\/Root>");
p = re.compile(regex)
m = p.match(data)
if m:
print(m.group())
else:
print('No match')
Output:
No match
What am I doing wrong? How can I accomplish my goal? Any help would be much appreciated.

Thou shalt never use regular expressions for parsing XML/HTML. There is BeautifulSoup for this daunting task.
import bs4
soup = bs4.BeautifulSoup(open("output.out").read(), "lxml")
roots = soup.findAll('root')
#[<root xmlns="http://schemas.com/service"
# xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
# <data id="123" implementation="2016.122-SNAPSHOT" interface="2017.1"
# version="2016.1.2700-SNAPSHOT"></data></root>]
roots[0] is an XML document. You can do anything you want with it.

Related

Ahref html issue, html tags not showing

I don't seem to be having much luck solving this issue, i am pulling data from an xml url that looks like:
<?xml version="1.0" encoding="utf-8"?>
<task>
<tasks>
<taskId>46</taskId>
<taskUserId>4</taskUserId>
<taskName>test</taskName>
<taskMode>2</taskMode>
<taskSite>thetest.org</taskSite>
<taskUser>NULL</taskUser>
<taskPass>NULL</taskPass>
<taskEmail>NULL</taskEmail>
<taskUrl>https://www.thetest.com/</taskUrl>
<taskTitle>test</taskTitle>
<taskBody>This is a test using html tags.</taskBody>
<taskCredentials>...</taskProxy>
</tasks>
</task>
This part is where i'm having issues:
<taskBody>This is a test using html tags.</taskBody>
I pull the data using BeautifulSoup like:
# beautifulsoup setup
soup = BeautifulSoup(projects.text, 'xml')
xml_task_id = soup.find('taskId')
xml_task_user_id = soup.find('taskUserId')
xml_task_name = soup.find('taskName')
xml_mode = soup.find('taskMode')
xml_site_name = soup.find('taskSite')
xml_username = soup.find('taskUser')
xml_password = soup.find('taskPass')
xml_email = soup.find('taskEmail')
xml_url = soup.find('taskUrl')
xml_content_title = soup.find('taskTitle')
xml_content_body = soup.find('taskBody')
xml_credentials = soup.find('taskCredentials')
xml_proxy = soup.find('taskProxy')
print(xml_content_body.get_text())
When i print out this part, it prints like: This is a test using html tags.
Instead of showing the ahref tag in full like: This is a test
I literally need the full string printed as is, but it keeps executing the html code instead of printing the string.
Any help would be appreciated.

Python doesn't "execute" HTML code. I'm guessing you're viewing it in a web browser, and that's interpreting the <a> tags just like it's designed to.
Use the html.escape method to turn all tags into escape sequences (with > and < and the like), which stops the browser from interpreting them.

For Phython - Escape HTML Tags: https://wiki.python.org/moin/EscapingHtml
For PHP - Escape HTMl Tags:
https://www.w3schools.com/php/showphp.asp?filename=demo_func_string_strip_tags2

Parse large python xml using xmltree

I have a python script that parses huge xml files ( largest one is 446 MB)
try:
parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse(os.path.join(srcDir, fileName), parser)
root = tree.getroot()
except Exception, e:
print "Error parsing file "+str(fileName) + " Reason "+str(e.message)
for child in root:
if "PersonName" in child.tag:
personName = child.text
This is what my xml looks like :
<?xml version="1.0" encoding="utf-8"?>
<MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
<Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
<Description>myData</Description>
<Identifier>43hhjh87n4nm</Identifier>
</Aliases>
<RollNo uom="kPa">39979172.201167159</RollNo>
<PersonName>Miracle Smith</PersonName>
<Date>2017-06-02T01:10:32-05:00</Date>
....
All I want to do is get the PersonName tags contents thats all. Other tags I don't care about.
Sadly My files are huge and I keep getting this error when I use the code above :
Error parsing file 2eb6d894-0775-e611.xml Reason unknown error, line 1, column 310915857
Error parsing file 2ecc18b5-ef41-e711-80f.xml Reason Extra content at the end of the document, line 1, column 3428182
Error parsing file 2f0d6926-b602-e711-80f4-005.xml Reason Extra content at the end of the document, line 1, column 6162118
Error parsing file 2f12636b-b2f5-e611-80f3-00.xml Reason Extra content at the end of the document, line 1, column 8014679
Error parsing file 2f14e35a-d22b-4504-8866-.xml Reason Extra content at the end of the document, line 1, column 8411238
Error parsing file 2f50c2eb-55c6-e611-80f0-005056a.xml Reason Extra content at the end of the document, line 1, column 7636614
Error parsing file 3a1a3806-b6af-e611-80ef-00505.xml Reason Extra content at the end of the document, line 1, column 11032486
My XML is perfectly fine and has no extra content .Seems that the large files parsing causes the error.
I have looked at iterparse() but it seems to complex for what I want to achieve as it provides parsing of the whole DOM while I just want that one tag that is under the root. Also , does not give me a good sample to get the correct value by tag name ?
Should I use a regex parse or grep /awk way to do this ? Or any tweak to my code will let me get the Person name in these huge files ?
UPDATE:
Tried this sample and it seems to be printing the whole world from the xml except my tag ?
Does iterparse read from bottom to top of file ? In that case it will take a long time to get to the top i.e my PersonName Tag ? I tried changing the line below to read end to start events=("end", "start") and it does the same thing !!!
path = []
for event, elem in ET.iterparse('D:\\mystage\\2-80ea-005056.xml', events=("start", "end")):
if event == 'start':
path.append(elem.tag)
elif event == 'end':
# process the tag
print elem.text // prints whole world
if elem.tag == 'PersonName':
print elem.text
path.pop()

Iterparse is not that difficult to use in this case.
temp.xml is the file presented in your question with a </MyRoot> stuck on as a line at the end.
Think of the source = as boilerplace, if you will, that parses the xml file and returns chunks of it element-by-element, indicating whether the chunk is the 'start' of an element or the 'end' and supplying information about the element.
In this case we need consider only the 'start' events. We watch for the 'PersonName' tags and pick up their texts. Having found the one and only such item in the xml file we abandon the processing.
>>> from xml.etree import ElementTree
>>> source = iter(ElementTree.iterparse('temp.xml', events=('start', 'end')))
>>> for an_event, an_element in source:
... if an_event=='start' and an_element.tag.endswith('PersonName'):
... an_element.text
... break
...
'Miracle Smith'
Edit, in response to question in a comment:
Normally you wouldn't do this since iterparse is intended for use with large chunks of xml. However, by wrapping a string in a StringIO object it can be processed with iterparse.
>>> from xml.etree import ElementTree
>>> from io import StringIO
>>> xml = StringIO('''\
... <?xml version="1.0" encoding="utf-8"?>
... <MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
... <Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
... <Description>myData</Description>
... <Identifier>43hhjh87n4nm</Identifier>
... </Aliases>
... <RollNo uom="kPa">39979172.201167159</RollNo>
... <PersonName>Miracle Smith</PersonName>
... <Date>2017-06-02T01:10:32-05:00</Date>
... </MyRoot>''')
>>> source = iter(ElementTree.iterparse(xml, events=('start', 'end')))
>>> for an_event, an_element in source:
... if an_event=='start' and an_element.tag.endswith('PersonName'):
... an_element.text
... break
...
'Miracle Smith'

How to replace/remove XML tag with BeautifulSoup?

I have XML in a local file that is a template for a final message that gets POSTed to a REST service. The script pre processes the template data before it gets posted.
So the template looks something like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
<singleElement>
<subElementX>XYZ</subElementX>
</singleElement>
<repeatingElement id="11" name="Joe"/>
<repeatingElement id="12" name="Mary"/>
</root>
The message XML should look the same except that the repeatingElement tags need to be replaced with something else (XML generated by the script based on the attributes in the existing tag).
Here is my script so far:
xmlData = None
with open('conf//test1.xml', 'r') as xmlFile:
xmlData = xmlFile.read()
xmlSoup = BeautifulSoup(xmlData, 'html.parser')
repElemList = xmlSoup.find_all('repeatingelement')
for repElem in repElemList:
print("Processing repElem...")
repElemID = repElem.get('id')
repElemName = repElem.get('name')
# now I do something with repElemID and repElemName
# and no longer need it. I would like to replace it with <somenewtag/>
# and dump what is in the soup object back into a string.
# is it possible with BeautifulSoup?
Can I replace the repeating elements with something else and then dump the soup object into a new string that I can post to my REST API?
NOTE: I am using html.parser because I can't get the xml parser to work but it works alright, understanding HTML is more permissive than XML parsing.

You can use .replace_with() and .new_tag() methods:
for repElem in repElemList:
print("Processing repElem...")
repElemID = repElem.get('id')
repElemName = repElem.get('name')
repElem.replace_with(xmlSoup.new_tag("somenewtag"))
Then, you can dump the "soup" using str(soup) or soup.prettify().

How to find the value in particular tag elemnet in xml using python?

I am trying to parse xml data received from RESTful interface. In error conditions (when query does not result anything on the server), I am returned the following text. Now, I want to parse this string to search for the value of status present in the fifth line in example given below. How can I find if the status is present or not and if it is present then what is its value.
content = """
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
<ops:meta name="elapsed-time" value="3"/>
<exchange-documents>
<exchange-document system="ops.epo.org" country="US" doc-number="20060159695" status="not found">
<bibliographic-data>
<publication-reference>
<document-id document-id-type="epodoc">
<doc-number>US20060159695</doc-number>
</document-id>
</publication-reference>
<parties/>
</bibliographic-data>
</exchange-document>
</exchange-documents>
</ops:world-patent-data>
"""
import xml.etree.ElementTree as ET
root = ET.fromstring(content)
res = root.iterfind(".//{http://www.epo.org/exchange}exchange-documents[#status='not found']/..")

Just use BeautifulSoup:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('xml.txt', 'r'))
print soup.findAll('exchange-document')["status"]
#> not found
If you store every xml output in a single file, would be useful to iterate them:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('xml.txt', 'r'))
for tag in soup.findAll('exchange-document'):
print tag["status"]
#> not found
This will display every [status] tag from [exchange-document] element.
Plus, if you want only useful status you should do:
for tag in soup.findAll('exchange-document'):
if tag["status"] not in "not found":
print tag["status"]

Try this:
from xml.dom.minidom import parse
xmldoc = parse(filename)
elementList = xmldoc.getElementsByTagName(tagName)
elementList will contain all elements with the tag name you specify, then you can iterate over those.

Programmatically clean/ignore namespaces in XML - python

I'm trying to write a simple program to read my financial XML files from GNUCash, and learn Python in the process.
The XML looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<gnc-v2
xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
{...}
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:count-data cd:type="book">1</gnc:count-data>
<gnc:book version="2.0.0">
<book:id type="guid">91314601aa6afd17727c44657419974a</book:id>
<gnc:count-data cd:type="account">80</gnc:count-data>
<gnc:count-data cd:type="transaction">826</gnc:count-data>
<gnc:count-data cd:type="budget">1</gnc:count-data>
<gnc:commodity version="2.0.0">
<cmdty:space>ISO4217</cmdty:space>
<cmdty:id>BRL</cmdty:id>
<cmdty:get_quotes/>
<cmdty:quote_source>currency</cmdty:quote_source>
<cmdty:quote_tz/>
</gnc:commodity>
Right now, i'm able to iterate and get results using
import xml.etree.ElementTree as ET
r = ET.parse("file.xml").findall('.//')
after manually cleaning the namespaces, but I'm looking for a solution that could either read the entries regardless of their namespaces OR remove the namespaces before parsing.
Note that I'm a complete noob in python, and I've read: Python and GnuCash: Extract data from GnuCash files, Cleaning an XML file in Python before parsing and python: xml.etree.ElementTree, removing "namespaces" along with ElementTree docs and I'm still lost...
I've come up with this solution:
def strip_namespaces(self, tree):
nspOpen = re.compile("<\w*:", re.IGNORECASE)
nspClose = re.compile("<\/\w*:", re.IGNORECASE)
for i in tree:
start = re.sub(nspOpen, '<', tree.tag)
end = re.sub(nspOpen, '<\/', tree.tag)
# pprint(finaltree)
return
But I'm failing to apply it. I can't seem to be able to retrieve the tag names as they appear on the file.

I think below python code will be helpfull to you.
sample.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gnc:prodinfo xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:change>
<gnc:lastUpdate>2018-12-21
</gnc:lastUpdate>
</gnc:change>
<gnc:bill>
<gnc:billAccountNumber>1234</gnc:billAccountNumber>
<gnc:roles>
<gnc:id>111111</gnc:id>
<gnc:pos>2</gnc:pos>
<gnc:genid>15</gnc:genid>
</gnc:roles>
</gnc:bill>
<gnc:prodtyp>sales and service</gnc:prodtyp>
</gnc:prodinfo>
PYTHON CODE: to remove xmlns for root tag.
import xml.etree.cElementTree as ET
def xmlns(str):
str1 = str.split('{')
l=[]
for i in str1:
if '}' in i:
l.append(i.split('}')[1])
else:
l.append(i)
var = ''.join(l)
return var
tree=ET.parse('sample.xml')
root = tree.getroot()
print(root.tag) #returns root tag with xmlns as prefix
print(xmlns(root.tag)) #returns root tag with out xmlns as prefix
Output:
{http://www.gnucash.org/XML/gnc}prodinfo
prodinfo

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse out xml from noisy file using python - python

Related

Ahref html issue, html tags not showing

Parse large python xml using xmltree

How to replace/remove XML tag with BeautifulSoup?

How to find the value in particular tag elemnet in xml using python?

Programmatically clean/ignore namespaces in XML - python

Categories

Resources