Parsing messy XML in Python - python

I'm super new to coding and if someone could help me in figuring out howto parse XML file it would be awesome.
I'm trying to write a python script that would read all notes created in Gnome-Notes and display it in command line. I've got the load notes part, but I can't figure out howto parse the XML so it would display the text part. The sample data looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<note version="1" xmlns:link="http://projects.gnome.org/bijiben/link" xmlns:size="http://projects.gnome.org/bijiben/size" xmlns="http://projects.gnome.org/bijiben">
<title>Testnote</title>
<text xml:space="preserve"><html xmlns="http://www.w3.org/1999/xhtml"><head><link rel="stylesheet" href="Default.css" type="text/css" /><script language="javascript" src="bijiben.js"></script></head><body id="editable" style="color: white;">Some text for the note.</body></html></text>
<last-change-date>2021-04-01T20:03:08Z</last-change-date>
<last-metadata-change-date>2021-04-01T20:02:53Z</last-metadata-change-date>
<create-date>2021-03-29T10:37:14Z</create-date>
<cursor-position>0</cursor-position>
<selection-bound-position>0</selection-bound-position>
<width>0</width>
<height>0</height>
<x>0</x>
<y>0</y>
<color>rgb(0,0,0)</color>
<tags/>
<open-on-startup>False</open-on-startup>
And after parsing I should get only the "Some text for the note." part. I've been trying ElementTree for this. While I don't have issues when working with "clean" xml files provided in the sample I can't figure out what to do with this one.

Should be doable using ElementTree
from xml.etree import ElementTree as ET
data = '''\
<?xml version="1.0" encoding="UTF-8"?>
<note version="1" xmlns:link="http://projects.gnome.org/bijiben/link" xmlns:size="http://projects.gnome.org/bijiben/size" xmlns="http://projects.gnome.org/bijiben">
<title>Testnote</title>
<text xml:space="preserve">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link rel="stylesheet" href="Default.css" type="text/css"/>
<script language="javascript" src="bijiben.js"/>
</head>
<body id="editable" style="color: white;">Some text for the note.</body>
</html>
</text>
<last-change-date>2021-04-01T20:03:08Z</last-change-date>
<last-metadata-change-date>2021-04-01T20:02:53Z</last-metadata-change-date>
<create-date>2021-03-29T10:37:14Z</create-date>
<cursor-position>0</cursor-position>
<selection-bound-position>0</selection-bound-position>
<width>0</width>
<height>0</height>
<x>0</x>
<y>0</y>
<color>rgb(0,0,0)</color>
<tags/>
<open-on-startup>False</open-on-startup>
</note>
'''
tree = ET.fromstring(data)
nmsp = {
'xml': 'http://www.w3.org/1999/xhtml',
} # NAMESPACE PREFIX ASSIGNMENT
print(tree.find('.//xml:body', namespaces=nmsp).text)

You can use regex to extract the string between the body tags:
<body.*>(.*)</body>
The first .* matches for any character, zero or more times, to account for any attributes in the body tag.
(.*) captures anything between the tags.
import re
with open('file.xml', 'r') as file:
data = file.read()
x = re.search(r"<body.*>(.*)</body>", data)
print(x.group(1))

Related

How to get a href attribute value in xml content (atom feed)?

I'm saving the content (atom feed / xml content) from a get request as content = response.text and the content looks like this:
<feed xmlns="http://www.w3.org/2005/Atom">
<title type="text">title-a</title>
<subtitle type="text">content: application/abc</subtitle>
<updated>2021-08-05T16:29:20.202Z</updated>
<id>tag:tag-a,2021-08:27445852</id>
<generator uri="uri-a" version="v-5.1.0.3846329218047">abc</generator>
<author>
<name>name-a</name>
<email>email-a</email>
</author>
<link href="url-a" rel="self"/>
<link href="url-b" rel="next"/>
<link href="url-c" rel="previous"/>
</feed>
How can I get the value "url-b" of the href attribute with rel="next" ?
I tried it with the ElementTree module, for example:
from xml.etree import ElementTree
response = requests.get("myurl", headers={"Authorization": f"Bearer {my_access_token}"})
content = response.text
tree = ElementTree.fromstring(content)
tree.find('.//link[#rel="next"]')
// or
tree.find('./link').attrib['href']
but that didn't work.
I appreciate any help and thank you in advance.
If there is an easier, simpler solution (maybe feedparser) I welcome that too.
How can I get the value "url-b" of the href attribute with rel="next" ?
see below
from xml.etree import ElementTree as ET
xml = '''<feed xmlns="http://www.w3.org/2005/Atom">
<title type="text">title-a</title>
<subtitle type="text">content: application/abc</subtitle>
<updated>2021-08-05T16:29:20.202Z</updated>
<id>tag:tag-a,2021-08:27445852</id>
<generator uri="uri-a" version="v-5.1.0.3846329218047">abc</generator>
<author>
<name>name-a</name>
<email>email-a</email>
</author>
<link href="url-a" rel="self"/>
<link href="url-b" rel="next"/>
<link href="url-c" rel="previous"/>
</feed>'''
root = ET.fromstring(xml)
links = root.findall('.//{http://www.w3.org/2005/Atom}link[#rel="next"]')
for link in links:
print(f'{link.attrib["href"]}')
output
url-b
You can use this XPath-1.0 expression:
./*[local-name()="feed"]/*[local-name()="link" and #rel="next"]/#href
This should result in "url-b".

Use BeautifulSoup to Replace Every Occurrence of XML Tag with Another Tag

I am trying to replace every occurrence of an XML tag in a document (call it the target) with the contents of a tag in a different document (call it the source). The tag from the source could contain just text, or it could contain more XML.
Here is a simple example of what I am not able to get working:
test-source.htm:
<?xml version="1.0" encoding="utf-8"?>
<html>
<head>
</head>
<body>
<srctxt>text to be added</srctxt>
</body>
</html>
test-target.htm:
<?xml version="1.0" encoding="utf-8"?>
<html>
<head>
</head>
<body>
<replacethis src="test-source.htm"></replacethis>
<p>irrelevant, just here for filler</p>
<replacethis src="test-source.htm"></replacethis>
</body>
</html>
replace_example.py:
import os
import re
from bs4 import BeautifulSoup
# Just for testing
source_file = "test-source.htm"
target_file = "test-target.htm"
with open(source_file) as s:
source = BeautifulSoup(s, "lxml")
with open(target_file) as t:
target = BeautifulSoup(t, "lxml")
source_tag = source.srctxt
for tag in target():
for attribute in tag.attrs:
if re.search(source_file, str(tag[attribute])):
tag.replace_with(source_tag)
with open(target_file, "w") as w:
w.write(str(target))
This is my unfortunate test-target.htm after running replace_example.py
<?xml version="1.0" encoding="utf-8"?><html>
<head>
</head>
<body>
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>
The first replacethis tag is now gone and the second replacethis tag has been replaced. This same problem happens with "insert" and "insert_before".
The output I want is:
<?xml version="1.0" encoding="utf-8"?><html>
<head>
</head>
<body>
<srctxt>text to be added</srctxt>
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>
Can someone please point me in the right direction?
Additional Complications: The example above is the simplest case where I could reproduce the problem I seem to be having with BeautifulSoup, but it does not convey the full detail of the problem I'm trying to solve. Actually, I have a list of targets and sources. The replacethis tag needs to be replaced by the contents of a source only if the src attribute contains a reference to a source in the list. So I could use the replace method, but it would require writing a lot more regex than if I could convince BeautifulSoup to work. If this problem is a BeautifulSoup bug, then maybe I'll just have to write the regex instead.
You could use another parser (html.parser) if you want to get rid of extra tags.
BS4's replace_with behavior looks like some bug in library.
As a partial solution you can just call
target_text.replace('<replacethis></replacethis>', source_text)
First, it is highly advised to not use regex on [X]HTML documents. Since you are modifying XML content, consider an lxml solution which you do have installed being the parsing engine in your BeautifulSoup calls. No for or if logic needed for this approach.
Specifically, consider XSLT, the special-purpose language, designed to transform XML into other XML, HTML, even json/csv/txt files. XSLT maintains the document() function allowing you to parse across documents. Python's lxml can run XSLT 1.0 scripts.
XSLT (save as .xsl in same folder as source file, adjust 'replacethis' and 'srctxt' names)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="xml"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- UPDATE <replacethis> TAG WITH <srctxt> FROM SOURCE -->
<xsl:template match="replacethis">
<xsl:copy-of select="document('test-source.htm')/html/body/srctxt"/>
</xsl:template>
</xsl:stylesheet>
Python
import lxml.etree as et
# LOAD XML AND XSL SOURCES
doc = et.parse('test-target.htm')
xsl = et.parse('XSLTScript.xsl')
# TRANSFORM SOURCE
transform = et.XSLT(xsl)
result = transform(doc)
# OUTPUT TO SCREEN
print(result)
# OUTPUT TO FILE
with open('test-target.htm', 'wb') as f:
f.write(result)
Output
<?xml version="1.0"?>
<html>
<head/>
<body>
<srctxt>text to be added</srctxt>
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>

edit the END TAG ONLY of an element using minidom in python?

I am having a few issues with python's minidom, and was wondering if you guys could help me out!
I am trying to convert some xml data from the wrong format to a format that will be read by our system. I have been working on a sort of proof-of-concept and I figured everything out except for this last little issue!
Here's what I have so far.
while y < length:
tag = doc_root.getElementsByTagName("p")[y].toxml()
xmlData = tag.replace(' region="bottom"','')
z = z + 1
xmlData= xmlData.replace(' xml:id="p%d"'%z,'')
xmlData = xmlData.replace('<p','p')
xmlData = xmlData.replace('/p>','p')
print xmlData
#div.appendChild(doc_root.getElementsByTagName("p")[x])
print xmlData
addData = doc.createElement(xmlData)
div.appendChild(addData)
y = y+1
Now this works great, except when printed into xml I get this
<p begin="00:00:02.470" end="00:00:05.388">Organizational Communication and<br/>Deaf Employees<p/>
when I need this
<p begin="00:00:02.470" end="00:00:05.388">Organizational Communication and<br/>Deaf Employees</p>
I think I understand what the issue is, I am appending the child as a string and because of this the element automatically makes a <> and </>, right? So how do I prevent that last </> from happening?
EDIT: mzjn has asked for an example of the xml data.
This is the xml (with sensitive parts removed)
<tt xml:lang="MI" ***********>
<head>
<metadata>
<ttm:title>*********</ttm:title>
<ttm:agent type="person" xml:id="author_1">
<ttm:name type="full">No Author</ttm:name>
</ttm:agent>
</metadata>
<styling>
<style tts:color="white" tts:fontFamily="Arial" tts:fontSize="24" tts:fontStyle="normal" tts:fontWeight="bold" tts:textAlign="center" xml:id="s1"/>
</styling>
<layout>
<region tts:displayAlign="before" tts:extent="80% 80%" tts:origin="10% 10%" xml:id="top"/>
<region tts:displayAlign="center" tts:extent="80% 80%" tts:origin="10% 10%" xml:id="center"/>
<region tts:displayAlign="after" tts:extent="80% 80%" tts:origin="10% 10%" xml:id="bottom"/>
</layout>
</head>
<body>
<div style="s1" xml:id="d1" xml:lang="MI">
<p begin="00:00:02.470" end="00:00:05.388" region="bottom" xml:id="p1">Organizational Communication and<br/>Deaf Employees</p>
this continues with each <p> tag until we get to the </div> </body> and </tt>
I need to change it to this:
<?xml version="1.0" ?>
<tt xmlns=****** xml:lang="en">
<body>
<div xml:lang="en" style="1">
<p begin="00:00:02.470" end="00:00:05.388">Organizational Communication and<br/>Deaf Employees</p>

How to remove all attributes of a tag

How can I remove all the attributes of a xml tag so I can get from this:
<xml blah blah blah> to just <xml>.
With lxml I know I can remove the whole element and I didn't find any way to do it specific on a tag. (I found solutions on stackoverflow for C# but I want Python).
I am opening a gpx(xml) file and this is my code so far (based on How do I get the whole content between two xml tags in Python?):
from lxml import etree
t = etree.parse("1.gpx")
e = t.xpath('//trk')[0]
print(e.text + ''.join(map(etree.tostring, e))).strip()
Another approach I did was this:
from lxml import etree
TOPOGRAFIX_NS = './/{http://www.topografix.com/GPX/1/1}'
TRACKPOINT_NS = TOPOGRAFIX_NS + 'extensions/{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}TrackPointExtension/{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}'
doc1 = etree.parse("1.gpx")
for node1 in doc1.findall(TOPOGRAFIX_NS + 'trk'):
node_to_string1 = etree.tostring(node1)
print(node_to_string1)
But I get the trk tag with TOPOGRAFIX_NS attributes witch I don't want and here I am wanting to remove the tag attribute. I just want to get:
<trk> all the inside content </trk>
Thank you very much!
P.S. The content of the gpx file:
<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1" creator="Endomondo.com" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/GpxExtensions/v3 http://www.garmin.com/xmlschemas/GpxExtensionsv3.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd" xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<metadata>
<author>
<name>Blah Blah</name>
<email id="blah" domain="blah.com"/>
</author>
<link href="http://www.endomondo.com">
<text>Endomondo</text>
</link>
<time>2014-01-20T10:50:28Z</time>
</metadata>
<trk>
<name>Galati</name>
<src>http://www.endomondo.com/</src>
<link href="http://www.endomondo.com/workouts/260782567/13005122">
<text>Galati</text>
</link>
<type>MOUNTAIN_BIKING</type>
<trkseg>
<trkpt lat="45.431074" lon="28.021038">
<time>2013-10-20T05:49:04Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>

Python ElementTree find() using a wildcard?

I am parsing an XML feed in python to extract certain tags. My XML contains namespaces and this results in each tag containing a namespace followed by tag name.
Here is the xml:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:rte="http://www.rte.ie/schemas/vod">
<id>10038711/</id>
<updated>2013-01-24T22:52:43+00:00</updated>
<title type="text">Reeling in the Years</title>
<logo>http://www.rte.ie/iptv/images/logo.gif</logo>
<link rel="self" type="application/atom+xml" href="http://feeds.rasset.ie/rteavgen/player/playlist?type=iptv&showId=10038711" />
<category term="feed"/>
<author>
<name>RTE</name>
<uri>http://www.rte.ie</uri>
</author>
<entry>
<id>10038711</id>
<published>2012-07-04T12:00:00+01:00</published>
<updated>2013-01-06T12:31:25+00:00</updated>
<title type="text">Reeling in the Years</title>
<content type="text">National and international events with popular music from the year 1989.First Broadcast: 08/11/1999</content>
<category term="WEB Exclusive" rte:type="channel"/>
<category term="Classics 1980" rte:type="genre"/>
<category term="rte player" rte:type="source"/>
<category term="" rte:type="transmision_details"/>
<category term="False" rte:type="copyprotectionoptout"/>
<category term="long" rte:type="form"/>
<category term="3275" rte:type="progid"/>
<link rel="site" type="text/html" href="http://www.rte.ie/tv50/"/>
<link rel="self" type="application/atom+xml" href="http://feeds.rasset.ie/rteavgen/player/playlist/?itemId=10038711&type=iptv&format=xml" />
<link rel="alternate" type="text/html" href="http://www.rte.ie/player/#v=10038711"/>
<rte:valid start="2012-07-23T15:56:04+01:00" end="2017-08-01T15:56:04+01:00"/>
<rte:duration ms="842205" formatted="0:10"/>
<rte:statistics views="19"/>
<rte:bri id="na"/>
<rte:channel id="13"/>
<rte:item id="10038711"/>
<media:title type="plain">Reeling in the Years</media:title>
<media:description type="plain">National and international events with popular music from the year 1989. First Broadcast: 08/11/1999</media:description>
<media:thumbnail url="http://img.rasset.ie/00062efc200.jpg" height="288" width="512" time="00:00:00+00:00"/>
<media:teaserimgref1x1 url="" time="00:00:00+00:00"/>
<media:rating scheme="http://www.rte.ie/schemes/vod">NA</media:rating>
<media:copyright>RTÉ</media:copyright>
<media:group rte:format="single">
<media:content url="http://vod.hds.rasset.ie/manifest/2012/0728/20120728_reelingint_cl10038711_10039316_260_.f4m" type="video/mp4" medium="video" expression="full" duration="842205" rte:format="content"/>
</media:group>
<rte:ads>
<media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&iu=%2F3014%2FP_RTE_TV50_Pre&ciu_szs=300x250&impl=s&gdfp_req=1&env=vp&output=xml_vast2&unviewed_position_start=1&url=[referrer_url]&correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
<media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&iu=%2F3014%2FP_RTE_TV50_Pre2&ciu_szs=300x250&impl=s&gdfp_req=1&env=vp&output=xml_vast2&unviewed_position_start=1&url=[referrer_url]&correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
<media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&iu=%2F3014%2FP_RTE_TV50_Pre3&ciu_szs=300x250&impl=s&gdfp_req=1&env=vp&output=xml_vast2&unviewed_position_start=1&url=[referrer_url]&correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
</rte:ads>
</entry>
<!-- playlist.xml -->
</feed>
When the XML is parsed each element is soming out as:
{http://www.w3.org/2005/Atom}id
{http://www.w3.org/2005/Atom}published
{http://www.w3.org/2005/Atom}updated
.....
.....
{http://www.rte.ie/schemas/vod}valid
{http://www.rte.ie/schemas/vod}duration
....
....
{http://search.yahoo.com/mrss/}description
{http://search.yahoo.com/mrss/}thumbnail
....
As I have 3 different namespaces and I cannot gurantee that they will be always the same then I woul prefer not to hard specify each tag like so:
for elem in tree.iter({http://www.w3.org/2005/Atom}entry'):
stream = str(elem.find('{http://www.w3.org/2005/Atom}id').text)
date_tmp = str(elem.find('{http://www.w3.org/2005/Atom}published').text)
name_tmp = str(elem.find('{http://www.w3.org/2005/Atom}title').text)
short_tmp = str(elem.find('{http://www.w3.org/2005/Atom}content').text)
channel_tmp = elem.find('{http://www.w3.org/2005/Atom}category', "channel")
channel = str(channel_tmp.get('term'))
icon_tmp = elem.find('{http://search.yahoo.com/mrss/}thumbnail')
icon_url = str(icon_tmp.get('url'))
Is there any way that I can put a wildcard or something similar into the find so it will simply ignore the namespace?
stream = str(elem.find('*id').text)
I can hardcode them as above but it would be my luck that down the line the namespace would change and my queries stop returning data..
Thanks for the help.
You can use an XPath expression with the local-name() function:
<?xml version="1.0"?>
<root xmlns="ns">
<tag/>
</root>
Assuming "doc" is the ElementTree for the above XML:
import lxml.etree
doc = lxml.etree.parse(<some_file_like_object>)
root = doc.getroot()
root.xpath('//*[local-name()="tag"]')
[<Element {ns}tag at 0x7fcde6f7c960>]
replacing <some_file_like_object> as appropriate (alternatively, you can use lxml.etree.fromstring with an XML string to get the root element directly).

Categories

Resources