I'm trying to extract URLs from a sitemap like this: https://www.bestbuy.com/sitemap_c_0.xml.gz
I've unzipped and saved the .xml.gz file as an .xml file. The structure looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>https://www.bestbuy.com/</loc>
<priority>0.0</priority>
</url>
<url>
<loc>https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008</loc>
<priority>0.0</priority>
</url>
<url>
<loc>https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647</loc>
<priority>0.0</priority>
</url>
I'm attempting to use ElementTree to extract all of the URLs within the loc nodes throughout this file, but struggling to get it working right.
Per the documentation, I'm trying something like this:
import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')
root = tree.getroot()
value = root.findall(".//loc")
However, nothing gets loaded into value. My goal is to extract all of the URLs between the loc nodes and print it out into a new flat file. Where am I going wrong?
You were close in your attempt but like mzjn said in a comment, you didn't account for the default namespace (xmlns="http://www.sitemaps.org/schemas/sitemap/0.9").
Here's an example of how to account for the namespace:
import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
for elem in tree.findall(".//sm:loc", ns):
print(elem.text)
output:
https://www.bestbuy.com/
https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008
https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647
Note that I used the namespace prefix sm, but you could use any NCName.
See here for more information on parsing XML with namespaces in ElementTree.
We can iterate through the URLs, toss them into a list and write them to a file as such:
from xml.etree import ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
name_space = '{http://www.sitemaps.org/schemas/sitemap/0.9}'
urls = []
for child in root.iter():
for block in child.findall('{}url'.format(name_space)):
for url in block.findall('{}loc'.format(name_space)):
urls.append('{}\n'.format(url.text))
with open('sample_urls.txt', 'w+') as f:
f.writelines(urls)
note we need to append the name space from the open urlset definition to properly parse the xml
Related
I am trying to parse through my .xml file using glob and then use etree to add more code to my .xml. However, I keep getting an error when using doc insert that says object has no attribute insert. Does anyone know how I can effectively add code to my .xml?
from lxml import etree
path = "D:/Test/"
for xml_file in glob.glob(path + '/*/*.xml'):
doc = etree.parse(xml_file)
new_elem = etree.fromstring("""<new_code abortExpression=""
elseExpression=""
errorIfNoMatch="false"/>""")
doc.insert(1,new_elem)
new_elem.tail = "\n"
My original xml looks like this :
<data>
<assesslet index="Test" hash-uptodate="False" types="TriggerRuleType" verbose="True"/>
</data>
And I'd like to modify it to look like this:
<data>
<assesslet index="Test" hash-uptodate="False" types="TriggerRuleType" verbose="True"/>
<new_code abortExpression="" elseExpression="" errorIfNoMatch="false"/>
</data>
The problem is that you need to extract the root from your document before you can start modifying it: modify doc.getroot() instead of doc.
This works for me:
from lxml import etree
xml_file = "./doc.xml"
doc = etree.parse(xml_file)
new_elem = etree.fromstring("""<new_code abortExpression=""
elseExpression=""
errorIfNoMatch="false"/>""")
root = doc.getroot()
root.insert(1, new_elem)
new_elem.tail="\n"
To print the results to a file, you can use doc.write():
doc.write("doc-out.xml", encoding="utf8", xml_declaration=True)
Note the xml_declaration=True argument: it tells doc.write() to produce the <?xml version='1.0' encoding='UTF8'?> header.
can someone help me on how to extract certain string from xml and replace in another xml. Like I want to extract only mediaId from second line of 1.XML
- <sample_settings_config version="23" mediaId="0x6868">
and replace it in 2.xml document (code of both XML's are same except the mediaId).
Where as 1.xml location is dynamic (C:\users\xxxx\Documents), so am looking for append, find extract and replace. :(
I am trying for Python or batch script.
struggling with this batch script
setlocal EnableDelayedExpansion
(Set UPD=C:\Users)
(Set UDS=\RAP\Documents\MB)
For /F "delims=" %%a in ('findstr /I /L "mediaId" 1.xml') do (
set "line=%%a"
set "line=!line:*<string>=!"
for /F "delims=<" %%b in ("!line!") do echo %%b
but it is not working, any help is appreciated, thank you.
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse("1.xml")
root = tree.getroot()
media_id = root.find('sample_settings_config').attrib['mediaId']
tree.parse("2.xml")
root = tree.getroot()
root.find('sample_settings_config').attrib['mediaId'] = media_id
tree.write('2.xml', xml_declaration=True)
You can use xml.etree.ElementTree to read and
write to XML, by the use of Python.
The code imports ElementTree and an instance is
created as tree. From tree, the XML file is
parsed and get the root. Then, find the tag
sample_settings_config and get the value of the
attribute of mediaId.
Repeat the parsing, get root and find the tag.
Update the dictionary key of mediaId with the
value stored in the variable media_id.
Write the modified content to the XML file.
1.xml:
<?xml version='1.0' encoding='us-ascii'?>
<root>
<sample_settings_config version="123" mediaId="0x6868">
"XML File 1"
</sample_settings_config>
</root>
2.xml:
<?xml version='1.0' encoding='us-ascii'?>
<root>
<sample_settings_config version="456" mediaId="0x0">
"XML File 2"
</sample_settings_config>
</root>
2.xml modified:
<?xml version='1.0' encoding='us-ascii'?>
<root>
<sample_settings_config mediaId="0x6868" version="456">
"XML File 2"
</sample_settings_config>
</root>
I was wonder how I would go about determining what the root tag for an XML document is using xml.dom.minidom.
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child1></child1>
<child2></child2>
<child3></child3>
</root>
In the example XML above, my root tag could be 3 or 4 different things. All I want to do is pull the tag, and then use that value to get the elements by tag name.
def import_from_XML(self, file_name)
file = open(file_name)
document = file.read()
if re.compile('^<\?xml').match(document):
xml = parseString(document)
root = '' # <-- THIS IS WHERE IM STUCK
elements = xml.getElementsByTagName(root)
I tried searching through the documentation for xml.dom.minidom, but it is a little hard for me to wrap my head around, and I couldn't find anything that answered this question outright.
I'm using Python 3.6.x, and I would prefer to keep with the standard library if possible.
For the line you commented as Where I am stuck, the following should assign the value of the root tag of the XML document to the variable theNameOfTheRootElement:
theNameOfTheRootElement = xml.documentElement.tagName
this is what I did when I last processed xml. I didn't use the approach you used but I hope it will help you.
import urllib2
from xml.etree import ElementTree as ET
req = urllib2.Request(site)
file=None
try:
file = urllib2.urlopen(req)
except urllib2.URLError as e:
print e.reason
data = file.read()
file.close()
root = ET.fromstring(data)
print("root", root)
for child in root.findall('parent element'):
print(child.text, child.attrib)
I am trying to add a vhost entry to tomcat server.xml using python lxml
import io
from lxml import etree
newdoc = etree.fromstring('<Host name="getrailo.com" appBase="webapps"><Context path="" docBase="/var/sites/getrailo.org" /><Alias>www.getrailo.org</Alias><Alias>my.getrailo.org</Alias></Host>')
doc = etree.parse('/root/server.xml')
root = doc.getroot()
for node1 in root.iter('Service'):
for node2 in node1.iter('Engine'):
node2.append(newdoc)
doc.write('/root/server.xml')
The problem is that it is removing the
<?xml version='1.0' encoding='utf-8'?>
line on top of the file from the output and the vhost entry is all in one line .How can I add the xml element in a pretty way like
<Host name="getrailo.org" appBase="webapps">
<Context path="" docBase="/var/sites/getrailo.org" />
<Alias>www.getrailo.org</Alias>
<Alias>my.getrailo.org</Alias>
</Host>
First you need to parse existing file with remove_blank_text so that it's clean and with no extra spaces that I think is a problem in this case
parser = etree.XMLParser(remove_blank_text=True)
newdoc = etree.fromstring('/root/server.xml' parser=parser)
Then you're safe to write it back to disk with pretty_print and xml_declaration set in doc.write()
doc.write('/root/server.xml',
xml_declaration=True,
encoding='utf-8',
pretty_print=True)
I am learning ElementTree in python. Everything seems fine except when I try to parse the xml file with prefix:
test.xml:
<?xml version="1.0"?>
<abc:data>
<abc:country name="Liechtenstein" rank="1" year="2008">
</abc:country>
<abc:country name="Singapore" rank="4" year="2011">
</abc:country>
<abc:country name="Panama" rank="5" year="2011">
</abc:country>
</abc:data>
When I try to parse the xml:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
I got the following error:
xml.etree.ElementTree.ParseError: unbound prefix: line 2, column 0
Do I need to specify something in order to parse a xml file with prefix?
Add the abc namespace to your xml file.
<?xml version="1.0"?>
<abc:data xmlns:abc="your namespace">
I encountered the same issue while processing xml file. You can use below code before parse your XML file. This will resolve your issue.
parser1 = etree.XMLParser(encoding="utf-8", recover=True)
tree1 = ElementTree.parse('filename.xml', parser1)
See if this works:
from bs4 import BeautifulSoup
xml_file = "test.xml"
with open(xml_file, "r", encoding="utf8") as f:
contents = f.read()
soup = BeautifulSoup(contents, "xml")
items = soup.find_all("country")
print (items)
The above will produce an array which you can then manipulate to achieve your aim (e.g. remove html tags etc.):
[<country name="Liechtenstein" rank="1" year="2008">
</country>, <country name="Singapore" rank="4" year="2011">
</country>, <country name="Panama" rank="5" year="2011">
</country>]