I'm trying to create a function which counts words in pptx document. The problem is that I can't figure out how to find only this kind of tags:
<a:t>Some Text</a:t>
When I try to: print xmlTree.findall('.//a:t'), it returns
SyntaxError: prefix 'a' not found in prefix map
Do you know what to do to make it work?
This is the function:
def get_pptx_word_count(filename):
import xml.etree.ElementTree as ET
import zipfile
z = zipfile.ZipFile(filename)
i=0
wordcount = 0
while True:
i+=1
slidename = 'slide{}.xml'.format(i)
try:
slide = z.read("ppt/slides/{}".format(slidename))
except KeyError:
break
xmlTree = ET.fromstring(slide)
for elem in xmlTree.iter():
if elem.tag=='a:t':
#text = elem.getText
#num = len(text.split(' '))
#wordcount+=num
The way to specify the namespace inside ElementTree is:
{namespace}element
So, you should change your query to:
print xmlTree.findall('.//{a}t')
Edit:
As #mxjn pointed out if a is a prefix and not the URI you need to insert the URI instead of a:
print xmlTree.findall('.//{http://tempuri.org/name_space_of_a}t')
or you can supply a prefix map:
prefix_map = {"a": "http://tempuri.org/name_space_of_a"}
print xmlTree.findall('.//a:t', prefix_map)
You need to tell ElementTree about your XML namespaces.
References:
Official Documentation (Python 2.7): 19.7.1.6. Parsing XML with Namespaces
Existing answer on StackOverflow: Parsing XML with namespace in Python via 'ElementTree'
Article by the author of ElementTree: ElementTree: Working with Namespaces and Qualified Names
Related
I have a XML file downloaded from Wordpress that is structured like this:
<wp:postmeta>
<wp:meta_key><![CDATA[country]]></wp:meta_key>
<wp:meta_value><![CDATA[Germany]]></wp:meta_value>
</wp:postmeta>
my goals is to look through the XML file for all the country keys and print the value. I'm completely new to the XML library so I'm looking where to take it from here.
# load libraries
# importing os to handle directory functions
import os
# import XML handlers
from xml.etree import ElementTree
# importing json to handle structured data saving
import json
# dictonary with namespaces
ns = {'wp:meta_key', 'wp:meta_value'}
tree = ElementTree.parse('/var/www/python/file.xml')
root = tree.getroot()
# item
for item in root.findall('wp:post_meta', ns):
print '- ', item.text
print "Finished running"
this throws me a error about using wp as a namespace but I'm not sure where to go from here the documentation is unclear to me. Any help is appreciated.
Downvoters please let me know how I can improve my question.
I don't know XML, but I can treat it as a string like this.
from simplified_scrapy import SimplifiedDoc, req, utils
xml = '''
<wp:postmeta>
<wp:meta_key><![CDATA[country]]></wp:meta_key>
<wp:meta_value><![CDATA[Germany]]></wp:meta_value>
</wp:postmeta>
'''
doc = SimplifiedDoc(xml)
kvs = doc.select('wp:postmeta').selects('wp:meta_key|wp:meta_value').html
print (kvs)
Result:
['<![CDATA[country]]>', '<![CDATA[Germany]]>']
I'm using BeautifulSoup4 with Python 2.7 to parse some XML files. The reason I'm using BS is that I know the documents will contain invalid headers, inconsistent encodings etc. though I don't know for certain that lxml etc. can't cope.
I'm trying to check if certain elements have a value so...
if soup.person.identifier.string is None:
# reject file
Which is fine as long as the XML is:
<root>
<person>
<identifier><identifier>
</person>
</root>
But if the "identifier" element is omitted entirely I get an error that "None does not have an attribute string".
My question is what is the neatest way to handle this? I'd prefer to avoid having to check first that the element exists before I check for a value.
There's
try:
identifier = soup.something.identifier.string
except:
identifier = None
if identifier is None:
# reject file
but that too seems a bit long winded.
If I were using lxml I'd just do
if len(root.xpath('person/identifier/text()') == 0
Which would handle both.
maybe something like:
items = [item for item in soup.find_all(name='somethingelse') if item.text == ""]
ex.
import bs4
string = """
<root>
<something>
<somethingelse></somethingelse>
<somethingelse>haha</somethingelse>
</something>
</root>
"""
soup = bs4.BeautifulSoup(string, 'lxml')
items = [item for item in soup.find_all(name='somethingelse') if item.text == ""]
output: [<somethingelse></somethingelse>]
and it won't break if it can't find one
What I ended up doing was -
def bv(value_string, locals):
try:
result = eval(value_string, globals(), locals)
except AttributeError:
result = None
return result
bv('person.identifier.string', locals())
This works but I suspect there's a better way to do this.
I am working on CityGML data right now and try to parse CityGML in Python.
To do so, I use ElementTree, which is working fine with any XML files. But whenever I try to parse the CItyGML file I don't get any results.
As one example I want to get a list with all child tags named "creationDate" in the CityGML file. Here is the code:
import xml.etree.ElementTree as ET
tree = ET.parse('Gasometer.xml')
root = tree.getroot()
def child_list(child):
list_child = list(tree.iter(child))
return list_child
date = child_list('creationDate')
print (date)
I only get an empty list [].
Here is the the very first part of the CityGML file (the "creationDate"-tag you can find at the end):
<?xml version="1.0" encoding="UTF-8"?>
<CityModel>
<cityObjectMember>
<bldg:Building gml:id="UUID_899cac3f-e0b6-41e6-ae30-a91ce51d6d95">
<gml:description>Wohnblock in geschlossener Bauweise</gml:description>
<gml:boundedBy>
<gml:Envelope srsName="urn:ogc:def:crs,crs:EPSG::3068,crs:EPSG::5783" srsDimension="3">
<gml:lowerCorner>21549.6537889055 17204.3479916992 38.939998626709</gml:lowerCorner>
<gml:upperCorner>21570.6420902953 17225.660050148 60.6840192923434</gml:upperCorner>
</gml:Envelope>
</gml:boundedBy>
<creationDate>2014-03-28</creationDate>
This appears not only when I try to get lists of child tags. I can't print any attributes or tag names. It looks like the way I parse the file is wrong. I hope somebody can help me out with my problem and tell me what I should do! Thanks!
Since this is an old post I'll just leave this here in case someone else might need it.
To parse CityGML try the following code, it should help getting a general idea how to fetch the information.
import xml.etree.ElementTree as ET
def loadfile():
tree = ET.parse('filename')
root = tree.getroot()
for envelope in root.iter('{http://www.opengis.net/gml}Envelope'):
print "ENV tag", envelope.tag
print "ENV attrib", envelope.attrib
print "ENV text", envelope.text
lCorner = envelope.find('{http://www.opengis.net/gml}lowerCorner').text
uCorner = envelope.find('{http://www.opengis.net/gml}upperCorner').text
print "lC",lCorner
print "uC",uCorner
if __name__== "__main__":
loadfile()
To get the srsName try following:
import xml.etree.ElementTree as ET
def loadfile():
tree = ET.parse('filename')
root = tree.getroot()
for envelope in root.iter('{http://www.opengis.net/gml}Envelope'):
key = envelope.attrib
srsName = key.get('srsName')
print "SRS Name: ", srsName
if __name__== "__main__":
loadfile()
I hope this helps you or anyone else who might try parsing CityGML with ElementTree.
I need to get the value from the FLVPath from this link : http://www.testpage.com/v2/videoConfigXmlCode.php?pg=video_29746_no_0_extsite
from lxml import html
sub_r = requests.get("http://www.testpage.co/v2/videoConfigXmlCode.php?pg=video_%s_no_0_extsite" % list[6])
sub_root = lxml.html.fromstring(sub_r.content)
for sub_data in sub_root.xpath('//PLAYER_SETTINGS[#Name="FLVPath"]/#Value'):
print sub_data.text
But no data returned
You're using lxml.html to parse the document, which causes lxml to lowercase all element and attribute names (since that doesn't matter in html), which means you'll have to use:
sub_root.xpath('//player_settings[#name="FLVPath"]/#value')
Or as you're parsing a xml file anyway, you could use lxml.etree.
You could try
print sub_data.attrib['Value']
url = "http://www.testpage.com/v2/videoConfigXmlCode.php?pg=video_29746_no_0_extsite"
response = requests.get(url)
# Use `lxml.etree` rathern than `lxml.html`,
# and unicode `response.text` instead of `response.content`
doc = lxml.etree.fromstring(response.text)
for path in doc.xpath('//PLAYER_SETTINGS[#Name="FLVPath"]/#Value'):
print path
I have a HTML file I got from Wikipedia and would like to find every link on the page such as /wiki/Absinthe and replace it with the current directory added to the front such as /home/fergus/wikiget/wiki/Absinthe so for:
Absinthe
becomes:
Absinthe
and this is throughout the whole document.
Do you have any ideas? I'm happy to use BeautifulSoup or Regex!
If that's really all you have to do, you could do it with sed and its -i option to rewrite the file in-place:
sed -e 's,href="/wiki,href="/home/fergus/wikiget/wiki,' wiki-file.html
However, here's a Python solution using the lovely lxml API, in case you need to do anything more complex or you might have badly formed HTML, etc.:
from lxml import etree
import re
parser = etree.HTMLParser()
with open("wiki-file.html") as fp:
tree = etree.parse(fp, parser)
for e in tree.xpath("//a[#href]"):
link = e.attrib['href']
if re.search('^/wiki',link):
e.attrib['href'] = '/home/fergus/wikiget'+link
# Or you can just specify the same filename to overwrite it:
with open("wiki-file-rewritten.html","w") as fp:
fp.write(etree.tostring(tree))
Note that lxml is probably a better option than BeautifulSoup for this kind of task nowadays, for the reasons given by BeautifulSoup's author.
This is solution using re module:
#!/usr/bin/env python
import re
open('output.html', 'w').write(re.sub('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe', open('file.html').read()))
Here's another one without using re:
#!/usr/bin/env python
open('output.html', 'w').write(open('file.html').read().replace('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe'))
You can use a function with re.sub:
def match(m):
return '<a href="/home/fergus/wikiget' + m.group(1) + '">'
r = re.compile(r'<a\shref="([^"]+)">')
r.sub(match, yourtext)
An example:
>>> s = 'Absinthe'
>>> r.sub(match, s)
'Absinthe'
from lxml import html
el = html.fromstring('word')
# or `el = html.parse(file_or_url).getroot()`
def repl(link):
if link.startswith('/'):
link = '/home/fergus/wikiget' + link
return link
print(html.tostring(el))
el.rewrite_links(repl)
print(html.tostring(el))
Output
word
word
You could also use the function lxml.html.rewrite_links() directly:
from lxml import html
def repl(link):
if link.startswith('/'):
link = '/home/fergus/wikiget' + link
return link
print html.rewrite_links(htmlstr, repl)
I would do
import re
ch = 'Absinthe'
r = re.compile('(<a\s+href=")(/wiki/[^"]+">[^<]+</a>)')
print ch
print
print r.sub('\\1/home/fergus/wikiget\\2',ch)
EDIT:
this solution have been said not to capture tags with additional attribute. I thought it was a narrow pattern of string that was aimed, such as WORD
If not, well, no problem, a solution with a simpler RE is easy to write
r = re.compile('(<a\s+href="/)([^>]+">)')
ch = '<a href="/wiki/Aide:Homonymie" title="Aide:Homonymie">'
print ch
print r.sub('\\1home/fergus/wikiget/\\2',ch)
or why not:
r = re.compile('(<a\s+href="/)')
ch = '<a href="/wiki/Aide:Homonymie" title="Aide:Homonymie">'
print ch
print r.sub('\\1home/fergus/wikiget/',ch)