I am attempting to parse Open XML from a Microsoft Word document. However, whenever i go to look at any tag or attribute i receive the tag i want, preceded by the openxmlformats namespace. Examples below. Does anybody know how i can remove this, and only receive my tag id and value?
Current format:
for content in root.iter():
print(content.tag)
returns:
'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tag'
and
for content in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tag'):
print(content.attrib)
returns
'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': 'Orange'
Desired Output:
for content in root.iter():
print(content.tag)
returns
tag
and
for content in root.iter('tag'):
print(content.attrib)
returns
val : 'Orange'
You can write your own version of the iterator that does what you want:
from collections import namedtuple
import re
my_content = namedtuple('my_content', ['tag', 'attrib'])
def remove_namespace(name):
return re.sub('^\{[^\}]\}', '', name)
def my_iterator(root, tag=None, namespace='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'):
iterator = root.iter() if tag is None else root.iter(namespace + tag)
for content in iterator:
tag = remove_namespace(content.tag)
attrib = {remove_namespace(key): val for key, val in content.attrib.items()}
yield my_content(tag, attrib)
This will return objects that only have the tag and attrib attributes. You will have to write a more complex proxy object if you want more detailed functionality. You can use the generator as a replacement for the previous:
for content in my_iter(root):
print(content.tag)
and
for content in my_iter(root, 'tag'):
print(content.attrib)
Related
Im trying to print the value of just one field of a XML tree, here is the XML tree (e.g), the one that i get when i request it
<puco>
<resultado>OK</resultado>
<coberturaSocial>O.S.P. TIERRA DEL FUEGO(IPAUSS)</coberturaSocial>
<denominacion>DAMIAN GUTIERREZ DEL RIO</denominacion>
<nrodoc>32443324</nrodoc>
<rnos>924001</rnos>
<tipodoc>DNI</tipodoc>
</puco>
Now, i just want to print "coberturaSocial" value, here the request that i have in my views.py:
def get(request):
r = requests.get('https://sisa.msal.gov.ar/sisa/services/rest/puco/38785898')
dom = r.content
asd = etree.fromstring(dom)
If i print "asd" i get this error: The view didn't return an HttpResponse object. It returned None instead.
and also in the console i get this
I just want to print coberturaSocial, please help, new in xml parsing!
You need to extract the contents of the tag and then return it wrapped in a response, like so:
return HttpResponse(asd.find('coberturaSocial').text)
I'm guessing etree is import xml.etree.ElementTree as etree
You can use:
text = r.content
dom = etree.fromstring(text)
el = dom.find('coberturaSocial')
el.text # this is where the string is
I have an XML with the following structure that I'm getting from an API -
<entry>
<id>2397</id>
<title>action_alert</title>
<tes:actions>
<tes:name>action_alert</tes:name>
<tes:type>2</tes:type>
</tes:actions>
</entry>
I am scanning for the ID by doing the following -
sourceobject = etree.parse(urllib2.urlopen(fullsourceurl))
source_id = sourceobject.xpath('//id/text()')[0]
I also want to get the tes:type
source_type = sourceobject.xpath('//tes:actions/tes:type/text()')[0]
Doesn't work. It gives the following error -
lxml.etree.XPathEvalError: Undefined namespace prefix
How do I get it to ignore the namespace?
Alternatively, I know the namespace which is this -
<tes:action xmlns:tes="http://www.blah.com/client/servlet">
The proper way to access nodes in namespace is by passing prefix-namespace URL mapping as additional argument to xpath() method, for example :
ns = {'tes' : 'http://www.blah.com/client/servlet'}
source_type = sourceobject.xpath('//tes:actions/tes:type/text()', namespaces=ns)
Or, another way which is less recommended, by literally ignoring namespaces using xpath function local-name() :
source_type = sourceobject.xpath('//*[local-name()="actions"]/*[local-name()="type"]/text()')[0]
I'm not exactly sure about the namespace thing, but I think it would be easier to use beautifulsoup:
(text is the text)
from bs4 import BeautifulSoup
soup = BeautifulSoup(text)
ids = []
get_ids = soup.find_all("id")
for tag in get_ids:
ids.append(tag.text)
#ids is now ['2397']
types = []
get_types = soup.find_all("tes:actions")
for child in get_types:
type = child.find_all("tes:type")
for tag in type:
types.append(tag.text)
#types is now ['2']
Does Beautiful Soup allow for the exclusion of html code by div (or other filters)?
I am trying to parse code that is very poorly written, where there is not an appropriate tag, id, class or anything else to key on parsing the desired content.
What I am looking for is a select or findall everything in a id that is not a certain class. Per the sample code below, I want everything in id-main that is not contained in class-toc-indentation.
Below I have main_txt and toc_txt, though my goal is to have main_txt with toc_txt further parsed out.
soup = BeautifulSoup(orig_file)
title = soup.find('title')
main_txt = soup.findAll(id='main')[0]
toc_txt = soup.findAll(class_ ='toc-indentation')
I did my best to find the answer but just can seem to locate anything that will help me.
Please let me know if you have any questions or required further info.
Any assistance will be highly appreciated.
To get all elements inside main_text except those that are inside elements with class 'toc-indentation':
def not_inside_toc(tag):
return tag.get('class') != ['toc-indentation'] or tag.clear()
main_text = soup.find(id='main')
tags = main_text.find_all(not_inside_toc)
By passing a function to find_all you can make a filter doing what you want.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function
def myFilter(tag, ID, cls):
'''
Returns every eleement that have a parent with id=ID and class != cls
'''
if tag.has_attr('class') and cls not in tag['class']:
parent = tag.parents.next()
else:
return False
if parent.has_attr('id') and ID in parent['id']:
return True
else:
return False
print soup.find_all(lambda tag: myFilter(tag, 'main', 'toc-indentation'))
I am trying to use the shrink the web service for site thumbnails. They have a API that returns XML telling you if the site thumbnail can be created. I am trying to use ElementTree to parse the xml, but not sure how to get to the information I need. Here is a example of the XML response:
<?xml version="1.0" encoding="UTF-8"?>
<stw:ThumbnailResponse xmlns:stw="http://www.shrinktheweb.com/doc/stwresponse.xsd">
<stw:Response>
<stw:ThumbnailResult>
<stw:Thumbnail Exists="false"></stw:Thumbnail>
<stw:Thumbnail Verified="false">fix_and_retry</stw:Thumbnail>
</stw:ThumbnailResult>
<stw:ResponseStatus>
<stw:StatusCode>Blank Detected</stw:StatusCode>
</stw:ResponseStatus>
<stw:ResponseTimestamp>
<stw:StatusCode></stw:StatusCode>
</stw:ResponseTimestamp>
<stw:ResponseCode>
<stw:StatusCode></stw:StatusCode>
</stw:ResponseCode>
<stw:CategoryCode>
<stw:StatusCode>none</stw:StatusCode>
</stw:CategoryCode>
<stw:Quota_Remaining>
<stw:StatusCode>1</stw:StatusCode>
</stw:Quota_Remaining>
</stw:Response>
</stw:ThumbnailResponse>
I need to get the "stw:StatusCode". If I try to do a find on "stw:StatusCode" I get a "expected path separator" syntax error. Is there a way to just get the status code?
Grrr namespaces ....try this:
STW_PREFIX = "{http://www.shrinktheweb.com/doc/stwresponse.xsd}"
(see line 2 of your sample XML)
Then when you want a tag like stw:StatusCode, use STW_PREFIX + "StatusCode"
Update: That XML response isn't the most brilliant design. It's not possible to guess from your single example whether there can be more than 1 2nd-level node. Note that each 3rd-level node has a "StatusCode" child. Here is some rough-and-ready code that shows you (1) why you need that STW_PREFIX caper (2) an extract of the usable info.
import xml.etree.cElementTree as et
def showtag(elem):
return repr(elem.tag.rsplit("}")[1])
def showtext(elem):
return None if elem.text is None else repr(elem.text.strip())
root = et.fromstring(xml_response) # xml_response is your input string
print repr(root.tag) # see exactly what tag is in the element
for child in root[0]:
print showtag(child), showtext(child)
for gc in child:
print "...", showtag(gc), showtext(gc), gc.attrib
Result:
'{http://www.shrinktheweb.com/doc/stwresponse.xsd}ThumbnailResponse'
'ThumbnailResult' ''
... 'Thumbnail' None {'Exists': 'false'}
... 'Thumbnail' 'fix_and_retry' {'Verified': 'false'}
'ResponseStatus' ''
... 'StatusCode' 'Blank Detected' {}
'ResponseTimestamp' ''
... 'StatusCode' None {}
'ResponseCode' ''
... 'StatusCode' None {}
'CategoryCode' ''
... 'StatusCode' 'none' {}
'Quota_Remaining' ''
... 'StatusCode' '1' {}
I would like to get all the <script> tags in a document and then process each one based on the presence (or absence) of certain attributes.
E.g., for each <script> tag, if the attribute for is present do something; else if the attribute bar is present do something else.
Here is what I am doing currently:
outputDoc = BeautifulSoup(''.join(output))
scriptTags = outputDoc.findAll('script', attrs = {'for' : True})
But this way I filter all the <script> tags with the for attribute... but I lost the other ones (those without the for attribute).
If i understand well, you just want all the script tags, and then check for some attributes in them?
scriptTags = outputDoc.findAll('script')
for script in scriptTags:
if script.has_attr('some_attribute'):
do_something()
You don't need any lambdas to filter by attribute, you can simply use some_attribute=True in find or find_all.
script_tags = soup.find_all('script', some_attribute=True)
# or
script_tags = soup.find_all('script', {"some-data-attribute": True})
Here are more examples with other approaches as well:
soup = bs4.BeautifulSoup(html)
# Find all with a specific attribute
tags = soup.find_all(src=True)
tags = soup.select("[src]")
# Find all meta with either name or http-equiv attribute.
soup.select("meta[name],meta[http-equiv]")
# find any tags with any name or source attribute.
soup.select("[name], [src]")
# find first/any script with a src attribute.
tag = soup.find('script', src=True)
tag = soup.select_one("script[src]")
# find all tags with a name attribute beginning with foo
# or any src beginning with /path
soup.select("[name^=foo], [src^=/path]")
# find all tags with a name attribute that contains foo
# or any src containing with whatever
soup.select("[name*=foo], [src*=whatever]")
# find all tags with a name attribute that endwith foo
# or any src that ends with whatever
soup.select("[name$=foo], [src$=whatever]")
You can also use regular expressions with find or find_all:
import re
# starting with
soup.find_all("script", src=re.compile("^whatever"))
# contains
soup.find_all("script", src=re.compile("whatever"))
# ends with
soup.find_all("script", src=re.compile("whatever$"))
For future reference, has_key has been deprecated is beautifulsoup 4. Now you need to use has_attr
scriptTags = outputDoc.find_all('script')
for script in scriptTags:
if script.has_attr('some_attribute'):
do_something()
If you only need to get tag(s) with attribute(s), you can use lambda:
soup = bs4.BeautifulSoup(YOUR_CONTENT)
Tags with attribute
tags = soup.find_all(lambda tag: 'src' in tag.attrs)
OR
tags = soup.find_all(lambda tag: tag.has_attr('src'))
Specific tag with attribute
tag = soup.find(lambda tag: tag.name == 'script' and 'src' in tag.attrs)
Etc ...
Thought it might be useful.
you can check if some attribute are present
scriptTags = outputDoc.findAll('script', some_attribute=True)
for script in scriptTags:
do_something()
By using the pprint module you can examine the contents of an element.
from pprint import pprint
pprint(vars(element))
Using this on a bs4 element will print something similar to this:
{'attrs': {u'class': [u'pie-productname', u'size-3', u'name', u'global-name']},
'can_be_empty_element': False,
'contents': [u'\n\t\t\t\tNESNA\n\t'],
'hidden': False,
'name': u'span',
'namespace': None,
'next_element': u'\n\t\t\t\tNESNA\n\t',
'next_sibling': u'\n',
'parent': <h1 class="pie-compoundheader" itemprop="name">\n<span class="pie-description">Bedside table</span>\n<span class="pie-productname size-3 name global-name">\n\t\t\t\tNESNA\n\t</span>\n</h1>,
'parser_class': <class 'bs4.BeautifulSoup'>,
'prefix': None,
'previous_element': u'\n',
'previous_sibling': u'\n'}
To access an attribute - lets say the class list - use the following:
class_list = element.attrs.get('class', [])
You can filter elements using this approach:
for script in soup.find_all('script'):
if script.attrs.get('for'):
# ... Has 'for' attr
elif "myClass" in script.attrs.get('class', []):
# ... Has class "myClass"
else:
# ... Do something else
A simple way to select just what you need.
outputDoc.select("script[for]")