I'm using BeautifulSoup4 with Python 2.7 to parse some XML files. The reason I'm using BS is that I know the documents will contain invalid headers, inconsistent encodings etc. though I don't know for certain that lxml etc. can't cope.
I'm trying to check if certain elements have a value so...
if soup.person.identifier.string is None:
# reject file
Which is fine as long as the XML is:
<root>
<person>
<identifier><identifier>
</person>
</root>
But if the "identifier" element is omitted entirely I get an error that "None does not have an attribute string".
My question is what is the neatest way to handle this? I'd prefer to avoid having to check first that the element exists before I check for a value.
There's
try:
identifier = soup.something.identifier.string
except:
identifier = None
if identifier is None:
# reject file
but that too seems a bit long winded.
If I were using lxml I'd just do
if len(root.xpath('person/identifier/text()') == 0
Which would handle both.
maybe something like:
items = [item for item in soup.find_all(name='somethingelse') if item.text == ""]
ex.
import bs4
string = """
<root>
<something>
<somethingelse></somethingelse>
<somethingelse>haha</somethingelse>
</something>
</root>
"""
soup = bs4.BeautifulSoup(string, 'lxml')
items = [item for item in soup.find_all(name='somethingelse') if item.text == ""]
output: [<somethingelse></somethingelse>]
and it won't break if it can't find one
What I ended up doing was -
def bv(value_string, locals):
try:
result = eval(value_string, globals(), locals)
except AttributeError:
result = None
return result
bv('person.identifier.string', locals())
This works but I suspect there's a better way to do this.
Related
I am reading a xml file and converting to df using xmltodict and pandas.
This is how one of the elements in the file looks like
<net>
<ref>https://whois.arin.net/rest/v1/net/NET-66-125-37-120-1</ref>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
<name>SBC066125037120020307</name>
<netBlocks>
<netBlock>
<cidrLenth>29</cidrLenth>
<endAddress>066.125.037.127</endAddress>
<type>S</type>
<startAddress>066.125.037.120</startAddress>
</netBlock>
</netBlocks>
<pocLinks/>
<orgHandle>C00285134</orgHandle>
<parentNetHandle>NET-66-120-0-0-1</parentNetHandle>
<registrationDate>2002-03-08T00:00:00-05:00</registrationDate>
<startAddress>66.125.37.120</startAddress>
<updateDate>2002-03-08T07:56:59-05:00</updateDate>
<version>4</version>
</net>
since there are a large number of records like this which is being pulled in by an API, sometimes some <net> objects at the end of the file can be partially downloaded.
ex : one tag not having closing tag.
This is what i wrote to parse the xml
xml_data = open('/Users/dgoswami/Downloads/net.xml', 'r').read() # Read data
xml_data = xmltodict.parse(xml_data,
process_namespaces=True,
namespaces={'http://www.arin.net/bulkwhois/core/v1':None})
when that happens, I get an error like so
no element found: line 30574438, column 37
I want to be able to parse till the last valid <net> element.
How can that be done?
You may need to fix your xml beforehand - xmltodict has no ability to do that for you.
You can leverage lxml as described in Python xml - handle unclosed token to fix your xml:
from lxml import etree
def fixme(x):
p = etree.fromstring(x, parser = etree.XMLParser(recover=True))
return etree.tostring(p).decode("utf8")
fixed = fixme("""<start><net>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
</net><net>
<endAddress>66.125.37.227</endAddress>
<handle>NET-66-125-37-220-1</handle>
""")
and then use the fixed xml:
import xmltodict
print(xmltodict.parse(fixed))
to get
OrderedDict([('start',
OrderedDict([('net', [
OrderedDict([('endAddress', '66.125.37.127'), ('handle', 'NET-66-125-37-120-1')]),
OrderedDict([('endAddress', '66.125.37.227'), ('handle', 'NET-66-125-37-220-1')])
])
]))
])
I need to check the existence of certain tags in an XML file before parsing it; I'm using Element Tree in Python. Reading here, I tried writing this:
tgz_xml = f"https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=PMC8300416"
response = urllib.request.urlopen(tgz_xml).read()
tree = ET.fromstring(response)
for OA in tree.findall('OA'):
records = OA.find('records')
if records is None:
print('records missing')
else:
print('records found')
I need to check if the "records" tag exists. I don't get an error, but this doesn't print out anything. What did I do wrong?
Thank you!
When parsing this XML document variable tree already points to element OA, so when searching for this element expression tree.findall('OA') returns an empty list and loop isn't executed. Remove that line and code will be executed:
import xml.etree.ElementTree as ET
from urllib.request import urlopen
tgz_xml = f"https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=PMC8300416"
with urlopen(tgz_xml) as conn:
response = conn.read()
tree = ET.fromstring(response)
records = tree.find('records')
if records is None:
print('records missing')
else:
print('records found')
I'm trying to create a function which counts words in pptx document. The problem is that I can't figure out how to find only this kind of tags:
<a:t>Some Text</a:t>
When I try to: print xmlTree.findall('.//a:t'), it returns
SyntaxError: prefix 'a' not found in prefix map
Do you know what to do to make it work?
This is the function:
def get_pptx_word_count(filename):
import xml.etree.ElementTree as ET
import zipfile
z = zipfile.ZipFile(filename)
i=0
wordcount = 0
while True:
i+=1
slidename = 'slide{}.xml'.format(i)
try:
slide = z.read("ppt/slides/{}".format(slidename))
except KeyError:
break
xmlTree = ET.fromstring(slide)
for elem in xmlTree.iter():
if elem.tag=='a:t':
#text = elem.getText
#num = len(text.split(' '))
#wordcount+=num
The way to specify the namespace inside ElementTree is:
{namespace}element
So, you should change your query to:
print xmlTree.findall('.//{a}t')
Edit:
As #mxjn pointed out if a is a prefix and not the URI you need to insert the URI instead of a:
print xmlTree.findall('.//{http://tempuri.org/name_space_of_a}t')
or you can supply a prefix map:
prefix_map = {"a": "http://tempuri.org/name_space_of_a"}
print xmlTree.findall('.//a:t', prefix_map)
You need to tell ElementTree about your XML namespaces.
References:
Official Documentation (Python 2.7): 19.7.1.6. Parsing XML with Namespaces
Existing answer on StackOverflow: Parsing XML with namespace in Python via 'ElementTree'
Article by the author of ElementTree: ElementTree: Working with Namespaces and Qualified Names
Trying to use regex to select values between <title> </title>.
However sometimes these two tags are on different lines.
As the others have stated, it's more powerful and less brittle to use a full fledged markup language parser, like the htmlparser from stdlib or even BeautifulSoup, over regex. Though, since regex seems to be a requirement, maybe something like this will work:
import urllib2
import re
URL = 'http://amazon.com'
page = urllib2.urlopen(URL)
stream = page.readlines()
flag = False
for line in stream:
if re.search("<title>", line):
print line
if not re.search("</title>", line):
flag = True
elif re.search("</title>", line):
print line
flag = False
elif flag == True:
print line
When it finds the <title> tag it prints the line, checks to make sure the closing tag isn't on the same line, and then continues to print lines until it finds the closing </title>.
If you can't use a parser, just do it by brute force. Read the HTML doc into the string doc then:
try:
title = doc.split('<title>')[1].split('</title>')[0]
except IndexError:
## no title tag, handle error as you see fit
Note that if there is an opening title tag without a matching closing tag, the search succeeds. Not a likely scenario in a well-formatted HTML doc, but FYI.
I am having a corpus in xml in which one of the tags is named extract <EXTRACT>. but the term is a keyword in Beautifulsoup. How can I extract the contents of this tag. when i write entry.extract.text it returns error and when I use entry.extract, the entire contents are extracted.
from what I know about Beautifulsoup, it performs case folding of tags. If there is some method to overcome this,it also might be helpful to me.
NB:
for the time being I resolved the issue with following method.
extra = entry.find('extract')
absts.write(str(extra.text))
But I would like to know if there is any way to use it as we use with other tags like entry.tagName
According to BS source code tag.tagname actually calls tag.find("tagname") under the hood. Here's how __getattr__() method of a Tag class looks:
def __getattr__(self, tag):
if len(tag) > 3 and tag.endswith('Tag'):
# BS3: soup.aTag -> "soup.find("a")
tag_name = tag[:-3]
warnings.warn(
'.%sTag is deprecated, use .find("%s") instead.' % (
tag_name, tag_name))
return self.find(tag_name)
# We special case contents to avoid recursion.
elif not tag.startswith("__") and not tag=="contents":
return self.find(tag)
raise AttributeError(
"'%s' object has no attribute '%s'" % (self.__class__, tag))
See that it is completely based on find(), so it's pretty much ok to use tag.find("extract") in your case:
from bs4 import BeautifulSoup
data = """<test><EXTRACT>extract text</EXTRACT></test>"""
soup = BeautifulSoup(data, 'html.parser')
test = soup.find('test')
print test.find("extract").text # prints 'extract text'
Also, you can use test.extractTag.text, but it is deprecated and I wouldn't recommend it.
Hope that helps.