XML scanning for value - python

I have an XML with the following structure that I'm getting from an API -
<entry>
<id>2397</id>
<title>action_alert</title>
<tes:actions>
<tes:name>action_alert</tes:name>
<tes:type>2</tes:type>
</tes:actions>
</entry>
I am scanning for the ID by doing the following -
sourceobject = etree.parse(urllib2.urlopen(fullsourceurl))
source_id = sourceobject.xpath('//id/text()')[0]
I also want to get the tes:type
source_type = sourceobject.xpath('//tes:actions/tes:type/text()')[0]
Doesn't work. It gives the following error -
lxml.etree.XPathEvalError: Undefined namespace prefix
How do I get it to ignore the namespace?
Alternatively, I know the namespace which is this -
<tes:action xmlns:tes="http://www.blah.com/client/servlet">

The proper way to access nodes in namespace is by passing prefix-namespace URL mapping as additional argument to xpath() method, for example :
ns = {'tes' : 'http://www.blah.com/client/servlet'}
source_type = sourceobject.xpath('//tes:actions/tes:type/text()', namespaces=ns)
Or, another way which is less recommended, by literally ignoring namespaces using xpath function local-name() :
source_type = sourceobject.xpath('//*[local-name()="actions"]/*[local-name()="type"]/text()')[0]

I'm not exactly sure about the namespace thing, but I think it would be easier to use beautifulsoup:
(text is the text)
from bs4 import BeautifulSoup
soup = BeautifulSoup(text)
ids = []
get_ids = soup.find_all("id")
for tag in get_ids:
ids.append(tag.text)
#ids is now ['2397']
types = []
get_types = soup.find_all("tes:actions")
for child in get_types:
type = child.find_all("tes:type")
for tag in type:
types.append(tag.text)
#types is now ['2']

Related

beautifulsoup - how to get data from within self closing tag

I'm trying to use beautifulsoup to retain the value "XXXXX" in the self closing html tag below (apologies if my terminology is incorrect)
Is this possible? All the questions I can find are around getting data out that is between div tags, rather than an attribute in a self closing tag.
<input name="nonce" type="hidden" value="XXXXX"/>
Considering the text you need to parse is on the file variable, you can use the following code:
soup = BeautifulSoup(file, "html.parser")
X = soup.find('input').get('value')
print(X)
I don't think it should make a difference that it's a self-closing tag in this case. The same methods should still be applicable. (Any of the methods in the comments should also work as an alternative.)
nonceInp = soup.select_one('input[name="nonce"]')
# nonceInp = soup.find('input', {'name': 'nonce'})
if nonceInp:
nonceVal = nonceInp['value']
# nonceVal = nonceInp.attrs['value']
# nonceVal = nonceInp.get('value')
# nonceVal = nonceInp.attrs.get('value')
else: nonceVal = None # print('could not find an input named "nonce"')

Getting specific child in xml file in Python

How can I get the what is in the "Code", and "ModificationDate" attribute? The picture is showing what is in that tag "IntraModelReport". I know it prints out everything in the tag IntraModelReport. But I just want two attributes in that tag. I also want to note I get an error message at the end that says "find() missing 1 required positional argument: 'self'"
from bs4 import BeautifulSoup
with open(r'INTERACTION CDM.cdm') as f:
data =f.read()
#passing the stored data inside the beautiful soup parser
soup = BeautifulSoup(data, 'xml')
unquieID = soup.find('ObjectID')
print(unquieID)
#Finding all instances of a tag.
intraModelReportTag = soup.find("IntraModelReport")
print(intraModelReportTag)
tag = soup.find(attrs={"IntraModelReport" : "Code"})
output = tag['Code']
print(tag)
print(output)
<Model xmlns:a="attribute" xmlns:c="collection" xmlns:o="object">
<o:RootObject Id="o1">
<a:SessionID>00000000-0000-0000-0000-000000000000</a:SessionID>
<c:Children>
<o:Model Id="o2">
<a:ObjectID>875D4C90-849D-43C2-827A-0BE7CA7265A4</a:ObjectID>
<a:Name>INTERACTION CDM</a:Name>
<a:Code>INTERACTION_CDM</a:Code>
<a:CreationDate>1578996736</a:CreationDate>
<a:Creator>b0000001</a:Creator>
<a:ModificationDate>1582198848</a:ModificationDate>
<a:Modifier>b0000001</a:Modifier>
<a:PackageOptionsText>[FolderOptions]
[FolderOptions\Conceptual Data Objects]
GenerationCheckModel=Yes
GenerationPath=
GenerationOptions=
GenerationTasks=
GenerationTargets=
GenerationSelections=</a:PackageOptionsText>
<a:ModelOptionsText>[ModelOptions]
.....
<c:Reports>
<o:IntraModelReport Id="o76">
<a:ObjectID>72517613-3F32-4E3D-8E4A-CDD186B0CBA3</a:ObjectID>
<a:Name>INTERACTION CDM</a:Name>
<a:Code>INTERACTION_CDM</a:Code>
<a:CreationDate>1578997381</a:CreationDate>
<a:ModificationDate>1578997500</a:ModificationDate>
<a:Modifier>b0000001</a:Modifier>
<a:ReportFirstPageTitle>INTERACTION CDM</a:ReportFirstPageTitle>
<a:ReportFirstPageAuthor>b0000001</a:ReportFirstPageAuthor>
<a:ReportFirstPageDate>%DATE%</a:ReportFirstPageDate>
<a:HtmlStylesheetFile>PWI_Theme.css</a:HtmlStylesheetFile>
<a:HtmlHeaderFile>Header_PWI.htm</a:HtmlHeaderFile>
<a:HtmlFooterFile>Footer_PWI.htm</a:HtmlFooterFile>
<a:HtmlHeaderSize>54</a:HtmlHeaderSize>
<a:HtmlFooterSize>18</a:HtmlFooterSize>
<a:HtmlTOCLevel>4</a:HtmlTOCLevel>
<a:HtmlHomePageFile>Home_PWI.html</a:HtmlHomePageFile>
<a:HtmlTemplate>PWI</a:HtmlTemplate>
<a:RtfTemplate>Professional</a:RtfTemplate>
<a:RtfUseSectionHeadFoot>1</a:RtfUseSectionHeadFoot>
<c:Paragraphs>
To get the "Code" and "ModificationDate", call the tag names as follows:
...
intra_model_report_tag = soup.find("o:IntraModelReport")
print(intra_model_report_tag.find("a:Code").text)
print(intra_model_report_tag.find("a:ModificationDate").text)
Output:
INTERACTION_CDM
1578997500

Parsing HTML using LXML Python

I'm trying to parse Oxford Dictionary in order to obtain the etymology of a given word.
class SkipException (Exception):
def __init__(self, value):
self.value = value
try:
doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good"))
except SkipException:
doc = ''
if doc:
table = []
trs = doc.xpath("//div[1]/div[2]/div/div/div/div[1]/section[5]/div/p")
I cannot seem to work out how to obtain the string of text I need. I know I lack some lines of code in the ones I have copied but I don't know how HTML nor LXML fully works. I would much appreciate if someone could provide me with the correct way to solve this.
You don't want to do web scraping, and especially when probably every dictionary has an API interface. In the case of Oxford create an account at https://developer.oxforddictionaries.com/. Get the API credentials from your account and do something like this:
import requests
import json
api_base = 'https://od-api.oxforddictionaries.com:443/api/v1/entries/{}/{}'
language = 'en'
word = 'parachute'
headers = {
'app_id': '',
'app_key': ''
}
url = api_base.format(language, word)
reply = requests.get(url, headers=headers)
if reply.ok:
reply_dict = json.loads(reply.text)
results = reply_dict.get('results')
if results:
headword = results[0]
entries = headword.get('lexicalEntries')[0].get('entries')
if entries:
entry = entries[0]
senses = entry.get('senses')
if senses:
sense = senses[0]
print(sense.get('short_definitions'))
Here's a sample to get you started scraping Oxford dictionary pages:
import lxml.html as lh
from urllib.request import urlopen
url = 'https://en.oxforddictionaries.com/definition/parachute'
html = urlopen(url)
root = lh.parse(html)
body = root.find("body")
elements = body.xpath("//span[#class='ind']")
for element in elements:
print(element.text)
To find the correct search string you need to format the html so you can see the structure. I used the html formatter at https://www.freeformatter.com/html-formatter.html. Looking at the formatted HTML, I could see the definitions were in the span elements with the 'ind' class attribute.

Is it possible to pass a variable to (Beautifulsoup) soup.find()?

Hi I need to pass a variable to the soup.find() function, but it doesn't work :(
Does anyone know a solution for this?
from bs4 import BeautifulSoup
html = '''<div> blabla
<p class='findme'> p-tag content</p>
</div>'''
sources = {'source1': '\'p\', class_=\'findme\'',
'source2': '\'span\', class_=\'findme2\'',
'source1': '\'div\', class_=\'findme3\'',}
test = BeautifulSoup(html)
# this works
#print(test.find('p', class_='findme'))
# >>> <p class="findme"> p-tag content</p>
# this doesn't work
tag = '\'p\' class_=\'findme\''
# a source gets passed
print(test.find(sources[source]))
# >>> None
I am trying to split it up as suggested like this:
pattern = '"p", {"class": "findme"}'
tag = pattern.split(', ')
tag1 = tag[0]
filter = tag[1]
date = test.find(tag1, filter)
I don't get errors, just None for date. The problem is propably the content of tag1 and filter The debuger of pycharm gives me:
tag1 = '"p"'
filter = '{"class": "findme"}'
Printing them doesn't show these apostrophs. Is it possible to remove these apostrophs?
The first argument is a tag name, and your string doesn't contain that. BeautifulSoup (or Python, generally) won't parse out a string like that, it cannot guess that you put some arbitrary Python syntax in that value.
Separate out the components:
tag = 'p'
filter = {'class_': 'findme'}
test.find(tag, **filter)
Okay I got it, thanks again.
dic_date = {'source1': 'p, class:findme', other sources ...}
pattern = dic_date[source]
tag = pattern.split(', ')
if len(tag) is 2:
att = tag[1].split(':') # getting the attribute
att = {att[0]: att[1]} # building a dictionary for the attributes
date = soup.find(tag[0], att)
else:
date = soup.find(tag[0]) # if there is only a tag without an attribute
Well it doesn't look very nice but it's working :)

Test if an attribute is present in a tag in BeautifulSoup

I would like to get all the <script> tags in a document and then process each one based on the presence (or absence) of certain attributes.
E.g., for each <script> tag, if the attribute for is present do something; else if the attribute bar is present do something else.
Here is what I am doing currently:
outputDoc = BeautifulSoup(''.join(output))
scriptTags = outputDoc.findAll('script', attrs = {'for' : True})
But this way I filter all the <script> tags with the for attribute... but I lost the other ones (those without the for attribute).
If i understand well, you just want all the script tags, and then check for some attributes in them?
scriptTags = outputDoc.findAll('script')
for script in scriptTags:
if script.has_attr('some_attribute'):
do_something()
You don't need any lambdas to filter by attribute, you can simply use some_attribute=True in find or find_all.
script_tags = soup.find_all('script', some_attribute=True)
# or
script_tags = soup.find_all('script', {"some-data-attribute": True})
Here are more examples with other approaches as well:
soup = bs4.BeautifulSoup(html)
# Find all with a specific attribute
tags = soup.find_all(src=True)
tags = soup.select("[src]")
# Find all meta with either name or http-equiv attribute.
soup.select("meta[name],meta[http-equiv]")
# find any tags with any name or source attribute.
soup.select("[name], [src]")
# find first/any script with a src attribute.
tag = soup.find('script', src=True)
tag = soup.select_one("script[src]")
# find all tags with a name attribute beginning with foo
# or any src beginning with /path
soup.select("[name^=foo], [src^=/path]")
# find all tags with a name attribute that contains foo
# or any src containing with whatever
soup.select("[name*=foo], [src*=whatever]")
# find all tags with a name attribute that endwith foo
# or any src that ends with whatever
soup.select("[name$=foo], [src$=whatever]")
You can also use regular expressions with find or find_all:
import re
# starting with
soup.find_all("script", src=re.compile("^whatever"))
# contains
soup.find_all("script", src=re.compile("whatever"))
# ends with
soup.find_all("script", src=re.compile("whatever$"))
For future reference, has_key has been deprecated is beautifulsoup 4. Now you need to use has_attr
scriptTags = outputDoc.find_all('script')
for script in scriptTags:
if script.has_attr('some_attribute'):
do_something()
If you only need to get tag(s) with attribute(s), you can use lambda:
soup = bs4.BeautifulSoup(YOUR_CONTENT)
Tags with attribute
tags = soup.find_all(lambda tag: 'src' in tag.attrs)
OR
tags = soup.find_all(lambda tag: tag.has_attr('src'))
Specific tag with attribute
tag = soup.find(lambda tag: tag.name == 'script' and 'src' in tag.attrs)
Etc ...
Thought it might be useful.
you can check if some attribute are present
scriptTags = outputDoc.findAll('script', some_attribute=True)
for script in scriptTags:
do_something()
By using the pprint module you can examine the contents of an element.
from pprint import pprint
pprint(vars(element))
Using this on a bs4 element will print something similar to this:
{'attrs': {u'class': [u'pie-productname', u'size-3', u'name', u'global-name']},
'can_be_empty_element': False,
'contents': [u'\n\t\t\t\tNESNA\n\t'],
'hidden': False,
'name': u'span',
'namespace': None,
'next_element': u'\n\t\t\t\tNESNA\n\t',
'next_sibling': u'\n',
'parent': <h1 class="pie-compoundheader" itemprop="name">\n<span class="pie-description">Bedside table</span>\n<span class="pie-productname size-3 name global-name">\n\t\t\t\tNESNA\n\t</span>\n</h1>,
'parser_class': <class 'bs4.BeautifulSoup'>,
'prefix': None,
'previous_element': u'\n',
'previous_sibling': u'\n'}
To access an attribute - lets say the class list - use the following:
class_list = element.attrs.get('class', [])
You can filter elements using this approach:
for script in soup.find_all('script'):
if script.attrs.get('for'):
# ... Has 'for' attr
elif "myClass" in script.attrs.get('class', []):
# ... Has class "myClass"
else:
# ... Do something else
A simple way to select just what you need.
outputDoc.select("script[for]")

Categories

Resources