XPath delivering blank text - python

I am trying to pull the text out of a tag that follows an element I'm starting with. The HTML looks like this, with multiple entries of the same structure:
<h5>
Title
</h5>
<div class="author">
<p>"Author A, Author B"</p>
</div>
<div id="abstract-more#####" class="collapse">
<p>
<strong>Abstract:</strong>
"Text here..."
</p>
<p>...</p>
So once I've isolated a given title element/node (stored as 'paper'), I want to store the author and abstract text. It works when I use this to get the author:
author = paper.find_element_by_xpath("./following::div[contains(#class, 'author')]/p").text
But is returning a blank output for 'abstract' when I use this:
abstract = paper.find_element_by_xpath("./following::div[contains(#id, 'abstract-more')]/p").text
Why does it work fine for the author but not for the abstract? I've tried using .// instead of ./ and other slight tweaks but to no avail. I also don't know why it's not giving an error out and saying it can't find the abstract element and is instead just returning a blank...

Try this:
//div[contains(#id, 'abstract-more')]/p[1]

Please use starts-with in xpath instead of contains.
XPath: .//div[starts-with(#id, 'abstract-more')]/p"
abstract = paper.find_element_by_xpath(".//div[starts-with(#id, 'abstract-more')]/p").text

You can try this xpath :
//div[#class="author"]/following-sibling::div[contains(#id,'abstract-more')]/p[1]
in code :
author = paper.find_element_by_xpath("//div[#class="author"]/following-sibling::div[contains(#id,'abstract-more'')]/p[1]")
print(author.text)

Related

BeautifulSoup4: Fail to find 'a' tag with specific href value by find()

I am trying to crawl the realtime Bitcoin-HKD Currency from https://www.coinbase.com/pt-PT/price/ with python3.
The only way I found to locate it specificly in the HTML is by this tage a with href="/pt-PT/price/bitcoin"
<a href="/pt-PT/price/bitcoin" title="Visite a moeda Bitcoin" data-element-handle="asset-highlight-top-daily-volume" class="Link__A-eh4rrz-0 hfBqui AssetHighlight__StyledLink-sc-1srucyv-1 cbFcph" color="slate">
<h2 class="AssetHighlight__Title-sc-1srucyv-2 jmJxYl">Volume mais alto (24 h)</h2>
<div class="Flex-l69ttv-0 gaVUrq">
<img src="https://dynamic-assets.coinbase.com/e785e0181f1a23a30d9476038d9be91e9f6c63959b538eabbc51a1abc8898940383291eede695c3b8dfaa1829a9b57f5a2d0a16b0523580346c6b8fab67af14b/asset_icons/b57ac673f06a4b0338a596817eb0a50ce16e2059f327dc117744449a47915cb2.png" alt="Visite a moeda Bitcoin" aria-label="Visite a moeda Bitcoin" loading="lazy" class="AssetHighlight__AssetImage-sc-1srucyv-5 lcjcxh"/>
<div class="Flex-l69ttv-0 kvilOX">
<div class="Flex-l69ttv-0 gTbYCC">
<h3 class="AssetHighlight__SubTitle-sc-1srucyv-3 gdcBEE">Bitcoin</h3>
<p class="AssetHighlight__Price-sc-1srucyv-4 bUAWAG">460 728,81 HK$</p>
Here 460 728,81 HK$ is the data wanted.
Thus I applied the following codes:
import bs4
import urllib.request as req
url="https://www.coinbase.com/prthe ice/bitcoin/hkd"
request=req.Request(url,headers={
"user-agent":"..."
})
with req.urlopen(request) as response:
data=response.read().decode("utf-8")
root=bs4.BeautifulSoup(data,"html.parser")
secBitcoin=root.find('a',href="/pt-PT/price/bitcoin")
realtimeCurrency=secBitcoin.find('p')
print(realtimeCurrency.string)
However, it always returns secBitcoin = None. No result matches.
The find function works just fine when I search 'div' label with class parameter.
I have also tried format like
.find('a[href="/pt-PT/price/bitcoin"]')
But nothing works.
It's possible the page is loading the currency values after the initial page load. You could try hitting ctrl+s to save the full webpage and open that file instead of using requests. If that also doesn't work, then I'm not sure where the problem is.
And if that does work, then you'll probably need to use something like selenium to get what you need
href is an attribute of an element and hence I think you cannot find it that way.
def is_a_and_href_matching(element):
is_a = element.name == a
if is_a and element.has_attr(href):
if element['href'] == "/pt-PT/price/bitcoin":
return True
return False
secBitcoins=root.find_all(is_a_and_href_matching)
for secBitcoin in secBitcoins:
p = setBitcoin.find('p')

Find html-tag with one and only one attribute with BeautifulSoup

I have a html-site that I want to scrape some data from. The html looks like this:
<p class="provice hidden-xs">
<span class="provice-mobile">NEW YORK</span>
witespace
<span class="provice-mobile" style="color: #8888 !important">UNION</span>
</p>
I just want to choose "NEW YORK", and I tried this code:
city = soup.find('span', attrs={'class':'provice-mobile'})
city.text also includes "UNION", but I just want to see the span-tag that only has the attribute:
'class': 'provice-mobile
If I understand your question correctly, you are looking for the span-tags whose only attribute is class = "provice-mobile. I suggest you start by finding all the tags that has that attribute and afterwards sort out the ones that has more than that one attribute, i.e. keeping tags with only one attribute.
The code to accomplish this could look like this:
results = soup.findAll('span', attrs = {'class':'provice-mobile'})
results = [tag for tag in results if len(tag.attrs) == 1]

Is there a way to find the exact path of an element in the requests module in Python?

Is there a way to select the exact "div" in a source of a Beautiful Soup object? For example, let's say we have soup like this:
<div class="dialog-shadow" id="popupMenu1" onblur="hidePopup();" onmouseout="closePopup = contextMenuInputHasFocus() ? null : setTimeout('hidePopup()',500);" onmouseover="if(closePopup!=null){clearTimeout(closePopup);closePopup=null}"></div>
<div id="popupMenu2" onblur="hidePopup();" onmouseout="closePopup = contextMenuInputHasFocus() ? null : setTimeout('hidePopup()',500);" onmouseover="if(closePopup!=null){clearTimeout(closePopup);closePopup=null}"></div>
<div class="shadow" id="popupMenu3" onblur="hidePopup3();hidePopup();" onmouseout="closePopup = setTimeout('hidePopup();', 500); closePopup3 = setTimeout('hidePopup3()',500);" onmouseover="if(closePopup!=null){clearTimeout(closePopup);closePopup=null};if(closePopup3!=null){clearTimeout(closePopup3);closePopup3=null};"></div>
<div id="container">
<div class="background-menu-dark shadow" id="navHolder">
<span class="customBranding" id="logo" onclick="loadView(V_SUMMARY);" title="Özet Görünümü"><img height="40" src="Branding/SmallBanner.jpg?ts=20140403111116"/></span>
<div id="navigation">
<ul id="navigationLargeWidth">
<li id="mainInboxLink">
And I want to find the third div whose class is "shadow" in this piece of soup. But when I do something like this, it returns None:
soup.find('div',attrs={"class":"shadow"})
I know that it should be something like "ABC-->BC-->C" If i want to find C in the soup, but is there a way that I can find C just by knowing its unique class or ID?
(soup.select("div:nth-of-type(3))) is not what I'm looking for)
I see only 2 divs with that class. However, the reason your nth-of-type could be failing is due to you not including the class. Unless there is some reason (you haven't given) as to why nth-of-type itself is not acceptable.
div.shadow:nth-of-type(3)
without proper html to test with I cannot be sure of index or whether content is dynamically loaded (if from webpage)
If you are trying to dynamically construct the path then something like this?
For a div with a unique class
select_one('div.shadow')

How to set a text in a textarea by using Mechanical Soup?

I'm learning to create an Omegle bot, but the Omegle interface was created in HTML and I don't know very much about HTML nor MechanicalSoup.
In the part where the text is inserted, the code snippet is as follows:
<td class="chatmsgcell">
<div class="chatmsgwrapper">
<textarea class="chatmsg " cols="80" rows="3"></textarea>
</div>
</td>
In the part of the button to send the text, the code snippet is:
<td class="sendbthcell">
<div class="sendbtnwrapper">
<button class="sendbtn">Send<div class="btnkbshortcut">Enter</div></button>
</div>
</td>
I want to set a text in textarea and send it via button.
Looking at some examples in HTML, I guess the correct way to set text in a textarea is as follows:
<textarea>Here's a text.</textarea>
Also, I'm new at MechanicalSoup, but I think I know how to find and set a value in an HTML code:
# example in the Twitter interface
login_form = login_page.soup.find("form", {"class": "signin"})
LOGIN = "yourlogin"
login_form.find("input", {"name": "session[username_or_email]"})["value"] = LOGIN
From what I understand, the first argument is the name of the tag and a second argument is a dictionary whose first element is the name of the attribute and the second element is the value of the attribute.
But the tag textarea don't have an attribute for setting a text, like value="Here's a text.". What I should do for set a text in a textarea using MechanicalSoup?
I know it's not the answer you expect, but reading the doc would help ;-).
The full documentation is available at:
https://mechanicalsoup.readthedocs.io/
You probably want to start with the tutorial:
https://mechanicalsoup.readthedocs.io/en/stable/tutorial.html
In short, you need to select the form you want to fill-in:
browser.select_form('form[action="/post"]')
Then, filling-in fields is as simple as
browser["custname"] = "Me"
browser["custtel"] = "00 00 0001"
browser["custemail"] = "nobody#example.com"
browser["comments"] = "This pizza looks really good :-)"

Checking for a field error using Selenium Webdriver

I've been trying to implement tests to check for field validation in forms. A check for specific field error messages was straightforward, but I've also tried a generic check to identify the parent element of a field for an error class. This however isn't working.
A field with an error has the following HTML;
<div class="field clearfix error ">
<div class="error">
<p>Please enter a value</p>
</div>
<label for="id_fromDate">
<input id="id_fromDate" type="text" value="" name="fromDate">
</div>
So to check for an error I've got the following function;
def assertValidationFail(self, field_id):
# Checks for a div.error sibling element
el = self.find(field_id)
try:
error_el = el.find_element_by_xpath('../div[#class="error"]')
except NoSuchElementException:
error_el = None
self.assertIsNotNone(error_el)
So el is the input field, but then the xpath always fails. I believed that ../ went up a level in the same way that command line navigation does - is this not the case?
Misunderstood your question earlier. You may try the following logic: find the parent div, then check if it contains class error, rather than find parent div.error and check NoSuchElementException.
Because .. is the way to go upper level, ../div means parent's children div.
// non-working code, only the logic
parent_div = el.find_element_by_xpath("..") # the parent div
self.assertTrue("error" in parent_div.get_attribute("class"))
When you're using a relative xpath (based on an existing element), it needs to start with ./ like this:
el.find_element_by_xpath('./../div[#class="error"]')
Only after the ./ can you start specifying xpath nodes etc.

Categories

Resources