Get text with BeautifulSoup CSS Selector - python

Example HTML
<h2 id="name">
ABC
<span class="numbers">123</span>
<span class="lower">abc</span>
</h2>
I can get the numbers with something like:
soup.select('#name > span.numbers')[0].text
How do I get the text ABC using BeautifulSoup and the select function?
What about in this case?
<div id="name">
<div id="numbers">123</div>
ABC
</div>

In the first case, get the previous sibling:
soup.select_one('#name > span.numbers').previous_sibling
In the second case, get the next sibling:
soup.select_one('#name > #numbers').next_sibling
Note that I assume that it is intentional that here you have the numbers as an id value and the tag is div instead of span. Hence, I've adjusted the CSS selector.
To cover both cases, you can go to the parent of the tag and find the non-empty text node in a non-recursive mode:
parent = soup.select_one('#name > .numbers,#numbers').parent
print(parent.find(text=lambda text: text and text.strip(), recursive=False).strip())
Note the change in the selector - we are asking to match either numbers id or numbers class.
Though, I have a feeling that this universal solution would not be quite reliable because, for starters, I don't know what your real inputs could be.

Related

Python Selenium find element with following sibling by class, id, a (hreff or class)

I need help with finding an exact element and click it with following-sibling based on specific id number and then classes and a (href or class).
Here is simplified code, the below example occurs many times just with different id:
<div class="class_1" id="1234567">
<div class="class_2">
<div class="class_3">
<div class="class_3.1">
<div class="class_3.2">
<div class="class_3.3">
<div class="class_3.3.1">
<div class="class_3.3.1.1">
<div class="class_3.3.1.2">
<div class="class_3.3.1.3">
...
How can I locate an element with id and class for example something like this and click on it:
driver.find_element(By.XPATH, 'class=class_1 and id="2222222" and class="event-media-icon live-icon icon-white').click()
The xpath you are looking for will look like the following:
//div[#class='class_1' and(#id='1234567')]//a[#data-sport='soccer']
I guess the elements between the upper div and the goal a are not important so we can omit them.
The href value looks not unique too so I preferred using data-sport attribute that can be more unique.
To give more precise answer I need to see that web page with dev tools.
This xpath should work fine too
.//div[#class='class_1' and #id='1234567']//following-sibling::a[#data-sport='soccer']

Scrapy selector: get nth-child text of an element

I am using Scrapy selector to extract fields from html
xpath = /html/body/path/to/element/text()
This is similar to question scrapy get nth-child text of same class
and following the documentation we can use .getall() method to get all element and select specific one from the list.
selected_list = Selector(text=soup.prettify()).xpath(xpath).getall()
Is it possible to directly specify which nth element to select in the xpath itself?
Something like below
xpath = /html/body/path/to/element/text(2) #to select 3 child text
Example
<body>
<div>
<i class="ent_sprite remind_icon">
</i>
text that needs to be
</div>
</body>
The result of response.xpath('/body/div/text()').getall() consist of 2 elements
'\n'
'text that needs to be'
You can use following-sibling:: to have the nearest sibling (downward) of the expression. For example in this case you want the nearest text() of <i> tag, so you can do:
response.xpath('//i[#class="ent_sprite remind_icon"]/following-sibling::text()').get()
Which gives you the nearest text() node to <i class="ent_sprite remind_icon">.
If you want to get nth nearest sibling (downward) of a node the XPath would be following-sibling::node[n] in our case comes following:
'//i[#class="ent_sprite remind_icon"]/following-sibling::text()[n]'

xpath in lxml to find id number based on href

I'm trying to rewrite someones library to parse some xml returned with requests. However they use lxml in a way I'm not used to. I believe it's using regular expression to find the data and while most of the library provided works, it doesn't work when the site being parsed has the file id in a list structure. Essnetially I get a page back and I'm looking for an id that matches the href athlete number. So say I want to just get id's for athlete 567377.
</div>
</a></div>
<ul class='list-entries'>
<li class='entity-details feed-entry' id='Activity-123120999590'>
<div class='avatar avatar-athlete avatar-default'>
<a class='avatar-content' href='/athletes/567377' >
</a>
</div>
</li>
<li class='entity-details feed-entry' id='Activity-16784940202'>
<div class='avatar avatar-athlete avatar-default'>
<a class='avatar-content' href='/athletes/5252525'>
</a>
</div>
The code:
lst_group_activity = parser.xpath(".//li[substring(#id, 1, 8)='Activity']")
Provides all list items perfectly but for all activities. I want to only have the one related to the right athlete. The library uses the following to use an #href to select the right athlete.
lst_athlethe_act_in_group_activity = parser.xpath(".//li[substring(#id, 1, 8)='Activity']/*[#href='/athletes/"+athlethe_id+"']/..")
However, this never seems to work. It finds the activity but then throws them all away.
Is there a better way to get this working? Any tutorial that can point me in the right direction to correlate to the next element.
The element with the href attribute isn't an immedite child of your li element, so your xpath is failing. You're matching:
.//li/*[#href="..."]
You want:
.//li/div/a[#href="..."]
(You could match * instead of a if you think another element might contain the href attribute, and you can match against .//li//a[#href="..."] if you think the path to the a element might not always be li/div/a).
So to find the li element:
parser.xpath(".//li[substring(#id, 1, 8)='Activity']/div/a[#href='/athletes/%s']/../.." % '5252525')
But you can also write that without the ../..:
parser.xpath(".//li[substring(#id, 1, 8)='Activity' and div/a/#href='/athletes/%s']" % '5252525')

How to extract text from HTML (after certain string)

I have the following HTML:
<li class="group-ib medium-gap line-120 vertical-offset-10">
<i class="fa fa-angle-right font-bold font-95 text-primary text-dark">
::before
</i>
<span>
abc:
<b class="text-primary text-dark">st1</b>
</span>
</li>
And I want to extract str1 which always happens after abc. I was able to do it by using the XPATH link:
xpath('.//b[#class = "text-primary text-dark"]')[0].text
But the solution depended on it being the first appearance of this particular class, which appears more than once and isn't always in the same order. I was wondering if there was a way to search the HTML for abc and pull the subsequent text?
Maybe find the element that contains abc, navigate to child/parent if needed, get text.
Example of selectors:
Find any(* is for any tag) element that contains abc text and select any child.
//*[contains(text(), 'abc')]/*
Find any(* is for any tag) element that contains abc text and select his b child.
//*[contains(text(), 'abc')]/b
Find li element that has an element which contains text abc and select b element from inside it (inside li), use // since b is not first child of li.
//li[.//[contains(text(), 'abc')]]//b
If you know abc then start from there, see what element is returned and if needed to navigate to parent/ancestor/child.
For more about xpath please see w3schools xpath selectors
The following xpath should give the text you are searching for
//*[contains(text(),'abc')]/*[#class='text-primary text-dark'][1]/text()
assuming the str1 you are looking for should always be under elements with attribute class=text-primary text-dark
also assuming that you want to get the first such occurrence ( ignore the other text-primary text-darks )- that is why [1]
This xpath ensures that the node you are searching for those classes have a text abc before searching them.

How to read a particular value from a web page in Python/Selenium

I want to read the amount value (24.40) from this HTML.
<div id="order-total" class="clear-fix" style="margin-bottom:20px;">
<h3 class="col-left">Order total</h3>
<h3 class="col-right" style="display: block;">
<span class="credit-total-to-order" data-total-to-order="24.40">$ 24.40</span>
credits
</h3>
</div>
xpath - /html/body/div/header/section/form/div[5]/h3[2]/span
css - html body.ui-lang-en div#slave-edit.string-v2 header#slave-edit-header.edit
section#order-form form#frm-order-translation div#order-total.clear-fix
h3.col-right span.credit-total-to-order
I know I should use find_element_by_class_name or find_element_by_css_selector.
But not sure what should be the argument.
How can I do it?
Why not select the value from the element and parse the string to get the answer you need. For example, you can split the string and disregard the dollar to return the number you need.
someString = selenium.find_element_by_css_selector(".credit-total-to-order").text
someString.split(' ')[1]
Bear in mind - this will only work for the example you have provided.
Its not necessary to use find_element_by_class_name or find_element_by_css_selector..You can achive it with xpath like this
driver.find_element_by_xpath("//span[#class='credit-total-to-order']").text
UPDATE:
As per your updated html it looks like the style makes your element hidden.Mean while I also came to notice that the value you want to get is also stored in an attribute data-total-to-order.
So you can do somthing like this :
driver.find_element_by_xpath("//span[#class='credit-total-to-order']").get_Attribute("data-total-to-order")

Categories

Resources