How to extract text from HTML (after certain string)

How to extract text from HTML (after certain string) - python

I have the following HTML:
<li class="group-ib medium-gap line-120 vertical-offset-10">
<i class="fa fa-angle-right font-bold font-95 text-primary text-dark">
::before
</i>
<span>
abc:
<b class="text-primary text-dark">st1</b>
</span>
</li>
And I want to extract str1 which always happens after abc. I was able to do it by using the XPATH link:
xpath('.//b[#class = "text-primary text-dark"]')[0].text
But the solution depended on it being the first appearance of this particular class, which appears more than once and isn't always in the same order. I was wondering if there was a way to search the HTML for abc and pull the subsequent text?

Maybe find the element that contains abc, navigate to child/parent if needed, get text.
Example of selectors:
Find any(* is for any tag) element that contains abc text and select any child.
//*[contains(text(), 'abc')]/*
Find any(* is for any tag) element that contains abc text and select his b child.
//*[contains(text(), 'abc')]/b
Find li element that has an element which contains text abc and select b element from inside it (inside li), use // since b is not first child of li.
//li[.//[contains(text(), 'abc')]]//b
If you know abc then start from there, see what element is returned and if needed to navigate to parent/ancestor/child.
For more about xpath please see w3schools xpath selectors

The following xpath should give the text you are searching for
//*[contains(text(),'abc')]/*[#class='text-primary text-dark'][1]/text()
assuming the str1 you are looking for should always be under elements with attribute class=text-primary text-dark
also assuming that you want to get the first such occurrence ( ignore the other text-primary text-darks )- that is why [1]
This xpath ensures that the node you are searching for those classes have a text abc before searching them.

Related

Scrapy selector: get nth-child text of an element

I am using Scrapy selector to extract fields from html
xpath = /html/body/path/to/element/text()
This is similar to question scrapy get nth-child text of same class
and following the documentation we can use .getall() method to get all element and select specific one from the list.
selected_list = Selector(text=soup.prettify()).xpath(xpath).getall()
Is it possible to directly specify which nth element to select in the xpath itself?
Something like below
xpath = /html/body/path/to/element/text(2) #to select 3 child text
Example
<body>
<div>
<i class="ent_sprite remind_icon">
</i>
text that needs to be
</div>
</body>
The result of response.xpath('/body/div/text()').getall() consist of 2 elements
'\n'
'text that needs to be'

You can use following-sibling:: to have the nearest sibling (downward) of the expression. For example in this case you want the nearest text() of <i> tag, so you can do:
response.xpath('//i[#class="ent_sprite remind_icon"]/following-sibling::text()').get()
Which gives you the nearest text() node to <i class="ent_sprite remind_icon">.
If you want to get nth nearest sibling (downward) of a node the XPath would be following-sibling::node[n] in our case comes following:
'//i[#class="ent_sprite remind_icon"]/following-sibling::text()[n]'

xpath/python search then grab child nodes?

I am working on a scraper using python and selenium and I have an issue traversing xpath. I feel like this should be simple, but I'm obviously missing something.
I am able to navigate the site I am browsing fine, but I need to grab some SPAN text based on an XPATH search.
I am able to click the appropriate radio button(in this case the 1st one)
(driver.find_elements_by_name("start-date"))[0].click()
But I also need to capture the text next to the radio button which is captures in the span tags.
<label>
<input type="radio" name="start-date" value="1" data-start-date="/Date(1507854300000)/" data-end-date="/Date(1508200200000)/" group="15" type-id="8">
<span class="start-date">
10/12/2017<br>Summary text
</span>
</label>
In the above example, I'm looking to capture "10/12/2017" and "Summary text" into 2 string variables based on the find_elements_by_name search I used to find the radio button.
I then have a second, similar, collection issue, where I need to capture the span tags after searching by class name. This finds the appropriate parent node on the page:
(driver.find_element_by_xpath("//div[#class=\"MyClass\"]"))
Based on the node returned by that search, I want to grab "Text 1" and "Text 2" from the span tags below it.
<div class="MyClass">
<span>
<span>Text 1</span>
</span>
<span class="bullet">
</span>
<span>
<span>Text 2</span>
</span>
</div>
I am new to xpath, but from what I can gather, the span nodes I am looking for should be children of the nodes I found with my searches, and I should be able to traverse down the hierarchy somehow to get the values, I'm just not sure how.

It's actually very simple, all WebElement objects have the same find_element_by_* methods that the WebDriver object has, with the main difference that the element methods change the context to that element, meaning that it will only have children of the selected element.
With that in mind you should be able to do:
my_element = driver.find_element_by_class_name('MyClass')
my_spans = my_element.find_elements_by_css_selector('span>span')
What happens here is that we grab the first element with class MyClass, then from the context of that element we search for elements that are span AND children of a span

you can try with the following x-path.
//div[#class='MyClass']/span[1]/span ---- To get Text 1
//div[#class='MyClass']/span[3]/span -----To get Text 2
or
(//div[#class='MyClass']/span/span)[1] ---- To get Text 1
(//div[#class='MyClass']/span/span)[2] ---- To get Text 2

How to select a radio button followed by text with XPath?

I have a radio button with value as HTML as follows:
<div class='result'>
<span>
<input type='radio'/>
option1
</span>
<span>
<input type='radio'/>
option2
</span>
<span>
<input type='radio'/>
option3
</span>
</div>
I tried the following XPath, but this isn't working:
//span[contains(text(),'option1')]/input[#type='radio']
Please help me write XPath for this.

There are actually two text nodes in target span: the first one is just an empty string before <input> and the second- after <input> (the one that contains "option1")
And your XPath //span[contains(text(),'option1')] means return span that contains "option1" in first text node.
You can use one of below expressions to match required input:
//span[normalize-space()="option1"]/input[#type="radio"]
//span[contains(text()[2],'option1')]/input[#type='radio']

There are two text elements per span. One precedes the input element, and one follows it, but the first one is essentially empty.
In this code I find the input elements, then their parents, then the second text elements of those span parents.
>>> from scrapy.selector import Selector
>>> selector = Selector(text=open('temp.html').read())
>>> for item in selector.xpath('.//input[#type="radio"]/../text()[2]'):
... item.extract()
...
'\noption1\n'
'\noption2\n'
'\noption3\n'

try this to select option 1
//input[#type='radio']/preceding::span[1][contains(.,'option1')]

I guess you can't use text() here. Because this function returns a sequence of child text nodes of current span element. There are 2 text nodes in your example:
<span>
<input type='radio'/>
option1
</span>
1st text node is between <span> and <input type='radio'/> containing just a newline.
2nd text node is between <input type='radio'/> and </span> containing option1 text plus 2 newlines (at the begining and at the ending).
contains function expects a string argument instead of a sequence. I think it will take only first text node from the sequence, which contains just a newline.
If you need to select input followed by some text node you can use the following expression:
//input[#type='radio'][contains(following-sibling::text(), 'option1')]
If you need to select span containing text option1 and input with #type='radio', you can try the following expression:
//span[contains(., 'option1') and input/#type='radio']
If you need to select input instead of span then use the following expression:
//span[contains(., 'option1')]/input[#type='radio']
I can suggest you the following resources to gain some information about XPath. W3C recomendations contains a full description of XPath. If you use XPath 2.0 then you can look at:
XML Path Language (XPath) 2.0
XQuery 1.0 and XPath 2.0 Functions and Operators
For XPath 3.0 look at:
XML Path Language (XPath) 3.0
XPath and XQuery Functions and Operators 3
These recomendations are big enough and hard to read. But you can find in these documents a list of all available axes including following-sibling::, a description of text(), a description of contains(), etc.
Also there are a lot of brief XPath tutorials. For example you can look at this one.

Get text with BeautifulSoup CSS Selector

Example HTML
<h2 id="name">
ABC
<span class="numbers">123</span>
<span class="lower">abc</span>
</h2>
I can get the numbers with something like:
soup.select('#name > span.numbers')[0].text
How do I get the text ABC using BeautifulSoup and the select function?
What about in this case?
<div id="name">
<div id="numbers">123</div>
ABC
</div>

In the first case, get the previous sibling:
soup.select_one('#name > span.numbers').previous_sibling
In the second case, get the next sibling:
soup.select_one('#name > #numbers').next_sibling
Note that I assume that it is intentional that here you have the numbers as an id value and the tag is div instead of span. Hence, I've adjusted the CSS selector.
To cover both cases, you can go to the parent of the tag and find the non-empty text node in a non-recursive mode:
parent = soup.select_one('#name > .numbers,#numbers').parent
print(parent.find(text=lambda text: text and text.strip(), recursive=False).strip())
Note the change in the selector - we are asking to match either numbers id or numbers class.
Though, I have a feeling that this universal solution would not be quite reliable because, for starters, I don't know what your real inputs could be.

XPATH to check on a specific text within a node

I have this as a node to parse:
<h3 class="atag">
<a href="http://www.example.com">
<span class="btag">text to be ignored</span>
</a>
<span class="ctag">text to be checked</span>
</h3>
I'm gonna need to extract "http://www.example.com" but not the part text to to be ignored; I also need to check that if ctag contains text to be checked.
I came up with this but it seems it doesn't do the job.
response.xpath("//h3/a/#*[not(self::span)]").extract()
any idea on this?

If you need to just select href from 'a' tag, use #href.
To also check, whether the ctag contains some text, I think you can use code like this:
'//h3[contains(span[#class="ctag"]/text(), "text to be checked")]/a/#href'
This would check whether there is a span with "text to be checked" inside given h3 block. If the text exists, the 'www.example.com' would be found, otherwise there would be an empty result.

Do you mean something like this XPath? :
//h3/a[following-sibling::span[#class='ctag' and .='text to be checked']/#href
above XPath get <a> tag that followed by <span class="ctag"> containing value of "text to be checked", then return href attribute from the previously mentioned <a> tag.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract text from HTML (after certain string) - python

Related

Scrapy selector: get nth-child text of an element

xpath/python search then grab child nodes?

How to select a radio button followed by text with XPath?

Get text with BeautifulSoup CSS Selector

XPATH to check on a specific text within a node

Categories

Resources