Scrapy: getting an item using XPath

Scrapy: getting an item using XPath - python

I am trying to extract this text This is it from the following html code:
<span id="theId" class="theClass1 nUl" title="This is it" >
I am trying this:
response.xpath('//span[#class="theClass1 nUl"]')
But I do not know how to get what is inside title.
How can I do that?

Use this XPath expression to get the text content of the title attribute:
response.xpath('//span[#class="theClass1 nUl"]/#title')
Output is:
This is it

Related

How to get the value of href using text

I have the following text: 'abcdef' ,
I want to use this text to get the href. How can I do it?
<span class="thumb_link">
<a class="link_txt" href="/14803/tool/4554"> abcdef</a>
</span>
I tried the following but it failed
driver.find_element_by_css_selector('a[text()="abcdef"]')
Because the matching point is text, it must be obtained using text.

The link text would do that or if you have several similar link texts. Another way is with xpath.
driver.find_element_by_xpath("//a[text()='abcdef']").get_attribute("href")

Selenium's link text allows you to do that.
driver.find_element_by_link_text("abcdef").get_attribute("href")
will return the href.

Get text of htlm inline element class with python selenium chromedriver

I want to get a text of a webpage which is within an inline Element.
The element looks like:
<span id="some_text" class="labelRight">SOME_TEXT</span>
I want to get the SOME_TEXT part.
I already tried:
driver.find_element_by_id('').get_text
driver.find_element_by_id('').get_attribute("class")

The right way is:
driver.find_element_by_id('ctl00_contentBody_lblCount').text

Extract text from div class with scrapy

I am using python along with scrapy. I want to extract the text from the div tag which is inside a div class. For example:
<div class="ld-header">
<h1>2013 Gulfstream G650ER for Sale</h1>
<div id="header-price">Price - $46,500,000</div>
</div>
I've extracted text from h1 tag
result.xpath('//div[#class="ld-header"]/h1/text()').extract()
but I can't extract Price. I've tried
'price': result.xpath('//div[#class="ld-header"]/div[#id="header-price"]/text()').extract()

As you have an id, you do not need to use the complete path to the element. Ids are unique per Webpage:
This Xpath:
//div[#id="header-price"]/text()
used on the give XML will return:
'Price - $46,500,000'
For debugging Xpath and CSS Selectors, I always find it helpful to use an online checker (just use Google to find some suggestions).

Try This one and you tell me :)
price = [x.replace('Price - ', '').replace('$', '') for x in result.xpath('//div[#class="ld-header"]/h1/text()').extract()]
This is a 'for' loop inside all the items in the extraction where you replace all the info you don't need with the 'replace()' method.

Selenium - Extract text in div without other tags (Python)

Trying to figure out how to access the text in the screenshot below without pulling all the span tags.
Doing element = driver.find_elements_by_id('response') gives me a list, but I can't seem to dig down further to access the text I want.
I also tried this after doing some searching:
element = driver.find_element_by_xpath("//div[#id='response']/pre")
But I get the same result.
Any tips?

element.get_attribute('innerHTML')
this will help you to get the text between two div tag

element.text
Should give out the contents of the element without any HTML tags.

In the case of the text being in the pure div the text is not extracted using element.text
Example:
<div>the text here</div>
I recommend to use a library called html2text and next:
html2text(element.get_attribute("outerHTML"))
It will do the trick!

Extracting text from hyperlink using XPath

I am using Python along with Xpath to scrape Reddit. Currently I am working on the front page. I am trying to extract links from its front page and display their titles in the shell.
For this I am using the Scrapy framework. I am testing this in the Scrapy shell itself.
My question is this: How do I extract the text from the <a> ABC </a> attribute. I want the string "ABC". I cannot find it. I have tried the following expressions, but it does not seem to work.
response.xpath('//p[descendant::a[contains(#class,"title")]]/#value')
response.xpath('//p[descendant::a[contains(#class,"title")]]/#data')
response.xpath('//p[descendant::a[contains(#class,"title")]]').extract()
response.xpath('//p[descendant::a[contains(#class,"title")]]/text()')
None of them seem to work. When I use extract(), it gives me the whole attribute itself. For example, instead of giving me ABC, it will give me <a>ABC</a>.
How can i extract the text string?

If <p> and <a> are in this situation:
<p>
<something>
<a class="title">ABC</a>
</something>
</p>
This will give you "ABC":
>>print response.xpath('//p//a[#class="title"]/text()').extract()[0]
ABC
// is equal of using descendants. p[descendant::a] wont give you the result because you are not considering <a> as descendant of <p>

Only tested it with online XPath evaluator, but it should work when you adjust it to
response.xpath('//p/descendant::a[contains(#class,"title")]/text()')
If you're evaluating //p[descendant::a[contains(#class,"title")]]/text(), the <p> (with the descendant <a>) is the current element and not the <a>.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy: getting an item using XPath - python

I am trying to extract this text This is it from the following html code: <span id="theId" class="theClass1 nUl" title="This is it" > I am trying this: response.xpath('//span[#class="theClass1 nUl"]') But I do not know how to get what is inside title. How can I do that?

Use this XPath expression to get the text content of the title attribute: response.xpath('//span[#class="theClass1 nUl"]/#title') Output is: This is it

Related

How to get the value of href using text

Get text of htlm inline element class with python selenium chromedriver

Extract text from div class with scrapy

Selenium - Extract text in div without other tags (Python)

Extracting text from hyperlink using XPath

Categories

Resources