Extract text from div class with scrapy - python

I am using python along with scrapy. I want to extract the text from the div tag which is inside a div class. For example:
<div class="ld-header">
<h1>2013 Gulfstream G650ER for Sale</h1>
<div id="header-price">Price - $46,500,000</div>
</div>
I've extracted text from h1 tag
result.xpath('//div[#class="ld-header"]/h1/text()').extract()
but I can't extract Price. I've tried
'price': result.xpath('//div[#class="ld-header"]/div[#id="header-price"]/text()').extract()

As you have an id, you do not need to use the complete path to the element. Ids are unique per Webpage:
This Xpath:
//div[#id="header-price"]/text()
used on the give XML will return:
'Price - $46,500,000'
For debugging Xpath and CSS Selectors, I always find it helpful to use an online checker (just use Google to find some suggestions).

Try This one and you tell me :)
price = [x.replace('Price - ', '').replace('$', '') for x in result.xpath('//div[#class="ld-header"]/h1/text()').extract()]
This is a 'for' loop inside all the items in the extraction where you replace all the info you don't need with the 'replace()' method.

Related

How can I search a specific element on the page for text using XPATH (Selenium);

I can't seem to find an example of this.
What I am trying to do is search a specific div element on the page for text that has the potential to change.
So it'd be like this
<div id="coolId">
<div>This</div>
<div>Can</div>
<div>Change depending on the iteration of the page</div>
</div>
In my case, the div coolID will always be present, but the text within it's inner divs and child elements will change depending on which iteration of the page is loaded, and I need to search for the presence of certain terms within this coolID div and cool div only because I know it will always be there, and I'd like to specify the search as much as possible so as not to potentially contaminate results with other text from other places on the page.
In my head, I sort of see it like this (using the above example):
"//div[#id='coolId', contains(text(), 'Change depending on the iteration of the page')]"
Or something to this effect.
Does anyone know how to do this?
I'm not completely sure you can set a correct XPath based on all 3 inner elements texts.
What you clearly can is to locate the outer div with id = coolId based on one of the inner texts that will be unique and then to extract all the inner texts form it.
the_total_text = driver.find_element_by_xpath("//div[#id and contains(.,'Change depending on the iteration of the page')]").text
This will give you
the_total_text = This Can Change depending on the iteration of the page
You should try:
div_element_with_needed_text = driver.find_element_by_xpath("//div[#id='coolId']/div[text()[contains(.,'Change depending on the iteration of the page')]]")
Considering the HTML:
<div id="coolId">
<div>This</div>
<div>Can</div>
<div>Change depending on the iteration of the page</div>
</div>
to retrieve the variable texts with respective to the parent <div id="coolId"> you can use the following solutions:
Extracting This using xpath:
first_child_text = driver.find_element(By.XPATH, "//div[#id='coolId']//following::div[1]").text
Extracting Can using xpath:
second_child_text = driver.find_element(By.XPATH, "//div[#id='coolId']//following::div[2]").text
Extracting Change depending on the iteration of the page using xpath:
third_child_text = driver.find_element(By.XPATH, "//div[#id='coolId']//following::div[3]").text
To extract all the texts from the decendents using xpath:
all_child_text = driver.find_element(By.XPATH, "//div[#id='coolId']").text

XPATH target the text nested <a> tag inside of a <p> tag

I am trying to target text inside a tag. There are some tags that have nested tags as well and my XPATH isn't targeting the text value of those tags.
Link: https://help.lyft.com/hc/en-us/articles/115012925707-I-was-charged-incorrectly
Here is the XPATH I am using: //article//p/text()
Of course, I can do //article//p//text() and target the text but that also gets other links I don't want to extract. I only want to get all the text inside of a tag and if there is any nested tag, take that value too.
How can I achieve such a result?
Thanks, everyone.
As most of the pink-colored links start with Learn, I would probably go about it this way:
a = response.xpath('//article//p//a//text()').extract()
if "Learn" not in a and "Back to top" not in a:
print(response.xpath('//article//p/text()').extract())

Selenium - Extract text in div without other tags (Python)

Trying to figure out how to access the text in the screenshot below without pulling all the span tags.
Doing element = driver.find_elements_by_id('response') gives me a list, but I can't seem to dig down further to access the text I want.
I also tried this after doing some searching:
element = driver.find_element_by_xpath("//div[#id='response']/pre")
But I get the same result.
Any tips?
element.get_attribute('innerHTML')
this will help you to get the text between two div tag
element.text
Should give out the contents of the element without any HTML tags.
In the case of the text being in the pure div the text is not extracted using element.text
Example:
<div>the text here</div>
I recommend to use a library called html2text and next:
html2text(element.get_attribute("outerHTML"))
It will do the trick!

Extracting text from hyperlink using XPath

I am using Python along with Xpath to scrape Reddit. Currently I am working on the front page. I am trying to extract links from its front page and display their titles in the shell.
For this I am using the Scrapy framework. I am testing this in the Scrapy shell itself.
My question is this: How do I extract the text from the <a> ABC </a> attribute. I want the string "ABC". I cannot find it. I have tried the following expressions, but it does not seem to work.
response.xpath('//p[descendant::a[contains(#class,"title")]]/#value')
response.xpath('//p[descendant::a[contains(#class,"title")]]/#data')
response.xpath('//p[descendant::a[contains(#class,"title")]]').extract()
response.xpath('//p[descendant::a[contains(#class,"title")]]/text()')
None of them seem to work. When I use extract(), it gives me the whole attribute itself. For example, instead of giving me ABC, it will give me <a>ABC</a>.
How can i extract the text string?
If <p> and <a> are in this situation:
<p>
<something>
<a class="title">ABC</a>
</something>
</p>
This will give you "ABC":
>>print response.xpath('//p//a[#class="title"]/text()').extract()[0]
ABC
// is equal of using descendants. p[descendant::a] wont give you the result because you are not considering <a> as descendant of <p>
Only tested it with online XPath evaluator, but it should work when you adjust it to
response.xpath('//p/descendant::a[contains(#class,"title")]/text()')
If you're evaluating //p[descendant::a[contains(#class,"title")]]/text(), the <p> (with the descendant <a>) is the current element and not the <a>.

Iterating Over Elements and Sub Elements With lxml

This one is for legitimate lxml gurus. I have a web scraping application where I want to iterate over a number of div.content (content is the class) tags on a website. Once in a div.content tag, I want to see if there are any <a> tags that are the children of <h3> elements. This seems relatively simple by just trying to create a list using XPath from the div.cont tag, i.e.,
linkList = tree.xpath('div[contains(#class,"cont")]//h3//a')
The problem is, I then want to create a tuple that contains the link from the div.content box as well as the text from the paragraph element of the same div.content box. I could obviously iterate over the whole document and store all of the paragraph text as well as all of the links, but I wouldn't have any real way of matching the appropriate paragraphs to the <a> tags.
lxml's Element.iter() function could ALMOST achieve this by iterating over all of the div.cont elements, ignoring those without <a> tags, and pairing up the paragraph/a combos, but unfortunately there doesn't seem to be any option for iterating over class names, only tag names, with that method.
Edit: here's an extremely stripped down version of the HTML I want to parse:
<body>
<div class="cont">
<h1>Random Text</h1>
<p>The text I want to obtain</p>
<h3>The link I want to obtain</h3>
</div>
</body>
There are a number of div.conts like this that I want to work with -- most of them have far more elements than this, but this is just a sketch to give you an idea of what I'm working with.
You could just use a less specific XPath expression:
for matchingdiv in tree.xpath('div[contains(#class,"cont")]'):
# skip those without a h3 > a setup.
link = matchingdiv.xpath('.//h3//a')
if not link:
continue
# grab the `p` text and of course the link.
You could expand this (be ambitious) and select for the h3 > a tags, then go to the div.cont ancestor (based off XPath query with descendant and descendant text() predicates):
for matchingdiv in tree.xpath('.//h3//a/ancestor::*[self::div[contains(#class,"cont")]]'):
# no need to skip anymore, this is a div.cont with h3 and a contained
link = matchingdiv.xpath('.//h3//a')
# grab the `p` text and of course the link
but since you need to then scan for the link anyway that doesn't actually buy you anything.

Categories

Resources