With XPath, how to select all the images that are not inside <a> tag?
For example, here:
<a href='foo'> <img src='bar'/> </a>
<img src='ham' />
I should get "ham" image as a result. To get the first image, I would use \\a\\img. Is there anything like \\not(a)\\img ?
I use python + lxml so python hacks are welcome, if pure xpath would be to hairy.
That's easily done with
//img[not(ancestor::a)]
Read the spec on XPath axes if you want to find out about the other ones besides ancestor.
Related
I'm working with Python Selenium, and in the following HTML structure:
<div>
<h2>Welcome</h2>
<div>
<p>some text <strong>important</strong></p>
<a>link</a>
</div>
</div>
I'd like to get the text from each descendant (h2, div, p, strong, a) of the parent div, e.g. for the <p> tag I want some text.
I've been using the .text attribute and getting some text important instead. I'd like to use something similar as the BeautifulSoup attribute .string.
Edit: I need the code to work for any parent element containing descendants with more nested descendants - not just this particular HTML structure.
Thanks for your help.
Use Java Script executor to return textContent.
print(driver.execute_script('return arguments[0].firstChild.textContent;', driver.find_element_by_xpath("//h2[contains(.,'Welcome')]/following::div/p")))
I am using Python & Selenium to scrap the content of a certain webpage. Currently, I have the following problem: There are multiple div-classes with the same name, but each div-class has different content. I only need the information for one particular div-class. In the following example, I would need the information in the first "show_result"-class since there is the "Important-Element" within the link text:
<div class="show_result">
<a href="?submitaction=showMoreid=77" title="Go-here">
<span class="new">Important-Element</span></a>
Other text, links, etc within the class...
</div>
<div class="show_result">
<a href="?submitaction=showMoreid=78" title="Go-here">
<span class="new">Not-Important-Element</span></a>
Other text, links, etc within the class...
</div>
<div class="show_result">
<a href="?submitaction=showMoreid=79" title="Go-here">
<span class="new">Not-Important-Element</span></a>
Other text, links, etc within the class...
</div>
With the following code I can get the "Important-Element" and its link:
driver.find_element_by_partial_link_text('Important-Element'). However, I also need the other information within the same div-class "show-result". How can I refer to the entire div-class that contains the Important-Element in the link text? driver.find_elements_by_class_name('show_result') does not work since I do not know in which of the div-classes the Important-Element is located.
Thanks,
Finn
Edit / Update: Ups, I found the solution on my own using xpath:
driver.find_element_by_xpath("//div[contains(#class, 'show_result') and contains(., 'Important-Element')]")
I know you've found an answer but I believe it's wrong since you would also select the other nodes because Important-Element is still in Non-Important-Element.
Maybe it works for your specific case since that's not really the text you're after. But here are a few more answers:
//div[#class='show_result' and starts-with(.,'Important-Element')]
//div[span[text()='Important-Element']]
//div[contains(span/text(),'Important-Element') and not(contains(span/text(),'Non'))]
There are more ways to write this...
Ups, i found the solution on my own via xpath:
driver.find_element_by_xpath("//div[contains(#class, 'show_result') and contains(., 'Important-Element')]")
I am using Python along with Xpath to scrape Reddit. Currently I am working on the front page. I am trying to extract links from its front page and display their titles in the shell.
For this I am using the Scrapy framework. I am testing this in the Scrapy shell itself.
My question is this: How do I extract the text from the <a> ABC </a> attribute. I want the string "ABC". I cannot find it. I have tried the following expressions, but it does not seem to work.
response.xpath('//p[descendant::a[contains(#class,"title")]]/#value')
response.xpath('//p[descendant::a[contains(#class,"title")]]/#data')
response.xpath('//p[descendant::a[contains(#class,"title")]]').extract()
response.xpath('//p[descendant::a[contains(#class,"title")]]/text()')
None of them seem to work. When I use extract(), it gives me the whole attribute itself. For example, instead of giving me ABC, it will give me <a>ABC</a>.
How can i extract the text string?
If <p> and <a> are in this situation:
<p>
<something>
<a class="title">ABC</a>
</something>
</p>
This will give you "ABC":
>>print response.xpath('//p//a[#class="title"]/text()').extract()[0]
ABC
// is equal of using descendants. p[descendant::a] wont give you the result because you are not considering <a> as descendant of <p>
Only tested it with online XPath evaluator, but it should work when you adjust it to
response.xpath('//p/descendant::a[contains(#class,"title")]/text()')
If you're evaluating //p[descendant::a[contains(#class,"title")]]/text(), the <p> (with the descendant <a>) is the current element and not the <a>.
Here is the HTML I'm dealing with
<a class="_54nc" href="#" role="menuitem">
<span>
<span class="_54nh">Other...</span>
</span>
</a>
I can't seem to get my XPath structured correctly to find this element with the link. There are other elements on the page with the same attributes as <a class="_54nc"> so I thought I would start with the child and then go up to the parent.
I've tried a number of variations, but I would think something like this:
crawler.get_element_by_xpath('//span[#class="_54nh"][contains(text(), "Other")]/../..')
None of the things I've tried seem to be working. Any ideas would be much appreciated.
Or, more cleaner is //*[.='Other...']/../.. and with . you are directly pointing to the parent element
In other scenario, if you want to find a tag then use css [role='menuitem'] which is a better option if role attribute is unique
how about trying this
crawler.get_element_by_xpath('//a[#class="_54nc"][./span/span[contains(text(), "other")]]')
Try this:
crawler.get_element_by_xpath('//a[#class='_54nc']//span[.='Other...']');
This will search for the element 'a' with class as "_54nc" and containing exact text/innerHTML "Other...". Furthermore, you can just edit the text "Other..." with other texts to find the respective element(s)
How I can a image if code like this:
<div class="galery-images">
<div class="galery-images-slide" style="width: 760px;">
<div class="galery-item galery-item-selected" style="background-image: url(/images/photo/1/20130206/30323/136666697057736800.jpg);"></div>
I want to get 136666697057736800.jpg
I wrote:
images = soup.select("div.galery-item")
And i get a list:
[<div class="galery-item galery-item-selected" style="background-image: url(/images/photo/1/20130206/30323/136666697057736800.jpg);"></div>,
<div class="galery-item" style="background-image: url(/images/photo/1/20130206/30323/136013892671126300.jpg);" ></div>,
<div class="galery-item" style="background-image: url(/images/photo/1/20130206/30323/136666699218876700.jpg);"></div>]
I dont understand: how I can get all images?
Use regex or a css parser to extract the url, concatenate the host to the beginning of the URL, finally download the image like this.
import urllib
urllib.urlretrieve("https://www.google.com/images/srpr/logo11w.png", "google.png")
To make your life easier, you should use a regex:
urls = []
for ele in soup.find_all('div', attrs={'class':'galery-images-slide'}):
pattern = re.compile('.*background-image:\s*url\((.*)\);')
match = pattern.match(ele.div['style'])
if match:
urls.append(match.group(1))
This works by finding all the divs belonging to the parent div (which has the class: 'galery-images-slide'). Then, you can parse the child divs to find any that contain the style (which itself contains the background-url) using a regex.
So, from your above example, this will output:
[u'/images/photo/1/20130206/30323/136666697057736800.jpg']
Now, to download the specified image, you append the site name in front of the url, and you should be able to download it.
NOTE:
This requires the regex module (re) in Python in addition to BeautifulSoup.
And, the regex I used is quite naive. But, you can adjust this as required to suit your needs.