How to skip over child element with Scrapy - python

I'm looking to scrape just the job description from this page: https://www.aha.io/company/careers/current-openings/customer_success_specialist_project_management_us
I'd like to get all of the text and HTML inside the div with the class of "container py2 content job", EXCEPT the button. It's in an <a> tag with the class of "btn btn-large btn-secondary".
I've got two different xpath selectors that I thought should work, but don't. The first doesn't exclude the button and the second gets rid of all of the other HTML, which I'd like to keep.
response.xpath('//div[#class ="container py2 content job"]
[not(parent::a/#class="btn btn-large btn-secondary")]').extract()
response.xpath('//div[#class ="container py2 content
job"]/descendant::text()[not(parent::a/#class="btn btn-large btn-
secondary")]').extract()
Neither is scraping all of the HTML in the div minus what's inside the a tag. I'm hoping there's something simple that I'm missing, but I can't find what I'm looking for in the documentation.

job_html = response.css('div.content *').extract()
job_html = [x for x in job_html if "Apply now" not in x]
print(job_html)

Related

Checking If Attribute Exists In Any Of Html Tags Selenium Python

I want to check out if any html tags have <style> attribute like <a style = ..> or <h1 style = ...> or <div style = ..> etc.
I used below code but it could not be run:
driver = webdriver.Chrome(web_driver_address, options=op)
driver.get(url)
elems = driver.find_elements_by_xpath("[#style]")
How can i fix this?
Your XPath is missing element tag name.
In your case it can be any tag name, but it still should be there as a part of syntax, so you should use * like any there.
Also, you are missing the // that means the element can be anywhere on the page.
So the correct XPath expression will be something like this:
elems = driver.find_elements_by_xpath("//*[#style]")
Don't forget to add some wait / delay to let page load all the elements before you get them
xpath needs tag to be valid. If you don't want a specific tag use *
find_elements_by_xpath("//*[#style]")
Or with css_selector
find_elements_by_css_selector("[style]")

Can't grab next sibling using css selector within scrapy

I'm trying to fetch the budget using scrapy implementing css selector within it. I can get it when I use xpath but in case of css selector I'm lost. I can even get the content when I go for BeautifulSoup and use next_sibling.
I've tried with:
import requests
from scrapy import Selector
url = "https://www.imdb.com/title/tt0111161/"
res = requests.get(url)
sel = Selector(res)
# budget = sel.xpath("//h4[contains(.,'Budget:')]/following::text()").get()
# print(budget)
budget = sel.css("h4:contains('Budget:')::text").get()
print(budget)
Output I'm getting using css selector:
Budget:
Expected output:
$25,000,000
Relevant portion of html:
<div class="txt-block">
<h4 class="inline">Budget:</h4>$25,000,000
<span class="attribute">(estimated)</span>
</div>
website address
That portion in that site is visible as:
How can I get the budgetary information using css selector when it is used within scrapy?
This selector .css("h4:contains('Budget:')::text") is selecting the h4 tag, and the text you want is in it's parent, the div element.
You could use .css('div.txt-block::text') but this would return several elements, as the page have several elements like that. CSS selectors don't have a parent pseudo-element, I guess you could use .css('div.txt-block:nth-child(12)::text') but if you are going to scrape more pages, this will probably fail in other pages.
The best option would be to use XPath:
response.xpath('//h4[text() = "Budget:"]/parent::div/text()').getall()

Getting href value of a tag of selenium web element

I want to get the url of the link of tag. I have attached the class of the element to type selenium.webdriver.remote.webelement.WebElement in python:
elem = driver.find_elements_by_class_name("_5cq3")
and the html is:
<div class="_5cq3" data-ft="{"tn":"E"}">
<a class="_4-eo" href="/9gag/photos/a.109041001839.105995.21785951839/10153954245456840/?type=1" rel="theater" ajaxify="/9gag/photos/a.109041001839.105995.21785951839/10153954245456840/?type=1&src=https%3A%2F%2Fscontent.xx.fbcdn.net%2Fhphotos-xfp1%2Ft31.0-8%2F11894571_10153954245456840_9038620401603938613_o.jpg&smallsrc=https%3A%2F%2Fscontent.xx.fbcdn.net%2Fhphotos-prn2%2Fv%2Ft1.0-9%2F11903991_10153954245456840_9038620401603938613_n.jpg%3Foh%3D0c837ce6b0498cd833f83cfbaeb577e7%26oe%3D567D8819&size=651%2C1000&fbid=10153954245456840&player_origin=profile" style="width:256px;">
<div class="uiScaledImageContainer _4-ep" style="width:256px;height:394px;" id="u_jsonp_2_r">
<img class="scaledImageFitWidth img" src="https://fbcdn-photos-h-a.akamaihd.net/hphotos-ak-prn2/v/t1.0-0/s526x395/11903991_10153954245456840_9038620401603938613_n.jpg?oh=15f59e964665efe28943d12bd00cefd9&oe=5667BDBA&__gda__=1448928574_a7c6da855842af4c152c2fdf8096e1ef" alt="9GAG's photo." width="256" height="395">
</div>
</a>
</div>
I want the href value of the a tag falling inside the class _5cq3.
Why not do it directly?
url = driver.find_element_by_class_name("_4-eo").get_attribute("href")
And if you need the div element first you can do it this way:
divElement = driver.find_elements_by_class_name("_5cq3")
url = divElement.find_element_by_class_name("_4-eo").get_attribute("href")
or another way via xpath (given that there is only one link element inside your 5cq3 Elements:
url = driver.find_element_by_xpath("//div[#class='_5cq3']/a").get_attribute("href")
You can use xpath for same
If you want to take href of "a" tag, 2nd line according to your HTML code then use
url = driver.find_element_by_xpath("//div[#class='_5cq3']/a[#class='_4-eo']").get_attribute("href")
If you want to take href of "img" tag, 4nd line according to your HTML code then use
url = driver.find_element_by_xpath("//div[#class='_5cq3']/a/div/img[#class='scaledImageFitWidth img']").get_attribute("href")
Use:
1)
xpath to specify the path to the href first.
x = '//a[#class="_4-eo"]'
k = driver.find_elements_by_xpath(x).get_attribute("href")
for url in k:
print url
2) Use #drkthng's solution(the simplest).
3)You can use:
parentElement = driver.find_elements_by_class("_4-eo")
elementList = parentElement.find_elements_by_tag_name("href")
You can use whatever you want in Selenium. there are 2-3 more ways to find the same.
And for image src use below xpath:
img_path = '//div[#class="uiScaledImageContainer _4-ep"]//img[#src]'

How can I get text of an element in Selenium WebDriver, without including child element text?

Consider:
<div id="a">This is some
<div id="b">text</div>
</div>
Getting "This is some" is nontrivial. For instance, this returns "This is some text":
driver.find_element_by_id('a').text
How does one, in a general way, get the text of a specific element without including the text of its children?
Here's a general solution:
def get_text_excluding_children(driver, element):
return driver.execute_script("""
return jQuery(arguments[0]).contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
}).text();
""", element)
The element passed to the function can be something obtained from the find_element...() methods (i.e., it can be a WebElement object).
Or if you don't have jQuery or don't want to use it, you can replace the body of the function above with this:
return self.driver.execute_script("""
var parent = arguments[0];
var child = parent.firstChild;
var ret = "";
while(child) {
if (child.nodeType === Node.TEXT_NODE)
ret += child.textContent;
child = child.nextSibling;
}
return ret;
""", element)
I'm actually using this code in a test suite.
In the HTML which you have shared:
<div id="a">This is some
<div id="b">text</div>
</div>
The text This is some is within a text node. To depict the text node in a structured way:
<div id="a">
This is some
<div id="b">text</div>
</div>
This use case
To extract and print the text This is some from the text node using Selenium's python client, you have two ways as follows:
Using splitlines(): You can identify the parent element i.e. <div id="a">, extract the innerHTML and then use splitlines() as follows:
using xpath:
print(driver.find_element_by_xpath("//div[#id='a']").get_attribute("innerHTML").splitlines()[0])
using css_selector:
print(driver.find_element_by_css_selector("div#a").get_attribute("innerHTML").splitlines()[0])
Using execute_script(): You can also use the execute_script() method which can synchronously execute JavaScript in the current window/frame as follows:
using xpath and firstChild:
parent_element = driver.find_element_by_xpath("//div[#id='a']")
print(driver.execute_script('return arguments[0].firstChild.textContent;', parent_element).strip())
using xpath and childNodes[n]:
parent_element = driver.find_element_by_xpath("//div[#id='a']")
print(driver.execute_script('return arguments[0].childNodes[1].textContent;', parent_element).strip())
Use:
def get_true_text(tag):
children = tag.find_elements_by_xpath('*')
original_text = tag.text
for child in children:
original_text = original_text.replace(child.text, '', 1)
return original_text
You don't have to do a replace. You can get the length of the children text, subtract that from the overall length, and slice into the original text. That should be substantially faster.
Unfortunately, Selenium was only built to work with Elements, not Text nodes.
If you try to use a function like get_element_by_xpath to target the text nodes, Selenium will throw an InvalidSelectorException.
One workaround is to grab the relevant HTML with Selenium and then use an HTML parsing library like Beautiful Soup that can handle text nodes more elegantly.
import bs4
from bs4 import BeautifulSoup
inner_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("innerHTML")
inner_soup = BeautifulSoup(inner_html, 'html.parser')
outer_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("outerHTML")
outer_soup = BeautifulSoup(outer_html, 'html.parser')
From there, there are several ways to search for the Text content. You'll have to experiment to see what works best for your use case.
Here's a simple one-liner that may be sufficient:
inner_soup.find(text=True)
If that doesn't work, then you can loop through the element's child nodes with .contents() and check their object type.
Beautiful Soup has four types of elements, and the one that you'll be interested in is the NavigableString type, which is produced by Text nodes. By contrast, Elements will have a type of Tag.
contents = inner_soup.contents
for bs4_object in contents:
if (type(bs4_object) == bs4.Tag):
print("This object is an Element.")
elif (type(bs4_object) == bs4.NavigableString):
print("This object is a Text node.")
Note that Beautiful Soup doesn't support XPath expressions. If you need those, then you can use some of the workarounds in this question.

lxml: get element with a particular child element?

Working in lxml, I want to get the href attribute of all links with an img child that has title="Go to next page".
So in the following snippet:
<a class="noborder" href="StdResults.aspx">
<img src="arrowr.gif" title="Go to next page"></img>
</a>
I'd like to get StdResults.aspx back.
I've got this far:
next_link = doc.xpath("//a/img[#title='Go to next page']")
print next_link[0].attrib['href']
But next_link is the img, not the a tag - how can I get the a tag?
Thanks.
Just change a/img... to a[img...]: (the brackets sort of mean "such that")
import lxml.html as lh
content='''<a class="noborder" href="StdResults.aspx">
<img src="arrowr.gif" title="Go to next page"></img>
</a>'''
doc=lh.fromstring(content)
for elt in doc.xpath("//a[img[#title='Go to next page']]"):
print(elt.attrib['href'])
# StdResults.aspx
Or, you could go even farther and use
"//a[img[#title='Go to next page']]/#href"
to retrieve the values of the href attributes.
You can also select the parent node or arbitrary ancestors by using //a/img[#title='Go to next page']/parent::a or //a/img[#title='Go to next page']/ancestor::a respectively as XPath expressions.

Categories

Resources