Can't grab next sibling using css selector within scrapy

Can't grab next sibling using css selector within scrapy - python

I'm trying to fetch the budget using scrapy implementing css selector within it. I can get it when I use xpath but in case of css selector I'm lost. I can even get the content when I go for BeautifulSoup and use next_sibling.
I've tried with:
import requests
from scrapy import Selector
url = "https://www.imdb.com/title/tt0111161/"
res = requests.get(url)
sel = Selector(res)
# budget = sel.xpath("//h4[contains(.,'Budget:')]/following::text()").get()
# print(budget)
budget = sel.css("h4:contains('Budget:')::text").get()
print(budget)
Output I'm getting using css selector:
Budget:
Expected output:
$25,000,000
Relevant portion of html:
<div class="txt-block">
<h4 class="inline">Budget:</h4>$25,000,000
<span class="attribute">(estimated)</span>
</div>
website address
That portion in that site is visible as:
How can I get the budgetary information using css selector when it is used within scrapy?

This selector .css("h4:contains('Budget:')::text") is selecting the h4 tag, and the text you want is in it's parent, the div element.
You could use .css('div.txt-block::text') but this would return several elements, as the page have several elements like that. CSS selectors don't have a parent pseudo-element, I guess you could use .css('div.txt-block:nth-child(12)::text') but if you are going to scrape more pages, this will probably fail in other pages.
The best option would be to use XPath:
response.xpath('//h4[text() = "Budget:"]/parent::div/text()').getall()

Related

How do I extract text from a button using Beautiful Soup?

I am trying to scrape GoFundMe information but can't seem to extract the number of donors.
This is the html I am trying to navigate. I am attempting to retrieve 11.1K,
<ul class="list-unstyled m-meta-list m-meta-list--default">
<li class="m-meta-list-item">
<button class="text-stat disp-inline text-left a-button a-button--inline" data-element-
id="btn_donors" type="button" data-analytic-event-listener="true">
<span class="text-stat-value text-underline">11.1K</span>
<span class="m-social-stat-item-title text-stat-title">donors</span>
I've tried using
donors = soup.find_all('li', class_ = 'm-meta-list-item')
for donor in donors:
print(donor.text)
The class/button seems to be hidden inside another class? How can I extract it?
I'm new to beautifulsoup but have used selenium quite a bit.
Thanks in advance.

These fundraiser pages all have similar html and that value is dynamically retrieved. I would suggest using selenium and a css class selector
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.gofundme.com/f/treatmentforsiyona?qid=7375740208a5ee878a70349c8b74c5a6')
num = d.find_element_by_css_selector('.text-stat-value').text
print(num)
d.quit()
Learn more about selenium:
https://sqa.stackexchange.com/a/27856

get the id gofundme.com/f/{THEID} and call the API
/web-gateway/v1/feed/THEID/donations?sort=recent&limit=20&offset=20
process the Data
for people in apiResponse['references']['donations']
print(people['name'])
use browser console to find host API.

How to skip over child element with Scrapy

I'm looking to scrape just the job description from this page: https://www.aha.io/company/careers/current-openings/customer_success_specialist_project_management_us
I'd like to get all of the text and HTML inside the div with the class of "container py2 content job", EXCEPT the button. It's in an <a> tag with the class of "btn btn-large btn-secondary".
I've got two different xpath selectors that I thought should work, but don't. The first doesn't exclude the button and the second gets rid of all of the other HTML, which I'd like to keep.
response.xpath('//div[#class ="container py2 content job"]
[not(parent::a/#class="btn btn-large btn-secondary")]').extract()
response.xpath('//div[#class ="container py2 content
job"]/descendant::text()[not(parent::a/#class="btn btn-large btn-
secondary")]').extract()
Neither is scraping all of the HTML in the div minus what's inside the a tag. I'm hoping there's something simple that I'm missing, but I can't find what I'm looking for in the documentation.

job_html = response.css('div.content *').extract()
job_html = [x for x in job_html if "Apply now" not in x]
print(job_html)

how to find element by css selector using python / selenium

i'm trying to pick up links of youtube channels which are located as below:
<a id="author-text" class="yt-simple-endpoint style-scope ytd-comment-
renderer" href="/channel/UCUSy-h1fPG1L6X7KOe70asA"> <span class="style-
scope ytd-comment-renderer">Jörgen Nilsson</span></a>
So in the example above I would want to pick up "/channel/UCUSy-h1fPG1L6X7KOe70asA". So far i have tried many options but none work:
driver = webdriver.Chrome('C:/Users/me/Chrome Web Driver/chromedriver.exe')
api_url="https://www.youtube.com/watch?v=TQG7m1BFeRc"
driver.get(api_url)
time.sleep(2)
div = driver.find_element_by_class_name("yt-simple-endpoint style-scope ytd-comment-renderer")
but I get the following error:
InvalidSelectorException: Message: invalid selector: Compound class names not permitted
I also tried other approaches:
div = driver.find_elements_by_xpath("yt-simple-endpoint style-scope ytd-comment-renderer")
div = driver.find_element_by_class_name('yt-simple-endpoint style-scope ytd-comment-renderer')
div=driver.find_element_by_css_selector('.yt-simple-endpoint style-scope ytd-comment-renderer').get_attribute('href')
but no luck.. if someone could please help it would be much appreciated. Thank you

Your selectors are invalid:
driver.find_element_by_class_name("yt-simple-endpoint style-scope ytd-comment-renderer")
you cannot pass more than one class name to find_element_by_class_name method. You can try driver.find_element_by_class_name("ytd-comment-renderer")
driver.find_elements_by_xpath("yt-simple-endpoint style-scope ytd-comment-renderer")
it's not a correct XPath syntax. You probably mean driver.find_elements_by_xpath("//*[#class='yt-simple-endpoint style-scope ytd-comment-renderer']")
driver.find_element_by_css_selector('.yt-simple-endpoint style-scope ytd-comment-renderer')
each class name should start with the dot: driver.find_element_by_css_selector('.yt-simple-endpoint.style-scope.ytd-comment-renderer')
But the best way IMHO to identify by ID:
driver.find_element_by_id("author-text")

You can use BeautifulSoup in python to get the links in anchor tag having specific class names like soup.find_all('a', attrs={'class':'yt-simple-endpoint'}) you can read more here find_all using css

Get links from a certain div using Selenium in Python

I have the following HTML page. I want to get all the links inside a specific div. Here is my HTML code:
<div class="rec_view">
<a href='www.xyz.com/firstlink.html'>
<img src='imga.png'>
</a>
<a href='www.xyz.com/seclink.html'>
<img src='imgb.png'>
</a>
<a href='www.xyz.com/thrdlink.html'>
<img src='imgc.png'>
</a>
</div>
I want to get all the links that are present on the rec_view div. So those links that I want are,
www.xyz.com/firstlink.html
www.xyz.com/seclink.html
www.xyz.com/thrdlink.html
Here is the Python code which I tried with
from selenium import webdriver;
webpage = r"https://www.testurl.com/page/123/"
driver = webdriver.Chrome("C:\chromedriver_win32\chromedriver.exe")
driver.get(webpage)
element = driver.find_element_by_css_selector("div[class='rec_view']>a")
link = element.get_attribute("href")
print(link)
How can I get those links using selenium on Python?

As per the HTML you have shared to get the list of all the links that are present on the rec_view div you can use the following code block :
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\chromedriver_win32\chromedriver.exe')
driver.get('https://www.testurl.com/page/123/')
elements = driver.find_elements_by_css_selector("div.rec_view a")
for element in elements:
print(element.get_attribute("href"))
Note : As you need to collect all the href attributes from the div tag so instead of find_element_* you need to use find_elements_*. Additionally, > refers to immediate <a> child node where as you need to traverse all the <a> child nodes so the desired css_selector will be div.rec_view a

Find nested divs scrapy

I am trying to get the text from a div that is nested. Here is the code that I currently have:
sites = hxs.select('/html/body/div[#class="content"]/div[#class="container listing-page"]/div[#class="listing"]/div[#class="listing-heading"]/div[#class="price-container"]/div[#class="price"]')
But it is not returning a value. Is my syntax wrong? Essentially I just want the text out of <div class="price">
Any ideas?
The URL is here.

The price is inside an iframe so you should scrape https://www.rentler.com/ksl/listing/index/?sid=17403849&nid=651&ad=452978
Once you request this url:
hxs.select('//div[#class="price"]/text()').extract()[0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't grab next sibling using css selector within scrapy - python

Related

How do I extract text from a button using Beautiful Soup?

How to skip over child element with Scrapy

how to find element by css selector using python / selenium

Get links from a certain div using Selenium in Python

Find nested divs scrapy

Categories

Resources