Find nested divs scrapy

Find nested divs scrapy - python

I am trying to get the text from a div that is nested. Here is the code that I currently have:
sites = hxs.select('/html/body/div[#class="content"]/div[#class="container listing-page"]/div[#class="listing"]/div[#class="listing-heading"]/div[#class="price-container"]/div[#class="price"]')
But it is not returning a value. Is my syntax wrong? Essentially I just want the text out of <div class="price">
Any ideas?
The URL is here.

The price is inside an iframe so you should scrape https://www.rentler.com/ksl/listing/index/?sid=17403849&nid=651&ad=452978
Once you request this url:
hxs.select('//div[#class="price"]/text()').extract()[0]

Related

I can't extract href from html webpage

I'm using scrapy to fetch data from this webpage.
I'm relatively new to this. I need to get the href link of the next page button > but can't find the solution.
Please help
Tried this in the terminal
response.xpath('//a[#class="btn--pagination btn--pag-next pag-control"]/#href').extract()
but it just gives me [].
This is the html code of the button:
data-page="2" data-url="http://www.worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/women/senior?regionType=world&timing=electronic&windReading=regular&page=1&bestResultsOnly=false&firstDay=1899-12-31&lastDay=2023-01-20" class="btn--pagination btn--pag-next pag-control" href="//www.worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/women/senior?regionType=world&timing=electronic&windReading=regular&page=2&bestResultsOnly=false&firstDay=1899-12-31&lastDay=2023-01-20" style="">
>
</a>

The issue is that the link element doesn't have an href attribute. use the data-url instead
response.xpath('//a[#class="btn--pagination btn--pag-next pag-control"]/#data-url').get()
OUTPUT
'http://worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/women/senior?regionType=world&timing=electronic&windReading=regular&page=1&bestResultsOnly=false&firstDay=1899-12-31&lastDay=2023-01-21'

How do I extract text from a button using Beautiful Soup?

I am trying to scrape GoFundMe information but can't seem to extract the number of donors.
This is the html I am trying to navigate. I am attempting to retrieve 11.1K,
<ul class="list-unstyled m-meta-list m-meta-list--default">
<li class="m-meta-list-item">
<button class="text-stat disp-inline text-left a-button a-button--inline" data-element-
id="btn_donors" type="button" data-analytic-event-listener="true">
<span class="text-stat-value text-underline">11.1K</span>
<span class="m-social-stat-item-title text-stat-title">donors</span>
I've tried using
donors = soup.find_all('li', class_ = 'm-meta-list-item')
for donor in donors:
print(donor.text)
The class/button seems to be hidden inside another class? How can I extract it?
I'm new to beautifulsoup but have used selenium quite a bit.
Thanks in advance.

These fundraiser pages all have similar html and that value is dynamically retrieved. I would suggest using selenium and a css class selector
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.gofundme.com/f/treatmentforsiyona?qid=7375740208a5ee878a70349c8b74c5a6')
num = d.find_element_by_css_selector('.text-stat-value').text
print(num)
d.quit()
Learn more about selenium:
https://sqa.stackexchange.com/a/27856

get the id gofundme.com/f/{THEID} and call the API
/web-gateway/v1/feed/THEID/donations?sort=recent&limit=20&offset=20
process the Data
for people in apiResponse['references']['donations']
print(people['name'])
use browser console to find host API.

Using BeautifulSoup to scrape specific element within a CSS class

I'm trying to use BeautifulSoup in Python to scrape the 3rd li element within a CSS class. That said, i'm pretty new to this, and am not sure the best way to go about this.
Within the below example, what i'm trying to do is to scrape the 170 votes from this list (**in the real world example there are hundreds of these on a page that i'm looking to scrape, but they're all nested under the same CSS class within the 3rd li element)
<ul class="example-ul-class">
<li class="example-li-class">EXAMPLE NAME</li>
<li><i class="example-li-class">12 hours ago</time></li>
<li><i class="example-li-class"> 170 votes</li>
<li><i class="example-li-class">3 min read</li>
</ul>
I tried using something like the below but am getting the error found after the code
subtext = soup.select('.example-ul-class > li[2]')
print(subtext)
Error:
in selector_iter
raise SelectorSyntaxError(msg, self.pattern, index)
soupsieve.util.SelectorSyntaxError: Malformed attribute selector at position 29
line 1:
.example-ul-class > li[2]
**Again, the desired output would be to return just the string '170 votes'
Appreciate the help!

Instead of a CSS selector, try selecting using normal BS methods:
print(soup.find('ul',class_='example-ul-class').find_all('li')[2].text.strip())

Can't grab next sibling using css selector within scrapy

I'm trying to fetch the budget using scrapy implementing css selector within it. I can get it when I use xpath but in case of css selector I'm lost. I can even get the content when I go for BeautifulSoup and use next_sibling.
I've tried with:
import requests
from scrapy import Selector
url = "https://www.imdb.com/title/tt0111161/"
res = requests.get(url)
sel = Selector(res)
# budget = sel.xpath("//h4[contains(.,'Budget:')]/following::text()").get()
# print(budget)
budget = sel.css("h4:contains('Budget:')::text").get()
print(budget)
Output I'm getting using css selector:
Budget:
Expected output:
$25,000,000
Relevant portion of html:
<div class="txt-block">
<h4 class="inline">Budget:</h4>$25,000,000
<span class="attribute">(estimated)</span>
</div>
website address
That portion in that site is visible as:
How can I get the budgetary information using css selector when it is used within scrapy?

This selector .css("h4:contains('Budget:')::text") is selecting the h4 tag, and the text you want is in it's parent, the div element.
You could use .css('div.txt-block::text') but this would return several elements, as the page have several elements like that. CSS selectors don't have a parent pseudo-element, I guess you could use .css('div.txt-block:nth-child(12)::text') but if you are going to scrape more pages, this will probably fail in other pages.
The best option would be to use XPath:
response.xpath('//h4[text() = "Budget:"]/parent::div/text()').getall()

How to get info from this html using beautifulsoup?

I want to get all the social link of a company from this. When doing
summary_div.find("div", {'class': "cp-summary__social-links"})
I am getting this
<div class="cp-summary__social-links">
<div data-integration-name="react-component" data-payload='{"props":
{"links":[{"url":"http://www.snapdeal.com?utm_source=craft.co","icon":"web","label":"Website"},
{"url":"http://www.linkedin.com/company/snapdeal?utm_source=craft.co","icon":"linkedin","label":"LinkedIn"},
{"url":"https://instagram.com/snapdeal/?utm_source=craft.co","icon":"instagram","label":"Instagram"},
{"url":"https://www.facebook.com/Snapdeal?utm_source=craft.co","icon":"facebook","label":"Facebook"},
{"url":"https://www.crunchbase.com/organization/snapdeal?utm_source=craft.co","icon":"cb","label":"CrunchBase"},
{"url":"https://www.youtube.com/user/snapdeal?utm_source=craft.co","icon":"youtube","label":"YouTube"},
{"url":"https://twitter.com/snapdeal?utm_source=craft.co","icon":"twitter","label":"Twitter"}],
"companyName":"Snapdeal"},"name":"CompanyLinks"}' data-rwr-element="true"></div></div>
I also tried getting children of cp-summary__social-links, which I want indeed and then find all a tag to get all the links. This does not work too.
Any idea, how to do this?
Update: As Sraw suggested, I managed to get all urls by doing like this.
urls = []
social_link = summary_div.find("div", {'class': "cp-summary__social-links"}).find("div", {"data-integration-name": "react-component"})
json_text = json.loads(social_link["data-payload"])
for link in json_text['props']['links']:
urls.append(link['url'])
Thanks in advance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find nested divs scrapy - python

The price is inside an iframe so you should scrape https://www.rentler.com/ksl/listing/index/?sid=17403849&nid=651&ad=452978 Once you request this url: hxs.select('//div[#class="price"]/text()').extract()[0]

Related

I can't extract href from html webpage

How do I extract text from a button using Beautiful Soup?

Using BeautifulSoup to scrape specific element within a CSS class

Can't grab next sibling using css selector within scrapy

How to get info from this html using beautifulsoup?

Categories

Resources