Using BeautifulSoup to scrape specific element within a CSS class - python

I'm trying to use BeautifulSoup in Python to scrape the 3rd li element within a CSS class. That said, i'm pretty new to this, and am not sure the best way to go about this.
Within the below example, what i'm trying to do is to scrape the 170 votes from this list (**in the real world example there are hundreds of these on a page that i'm looking to scrape, but they're all nested under the same CSS class within the 3rd li element)
<ul class="example-ul-class">
<li class="example-li-class">EXAMPLE NAME</li>
<li><i class="example-li-class">12 hours ago</time></li>
<li><i class="example-li-class"> 170 votes</li>
<li><i class="example-li-class">3 min read</li>
</ul>
I tried using something like the below but am getting the error found after the code
subtext = soup.select('.example-ul-class > li[2]')
print(subtext)
Error:
in selector_iter
raise SelectorSyntaxError(msg, self.pattern, index)
soupsieve.util.SelectorSyntaxError: Malformed attribute selector at position 29
line 1:
.example-ul-class > li[2]
**Again, the desired output would be to return just the string '170 votes'
Appreciate the help!

Instead of a CSS selector, try selecting using normal BS methods:
print(soup.find('ul',class_='example-ul-class').find_all('li')[2].text.strip())

Related

Using Python 3.7 and Webdriver, how do I find a specific button inside a "<li", which is in turn found inside a "<div" element?

I am including portions of the HTML below. I believe I have found the larger element using the command:
driver.find_element_by_xpath('//div[#id="day-number-4"]')
That div ID is unique on the web page, and the command above did not create an exception nor error.
The HTML code that defines that element is:
<div id="day-number-4" class="c-schedule-calendar__class-schedule-content tabcontent js-class-schedule-content u-is-none u-is-block" data-index="3">
Now, the hard part. Inside that div element are a number of "<li"'s that take the form of:
<li tabindex="0" class="row c-schedule-calendar__class-schedule-listitem-wrapper c-schedule-calendar__workout-schedule-list-item" data-index="0" data-workout-id="205687" data-club-id="229">
and then followed a clickable button in this format:
<button class="c-btn-outlined class-action" data-class-action="book-class" data-class-action-step="class-action-confirmation" data-workout-id="205687" data-club-id="229" data-waitlistable="true"><span class="c-btn__label">Join Waitlist</span></button>
I always want to click on the 3rd button inside that <div element. The only thing unique would be the data-index, which starts at 0 for the 1st <li", and 2 for the 3rd So I want to find the clickable button that will follow this HTML code:
<li tabindex="0" class="row c-schedule-calendar__class-schedule-listitem-wrapper c-schedule-calendar__workout-schedule-list-item" data-index="2" data-workout-id="206706" data-club-id="229">
I cannot search on data-index as "data-index="2"" appears many times on the web page.
How do I do this?
I answered my own question. I used multiple searches within the specific element using this code:
Day_of_Timeslot = driver.find_element_by_xpath('//div[#id="day-number-4"]')
Precise_Timeslot = Day_of_Timeslot.find_element_by_xpath(".//li[#data-index='1']")
Actual_Button = Precise_Timeslot.find_element_by_xpath(".//button[#data-class-action='book-class']").click()

Can't grab next sibling using css selector within scrapy

I'm trying to fetch the budget using scrapy implementing css selector within it. I can get it when I use xpath but in case of css selector I'm lost. I can even get the content when I go for BeautifulSoup and use next_sibling.
I've tried with:
import requests
from scrapy import Selector
url = "https://www.imdb.com/title/tt0111161/"
res = requests.get(url)
sel = Selector(res)
# budget = sel.xpath("//h4[contains(.,'Budget:')]/following::text()").get()
# print(budget)
budget = sel.css("h4:contains('Budget:')::text").get()
print(budget)
Output I'm getting using css selector:
Budget:
Expected output:
$25,000,000
Relevant portion of html:
<div class="txt-block">
<h4 class="inline">Budget:</h4>$25,000,000
<span class="attribute">(estimated)</span>
</div>
website address
That portion in that site is visible as:
How can I get the budgetary information using css selector when it is used within scrapy?
This selector .css("h4:contains('Budget:')::text") is selecting the h4 tag, and the text you want is in it's parent, the div element.
You could use .css('div.txt-block::text') but this would return several elements, as the page have several elements like that. CSS selectors don't have a parent pseudo-element, I guess you could use .css('div.txt-block:nth-child(12)::text') but if you are going to scrape more pages, this will probably fail in other pages.
The best option would be to use XPath:
response.xpath('//h4[text() = "Budget:"]/parent::div/text()').getall()

Python Selenium - Find element by class and text

I'm trying to paginate through the results of this search: Becoming Amazon search. I get a 'NoSuchElementException'..'Unable to locate element: < insert xpath here >
Here is the html:
<div id="pagn" class="pagnHy">
<span class="pagnLink">
2
</span>
</div>
Here are the xpaths I've tried:
driver.find_element_by_xpath('//*[#id="pagn" and #class="pagnLink" and text()="2"]')
driver.find_element_by_xpath('//div[#id="pagn" and #class="pagnLink" and text()="2"]')
driver.find_element_by_xpath("//*[#id='pagn' and #class='pagnLink' and text()[contains(.,'2')]]")
driver.find_element_by_xpath("//span[#class='pagnLink' and text()='2']")
driver.find_element_by_xpath("//div[#class='pagnLink' and text()='2']")
If I just use find_element_by_link_text(...) then sometimes the wrong link will be selected. For example, if the number of reviews is equal to the page number I'm looking for (in this case, 2), then it will select the product with 2 reviews, instead of the page number '2'.
You're trying to mix attributes and text nodes from different WebElements in the same predicate. You should try to separate them as below:
driver.find_element_by_xpath('//div[#id="pagn"]/span[#class="pagnLink"]/a[text()="2"]')
Sometimes it might be better to take a intermediate step and first to get the element which contains the results.
Afterwards you just search within this element.
Doing it this way you simplify your search terms.
from selenium import webdriver
url = 'https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&fieldkeywords=becoming&rh=i%3Aaps%2Ck%3Abecoming'
driver = webdriver.Firefox()
resp = driver.get(url)
results_list_object = driver.find_element_by_id('s-results-list-atf')
results = results_list_object.find_elements_by_css_selector('li[id*="result"]')
for number, article in enumerate(results):
print(">> article %d : %s \n" % (number, article.text))
When I look at the markup, I'm seeing the following:
<span class="pagnLink">
2
</span>
So you want to find a span with class pagnLink that has a child a element with the text 2, or:
'//*[#class="pagnLink"]/a[text()="2"]'

Pulling Text from Type 'Navigable String' and 'Tag' on Beautiful Soup

I'm stuck on parsing part of Rotten Tomatoes website that has the critics score as a tag and the "%" separately. I followed some SO suggestions such as using find_all('span',text="true"), but Python 3.5.1 shell returned this error: AttributeError: 'NavigableString' object has no attribute 'find_all' I also tried finding the direct child of Beautiful Soup object critiscore, but received the same error. Please tell me where I went wrong. Here's my python code:
def get_rating(address):
"""pull ratings numbers from rotten tomatoes"""
RTaddress = urllib.request.urlopen(address)
tomatoe = BeautifulSoup(RTaddress, "lxml")
for criticscore in tomatoe.find('span', class_=['meter-value superPageFontColor']):
print(''.join(criticscore.find_all('span', recursive=False))) #print the Tomatometer
Also, here's the code on Rotten Tomatoes I'm interested in scraping:
<div class="critic-score meter">
<a href="#contentReviews" class="unstyled articleLink" id="tomato_meter_link">
<span class="meter-tomato icon big medium-xs certified_fresh pull-left"></span>
<span class="meter-value superPageFontColor"><span>96</span>%</span>
</a>
</div>
The problem line is this one:
for criticscore in tomatoe.find('span', class_=['meter-value superPageFontColor']):
Here, you are locating a single element via find() and then iterate over its children which can be the text nodes as well as other elements (when you iterate over an element, this is what happens in BeautifulSoup).
Instead, you probably meant to use find_all() instead of find():
for criticscore in tomatoe.find_all('span', class_=['meter-value superPageFontColor']):
Or, you can use a single CSS selector instead:
for criticscore in tomatoe.select('span.meter-value > span'):
print(criticscore.get_text())
where > means a direct parent-child relationship (this is your recursive=False replacement).

Find nested divs scrapy

I am trying to get the text from a div that is nested. Here is the code that I currently have:
sites = hxs.select('/html/body/div[#class="content"]/div[#class="container listing-page"]/div[#class="listing"]/div[#class="listing-heading"]/div[#class="price-container"]/div[#class="price"]')
But it is not returning a value. Is my syntax wrong? Essentially I just want the text out of <div class="price">
Any ideas?
The URL is here.
The price is inside an iframe so you should scrape https://www.rentler.com/ksl/listing/index/?sid=17403849&nid=651&ad=452978
Once you request this url:
hxs.select('//div[#class="price"]/text()').extract()[0]

Categories

Resources