Unable to parse a specific text in HTML with beautifulsoup

Unable to parse a specific text in HTML with beautifulsoup - python

I am trying to parse the rating I have provided to a movie on IMDB. Below is my code:
with Session() as s:
shaw = s.get('https://www.imdb.com/title/tt0111161/')
shaw_soup = bs4.BeautifulSoup(shaw.content, 'html.parser')
title_block = shaw_soup.find(class_ = 'title_block')
rating_widget = title_block.find('div', id = 'star-rating-widget')
star_rating_value = rating_widget.find('span', class_ = 'star-rating-value')
print(star_rating_value)
The html structure of the portion of webpage is as follows:
enter image description here
The output of the print(star_rating_value) is None.
The curious part is when I am parsing other attributes, there is no issue. This issue is only for parsing the rating which I have provided to a movie.

Related

Unable to scrape website for price element

I wanted to scrape name, roast and price and I have succesfully done it with the code below. However I am not able to scrape the price . it shows up as 'None'.
URLS = ["https://www.thirdwavecoffeeroasters.com/products/vienna-roast","https://www.thirdwavecoffeeroasters.com/products/baarbara-estate","https://www.thirdwavecoffeeroasters.com/products/el-diablo-blend","https://www.thirdwavecoffeeroasters.com/products/organic-signature-filter-coffee-blend","https://www.thirdwavecoffeeroasters.com/products/moka-pot-express-blend-1","https://www.thirdwavecoffeeroasters.com/products/karadykan-estate","https://www.thirdwavecoffeeroasters.com/products/french-roast","https://www.thirdwavecoffeeroasters.com/products/signature-cold-brew-blend","https://www.thirdwavecoffeeroasters.com/products/bettadakhan-estate","https://www.thirdwavecoffeeroasters.com/products/monsoon-malabar-aa"]
for url in range(0,10):
req=requests.get(URLS[url])
soup = bs(req.text,"html.parser")
coffees = soup.find_all("div",class_="col-md-4 col-sm-12 col-xs-12")
for coffee in coffees:
name = coffee.find("div",class_="product-details-main").find("ul",class_="uk-breadcrumb uk-text-uppercase").span.text
roast = coffee.find("div",class_="uk-flex uk-flex-middle uk-width-1-1 coff_type_main").find("p",class_="coff_type uk-margin-small-left uk-text-uppercase").text.split("|")[0]
prices = coffee.find("div",class_="uk-width-1-1 uk-first-column")
print(name,roast,price)

The issue here is that in the response you are not getting the HTML element you are looking for. If you search for the price element in req.text, you won't find it. This is probably because the website renders the page dynamically with JavaScript.
If you look closely in req.text, however, you can find all the coffee variants with their respective price in the meta JavaScript object. You can look for it and parse it into a Python dictionary using the loads() function from the json library.
Here is your working code that also prints the prices. Note that all prices are multiplied by 100, so you may need to account for that.
import requests
from bs4 import BeautifulSoup as bs
import json
URLS = ["https://www.thirdwavecoffeeroasters.com/products/vienna-roast","https://www.thirdwavecoffeeroasters.com/products/baarbara-estate","https://www.thirdwavecoffeeroasters.com/products/el-diablo-blend","https://www.thirdwavecoffeeroasters.com/products/organic-signature-filter-coffee-blend","https://www.thirdwavecoffeeroasters.com/products/moka-pot-express-blend-1","https://www.thirdwavecoffeeroasters.com/products/karadykan-estate","https://www.thirdwavecoffeeroasters.com/products/french-roast","https://www.thirdwavecoffeeroasters.com/products/signature-cold-brew-blend","https://www.thirdwavecoffeeroasters.com/products/bettadakhan-estate","https://www.thirdwavecoffeeroasters.com/products/monsoon-malabar-aa"]
for url in URLS:
req=requests.get(url)
soup = bs(req.text,"html.parser")
coffees = soup.find_all("div",class_="col-md-4 col-sm-12 col-xs-12")
for coffee in coffees:
name = coffee.find("div",class_="product-details-main").find("ul",class_="uk-breadcrumb uk-text-uppercase").span.text
roast = coffee.find("div",class_="uk-flex uk-flex-middle uk-width-1-1 coff_type_main").find("p",class_="coff_type uk-margin-small-left uk-text-uppercase").text.split("|")[0]
print(name,roast)
# Find the JavaScript declaration (there is only one `meta` variable, so no chance of ambiguity)
json_start = req.text.find("var meta = ") + len("var meta = ")
# Find the end of the JavaScript object (there won't be any `;` inside)
json_end = req.text.find(";",json_start)
json_str = req.text[json_start:json_end]
json_data = json.loads(json_str)
# Get the prices from the JSON object
product = json_data["product"]
variants = product["variants"]
for variant in variants:
print(f'Variant: {variant["name"]}, price: {variant["price"]}')
print("\n")
I hope this helps, cheers.

Webscraping Issue w/ BeautifulSoup

I am new to Python web scraping, and I am scraping productreview.com for review. The following code pulls all the data I need for a single review:
#Scrape TrustPilot for User Reviews (Rating, Comments)
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import json
import requests
import datetime as dt
final_list=[]
url = 'https://www.productreview.com.au/listings/world-nomads'
r = requests.get(url)
soup = bs(r.text, 'lxml')
for div in soup.find('div', class_ = 'loadingOverlay_24D'):
try:
name = soup.find('h4', class_ = 'my-0_27D align-items-baseline_kxl flex-row_3gP d-inline-flex_1j8 text-muted_2v5')
name = name.find('span').text
location = soup.find('h4').find('small').text
policy = soup.find('div', class_ ='px-4_1Cw pt-4_9Zz pb-2_1Ex card-body_2iI').find('span').text
title = soup.find('h3').find('span').text
content = soup.find('p', class_ = 'mb-0_2CX').text
rating = soup.find('div', class_ = 'mb-4_2RH align-items-center_3Oi flex-wrap_ATH d-flex_oSG')
rating = rating.find('div')['title']
final_list.append([name, location, policy, rating, title, content])
except AttributeError:
pass
reviews = pd.DataFrame(final_list, columns = ['Name', 'Location', 'Policy', 'Rating', 'Title', 'Content'])
print(reviews)
But when I edit
for div in soup.find('div', class_ = 'loadingOverlay_24D'):
to
for div in soup.findAll('div', class_ = 'loadingOverlay_24D'):
I don't get all reviews, I just get the same entry looped over and over.
Any help would be much appreciated.
Thanks!

Issue 1: Repeated data inside the loop
You loop has the following form:
for div in soup.find('div' , ...):
name = soup.find('h4', ... )
policy = soup.find('div', ... )
...
Notice that you are calling find inside the loop for the soup object. This means that each time you try to find the value for name, it will search the whole document from the beginning and return the first match, in every iteration.
This is why you are getting the same data over and over.
To fix this, you need to call find inside the current review div that you are currently at. That is:
for div in soup.find('div' , ...):
name = div.find('h4', ... )
policy = div.find('div', ... )
...
Issue 2: Missing data and error handling
In your code, any errors inside the loop are ignored. However, there are many errors that are actually happening while parsing and extracting the values. For example:
location = div.find('h4').find('small').text
Not all reviews have location information. Hence, the code will extract h4, then try to find small, but won't find any, returning None. Then you are calling .text on that None object, causing an exception. Hence, this review will not be added to the result data frame.
To fix this, you need to add more error checking. For example:
locationDiv = div.find('h4').find('small')
if locationDiv:
location = locationDiv.text
else:
location = ''
Issue 3: Identifying and extracting data
The page you're trying to parse has broken HTML, and uses CSS classes that seem random or at least inconsistent. You need to find the correct and unique identifiers for the data that you are extracting such that they strictly match all the entries.
For example, you are extracting the review-container div using CSS class loadingOverlay_24D. This is incorrect. This CSS class seems to be for a "loading" placeholder div or something similar. Actual reviews are enclosed in div blocks that look like this:
<div itemscope="" itemType="http://schema.org/Review" itemProp="review">
....
</div>
Notice that the uniquely identifying property is the itemProp attribute. You can extract those div blocks using:
soup.find('div', {'itemprop': 'review'}):
Similarly, you have to find the correct identifying properties of the other data you want to extract to ensure you get all your data fully and correctly.
One more thing, when a tag has more than one CSS class, usually only one of them is the identifying property you want to use. For example, for names, you have this:
name = soup.find('h4', class_ = 'my-0_27D align-items-baseline_kxl flex-row_3gP d-inline-flex_1j8 text-muted_2v5')
but in reality, you don't need all these classes. The first class, in this case, is sufficient to identify the name h4 blocks
name = soup.find('h4', class_ = 'my-0_27D')
Example:
Here's an example to extract the author names from review page:
for div in soup.find_all('div', {'itemprop': 'review'}):
name = div.find('h4', class_ = 'my-0_27D')
if (name):
name = name.find('span').text
else:
name = '-'
print(name)
Output:
Aidan
Bruno M.
Ba. I.
Luca Evangelista
Upset
Julian L.
Alison Peck
...

The page servs broken html code and html.parser is better at dealing with it.
Change soup = bs(r.text, 'lxml') to soup = bs(r.text, 'html.parser')

Asking the user to input something and use Beautiful Soup to parse a website

I am supposed to use Beautiful Soup 4 to obtain course information off of my school's website as an exercise. I have been at this for the past few days and my code still does not work.
The first thing I ask the user is to import the course catalog abbreviation. For example, ICS is abbreviated as Information for Computer Science. Beautiful Soup 4 is supposed to list all of the courses and how many students are enrolled.
While I was able to get the input portion to work, I still have errors or the program just stops.
Question: Is there a way for Beautiful Soup to accept user input so that when the user inputs ICS, the output would be a list of all courses that are related to ICS?
Here is the code and my attempt at it:
from bs4 import BeautifulSoup
import requests
import re
#get input for course
course = input('Enter the course:')
#Here is the page link
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get request and response
page_response = requests.get(BASE_AVAILABILITY_URL)
#getting Beautiful Soup to gather the html content
page_content = BeautifulSoup(page_response.content, 'html.parser')
#getting course information
main = page_content.find_all(class_='parent clearfix')
main_p = "".join(str (x) for x in main)
#get the course anchor tags
main_q = BeautifulSoup(main_p, "html.parser")
courses = main.find('a', href = True)
#get each course name
#empty dictionary for course list
courses_list = []
for a in courses:
courses_list.append(a.text)
search = input('Enter the course title:')
for course in courses_list:
if re.search(search, course, re.IGNORECASE):
print(course)
This is the original code that was provided in Juypter Notebook
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text)
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
What's odd is that if the user saves the html file, uploads it into Juypter Notebook, and then opens the file to be read, the courses are displayed. But, for this task, the user can not save files and it must be an outright input to get the output.

The problem with your code is page_content.find_all(class_='parent clearfix') retuns and empty list []. So thats the first thing you need to change. Looking at the html, you'll want to be looking for <table>, <tr>, <td>, tags
working off what was provided from the original code, you just need to alter a few things to flow logically:
I'll point out what I changed:
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text) #<-- need to get the html text before creating a bs4 object. So I move the request (line below) before this, and also adjusted the parameter for this function.
# the rest of the code is fine
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
This will give you:
import requests, bs4
BASE_AVAILABILITY_URL = "https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s="
#get input for course
course = input('Enter the course:')
url = BASE_AVAILABILITY_URL + course
def scrape_availability(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html.parser')
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
scrape_availability(url)

Scraping each element from website with BeautifulSoup

I wrote a code for scraping one real estate website. This is the link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/
From this page I can get only location, size and price of the apartment, but Is it possible to write a code that will go on page of each appartment and scrape values from it, because it contains much more info. Check this link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I have posted a code. I noticed that my url changes when I click on specific real estate. For example:
arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I taught about creating for loop, but there is no way to know how it changes because it has some id number at the end:
NkvJK0Ou5tV
This is the code that I have:
from bs4 import BeautifulSoup
import requests
website = "https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/"
soup = requests.get(website).text
my_html = BeautifulSoup(soup, 'lxml')
lokacija = my_html.find_all('p', class_='offer-location text-truncate')
ukupna_kvadratura = my_html.find_all('p', class_='offer-price offer-price--invert')
ukupna_cena = my_html.find_all('div', class_='d-flex justify-content-between w-100')
ukupni_opis = my_html.find_all('div', class_='mt-1 mb-1 mt-lg-0 mb-lg-0 d-md-block offer-meta-info offer-adress')
for lok, kvadratura, cena_stana, sumarno in zip(lokacija, ukupna_kvadratura, ukupna_cena, ukupni_opis):
lok = lok.text.split(',')[0] #lokacija
kv = kvadratura.span.text.split(' ')[0] #kvadratura
jed = kvadratura.span.text.split(' ')[1] #jedinica mere
cena = cena_stana.span.text #cena
sumarno = sumarno.text
datum = sumarno.split('|')[0].strip()
status = sumarno.split('|')[1].strip()
opis = sumarno.split('|')[2].strip()
print(lok, kv, jed, cena, datum, status, opis)

You can get href from div class="placeholder-preview-box ratio-4-3".
From here you can find the URL.

You can iterate over the links provided by the pagination at the bottom of the page:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/').text, 'html.parser')
def scrape_page(page):
return [{'title':i.h2.get_text(strip=True), 'loc':i.p.get_text(strip=True), 'price':i.find('p', {'class':'offer-price'}).get_text(strip=True)} for i in page.find_all('div', {'class':'row offer'})]
result = [scrape_page(d)]
while d.find('a', {'class':'pagination-arrow arrow-right'}):
d = soup(requests.get(f'https://www.nekretnine.rs{d.find("a", {"class":"pagination-arrow arrow-right"})["href"]}').text, 'html.parser')
result.append(scrape_page(d))

Beautifulsoup can't find text

I'm trying to write a scraper in python using urllib and beautiful soup. I have a csv of URLs for news stories, and for ~80% of the pages the scraper works, but when there is a picture at the top of the story the script no longer pulls the time or the body text. I am mostly confused because soup.find and soup.find_all don't seem to produce different results. I have tried a variety of different tags that should capture the text as well as 'lxml' and 'html.parser.'
Here is the code:
testcount = 0
titles1 = []
bodies1 = []
times1 = []
data = pd.read_csv('URLsALLjun27.csv', header=None)
for url in data[0]:
try:
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
titlemess = soup.find(id="title").get_text() #getting the title
titlestring = str(titlemess) #make it a string
title = titlestring.replace("\n", "").replace("\r","")
titles1.append(title)
bodymess = soup.find(class_="article").get_text() #get the body with markup
bodystring = str(bodymess) #make body a string
body = bodystring.replace("\n", "").replace("\u3000","") #scrub markup
bodies1.append(body) #add to list for export
timemess = soup.find('span',{"class":"time"}).get_text()
timestring = str(timemess)
time = timestring.replace("\n", "").replace("\r","").replace("年", "-").replace("月","-").replace("日", "")
times1.append(time)
testcount = testcount +1 #counter
print(testcount)
except Exception as e:
print(testcount, e)
And here are some of the results I get (those marked 'nonetype' are the ones where the title was successfully pulled but body/time is empty)
1 http://news.xinhuanet.com/politics/2016-06/27/c_1119122255.htm
2 http://news.xinhuanet.com/politics/2016-05/22/c_129004569.htm 'NoneType' object has no attribute 'get_text'
Any help would be much appreciated! Thanks.
EDIT: I don't have '10 reputation points' so I can't post more links to test but will comment with them if you need more examples of pages.

The issue is that there is no class="article" on the website with the picture in it and same with the "class":"time". Consequently, it seems that you'll have to detect whether there's a picture on the website or not and then if there is a picture, search for the date and text as follows:
For the date, try:
timemess = soup.find(id="pubtime").get_text()
For the body text, it seems that the article is rather just the caption for the picture. Consequently, you could try the following:
bodymess = soup.find('img').findNext().get_text()
In brief, the soup.find('img') finds the image and findNext() goes to the next block which, coincidentally, contains the text.
Thus, in your code, I would do something as follows:
try:
bodymess = soup.find(class_="article").get_text()
except AttributeError:
bodymess = soup.find('img').findNext().get_text()
try:
timemess = soup.find('span',{"class":"time"}).get_text()
except AttributeError:
timemess = soup.find(id="pubtime").get_text()
As a general flow in web scraping, I usually go to the website itself using a browser and find the elements in the website backend in the browser first.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to parse a specific text in HTML with beautifulsoup - python

Related

Unable to scrape website for price element

Webscraping Issue w/ BeautifulSoup

Asking the user to input something and use Beautiful Soup to parse a website

Scraping each element from website with BeautifulSoup

Beautifulsoup can't find text

Categories

Resources