BeautifulSoup returns None - python

i am trying to get the title of the listing in this URL, but this code returns None.
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get('https://www.lamudi.com.ph/metro-manila/makati/condominium/buy/')
# Update Header
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0)
Gecko/20100101 Firefox/31.0',
})
# load data into bs4
soup = BeautifulSoup(data.text, 'html.parser')
# We need to extract all the data in this div: <div
class="ListingCell-KeyInfo-title" ..>
listingsTitle = soup.find('div', { 'class': 'ListingCell-KeyInfo-title'})
print(listingsTitle)
Any idea why is that?
Thanks

The url you request treat you as a bot.
Request response:
h1>Pardon Our Interruption...</h1>
<p>
As you were browsing <strong>www.lamudi.com.ph</strong> something about your
browser made us think you were a bot. There are a few reasons this might happen:
</p>
<ul>
Before you parse anything from the response.
Print the content first to make sure you have access the url in right way.
You have to add User-Agent or somthing else to make you like a real user
Try add this into your request headers :
USER_AGENT_FIREFOX= 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0'

I tried with selenium and with specific wait but do not works.
If you print the soup you can get the error. In fact the page return this: "As you were browsing www.lamudi.com.ph something about your browser made us think you were a bot. There are a few reasons this might happen:... "
The website recognise that you are not human.
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get('https://www.lamudi.com.ph/metro-manila/makati/condominium/buy/')
# load data into bs4
soup = BeautifulSoup(data.text, 'html.parser')
# We need to extract all the data in this div: <div class="ListingCell-KeyInfo-title" ..>
print(soup) #--> this print get the error
listingsTitle = soup.find('div', class_='ListingCell-KeyInfo-title')
print(listingsTitle)

Related

Why is BeautifulSoup leaving out parts of a website?

I'm completely new to python and wanted to dip my toes into web scraping. So I tried to scrape the rankings of players in https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3#
But when I try to access the rankings and ratings of each player, it gives none as a return. This is all inside the so I assume beautifulsoup isn't able to access it because it's javascript, but I'm not sure. please help ._.
Input:
from bs4 import BeautifulSoup
import requests
URL_USAFencingOctoberNac_2022 = "https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3"
October_Nac_2022 = requests.get(URL_USAFencingOctoberNac_2022)
October_Nac_2022 = BeautifulSoup(October_Nac_2022.text, "html.parser")
tbody = October_Nac_2022.tbody
print(tbody)
Output:
None
In this case the problem is not with BS4 but with your analysis before starting the scraping. The data which you are looking for is not available directly from the request you have made.
To get the data you have to make request to a different back end URL https://www.fencingtimelive.com/events/competitors/data/F87F9E882BD6467FB9461F68E484B8B3?sort=name, which will give you a JSON response.
The code will look something like this
from requests import get
url = 'https://www.fencingtimelive.com/events/competitors/data/F87F9E882BD6467FB9461F68E484B8B3?sort=
name'
response = get(url, headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0 X-Requested-With XMLHttpRequest'})
print(response.json())
If you want to test performance of BS4 consider the below example for fetching the blog post links from the link
from requests import get
from bs4 import BeautifulSoup as bs
url = "https://www.zyte.com/blog/"
response = get(url, headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux soup = bs(response.content)
posts = soup.find_all('div', {"class":"oxy-posts"})
print(len(posts))
Note:
Before writing code for scraping analyse the website thoroughly. It will give the idea about the data sources of the website

extracting text from id in beautifulsoup

have html code like this. using BeautifulSoup, i want to extract the text that is 2,441
have a span element and a id which is equals to lastPrice.
<span id="lastPrice">2,441.00</span>
I have tried to look up on the net and solve, but i am still unable to do it. I am a beginner.
i have tried this:
tag = soup.span
price = soup.find(id="lastPrice")
print(price.text)
The data you see on the page is rendered via JavaScript, so BeautifulSoup doesn't see the price. The price is embedded within the page in JavaScript form. You can extract it for example:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www1.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=HDFC"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data = json.loads(soup.find(id="responseDiv").contents[0])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
print(data["data"][0]["lastPrice"])
Prints:
2,407.90
Try this:
price = soup.select("#lastPrice")[0]
print(price.text)
Not with bs4 but with regex is as follows:
import re
line = '<span id="lastPrice">2,441.00</span>'
print(p.sub("", data))

soup.find_all returns empty list

I was trying to do some data scraping from booking.com for prices. But it just keeps on returning an empty list.
If anyone can explain me what is happening i would be really thankful to them.
Here is the website from which I am trying to scrape data:
https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1
Here is my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get("https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1").text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('div', class_='fde444d7ef _e885fdc12')
print(prices)
After checking different possible problems I found two problems.
price is in <span> but you search in <div>
server sends different HTML for different browsers or devices and code needs full header User-Agent from real browser. It can't be short Mozilla/5.0. And requests as default use something like Python/3.8 Requests/2.27
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0'
}
url = "https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1"
response = requests.get(url, headers=headers)
#print(response.status)
html_text = response.text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('span', class_='fde444d7ef _e885fdc12')
for item in prices:
print(item.text)

Trying to access hidden <div> tags when web scraping in python

so I'm trying to extract some data from a website by webscraping using python but some of the div tags are not expanding to show the data that I want.
This is my code.
import requests
from bs4 import BeautifulSoup as soup
uq_url = "https://my.uq.edu.au/programs-courses/requirements/program/2451/2021"
headers = {
'User-Agent': "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
web_r = requests.get(uq_url, headers=headers)
web_soup = soup(web_r.text, 'html.parser')
print(web_soup.prettify())
This is what the code will scrape but it won't extract any of the data in the div with id="app". It's supposed to have a lot of data in there like the second picture. Any help would be appreciated.
All the that content is present within a script tag, as shown in your image. You can regex out the appropriate javascript object then handle the unquoted keys with json, in order to convert to hjson. Then extract whatever you want:
import requests, re, hjson
from bs4 import BeautifulSoup as bs #there is some data as embedded html you may wish to parse later from json
r = requests.get('https://my.uq.edu.au/programs-courses/requirements/program/2451/2021', headers = {'User-Agent':'Mozilla/5.0'})
data = hjson.loads(re.search(r'window\.AppData = ([\s\S]+?);\n' , r.text).group(1))
# hjson.dumpsJSON(data['programRequirements'])
core_courses = data['programRequirements']['payload']['components'][1]['payload']['body'][0]['body']
for course in core_courses:
if 'curriculumReference' in course:
print(course['curriculumReference'])

Beautiful Soup Python findAll returning empty list

I am trying to scrape an Amazon Alexa Skill: https://www.amazon.com/PayPal/dp/B075764QCX/ref=sr_1_1?dchild=1&keywords=paypal&qid=1604026451&s=digital-skills&sr=1-1
For now, I am just trying to get the name of the skill (Paypal), but for some reason this is returning an empty list. I have looked at the website's inspect element and I know that it should give me the name so I am not sure what is going wrong. My code is below:
request = Request(skill_url, headers=request_headers)
response = urlopen(request)
response = response.read()
html = response.decode()
soup = BeautifulSoup(html, 'html.parser')
name = soup.find_all("h1", {"class" : "a2s-title-content"})
The page content is loaded with javascript, so you can't just use BeautifulSoup to scrape it. You have to use another module like selenium to simulate javascript execution.
Here is an example:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
url='YOUR URL'
driver = webdriver.Firefox()
driver.get(url)
page = driver.page_source
page_soup = soup(page,'html.parser')
containers = page_soup.find_all("h1", {"class" : "a2s-title-content"})
print(containers)
print(len(containers))
You can also use chrome-driver or edge-driver see here
Try to set User-Agent and Accept-Language HTTP headers to prevent the server send you Captcha page:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0',
'Accept-Language': 'en-US,en;q=0.5'
}
url = 'https://www.amazon.com/PayPal/dp/B075764QCX/ref=sr_1_1?dchild=1&keywords=paypal&qid=1604026451&s=digital-skills&sr=1-1'
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'lxml')
name = soup.find("h1", {"class" : "a2s-title-content"})
print(name.get_text(strip=True))
Prints:
PayPal

Categories

Resources