Why is BeautifulSoup leaving out parts of a website?

Why is BeautifulSoup leaving out parts of a website? - python

I'm completely new to python and wanted to dip my toes into web scraping. So I tried to scrape the rankings of players in https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3#
But when I try to access the rankings and ratings of each player, it gives none as a return. This is all inside the so I assume beautifulsoup isn't able to access it because it's javascript, but I'm not sure. please help ._.
Input:
from bs4 import BeautifulSoup
import requests
URL_USAFencingOctoberNac_2022 = "https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3"
October_Nac_2022 = requests.get(URL_USAFencingOctoberNac_2022)
October_Nac_2022 = BeautifulSoup(October_Nac_2022.text, "html.parser")
tbody = October_Nac_2022.tbody
print(tbody)
Output:
None

In this case the problem is not with BS4 but with your analysis before starting the scraping. The data which you are looking for is not available directly from the request you have made.
To get the data you have to make request to a different back end URL https://www.fencingtimelive.com/events/competitors/data/F87F9E882BD6467FB9461F68E484B8B3?sort=name, which will give you a JSON response.
The code will look something like this
from requests import get
url = 'https://www.fencingtimelive.com/events/competitors/data/F87F9E882BD6467FB9461F68E484B8B3?sort=
name'
response = get(url, headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0 X-Requested-With XMLHttpRequest'})
print(response.json())
If you want to test performance of BS4 consider the below example for fetching the blog post links from the link
from requests import get
from bs4 import BeautifulSoup as bs
url = "https://www.zyte.com/blog/"
response = get(url, headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux soup = bs(response.content)
posts = soup.find_all('div', {"class":"oxy-posts"})
print(len(posts))
Note:
Before writing code for scraping analyse the website thoroughly. It will give the idea about the data sources of the website

Related

soup.find_all returns empty list

I was trying to do some data scraping from booking.com for prices. But it just keeps on returning an empty list.
If anyone can explain me what is happening i would be really thankful to them.
Here is the website from which I am trying to scrape data:
https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1
Here is my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get("https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1").text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('div', class_='fde444d7ef _e885fdc12')
print(prices)

After checking different possible problems I found two problems.
price is in <span> but you search in <div>
server sends different HTML for different browsers or devices and code needs full header User-Agent from real browser. It can't be short Mozilla/5.0. And requests as default use something like Python/3.8 Requests/2.27
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0'
}
url = "https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1"
response = requests.get(url, headers=headers)
#print(response.status)
html_text = response.text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('span', class_='fde444d7ef _e885fdc12')
for item in prices:
print(item.text)

Scraping Website does not return correct source code

im trying to scrape a quizlet match set with Python. I want to scrape all the <span> tags with class: TermText
Here's the URL: 'https://quizlet.com/291523268'
import requests
raw = requests.get(URL).text
raw ends up returning things that do not contain any tags or cards at all. When I check the source of the website it shows all the TermText spans that I need meaning it's not JS loaded. Thus, I don't understand why my HTML is coming out wrong since it doesn't contain any of the html I need.

To get correct response from server, set correct User-Agent HTTP header:
import requests
from bs4 import BeautifulSoup
url = 'https://quizlet.com/291523268/python-flash-cards/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for span in soup.select('span.TermText'):
print(span.get_text(strip=True))
Prints:
algorithm
A set of specific steps for solving a category of problems
token
basic elements of a language(letters, numbers, symbols)
high-level language
A programming language like Python that is designed to be easy for humans to read and write.
low-level langauge
...and so on.

Web-scraping using python3

i'm trying to catch some info about amazon stuff. Idk why my code doesn't work. Every time i try to test these lines, i get a None output.
I'm using visual studio.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.amazon.it/Xiaomi-frequenza-Monitoraggio-Bracciale-Smartwatch/dp/B07T9DHKXL?pf_rd_r=F2MMPNCJR5AQ4KP5C82P&pf_rd_p=ff59f7ef-650d-5e5a-9ee5-6fd80bb0e21d&pd_rd_r=12e6add2-54cd-44b1-bfa4-81c70ad68010&pd_rd_w=Lo5MD&pd_rd_wg=t2rFz&ref_=pd_gw_ri"
)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find(id='productTitle')
price = soup.find(id='priceblock_ourprice')
print(title)
print(price)

Andrej Kesely gave you the answer while I was typing, but to understand why this happens,
just add this print line after the soup = ... :
soup = BeautifulSoup(page.content,'html.parser')
print(soup.find_all("title"))
title = soup.find(id='productTitle')
This will print:
[<title dir="ltr">Amazon CAPTCHA</title>]
Amazon isn't "showing" the real page to your code, it is asking for a captcha.

There are 2 problems:
1.) Use HTTP header User-Agent. Without it, Amazon sends you CAPTCHA page.
2.) As parser select html5lib or lxml. html.parser has problems parsing this page.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.it/Xiaomi-frequenza-Monitoraggio-Bracciale-Smartwatch/dp/B07T9DHKXL'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html5lib') # or 'lxml'
title = soup.find(id='productTitle')
price = soup.find(id='priceblock_ourprice')
print(title.get_text(strip=True))
print(price.get_text(strip=True))
Prints:
Xiaomi Mi Band 4 Activity Tracker,Monitor attività,Monitor frequenza cardiaca Monitoraggio Fitness, Bracciale Smartwatch con Schermo AMOLED a Colori 0,95, con iOS e Android (Versione Globale)
30,96 €

Most modern websites including Amazon load webpages dynamically using javascript. When you send a request using requests.get you get only the initial render of the webpage without the dynamically loaded content. You could use a library like selenium to load dynamically loaded webpages and then parse the page source to beautiful soup.

Get Bing search results in Python

I am trying to make a chatbot that can get Bing search results using Python. I've tried many websites, but they all use old Python 2 code or Google. I am currently in China and cannot access YouTube, Google, or anything else related to Google (Can't use Azure and Microsoft Docs either). I want the results to be like this:
This is the title
https://this-is-the-link.com
This is the second title
https://this-is-the-second-link.com
Code
import requests
import bs4
import re
import urllib.request
from bs4 import BeautifulSoup
page = urllib.request.urlopen("https://www.bing.com/search?q=programming")
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
print(link["href"])
And it gives me
/?FORM=Z9FD1
javascript:void(0);
javascript:void(0);
/rewards/dashboard
/rewards/dashboard
javascript:void(0);
/?scope=web&FORM=HDRSC1
/images/search?q=programming&FORM=HDRSC2
/videos/search?q=programming&FORM=HDRSC3
/maps?q=programming&FORM=HDRSC4
/news/search?q=programming&FORM=HDRSC6
/shop?q=programming&FORM=SHOPTB
http://go.microsoft.com/fwlink/?LinkId=521839
http://go.microsoft.com/fwlink/?LinkID=246338
https://go.microsoft.com/fwlink/?linkid=868922
http://go.microsoft.com/fwlink/?LinkID=286759
https://go.microsoft.com/fwlink/?LinkID=617297
Any help would be greatly appreciated (I'm using Python 3.6.9 on Ubuntu)

Actually, code you've written working properly, problem is in HTTP request headers. By default urllib use Python-urllib/{version} as User-Agent header value, which makes easy for website to recognize your request as automatically generated. To avoid this, you should use custom value which can be achieved passing Request object as first parameter of urlopen():
from urllib.parse import urlencode, urlunparse
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
query = "programming"
url = urlunparse(("https", "www.bing.com", "/search", "", urlencode({"q": query}), ""))
custom_user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
req = Request(url, headers={"User-Agent": custom_user_agent})
page = urlopen(req)
# Further code I've left unmodified
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
print(link["href"])
P.S. Take a look on comment left by #edd under your question.

BeautifulSoup returns None

i am trying to get the title of the listing in this URL, but this code returns None.
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get('https://www.lamudi.com.ph/metro-manila/makati/condominium/buy/')
# Update Header
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0)
Gecko/20100101 Firefox/31.0',
})
# load data into bs4
soup = BeautifulSoup(data.text, 'html.parser')
# We need to extract all the data in this div: <div
class="ListingCell-KeyInfo-title" ..>
listingsTitle = soup.find('div', { 'class': 'ListingCell-KeyInfo-title'})
print(listingsTitle)
Any idea why is that?
Thanks

The url you request treat you as a bot.
Request response:
h1>Pardon Our Interruption...</h1>
<p>
As you were browsing <strong>www.lamudi.com.ph</strong> something about your
browser made us think you were a bot. There are a few reasons this might happen:
</p>
<ul>
Before you parse anything from the response.
Print the content first to make sure you have access the url in right way.
You have to add User-Agent or somthing else to make you like a real user
Try add this into your request headers :
USER_AGENT_FIREFOX= 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0'

I tried with selenium and with specific wait but do not works.
If you print the soup you can get the error. In fact the page return this: "As you were browsing www.lamudi.com.ph something about your browser made us think you were a bot. There are a few reasons this might happen:... "
The website recognise that you are not human.
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get('https://www.lamudi.com.ph/metro-manila/makati/condominium/buy/')
# load data into bs4
soup = BeautifulSoup(data.text, 'html.parser')
# We need to extract all the data in this div: <div class="ListingCell-KeyInfo-title" ..>
print(soup) #--> this print get the error
listingsTitle = soup.find('div', class_='ListingCell-KeyInfo-title')
print(listingsTitle)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is BeautifulSoup leaving out parts of a website? - python

Related

soup.find_all returns empty list

Scraping Website does not return correct source code

Web-scraping using python3

Get Bing search results in Python

BeautifulSoup returns None

Categories

Resources