Web-scraping using python3

Web-scraping using python3 - python

i'm trying to catch some info about amazon stuff. Idk why my code doesn't work. Every time i try to test these lines, i get a None output.
I'm using visual studio.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.amazon.it/Xiaomi-frequenza-Monitoraggio-Bracciale-Smartwatch/dp/B07T9DHKXL?pf_rd_r=F2MMPNCJR5AQ4KP5C82P&pf_rd_p=ff59f7ef-650d-5e5a-9ee5-6fd80bb0e21d&pd_rd_r=12e6add2-54cd-44b1-bfa4-81c70ad68010&pd_rd_w=Lo5MD&pd_rd_wg=t2rFz&ref_=pd_gw_ri"
)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find(id='productTitle')
price = soup.find(id='priceblock_ourprice')
print(title)
print(price)

Andrej Kesely gave you the answer while I was typing, but to understand why this happens,
just add this print line after the soup = ... :
soup = BeautifulSoup(page.content,'html.parser')
print(soup.find_all("title"))
title = soup.find(id='productTitle')
This will print:
[<title dir="ltr">Amazon CAPTCHA</title>]
Amazon isn't "showing" the real page to your code, it is asking for a captcha.

There are 2 problems:
1.) Use HTTP header User-Agent. Without it, Amazon sends you CAPTCHA page.
2.) As parser select html5lib or lxml. html.parser has problems parsing this page.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.it/Xiaomi-frequenza-Monitoraggio-Bracciale-Smartwatch/dp/B07T9DHKXL'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html5lib') # or 'lxml'
title = soup.find(id='productTitle')
price = soup.find(id='priceblock_ourprice')
print(title.get_text(strip=True))
print(price.get_text(strip=True))
Prints:
Xiaomi Mi Band 4 Activity Tracker,Monitor attività,Monitor frequenza cardiaca Monitoraggio Fitness, Bracciale Smartwatch con Schermo AMOLED a Colori 0,95, con iOS e Android (Versione Globale)
30,96 €

Most modern websites including Amazon load webpages dynamically using javascript. When you send a request using requests.get you get only the initial render of the webpage without the dynamically loaded content. You could use a library like selenium to load dynamically loaded webpages and then parse the page source to beautiful soup.

Related

Why is BeautifulSoup leaving out parts of a website?

I'm completely new to python and wanted to dip my toes into web scraping. So I tried to scrape the rankings of players in https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3#
But when I try to access the rankings and ratings of each player, it gives none as a return. This is all inside the so I assume beautifulsoup isn't able to access it because it's javascript, but I'm not sure. please help ._.
Input:
from bs4 import BeautifulSoup
import requests
URL_USAFencingOctoberNac_2022 = "https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3"
October_Nac_2022 = requests.get(URL_USAFencingOctoberNac_2022)
October_Nac_2022 = BeautifulSoup(October_Nac_2022.text, "html.parser")
tbody = October_Nac_2022.tbody
print(tbody)
Output:
None

In this case the problem is not with BS4 but with your analysis before starting the scraping. The data which you are looking for is not available directly from the request you have made.
To get the data you have to make request to a different back end URL https://www.fencingtimelive.com/events/competitors/data/F87F9E882BD6467FB9461F68E484B8B3?sort=name, which will give you a JSON response.
The code will look something like this
from requests import get
url = 'https://www.fencingtimelive.com/events/competitors/data/F87F9E882BD6467FB9461F68E484B8B3?sort=
name'
response = get(url, headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0 X-Requested-With XMLHttpRequest'})
print(response.json())
If you want to test performance of BS4 consider the below example for fetching the blog post links from the link
from requests import get
from bs4 import BeautifulSoup as bs
url = "https://www.zyte.com/blog/"
response = get(url, headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux soup = bs(response.content)
posts = soup.find_all('div', {"class":"oxy-posts"})
print(len(posts))
Note:
Before writing code for scraping analyse the website thoroughly. It will give the idea about the data sources of the website

soup.find_all returns empty list

I was trying to do some data scraping from booking.com for prices. But it just keeps on returning an empty list.
If anyone can explain me what is happening i would be really thankful to them.
Here is the website from which I am trying to scrape data:
https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1
Here is my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get("https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1").text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('div', class_='fde444d7ef _e885fdc12')
print(prices)

After checking different possible problems I found two problems.
price is in <span> but you search in <div>
server sends different HTML for different browsers or devices and code needs full header User-Agent from real browser. It can't be short Mozilla/5.0. And requests as default use something like Python/3.8 Requests/2.27
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0'
}
url = "https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1"
response = requests.get(url, headers=headers)
#print(response.status)
html_text = response.text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('span', class_='fde444d7ef _e885fdc12')
for item in prices:
print(item.text)

I can't access the text in the span using BeautifulSoup

Hi Everyone receive error msg when executing this code :
from bs4 import BeautifulSoup
import requests
import html.parser
from requests_html import HTMLSession
session = HTMLSession()
response = session.get("https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht")
soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all("tr")
for table in tables:
movie_name = table.find("span", class_ = "secondaryInfo")
print(movie_name)
output:
movie_name = table.find("span", class_ = "secondaryInfo").text
AttributeError: 'NoneType' object has no attribute 'text'

You selected for the first row which is the header and doesn't have that class as it doesn't list the prices. An alternative way is to simply exclude the header with a css selector of nth-child(n+2). You also only need requests.
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht")
soup = BeautifulSoup(response.content, 'html.parser')
for row in soup.select('tr:nth-child(n+2)'):
movie_name = row.find("span", class_ = "secondaryInfo")
print(movie_name.text)

Just use the SelectorGadget Chrome extension to grab CSS selector by clicking on the desired element in your browser without inventing anything superfluous. However, it's not working perfectly if the HTML structure is terrible.
You're looking for this:
for result in soup.select(".titleColumn a"):
movie_name = result.text
Also, there's no need in using HTMLSession IF you don't want to persist certain parameters across requests to the same host (website).
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests
# user-agent is used to act as a real user visit
# this could reduce the chance (a little bit) of being blocked by a website
# and prevent from IP limit block or permanent ban
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get("https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht", headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
for result in soup.select(".titleColumn a"):
movie_name = result.text
print(movie_name)
# output
'''
Eternals
Dune: Part One
No Time to Die
Venom: Let There Be Carnage
Ron's Gone Wrong
The French Dispatch
Halloween Kills
Spencer
Antlers
Last Night in Soho
'''
P.S. There's a dedicated web scraping blog of mine. If you need to parse search engines, have a try using SerpApi.
Disclaimer, I work for SerpApi.

Get Bing search results in Python

I am trying to make a chatbot that can get Bing search results using Python. I've tried many websites, but they all use old Python 2 code or Google. I am currently in China and cannot access YouTube, Google, or anything else related to Google (Can't use Azure and Microsoft Docs either). I want the results to be like this:
This is the title
https://this-is-the-link.com
This is the second title
https://this-is-the-second-link.com
Code
import requests
import bs4
import re
import urllib.request
from bs4 import BeautifulSoup
page = urllib.request.urlopen("https://www.bing.com/search?q=programming")
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
print(link["href"])
And it gives me
/?FORM=Z9FD1
javascript:void(0);
javascript:void(0);
/rewards/dashboard
/rewards/dashboard
javascript:void(0);
/?scope=web&FORM=HDRSC1
/images/search?q=programming&FORM=HDRSC2
/videos/search?q=programming&FORM=HDRSC3
/maps?q=programming&FORM=HDRSC4
/news/search?q=programming&FORM=HDRSC6
/shop?q=programming&FORM=SHOPTB
http://go.microsoft.com/fwlink/?LinkId=521839
http://go.microsoft.com/fwlink/?LinkID=246338
https://go.microsoft.com/fwlink/?linkid=868922
http://go.microsoft.com/fwlink/?LinkID=286759
https://go.microsoft.com/fwlink/?LinkID=617297
Any help would be greatly appreciated (I'm using Python 3.6.9 on Ubuntu)

Actually, code you've written working properly, problem is in HTTP request headers. By default urllib use Python-urllib/{version} as User-Agent header value, which makes easy for website to recognize your request as automatically generated. To avoid this, you should use custom value which can be achieved passing Request object as first parameter of urlopen():
from urllib.parse import urlencode, urlunparse
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
query = "programming"
url = urlunparse(("https", "www.bing.com", "/search", "", urlencode({"q": query}), ""))
custom_user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
req = Request(url, headers={"User-Agent": custom_user_agent})
page = urlopen(req)
# Further code I've left unmodified
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
print(link["href"])
P.S. Take a look on comment left by #edd under your question.

BeautifulSoup returns None

i am trying to get the title of the listing in this URL, but this code returns None.
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get('https://www.lamudi.com.ph/metro-manila/makati/condominium/buy/')
# Update Header
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0)
Gecko/20100101 Firefox/31.0',
})
# load data into bs4
soup = BeautifulSoup(data.text, 'html.parser')
# We need to extract all the data in this div: <div
class="ListingCell-KeyInfo-title" ..>
listingsTitle = soup.find('div', { 'class': 'ListingCell-KeyInfo-title'})
print(listingsTitle)
Any idea why is that?
Thanks

The url you request treat you as a bot.
Request response:
h1>Pardon Our Interruption...</h1>
<p>
As you were browsing <strong>www.lamudi.com.ph</strong> something about your
browser made us think you were a bot. There are a few reasons this might happen:
</p>
<ul>
Before you parse anything from the response.
Print the content first to make sure you have access the url in right way.
You have to add User-Agent or somthing else to make you like a real user
Try add this into your request headers :
USER_AGENT_FIREFOX= 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0'

I tried with selenium and with specific wait but do not works.
If you print the soup you can get the error. In fact the page return this: "As you were browsing www.lamudi.com.ph something about your browser made us think you were a bot. There are a few reasons this might happen:... "
The website recognise that you are not human.
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get('https://www.lamudi.com.ph/metro-manila/makati/condominium/buy/')
# load data into bs4
soup = BeautifulSoup(data.text, 'html.parser')
# We need to extract all the data in this div: <div class="ListingCell-KeyInfo-title" ..>
print(soup) #--> this print get the error
listingsTitle = soup.find('div', class_='ListingCell-KeyInfo-title')
print(listingsTitle)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web-scraping using python3 - python

Related

Why is BeautifulSoup leaving out parts of a website?

soup.find_all returns empty list

I can't access the text in the span using BeautifulSoup

Get Bing search results in Python

BeautifulSoup returns None

Categories

Resources