Python BeautifulSoup not extracting every URL

Python BeautifulSoup not extracting every URL - python

I'm trying to find all the URLs on this page: https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments
More specifically, I want the links that are hyperlinked under each "Subject Code". However, when I run my code, barely any links get extracted.
I would like to know why this is happening, and how I can fix it.
from bs4 import BeautifulSoup
import requests
url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
page = requests.get(url)
soup = BeautifulSoup(page.text, features="lxml")
for link in soup.find_all('a'):
print(link.get('href'))
This is my first attempt in web-scraping..

there's an anti-bot protection, just add a user-agent to your headers. and do not forget to check your soup when things go wrong
from bs4 import BeautifulSoup
import requests
url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
ua={'User-Agent':'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_8_2) AppleWebKit/531.2 (KHTML, like Gecko) Chrome/26.0.869.0 Safari/531.2'}
r = requests.get(url, headers=ua)
soup = BeautifulSoup(r.text, features="lxml")
for link in soup.find_all('a'):
print(link.get('href'))
the message in the soup was
Sorry for the inconvenience.
We have detected excess or unusual web requests originating from your browser, and are unable to determine whether these requests are automated.
To proceed to the requested page, please complete the captcha below.

I would use nth-child(1) to restrict to the first column of the table matched by id. Then simply extract the .text. If that contains * then provide a default string for no course offered, otherwise, concatenate the retrieved course identifier onto a base query string construct:
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments', headers=headers)
soup = bs(r.content, 'lxml')
no_course = ''
base = 'https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-department&dept='
course_info = {i.text:(no_course if '*' in i.text else base + i.text) for i in soup.select('#mainTable td:nth-child(1)')}
course_info

Related

soup.find_all returns empty list

I was trying to do some data scraping from booking.com for prices. But it just keeps on returning an empty list.
If anyone can explain me what is happening i would be really thankful to them.
Here is the website from which I am trying to scrape data:
https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1
Here is my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get("https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1").text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('div', class_='fde444d7ef _e885fdc12')
print(prices)

After checking different possible problems I found two problems.
price is in <span> but you search in <div>
server sends different HTML for different browsers or devices and code needs full header User-Agent from real browser. It can't be short Mozilla/5.0. And requests as default use something like Python/3.8 Requests/2.27
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0'
}
url = "https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB&sid=2dad976fd78f6001d59007a49cb13017&sb=1&sb_lp=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaCeIAQGYATG4AQfIAQzYAQPoAQH4AQKIAgGoAgO4AuGQ8JAGwAIB0gIkYjFlZDljM2MtOGJiMy00MGZiLWIyMjMtMWIwYjNhYzU5OGQx2AIE4AIB%3Bsid%3D2dad976fd78f6001d59007a49cb13017%3Bsb_price_type%3Dtotal%26%3B&ss=Golden&is_ski_area=0&ssne=Golden&ssne_untouched=Golden&dest_id=-565331&dest_type=city&checkin_year=2022&checkin_month=3&checkin_monthday=15&checkout_year=2022&checkout_month=3&checkout_monthday=16&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1"
response = requests.get(url, headers=headers)
#print(response.status)
html_text = response.text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('span', class_='fde444d7ef _e885fdc12')
for item in prices:
print(item.text)

Beautiful Soup cannot scrape after the first div tag

Please see below. I would like to scrape for the restaurant name that is in
Popeyes
Please see the image below for the HTML on this website.
Can someone please show me how I can scrape that restaurant name "Popeyes" On Python Using Beautiful Soup or any other webscraping package?
Thanks in advance!
Below is the code I used to scrape data, however, it stopped at and I couldn't go further.
'''
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
# URl to web scrape from.
# in this example we web scrape graphics cards from Newegg.com
page_url = "https://www.doordash.com/store/popeyes-toronto-254846/en-CA"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
page_soup.div'''

You can try this (I may make a mistake on the class name):
import urllib.request
import bs4 as bs
from bs4 import BeautifulSoup
url_1 = 'https://www.doordash.com/store/popeyes-toronto-254846/en-CA'
sauce_1 = urllib.request.urlopen(url_1).read()
soup_1 = bs.BeautifulSoup(sauce_1, 'lxml')
for x in (soup_1.find_all('h1', class_ = 'sc-AnqlK keKZVr sc-jFpLkX bsGprJ')):
print(x)
Let me know if this help!

You can get the name by speciging the 'div' class.
from bs4 import BeautifulSoup
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
title = soup.find(class_ = 'sc-AnqlK keKZVr sc-jFpLkX bsGprJ').get_text()
print(title)
I don't know if wrote right the class name but you can copy and paste it.

When using 'requests.get()', the html is less than the actual html

The first image is the html of the site brought in python, and the second image is the actual html viewed by pressing F12 on the site. I don't know why the two results are different. Other sites display html normally, but I wonder why only that site is not displayed normally.
Here is the Code:
import requests
from bs4 import BeautifulSoup
result = requests.get('https://www.overbuff.com/heroes')
soup = BeautifulSoup(result.text, "html.parser")
print(soup)

Your maybe blocked by the page and should try it with some headers like:
headers = {"user-agent": "Mozilla/5.0"}
r = requests.get(url, headers=headers)
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.overbuff.com/heroes"
headers = {"user-agent": "Mozilla/5.0"}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
print(soup)

Scraping image hrefs from an Ordered List using BeautifulSoup

I am trying to retrieve the images from this website (with permission). Here is my code below with the website I want to access:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.vgmuseum.com/nes.htm"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html5lib")
li = soup.select('ol > li > a')
for link in li:
print(link.get('href'))
The images I would like to use are in this ordered list here:
list location for images

The page you are working with consists of iframes which is basically a way of including one page into the other. Browsers understand how iframes work and would download pages and display them in the browser window.
urllib2, though, is not a browser and cannot do that. You need to explore where the list of links is located, in which iframe and then follow the url where this iframe's content is coming from. In your case, the list of links on the left is coming from the http://www.vgmuseum.com/nes_b.html page.
Here is a working solution to follow links in the list of links, download pages containing images and the downloading images into the images/ directory. I am using requests module and utilizing lxml parser teamed up with BeautifulSoup for faster HTML parsing:
from urllib.parse import urljoin
import os
import requests
from bs4 import BeautifulSoup
url = "http://www.vgmuseum.com/nes_b.html"
def download_image(session, url):
print(url)
local_filename = os.path.join("images", url.split('/')[-1])
r = session.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
with requests.Session() as session:
session.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}
response = session.get(url)
soup = BeautifulSoup(response.content, "lxml")
for link in soup.select('ol > li > a[href*=images]'):
response = session.get(urljoin(response.url, link.get('href')))
for image in BeautifulSoup(response.content, "lxml").select("img[src]"):
download_image(session, url=urljoin(response.url, image["src"]))

I used the url in #Dan's comment above for parsing.
Code:
import requests
from bs4 import BeautifulSoup
url = 'http://www.vgmuseum.com/nes_b.html'
page = requests.get(url).text
soup = BeautifulSoup(page, 'html.parser')
li = soup.find('ol')
soup = BeautifulSoup(str(li), 'html.parser')
a = soup.find_all('a')
for link in a:
if not link.get('href') == '#top' and not link.get('href') == None:
print(link.get('href'))
Output:
images/nes/10yard.html
images/nes2/10.html
pics2/100man.html
images/nes/1942.html
images/nes2/1942.html
images/nes/1943.html
images/nes2/1943.html
pics7/1944.html
images/nes/1999.html
images/nes2/2600.html
images/nes2/3dbattles.html
images/nes2/3dblock.html
images/nes2/3in1.html
images/nes/4cardgames.html
pics2/4.html
images/nes/4wheeldrivebattle.html
images/nes/634.html
images/nes/720NES.html
images/nes/8eyes.html
images/nes2/8eyes.html
images/nes2/8eyesp.html
pics2/89.html
images/nes/01/blob.html
pics5/boy.html
images/03/a.html
images/03/aa.html
images/nes/abadox.html
images/03/abadoxf.html
images/03/abadoxj.html
images/03/abadoxp.html
images/03/abarenbou.html
images/03/aces.html
images/03/action52.html
images/03/actionin.html
images/03/adddragons.html
images/03/addheroes.html
images/03/addhillsfar.html
images/03/addpool.html
pics/addamsfamily.html
pics/addamsfamilypugsley.html
images/nes/01/adventureislandNES.html
images/nes/adventureisland2.html
images/nes/advisland3.html
pics/adventureisland4.html
images/03/ai4.html
images/nes/magickingdom.html
pics/bayou.html
images/03/bayou.html
images/03/captain.html
images/nes/adventuresofdinoriki.html
images/03/ice.html
images/nes/01/lolo1.html
images/03/lolo.html
images/nes/01/adventuresoflolo2.html
images/03/lolo2.html
images/nes/adventuresoflolo3.html
pics/radgravity.html
images/03/rad.html
images/nes/01/rockyandbullwinkle.html
images/nes/01/tomsawyer.html
images/03/afroman.html
images/03/afromario.html
pics/afterburner.html
pics2/afterburner2.html
images/03/ai.html
images/03/aigiina.html
images/nes/01/airfortress.html
images/03/air.html
images/03/airk.html
images/nes/01/airwolf.html
images/03/airwolfe.html
images/03/airwolfj.html
images/03/akagawa.html
images/nes/01/akira.html
images/03/akka.html
images/03/akuma.html
pics2/adensetsu.html
pics2/adracula.html
images/nes/01/akumajo.html
pics2/aspecial.html
pics/alunser.html
images/nes/01/alfred.html
images/03/alice.html
images/nes/01/alien3.html
images/nes/01/asyndrome.html
images/03/alien.html
images/03/all.html
images/nes/01/allpro.html
images/nes/01/allstarsoftball.html
images/nes/01/alphamission.html
pics2/altered.html

finding unique web links using python

I am writing a program to extract unique web links from www.stevens.edu( it is an assignment ) but there is one problem. My program is working and extracting links for all sites except www.stevens.edu for which i am getting output as 'none'. I am very frustrated with this and need help.i am using this url for testing - http://www.stevens.edu/
import urllib
from bs4 import BeautifulSoup as bs
url = raw_input('enter - ')
html = urllib.urlopen(url).read()
soup = bs (html)
tags = soup ('a')
for tag in tags:
print tag.get('href',None)
please guide me here and let me know why it is not working with www.stevens.edu?

The site check the User-Agent header, and returns different html base on it.
You need to set User-Agent header to get proper html:
import urllib
import urllib2
from bs4 import BeautifulSoup as bs
url = raw_input('enter - ')
req = urllib2.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) # <--
html = urllib2.urlopen(req).read()
soup = bs(html)
tags = soup('a')
for tag in tags:
print tag.get('href', None)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BeautifulSoup not extracting every URL - python

Related

soup.find_all returns empty list

Beautiful Soup cannot scrape after the first div tag

When using 'requests.get()', the html is less than the actual html

Scraping image hrefs from an Ordered List using BeautifulSoup

finding unique web links using python

Categories

Resources