Scraping <span> "text" </span> with BeautifulSoup - python

I am working on scraping the data from a website using BeautifulSoup. For whatever reason, I cannot seem to find a way to get the text between span elements to print. Here is what I am running.
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
url = 'https://www.amazon.com/GymCope-Anti-Tear-Cushioning-Non-Slip-Exercise/dp/B0921F1T2P/ref=sr_1_3_sspa?brr=1&pd_rd_r=4b40f0a8-f2d8-44dc-9a98-413c64d3fa34&pd_rd_w=P9ZJI&pd_rd_wg=RS7zW&pf_rd_p=9875e817-188b-48a2-986d-8146749644ac&pf_rd_r=AGWBT5KT04TYKGPZASKA&qid=1642452438&rd=1&rnid=3407731&s=sporting-goods&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExVjhWTk0xQU5WWldPJmVuY3J5cHRlZElkPUEwODE0MzYwMTdMTDZSNDVST08yMiZlbmNyeXB0ZWRBZElkPUEwODQ4MDM0MlE4WEtVUjFKMUdLMiZ3aWRnZXROYW1lPXNwX2F0Zl9icm93c2UmYWN0aW9uPWNsaWNrUmVkaXJlY3QmZG9Ob3RMb2dDbGljaz10cnVl'
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html)
bsr = soup.find("div", class_="a-section table-padding").text
and seeing this,
>>> bsr
' ASIN B0921F1T2P Customer Reviews \n\n \n 4.6 out of 5 stars \n 41 ratings \n\n\n 4.6 out of 5 stars Best Sellers Rank #69,660 in Sports & Outdoors (See Top 100 in Sports & Outdoors) #234 in Yoga Mats Date First Available April 8, 2021 '
I tried
bsra = soup.find("div", class_="a-section table-padding").find_next('span').get_text()
but it comes out
> > > bsr
> > > '\\n 4.6 out of 5 stars '
I want only to scrape "Best Sellers Rank" as in the picture. Thanks.

Referenced picture in your question is missing, but you can get rank by selecting your elements more specific:
soup.select_one('th:-soup-contains("Best Sellers Rank") + td').text.split()[0]
Example
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
url = 'https://www.amazon.com/GymCope-Anti-Tear-Cushioning-Non-Slip-Exercise/dp/B0921F1T2P/ref=sr_1_3_sspa?brr=1&pd_rd_r=4b40f0a8-f2d8-44dc-9a98-413c64d3fa34&pd_rd_w=P9ZJI&pd_rd_wg=RS7zW&pf_rd_p=9875e817-188b-48a2-986d-8146749644ac&pf_rd_r=AGWBT5KT04TYKGPZASKA&qid=1642452438&rd=1&rnid=3407731&s=sporting-goods&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExVjhWTk0xQU5WWldPJmVuY3J5cHRlZElkPUEwODE0MzYwMTdMTDZSNDVST08yMiZlbmNyeXB0ZWRBZElkPUEwODQ4MDM0MlE4WEtVUjFKMUdLMiZ3aWRnZXROYW1lPXNwX2F0Zl9icm93c2UmYWN0aW9uPWNsaWNrUmVkaXJlY3QmZG9Ob3RMb2dDbGljaz10cnVl'
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html)
soup.select_one('th:-soup-contains("Best Sellers Rank") + td').text.split()[0]
Output
#84,712

Related

I am trying to navigate through the pages of a website and scrape its links but the same page data is scraped even after changing page number

from bs4 import BeautifulSoup
import requests
import pymongo
def traverse_source():
article_links = []
for pgindx in range(9):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
"path": f"issue/S0196-0644(21)X0012-1?pageStart={pgindx}",
"Sec-fetch-site": "same-origin",
}
source_url = ""
source_data = requests.get(source_url,headers = headers)
print(source_data.headers)
source_url = None
source_soup = BeautifulSoup(source_data.content,"html.parser")
destination = source_soup.find_all("h3",attrs = {'class': 'toc__item__title' })
for dest in destination:
try:
article_links.append("https://www.annemergmed.com"+dest.a['href'])
except:
pass
source_soup = None
print(article_links)
if __name__ == "__main__":
traverse_source()
Here even after incrementing the page number in the URL, the content of the first webpage is always scraped. I tried navigating through the pages using GET method (changing the URL) but still even after changing the source url, it is still scraping the data of page number 1
This is one way of scraping that data:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(9)):
r = s.get(f'https://www.annemergmed.com/issue/S0196-0644(21)X0012-1?pageStart={x}')
soup = BeautifulSoup(r.text, 'html.parser')
titles = soup.select('div.articleCitation')
for t in titles:
url = t.select_one('h3 a').get('href')
header = t.select_one('h3 a').text
try:
authors = t.select_one('ul.toc__item__authors').get_text(strip=True)
except Exception as e:
authors = 'Unknown'
big_list.append((header, f'https://www.annemergmed.com{url}', authors))
df = pd.DataFrame(list(set(big_list)), columns = ['Title', 'Url', 'Authors'])
print(df.shape)
print(df.head(50))
This will return:
(409, 3)
Title Url Authors
0 194 Challenging the Dogma of Radiographs a Joint Above and Below a Suspected Fracture: Quantification of Waste in Wrist Fracture Evaluation https://www.annemergmed.com/article/S0196-0644(21)01046-5/fulltext M. Rozum,D. Mark Courtney,D. Diercks,S. McDonald
1 112 A Geographical Analysis of Access to Trauma Care From US National Parks in 2018 https://www.annemergmed.com/article/S0196-0644(21)00963-X/fulltext S. Robichaud,K. Boggs,B. Bedell,...A. Sullivan,N. Harris,C. Camargo
2 87 Emergency Radiology Overreads Change Management of Transferred Patients With Traumatic Injuries https://www.annemergmed.com/article/S0196-0644(21)00937-9/fulltext M. Vrablik,R. Kessler,M. Vrablik,...J. Robinson,D. Hippe,M. Hall
[...]

How to get all products from all categories

Could anyone assist me with my code I am trying to scrape products and prices from a patisserie website however it only retrieves the products on the main page. The rest of the products which are classified in categories have the same tags and attributes however when I run my code only products on the main page only appear. Here is my code;
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
cakes = []
url = "https://mrbakeregypt.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(requests.get(url).content, "html.parser")
articles = soup.find_all("div", class_="grid-view-item product-card")
for item in articles:
product = item.find("div", class_="h4 grid-view-item__title product-
card__title").text
price_regular = item.find("div", class_="price__regular").text.strip().replace('\n',
'')
item_cost = {"name": product,
"cost": price_regular
}
`[![enter code here][1]][1]`cakes.append(item_cost)
As mentioned you have to process all collections / categories and one approache could be to collect the links from your baseUrl - Note I used a set comprehension to get only unique urls and avoid to iterate the same categorie more than one time:
urlList = list(set(baseUrl+a['href'] for a in soup.select('a[href*="collection"]')))
Now you could itarate this urlList to scrape your informations:
...
for url in urlList:
r = requests.get(url, headers=headers)
soup = BeautifulSoup(requests.get(url).content)
articles = soup.find_all("div", class_="grid-view-item product-card")
...
Example
Take a look it also handles the type / categorie of product and both prices, so you could filter based on these in your dataframe
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
baseUrl = "https://mrbakeregypt.com"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
r = requests.get(baseUrl, headers=headers)
soup = BeautifulSoup(requests.get(baseUrl).content)
urlList = list(set(baseUrl+a['href'] for a in soup.select('a[href*="collection"]')))
data = []
for url in urlList:
r = requests.get(url, headers=headers)
soup = BeautifulSoup(requests.get(url).content)
articles = soup.find_all("div", class_="grid-view-item product-card")
for item in articles:
data.append({
'name': item.a.text.strip(),
'price_regular': item.find("div", class_="price__regular").dd.text.split()[-1].strip(),
'price_sale': item.find("div", class_="price__sale").dd.text.split()[-1].strip(),
'type': url.split('/')[-1],
'url': baseUrl+item.a.get('href')
})
df = pd.DataFrame(data)
Output
name
price_regular
price_sale
type
url
0
Mini Sandwiches Mix - 20 Pieces Bread Basket
402
402
sandwiches
https://mrbakeregypt.com/collections/sandwiches/products/mini-sandwiches-mix-bread-basket
1
Spiced Aubergine Mini Sandwiches - Box 2 Pieces
35
35
sandwiches
https://mrbakeregypt.com/collections/sandwiches/products/spiced-aubergine-mini-sandwich
2
Tuna Mini Sandwiches - Box 2 Pieces
49
49
sandwiches
https://mrbakeregypt.com/collections/sandwiches/products/tuna-mini-sandwich
3
Turkey Coleslaw Mini Sandwiches - Box 2 Pieces
45
45
sandwiches
https://mrbakeregypt.com/collections/sandwiches/products/turkey-coleslaw-mini-sandwich
4
Roast Beef Mini Sandwiches - Box 2 Pieces
49
49
sandwiches
https://mrbakeregypt.com/collections/sandwiches/products/roast-beef-mini-sandwich
...

Tablescraping from a website with ID using beautifulsoup

Im having a problem with scraping the table of this website, I should be getting the heading but instead am getting
AttributeError: 'NoneType' object has no attribute 'tbody'
Im a bit new to web-scraping so if you could help me out that would be great
import requests
from bs4 import BeautifulSoup
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
page = s.get(URL)
soup = BeautifulSoup(page.content, "lxml")
table = soup.find("table", id="propertysearchresults")
table_data = table.tbody.find_all("tr")
headings = []
for td in table_data[0].find_all("td"):
headings.append(td.b.text.replace('\n', ' ').strip())
print(headings)
What happens?
Note: Always look at your soup first - therein lies the truth. The content can always be slightly to extremely different from the view in the dev tools.
Access Revoked
Your IP address has been blocked. We
detected irregular, bot-like usage of our Property Search originating
from your IP address. This block was instated to reduce stress on our
webserver, to ensure that we're providing optimal site performance to
the taxpayers of Collin County. We have
not blocked your ability to download our
data exports, which you can still use to acquire bulk property
data.
How to fix?
Add a user-agent to your requets so that it looks like your requesting with a "browser".
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
page = s.get(URL,headers=headers)
Or as alternativ just download the results.
Example (scraping table)
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
page = s.get(URL,headers=headers)
soup = BeautifulSoup(page.content, "lxml")
data = []
for row in soup.select('#propertysearchresults tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
pd.DataFrame(data[1:], columns=data[0])
Output
Property ID ↓ Geographic ID ↓
Owner Name
Property Address
Legal Description
2021 Market Value
1
2709013 R-10644-00H-0010-1
PARTHASARATHY SURESH & ANITHA HARIKRISHNAN
12209 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 1
$513,019
2
2709018 R-10644-00H-0020-1
JOSHI PRASHANT & SHWETA PANT
12235 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 2
$546,254
3
2709019 R-10644-00H-0030-1
THALLAPUREDDY RAVENDRA & UMA MAHESWARI VEMULA
12261 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 3
$550,768
4
2709020 R-10644-00H-0040-1
KULKARNI BHEEMSEN T & GOURI R
12287 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 4
$509,593
5
2709021 R-10644-00H-0050-1
BALAM GANESH & SHANTHIREKHA LOKULA
12313 Willowgate Dr Frisco, TX\xa0 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 5
$553,949
...
import requests
from bs4 import BeautifulSoup
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
page = s.get(URL,headers=headers)
soup = BeautifulSoup(page.content, "lxml")
Finding Table Data:
column_data=soup.find("table").find_all("tr")[0]
column=[i.get_text() for i in column_data.find_all("td") if i.get_text()!=""]
row=soup.find("table").find_all("tr")[1:]
main_lst=[]
for row_details in row:
lst=[]
for i in row_details.find_all("td")[1:]:
if i.get_text()!="":
lst.append(i.get_text())
main_lst.append(lst)
Converting to pandas DataFrame:
import pandas as pd
df=pd.DataFrame(main_lst,columns=column)
Output:
Property ID↓ Geographic ID ↓ Owner Name Property Address Legal Description 2021 Market Value
0 2709013R-10644-00H-0010-1 PARTHASARATHY SURESH & ANITHA HARIKRISHNAN 12209 Willowgate DrFrisco, TX 75035 Ridgeview At Panther Creek Phase 2, Blk H, Lot 1 $513,019
.....
If you look at page.content, you will see that "Your IP address has been blocked".
You should add some headers to your request because the website is blocking your request. In your specific case, it will be enough to add a User-Agent:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
page = s.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "lxml")
table = soup.find("table", id="propertysearchresults")
table_data = table.tbody.find_all("tr")
headings = []
for td in table_data[0].find_all("td"):
headings.append(td.b.text.replace('\n', ' ').strip())
print(headings)
If you add headers, you will still have error, but in the row:
headings.append(td.b.text.replace('\n', ' ').strip())
You should change it to
headings.append(td.text.replace('\n', ' ').strip())
because td doesn't always have b.

How to fix the error "'NoneType' object has no attribute 'text'"

I am trying to scrape some contents on a website. I keep getting this error, 'NoneType' object has no attribute 'text' but I don't know how to fix this.
I noticed that the error has to do with this line tr.find("a", class_="sc-dakcWe sc-liNYZW cPIBpC").text.replace("\n", " ") but I have been stucked on how to fix it. After removing the .text.replace("\n", " ") part from it, I get the response as None. I realise my issue is how to get the correct selector for it, what could I change this tr.find("a", class_="sc-dakcWe sc-liNYZW cPIBpC") to that would give me the correct restaurant_name.
I am using the zomato restaurant and an example url for this is https://www.zomato.com/kanpur/top-restaurants
python code
import requests
from bs4 import BeautifulSoup
city = input("Enter your city: ")
url = "https://www.zomato.com/" + city + "/top-restaurants"
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36"}
response = requests.get(url, headers=header)
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.title.text)
top_rest = soup.find("div", class_="bke1zw-0 cMipmx")
list_tr = top_rest.find_all("div", class_="bke1zw-1")
for tr in list_tr:
restaurant_name = tr.find("a", class_="sc-dakcWe sc-liNYZW cPIBpC").text.replace("\n", " ")
print(restaurant_name)
Their classes are dynamically defined it seems so it changes on reload. There seems to be a much more simplistic method (and possibly more reliable) that doesn't require accessing the <a> tag you're trying to do. The images use alt texts for the restaurants, which we can capitalize on:
Code
import requests
from bs4 import BeautifulSoup
city = input("Enter your city: ")
url = "https://www.zomato.com/" + city + "/top-restaurants"
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36"}
response = requests.get(url, headers=header)
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.title.text)
top_rest = soup.find("div", class_="bke1zw-0 cMipmx")
# NEW METHOD
images = top_rest.find_all("img") # look for all images in top_rest
image_alts = [image.get('alt', '') for image in images] # get their alt texts
print(image_alts)
Result:
Enter your city: kanpur
Top Restaurants in Kanpur | Zomato
['Grill Inn', 'Shri Bhojnalaya Restaurant & Sweets', 'Barbeque Nation', 'Kukkad
at Nukkad', 'Tadka The Food Hub', 'Smile Pizza', 'Arabian Broost Chicken', 'Chac
hi Vaishno Dhaba', 'Barra House', "Pashtun's", 'Agra Vala Sweets', 'Al-Baik.Com'
, 'The Imperial Cord', 'Google Fast Food', 'Baba Foods', 'R S Bhojnalaya', 'Kere
la Cafe', 'Mama Hotel', 'Gyan Vaishnav', 'New Pizza Yum', 'Offline Cafe', 'The Chocolate Room', 'Mocha']

How to get all URLs within a page fom oddsportal?

I have a code that scrapes all URLs from oddsportal.com main page.
I want the subsequent links to all pages within the parent URL
e.g.
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
has further pages i.e. https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/, https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/, etc.
How can I get that?
My existing code:
import requests
import bs4 as bs
import pandas as pd
url = 'https://www.oddsportal.com/results/#soccer'
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
resp = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(resp.text, 'html.parser')
base_url = 'https://www.oddsportal.com'
a = soup.findAll('a', attrs={'foo': 'f'})
# This set will have all the URLs of the main page
s = set()
for i in a:
s.add(base_url + i['href'])
s = list(s)
# This will filter for all soccer URLs
s = [x for x in s if '/soccer/' in x]
s = pd.DataFrame(s)
print(s)
I am very new to webscraping and hence this question.
You can find main_div tag based on class attribute and use find_all method to get a tag by looping over it you can extract href of it
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
source = requests.get("https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/",headers=headers)
soup = BeautifulSoup(source.text, 'html.parser')
main_div=soup.find("div",class_="main-menu2 main-menu-gray")
a_tag=main_div.find_all("a")
for i in a_tag:
print(i['href'])
Output:
/soccer/africa/africa-cup-of-nations/results/
/soccer/africa/africa-cup-of-nations-2019/results/
/soccer/africa/africa-cup-of-nations-2017/results/
/soccer/africa/africa-cup-of-nations-2015/results/
/soccer/africa/africa-cup-of-nations-2013/results/
/soccer/africa/africa-cup-of-nations-2012/results/
/soccer/africa/africa-cup-of-nations-2010/results/
/soccer/africa/africa-cup-of-nations-2008/results/

Categories

Resources