I am working on a webscraper using html requests and beautiful soup (I am new to this). For multiple webpages e.g. (https://www.selfridges.com/GB/en/cat/hermes-rose-herms-silky-blush-6g_R03752945/?previewAttribute=32%20Rose%20Pommette) I am trying to grab the image link, which is always in the same for multiple webpages. The HTML is:
<img class="c-image-gallery__img" src="//images.selfridges.com/is/image/selfridges/R03752945_32ROSEPOMMETTE_M?$PDP_M_ZOOM$" loading="lazy">
I have tried to use the CSS selector:
r = scraper.get(link)
soup = BeautifulSoup(r.content, 'lxml')
imagelink = soup.select('body > section > section.c-product-hero.--multiple-product-shot > div.c-product-hero__product-shots.c-image-gallery > div > picture:nth-child(1) > img')
which returns None
or find_all:
soup.find_all('img')
But the specific link is not in the list. I am unsure why this is. Any help would be appreciated
This page you are trying to scrape uses Cloudflare and it has some kind of protection from being scraped. The server returns a "403 Forbidden" HTTP status code. Some websites use a lot of javascript and these are also hard to scrape without a javascript capable browser. I would suggest you use a different technology like Puppeteer.
from bs4 import BeautifulSoup
import requests
link = "https://www.selfridges.com/GB/en/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 OPR/75.0.3969.171"}
page = requests.get(link, headers=headers)
print(page.status_code)
print(page.text)
soup = BeautifulSoup(page.text, "lxml")
soup_imgs = soup.find_all("img")
for img in soup_imgs:
print(img)
Related
BeautifulSoup doesn’t find any tag on this page. Does anyone know what the problem can be?
I can find elements on the page with selenium, but since I have a list of pages, I don’t want to use selenium.
import requests
from bs4 import BeautifulSoup
url = 'https://dzen.ru/news/story/VMoskovskoj_oblasti_zapushhen_chat-bot_ochastichnoj_mobilizacii--b093f9a22a32ed6731e4a4ca50545831?lang=ru&from=reg_portal&fan=1&stid=fOB6O7PV5zeCUlGyzvOO&t=1664886434&persistent_id=233765704&story=90139eae-79df-5de1-9124-0d830e4d59a5&issue_tld=ru'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
soup.find_all('h1')
You can get the info on that page by adding headers to your requests, mimicking what you can see in Dev tools - Network tab main request to that url. Here is one way to get all links from that page:
import requests
from bs4 import BeautifulSoup as bs
headers = {
'Cookie': 'sso_checked=1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://dzen.ru/news/story/VMoskovskoj_oblasti_zapushhen_chat-bot_ochastichnoj_mobilizacii--b093f9a22a32ed6731e4a4ca50545831?lang=ru&from=reg_portal&fan=1&stid=fOB6O7PV5zeCUlGyzvOO&t=1664886434&persistent_id=233765704&story=90139eae-79df-5de1-9124-0d830e4d59a5&issue_tld=ru'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
links = [a.get('href') for a in soup.select('a')]
print(links)
Result printed in terminal:
['/news', 'https://dzen.ru/news', 'https://dzen.ru/news/region/moscow', 'https://dzen.ru/news/rubric/mobilizatsiya', 'https://dzen.ru/news/rubric/personal_feed', 'https://dzen.ru/news/rubric/politics', 'https://dzen.ru/news/rubric/society', 'https://dzen.ru/news/rubric/business', 'https://dzen.ru/news/rubric/world', 'https://dzen.ru/news/rubric/sport', 'https://dzen.ru/news/rubric/incident', 'https://dzen.ru/news/rubric/culture', 'https://dzen.ru/news/rubric/computers', 'https://dzen.ru/news/rubric/science', 'https://dzen.ru/news/rubric/auto', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://mosregtoday.ru/soc/v-podmoskove-zapustili-chat-bot-po-voprosam-chastichnoj-mobilizacii/?utm_source=yxnews&utm_medium=desktop', ...]
I am trying to scrape pdf files of Safety Data Sheets from this link: https://www.sigmaaldrich.com/PK/en/search/2127-03-9?focus=products&page=1&perpage=30&sort=relevance&term=2127-03-9&type=cas_number
The pdf link seems to be part of the SVG content on the webpage. I found Scraping a webpage for link titles and URLs utilizing BeautifulSoup link and am trying to use the answer to get SVG content.
However, the code does not seem to extract SVG content.
base_url = 'https://www.sigmaaldrich.com/PK/en/search/2127-03-9?focus=products&page=1&perpage=30&sort=relevance&term=2127-03-9&type=cas_number'
headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
with requests.Session() as session:
# extract the link to svg
res = session.get(base_url, headers = headers)
soup = BeautifulSoup(res.content, 'html.parser')
svg = soup.select_one("object.svg-content")
svg_link = urljoin(base_url, svg["data"])
Error
There isn't a SVG. I clicked and it brought a pdf. Could you try to download by a link, as https://www.sigmaaldrich.com/BR/pt/sds/sigma/d5767 e.g.?
Also, an alternative which doesn't require BS: Selenium. It's quite simple to achieve the desired effect with this tool.
Doc:
https://selenium-python.readthedocs.io/
https://www.browserstack.com/guide/download-file-using-selenium-python
Trying to get a specific portion of text from this web page... trying to use code I found from a similar post:
# Import required modules
from lxml import html
import requests
# Request the page
page = requests.get('https://www.baseball-reference.com/players/k/kershcl01.shtml')
# Parsing the page
tree = html.fromstring(page.content)
# Get element using XPath
share = tree.xpath(
'//div[#id="leaderboard_cyyoung"]/table/tbody/tr[11]/td/a')
print(share)
Output is just empty brackets []
You are getting empty results because the div element you are trying to query is commented out in the requested page's source. Note that when you use the requests.get method, you get the page's HTML source code, not the rendered HTML code generated by the browser from your interaction with the page and that you can inspect with the browser's developer tools.
So I would say: check again if this is really the element you see rendered on the page and see if you can identify what kind of interaction makes it rendered. Then you can use a tool to mock this interaction so that you can get the rendered HTML code within your Python environment. I would suggest helium for doing so. If this is not the right element, you can simply update the specified XPath to get the right source-code available element and successfully scrape it.
As stated, this is rendered/dynamic part of the site. It is there in the comments, so you'll need to pull out the comments of the html, then parse. The other issue with it is in the comments, there is no <tbody> tag, so it wont find anything, you'd need to remove that. I'm not sure what you want to pull out though (is it the link, is it the text?). I alerted your code to show you how to use it with lxml, but hoestly not a fan. I'd prefer to just use BeautifulSoup. BeautifulSoup however doesn't intigrate with xpath, so used css selector instead.
Your code altered:
import requests
from lxml import html
from bs4 import BeautifulSoup, Comment
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.baseball-reference.com/players/k/kershcl01.shtml"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'leaderboard_cyyoung' in str(each):
htmlStr = str(each)
# Parsing the page
tree = html.fromstring(htmlStr)
# Get element using XPath
share = tree.xpath('//div[#id="leaderboard_cyyoung"]/table/tr[11]/td/a')
print(share)
How I would do it:
import requests
from bs4 import BeautifulSoup, Comment
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.baseball-reference.com/players/k/kershcl01.shtml"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'leaderboard_cyyoung' in str(each):
soup = BeautifulSoup(str(each), 'html.parser')
share = soup.select('div#leaderboard_cyyoung > table > tr:nth-child(12) > td > a')
print(share)
break
Output:
[4.58 Career Shares]
I've recently started looking into purchasing some land, and I'm writing a little app to help me organize details in Jira/Confluence to help me keep track of who I've talked to and what I talked to them about in regards to each parcel of land individually.
So, I wrote this little scraper for landwatch(dot)com:
[url is just a listing on the website]
from bs4 import BeautifulSoup
import requests
def get_property_data(url):
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
title = soup.find_all(class_='b442a')[0].text
details = soup.find_all('p', class_='d19de')
price = soup.find_all('div', class_='_260f0')[0].text
deets = []
for i in range(len(details)):
if details[i].text != '':
deets.append(details[i].text)
detail = ''
for i in deets:
detail += '<p>' + i + '</p>'
return [title, detail, price]
Everything works great except that the class d19de has a ton of values hidden behind the Read More button.
While Googling away at this, I discovered How to Scrape reviews with read more from Webpages using BeautifulSoup, however I either don't understand what they're doing well enough to implement it, or this just doesn't work anymore:
import requests ; from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://www.mouthshut.com/product-reviews/Lakeside-Chalet-Mumbai-reviews-925017044").text, "html.parser")
for title in soup.select("a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_]"):
items = title.get('href')
if items:
broth = BeautifulSoup(requests.get(items).text, "html.parser")
for item in broth.select("div.user-review p.lnhgt"):
print(item.text)
Any thoughts on how to bypass that Read More button? I'm really hoping to do this in BeautifulSoup, and not selenium.
Here's an example URL for testing: https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403
That data is present within a script tag. Here is an example of extracting that content, parsing with json, and outputting land description info as a list:
from bs4 import BeautifulSoup
import requests, json
url = 'https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403'
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
all_data = json.loads(soup.select_one('[type="application/ld+json"]').string)
details = all_data['description'].split('\r\r')
You may wish to examine what else is in that script tag:
from pprint import pprint
pprint(all_data)
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.######.com/##/")
soup = BeautifulSoup(page)
for link in soup.findAll('a'):
if link['href'].startswith('http://'):
print(link)
I am using these code, through that script parsing the href tag but when trying with iframe they can't give the output. I dont know what happening there. anyone suggest me plz...
how about use iframe and src and also like requests it is better them urllib2
from bs4 import BeautifulSoup
#import urllib2
import requests
#page = urllib2.urlopen("http://www.######.com/##/")
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
page1 = requests.get(url,headers=headers)
page = page1.text
soup = BeautifulSoup(page,'lxml')
link = soup.find_all({'iframe':'src'})
link_clean = re.compile('src="(.+?)"').findall(str(z))
for item in link_clean:
print item
Oh, so you are trying to get all iframes on page? Everything looks ok except you should use src attribute with iframes. If that doesn't help please provide an example page.