Scraping svg Content from a webpage - python

I am trying to scrape pdf files of Safety Data Sheets from this link: https://www.sigmaaldrich.com/PK/en/search/2127-03-9?focus=products&page=1&perpage=30&sort=relevance&term=2127-03-9&type=cas_number
The pdf link seems to be part of the SVG content on the webpage. I found Scraping a webpage for link titles and URLs utilizing BeautifulSoup link and am trying to use the answer to get SVG content.
However, the code does not seem to extract SVG content.
base_url = 'https://www.sigmaaldrich.com/PK/en/search/2127-03-9?focus=products&page=1&perpage=30&sort=relevance&term=2127-03-9&type=cas_number'
headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
with requests.Session() as session:
# extract the link to svg
res = session.get(base_url, headers = headers)
soup = BeautifulSoup(res.content, 'html.parser')
svg = soup.select_one("object.svg-content")
svg_link = urljoin(base_url, svg["data"])
Error

There isn't a SVG. I clicked and it brought a pdf. Could you try to download by a link, as https://www.sigmaaldrich.com/BR/pt/sds/sigma/d5767 e.g.?
Also, an alternative which doesn't require BS: Selenium. It's quite simple to achieve the desired effect with this tool.
Doc:
https://selenium-python.readthedocs.io/
https://www.browserstack.com/guide/download-file-using-selenium-python

Related

Is there a way to make the html elements of a website more visible?

While scraping the following website (https://www.middletownk12.org/Page/4113), this code could not locate the table rows (To get the staff name, email & department) even though they are visible when I use the Chrome developer tools. The soup object is not readbale enough to locate the tr tags that have the info needed.
import requests
from bs4 import BeautifulSoup
url = "https://www.middletownk12.org/Page/4113"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
print(response.text)
I used different libraries such as bs4, request & selenium with no chance. I also tried Css selectors & XPATH with selenium with no chance. The Tr elements could not be located.
That table of contact information is filled in by Javascript after the page has loaded. The content doesn't exist in the page's HTML and you won't see it using requests.
By using the developer tools available in the browser, we can examine the requests made after the page has loaded. There are a lot of them, but at least in my browser it's obvious the contact information is loaded near the end.
Looking at the request log, I see a request for a spreadsheet from docs.google.com:
If we examine that entry, we find that it's a request for:
https://docs.google.com/spreadsheets/d/e/2PACX-1vSPXpr9MjxZXaYteex9ZMydfXx81YWqf5Coh9TfcB0q8YNRWrYTAtypX3IPlW44ZzXmhaSiQGNY-yle/pubhtml/sheet?headers=false&gid=0
And if we fetch the above link, we get a spreadsheet with the source data for that table.
Actually I used Selenium & then bs4 without any results. The code does not find the 'tr' elements...
Why are you using Selenium? The whole point to this answer is that you don't need to use Selenium if you can figure out the link to retrieve the data -- which we have.
All we need is requests to fetch the data and BeautifulSoup to parse it:
import requests
import bs4
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSPXpr9MjxZXaYteex9ZMydfXx81YWqf5Coh9TfcB0q8YNRWrYTAtypX3IPlW44ZzXmhaSiQGNY-yle/pubhtml/sheet?headers=false&gid=0'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for link in soup.findAll('a'):
print(f"{link.text}: {link.get('href')}")

Web Scraping using XPath - Not finding element after copying text xpath

Trying to get a specific portion of text from this web page... trying to use code I found from a similar post:
# Import required modules
from lxml import html
import requests
# Request the page
page = requests.get('https://www.baseball-reference.com/players/k/kershcl01.shtml')
# Parsing the page
tree = html.fromstring(page.content)
# Get element using XPath
share = tree.xpath(
'//div[#id="leaderboard_cyyoung"]/table/tbody/tr[11]/td/a')
print(share)
Output is just empty brackets []
You are getting empty results because the div element you are trying to query is commented out in the requested page's source. Note that when you use the requests.get method, you get the page's HTML source code, not the rendered HTML code generated by the browser from your interaction with the page and that you can inspect with the browser's developer tools.
So I would say: check again if this is really the element you see rendered on the page and see if you can identify what kind of interaction makes it rendered. Then you can use a tool to mock this interaction so that you can get the rendered HTML code within your Python environment. I would suggest helium for doing so. If this is not the right element, you can simply update the specified XPath to get the right source-code available element and successfully scrape it.
As stated, this is rendered/dynamic part of the site. It is there in the comments, so you'll need to pull out the comments of the html, then parse. The other issue with it is in the comments, there is no <tbody> tag, so it wont find anything, you'd need to remove that. I'm not sure what you want to pull out though (is it the link, is it the text?). I alerted your code to show you how to use it with lxml, but hoestly not a fan. I'd prefer to just use BeautifulSoup. BeautifulSoup however doesn't intigrate with xpath, so used css selector instead.
Your code altered:
import requests
from lxml import html
from bs4 import BeautifulSoup, Comment
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.baseball-reference.com/players/k/kershcl01.shtml"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'leaderboard_cyyoung' in str(each):
htmlStr = str(each)
# Parsing the page
tree = html.fromstring(htmlStr)
# Get element using XPath
share = tree.xpath('//div[#id="leaderboard_cyyoung"]/table/tr[11]/td/a')
print(share)
How I would do it:
import requests
from bs4 import BeautifulSoup, Comment
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.baseball-reference.com/players/k/kershcl01.shtml"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'leaderboard_cyyoung' in str(each):
soup = BeautifulSoup(str(each), 'html.parser')
share = soup.select('div#leaderboard_cyyoung > table > tr:nth-child(12) > td > a')
print(share)
break
Output:
[4.58 Career Shares]

Python 3 BeautifulSoup Scraping Content After "Read More" Text

I've recently started looking into purchasing some land, and I'm writing a little app to help me organize details in Jira/Confluence to help me keep track of who I've talked to and what I talked to them about in regards to each parcel of land individually.
So, I wrote this little scraper for landwatch(dot)com:
[url is just a listing on the website]
from bs4 import BeautifulSoup
import requests
def get_property_data(url):
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
title = soup.find_all(class_='b442a')[0].text
details = soup.find_all('p', class_='d19de')
price = soup.find_all('div', class_='_260f0')[0].text
deets = []
for i in range(len(details)):
if details[i].text != '':
deets.append(details[i].text)
detail = ''
for i in deets:
detail += '<p>' + i + '</p>'
return [title, detail, price]
Everything works great except that the class d19de has a ton of values hidden behind the Read More button.
While Googling away at this, I discovered How to Scrape reviews with read more from Webpages using BeautifulSoup, however I either don't understand what they're doing well enough to implement it, or this just doesn't work anymore:
import requests ; from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://www.mouthshut.com/product-reviews/Lakeside-Chalet-Mumbai-reviews-925017044").text, "html.parser")
for title in soup.select("a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_]"):
items = title.get('href')
if items:
broth = BeautifulSoup(requests.get(items).text, "html.parser")
for item in broth.select("div.user-review p.lnhgt"):
print(item.text)
Any thoughts on how to bypass that Read More button? I'm really hoping to do this in BeautifulSoup, and not selenium.
Here's an example URL for testing: https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403
That data is present within a script tag. Here is an example of extracting that content, parsing with json, and outputting land description info as a list:
from bs4 import BeautifulSoup
import requests, json
url = 'https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403'
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
all_data = json.loads(soup.select_one('[type="application/ld+json"]').string)
details = all_data['description'].split('\r\r')
You may wish to examine what else is in that script tag:
from pprint import pprint
pprint(all_data)

Select css selector using beautiful soup

I am working on a webscraper using html requests and beautiful soup (I am new to this). For multiple webpages e.g. (https://www.selfridges.com/GB/en/cat/hermes-rose-herms-silky-blush-6g_R03752945/?previewAttribute=32%20Rose%20Pommette) I am trying to grab the image link, which is always in the same for multiple webpages. The HTML is:
<img class="c-image-gallery__img" src="//images.selfridges.com/is/image/selfridges/R03752945_32ROSEPOMMETTE_M?$PDP_M_ZOOM$" loading="lazy">
I have tried to use the CSS selector:
r = scraper.get(link)
soup = BeautifulSoup(r.content, 'lxml')
imagelink = soup.select('body > section > section.c-product-hero.--multiple-product-shot > div.c-product-hero__product-shots.c-image-gallery > div > picture:nth-child(1) > img')
which returns None
or find_all:
soup.find_all('img')
But the specific link is not in the list. I am unsure why this is. Any help would be appreciated
This page you are trying to scrape uses Cloudflare and it has some kind of protection from being scraped. The server returns a "403 Forbidden" HTTP status code. Some websites use a lot of javascript and these are also hard to scrape without a javascript capable browser. I would suggest you use a different technology like Puppeteer.
from bs4 import BeautifulSoup
import requests
link = "https://www.selfridges.com/GB/en/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 OPR/75.0.3969.171"}
page = requests.get(link, headers=headers)
print(page.status_code)
print(page.text)
soup = BeautifulSoup(page.text, "lxml")
soup_imgs = soup.find_all("img")
for img in soup_imgs:
print(img)

Using Python to Scrape Nested Divs and Spans in Twitter?

I'm trying to scrape the likes and retweets from the results of a Twitter search.
After running the Python below, I get an empty list, []. I'm not using the Twitter API because it doesn't look at the tweets by hashtag this far back.
The code I'm using is:
from bs4 import BeautifulSoup
import requests
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
all_likes = soup.find_all('span', class_='ProfileTweet-actionCountForPresentation')
print(all_likes)
I can successfully save the html to file using this code. It is missing large amounts of information when I search the text, such as the class names I am looking for...
So (part of) the problem is apparently in accurately accessing the source code.
filename = 'newfile2.txt'
with open(filename, 'w') as handle:
handle.writelines(str(data))
This screenshot shows the span that I'm trying to scrape.
I've looked at this question, and others like it, but I'm not quite getting there.
How can I use BeautifulSoup to get deeply nested div values?
It seems that your GET request returns valid HTML but with no tweet elements in the #timeline element. However, adding a user agent to the request headers seems to remedy this.
from bs4 import BeautifulSoup
import requests
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text
soup = BeautifulSoup(data, "lxml")
all_likes = soup.find_all('span', class_='ProfileTweet-actionCountForPresentation')
print(all_likes)

Categories

Resources