how should i scrape href links from this website? - python

I'm trying to get every products individual URL link from this link https://www.goodricketea.com/product/darjeeling-tea
.How should I do that with beautifulsoup? Is there anyone who can help me?

To get product links from this site, you can for example do:
import requests
from bs4 import BeautifulSoup
url = "https://www.goodricketea.com/product/darjeeling-tea"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select("a:has(>h2)"):
print("https://www.goodricketea.com" + a["href"])
Prints:
https://www.goodricketea.com/product/darjeeling-tea/roasted-darjeeling-tea-250gm
https://www.goodricketea.com/product/darjeeling-tea/thurbo-darjeeling-tea-whole-leaf-250gm
https://www.goodricketea.com/product/darjeeling-tea/roasted-darjeeling-tea-organic-250gm
https://www.goodricketea.com/product/darjeeling-tea/roasted-darjeeling-tea-100gm
https://www.goodricketea.com/product/darjeeling-tea/thurbo-darjeeling-tea-whole-leaf-100gm
https://www.goodricketea.com/product/darjeeling-tea/thurbo-darjeeling-tea-fannings-250gm
https://www.goodricketea.com/product/darjeeling-tea/castleton-premium-muscatel-darjeeling-tea-100gm
https://www.goodricketea.com/product/darjeeling-tea/castleton-vintage-darjeeling-tea-250gm
https://www.goodricketea.com/product/darjeeling-tea/castleton-vintage-darjeeling-tea-100gm
https://www.goodricketea.com/product/darjeeling-tea/castleton-vintage-darjeeling-tea-bags-50-tea-bags
https://www.goodricketea.com/product/darjeeling-tea/castleton-vintage-darjeeling-tea-bags-100-tea-bags
https://www.goodricketea.com/product/darjeeling-tea/badamtam-exclusive-organic-darjeeling-tea-250gm
https://www.goodricketea.com/product/darjeeling-tea/badamtam-exclusive-organic-darjeeling-tea-100gm
https://www.goodricketea.com/product/darjeeling-tea/seasons-3-in-1-darjeeling-leaf-tea-150gm-first-flush-second-flush-pre-winter-flush

Related

Unable to find element BeautifulSoup

I am trying to parse a specific href link from the following website: https://www.murray-intl.co.uk/en/literature-library.
Element i seek to parse:
<a class="btn btn--naked btn--icon-left btn--block focus-within" href="https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc&_ga=2.12911351.1364356977.1629796255-1577053129.1629192717" target="blank">Portfolio Holding Summary<i class="material-icons btn__icon">library_books</i></a>
However, using BeautifulSoup I am unable to obtain the desired element, perhaps due to cookies acceptance.
from bs4 import BeautifulSoup
import urllib.request
import requests as rq
page = requests.get('https://www.murray-intl.co.uk/en/literature-library')
soup = BeautifulSoup(page.content, 'html.parser')
link = soup.find_all('a', class_='btn btn--naked btn--icon-left btn--block focus-within')
url = link[0].get('href')
url
I am still new at BS4, and hope someone can help me on the right course.
Thank you in advance!
To get correct tags, remove "focus-within" class (it's added later by JavaScript):
import requests
from bs4 import BeautifulSoup
url = "https://www.murray-intl.co.uk/en/literature-library"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = soup.find_all("a", class_="btn btn--naked btn--icon-left btn--block")
for u in links:
print(u.get_text(strip=True), u.get("href", ""))
Prints:
...
Portfolio Holding Summarylibrary_books https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc
...
EDIT: To get only the specified link you can use for example CSS selector:
link = soup.select_one('a:-soup-contains("Portfolio Holding Summary")')
print(link["href"])
Prints:
https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc

Web scraping href link in Python with Beautifulsoup

I'm trying to code a web scraping to obtain information of Linkedin Jobs post, including Job Description, Date, role, and link of the Linkedin job post. While I have made great progress obtaining job information about the job posts I'm currently stuck on how I could get the 'href' link of each job post. I have made many attempts including using class driver.find_element_by_class_name, and select_one method, neither seems to obtain the 'canonical' link by resulting none value. Could you please provide me some light?
This is the part of my code that tries to get the href link:
import requests
from bs4 import BeautifulSoup
url = https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('link'):
print(link.get('href'))
link: https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click
Picture of the code where the href link is stored
I think you were trying to access the href attribute incorrectly, to access them, use object["attribute_name"].
this works for me, searching for just links where rel = "canonical":
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
for link in soup.find_all('link', rel='canonical'):
print(link['href'])
The <link> has an attribute of rel="canonical". You can use an [attribute=value] CSS selector: [rel="canonical"] to get the value.
To use a CSS selector, use the .select_one() method instead of find().
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
print(soup.select_one('[rel="canonical"]')['href'])
Output:
https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D

How to get specific urls from a website in a class tag with beautiful soup? (Python)

I'm trying to get the urls of the main articles from a news outlet using beautiful soup. Since I do not want to get ALL of the links on the entire page, I specified the class. My code only manages to display the titles of the news articles, not the links. This is the website: https://www.reuters.com/news/us
Here is what I have so far:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("h3", {"class": "story-title"})
for i in links:
print(i.get_text().strip())
print()
Any help is greatly apreciated!
To get link to all articles you can use following code:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("div", {"class": "story-content"})
for i in links:
print(i.a.get('href'))

How to get sub-content from wikipedia page using BeautifulSoup

I am trying to scrape sub-content from Wikipedia pages based on the internal link using python, The problem is that scrape all content from the page, how can scrape just internal link paragraph, Thanks in advance
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
sub_link="#الأسباب"
total=base_link+sub_link
r=requests.get(total)
soup = bs(r.text, 'html.parser')
results=soup.find('p')
print(results)
It is because it's not a sublink you are trying to scrape. It's an anchor.
Try to request the entire page and then to find the given id.
Something like this:
from bs4 import BeautifulSoup as soup
import requests
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
anchor_id="ﺍﻸﺴﺑﺎﺑ"
r=requests.get(base_link)
page = soup(r.text, 'html.parser')
span = page.find('span', {'id': anchor_id})
results = span.parent.find_next_siblings('p')
print(results[0].text)

Search the frequency of words in the sub pages of a webpage using Python

I seek help as I am stuck on how to crawl each and every link (pages or sub pages) in a webpage and find the frequency of any word. I used beautiful soup
for scraping but I don't think so I am doing it right. For ex: I need to go to Service now official page > Solutions > View all Solutions. And find the frequency of "Intelligent" in all the links/sub pages under View all Solutions.
Any help would be very much appreciated.
Thank you :)
My Code
import requests
from bs4 import BeautifulSoup
url = "https://www.servicenow.com/solutions-by-category.html"
serviceNow_r = requests.get(url)
sNow_soup = BeautifulSoup(serviceNow_r.text, 'html.parser')
print(sNow_soup.find_all('href',{'class':'cta-list component'}))
for name in sNow_soup.find_all('href',{'class':'cta-list component'}):
print(name.text)
This is what you need to access the href attribute for every link in the page.
import requests
from bs4 import BeautifulSoup
url = "https://www.servicenow.com/solutions-by-category.html"
serviceNow_r = requests.get(url)
sNow_soup = BeautifulSoup(serviceNow_r.text, 'html.parser')
for anchor in sNow_soup.find_all('a', href=True):
print(anchor['href'])
You are searching for an href tag. This is wrong!
You should search for an a tag then get the href attribute. This is the url of the linked page.

Categories

Resources