I am trying to parse a specific href link from the following website: https://www.murray-intl.co.uk/en/literature-library.
Element i seek to parse:
<a class="btn btn--naked btn--icon-left btn--block focus-within" href="https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc&_ga=2.12911351.1364356977.1629796255-1577053129.1629192717" target="blank">Portfolio Holding Summary<i class="material-icons btn__icon">library_books</i></a>
However, using BeautifulSoup I am unable to obtain the desired element, perhaps due to cookies acceptance.
from bs4 import BeautifulSoup
import urllib.request
import requests as rq
page = requests.get('https://www.murray-intl.co.uk/en/literature-library')
soup = BeautifulSoup(page.content, 'html.parser')
link = soup.find_all('a', class_='btn btn--naked btn--icon-left btn--block focus-within')
url = link[0].get('href')
url
I am still new at BS4, and hope someone can help me on the right course.
Thank you in advance!
To get correct tags, remove "focus-within" class (it's added later by JavaScript):
import requests
from bs4 import BeautifulSoup
url = "https://www.murray-intl.co.uk/en/literature-library"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = soup.find_all("a", class_="btn btn--naked btn--icon-left btn--block")
for u in links:
print(u.get_text(strip=True), u.get("href", ""))
Prints:
...
Portfolio Holding Summarylibrary_books https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc
...
EDIT: To get only the specified link you can use for example CSS selector:
link = soup.select_one('a:-soup-contains("Portfolio Holding Summary")')
print(link["href"])
Prints:
https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc
Related
I'm trying to get every products individual URL link from this link https://www.goodricketea.com/product/darjeeling-tea
.How should I do that with beautifulsoup? Is there anyone who can help me?
To get product links from this site, you can for example do:
import requests
from bs4 import BeautifulSoup
url = "https://www.goodricketea.com/product/darjeeling-tea"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select("a:has(>h2)"):
print("https://www.goodricketea.com" + a["href"])
Prints:
https://www.goodricketea.com/product/darjeeling-tea/roasted-darjeeling-tea-250gm
https://www.goodricketea.com/product/darjeeling-tea/thurbo-darjeeling-tea-whole-leaf-250gm
https://www.goodricketea.com/product/darjeeling-tea/roasted-darjeeling-tea-organic-250gm
https://www.goodricketea.com/product/darjeeling-tea/roasted-darjeeling-tea-100gm
https://www.goodricketea.com/product/darjeeling-tea/thurbo-darjeeling-tea-whole-leaf-100gm
https://www.goodricketea.com/product/darjeeling-tea/thurbo-darjeeling-tea-fannings-250gm
https://www.goodricketea.com/product/darjeeling-tea/castleton-premium-muscatel-darjeeling-tea-100gm
https://www.goodricketea.com/product/darjeeling-tea/castleton-vintage-darjeeling-tea-250gm
https://www.goodricketea.com/product/darjeeling-tea/castleton-vintage-darjeeling-tea-100gm
https://www.goodricketea.com/product/darjeeling-tea/castleton-vintage-darjeeling-tea-bags-50-tea-bags
https://www.goodricketea.com/product/darjeeling-tea/castleton-vintage-darjeeling-tea-bags-100-tea-bags
https://www.goodricketea.com/product/darjeeling-tea/badamtam-exclusive-organic-darjeeling-tea-250gm
https://www.goodricketea.com/product/darjeeling-tea/badamtam-exclusive-organic-darjeeling-tea-100gm
https://www.goodricketea.com/product/darjeeling-tea/seasons-3-in-1-darjeeling-leaf-tea-150gm-first-flush-second-flush-pre-winter-flush
I'm trying to code a web scraping to obtain information of Linkedin Jobs post, including Job Description, Date, role, and link of the Linkedin job post. While I have made great progress obtaining job information about the job posts I'm currently stuck on how I could get the 'href' link of each job post. I have made many attempts including using class driver.find_element_by_class_name, and select_one method, neither seems to obtain the 'canonical' link by resulting none value. Could you please provide me some light?
This is the part of my code that tries to get the href link:
import requests
from bs4 import BeautifulSoup
url = https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('link'):
print(link.get('href'))
link: https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click
Picture of the code where the href link is stored
I think you were trying to access the href attribute incorrectly, to access them, use object["attribute_name"].
this works for me, searching for just links where rel = "canonical":
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
for link in soup.find_all('link', rel='canonical'):
print(link['href'])
The <link> has an attribute of rel="canonical". You can use an [attribute=value] CSS selector: [rel="canonical"] to get the value.
To use a CSS selector, use the .select_one() method instead of find().
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
print(soup.select_one('[rel="canonical"]')['href'])
Output:
https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D
I have been trying to randomize a wikipedia page and get the URL of that randomized site. Even though I can fetch every link on the site, I can not reach to this piece of html code and fetch the href for some reason.
An example of a randomized wikipedia page.
<a accesskey="v" href="https://en.wikipedia.org/wiki/T%C5%99eb%C3%ADvlice?action=edit" class="oo-ui-element-hidden"></a>
All the wikipedia pages have this and I need to get the href so that I can manipulate this in a way that I can get the current URL.
The code I have written this far:
from bs4 import BeautifulSoup
import requests
links = []
for x in range(0, 1):
source = requests.get("https://en.wikipedia.org/wiki/Special:Random").text
soup = BeautifulSoup(source, "lxml")
print(soup.find(id="firstHeading"))
for link in soup.findAll('a'):
links.append(link.get('href'))
print(links)
Directly getting the current URL would also help too, however I couldn't find a solution for that online.
Also I'm using Lunix OS -if that would help-
Take a look for the attributes
You should specify your search by using the attribute of this <a>:
soup.find_all('a', accesskey='e')
Example
import requests
from bs4 import BeautifulSoup
links = []
for x in range(0, 1):
source = requests.get("https://en.wikipedia.org/wiki/Special:Random").text
soup = BeautifulSoup(source, "lxml")
print(soup.find(id="firstHeading"))
for link in soup.find_all('a', accesskey='e'):
links.append(link.get('href'))
print(links)
Output
<h1 class="firstHeading" id="firstHeading" lang="en">James Stack (golfer)</h1>
['/w/index.php?title=James_Stack_(golfer)&action=edit']
Just in case
You do not need the second loop, if you just wanna handle that single <a> use find() instead of find_all()
Example
import requests
from bs4 import BeautifulSoup
links = []
for x in range(0, 5):
source = requests.get("https://en.wikipedia.org/wiki/Special:Random").text
soup = BeautifulSoup(source, "lxml")
links.append(soup.find('a', accesskey='e').get('href'))
links
Output
['/w/index.php?title=Rick_Moffat&action=edit',
'/w/index.php?title=Mount_Burrows&action=edit',
'/w/index.php?title=The_Rock_Peter_and_the_Wolf&action=edit',
'/w/index.php?title=Yamato,_Yamanashi&action=edit',
'/w/index.php?title=Craig_Henderson&action=edit']
I'm trying to get the length of a YouTube video with BeautifulSoup and by inspecting the site I can find this line: <span class="ytp-time-duration">6:14:06</span> which seems to be perfect to get the duration, but I can't figure out how to do it.
My script:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.youtube.com/watch?v=_uQrJ0TkZlc")
soup = BeautifulSoup(response.text, "html5lib")
mydivs = soup.findAll("span", {"class": "ytp-time-duration"})
print(mydivs)
The problem is that the output is []
From the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/
soup.findAll('div', attrs={"class":u"ytp-time-duration"})
But I guess that your syntax is a shortcut, so I would probably consider that when loading the youtube page, the div.ytp-time-duration is not present. It is only added after loading. Maybe that's why the soup does not pick it up.
I've made a scraper in python. It is running smoothly. Now I would like to discard or accept specific links from that page as in, links only containing "mobiles" but even after making some conditional statement I can't do so. Hope I'm gonna get any help to rectify my mistakes.
import requests
from bs4 import BeautifulSoup
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
soup = BeautifulSoup(Process.text, "lxml")
for link in soup.findAll('div',class_='')[0].findAll('a'):
if "mobiles" not in link:
print(link.get('href'))
SpecificItem()
On the other hand if I do the same thing using lxml library with xpath, It works.
import requests
from lxml import html
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
tree = html.fromstring(Process.text)
links = tree.xpath('//div[#class=""]//a/#href')
for link in links:
if "mobiles" not in link:
print(link)
SpecificItem()
So, at this point i think with BeautifulSoup library the code should be somewhat different to get the purpose served.
The root of your problem is your if condition works a bit differently between BeautifulSoup and lxml. Basically, if "mobiles" not in link: with BeautifulSoup is not checking if "mobiles" is in the href field. I didn't look too hard but I'd guess it's comparing it to the link.text field instead. Explicitly using the href field does the trick:
import requests
from bs4 import BeautifulSoup
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
soup = BeautifulSoup(Process.text, "lxml")
for link in soup.findAll('div',class_='')[0].findAll('a'):
href = link.get('href')
if "mobiles" not in href:
print(href)
SpecificItem()
That prints out a bunch of links and none of them include "mobiles".