BeautifulSoup url scraping - python

Trying out BeautifulSoup for the first time.
I have this link http://www.mediafire.com/download/alv8dq6k35n4m2k/For+You.zip
I want to catch the direct download url from the download button which is
http://download2110.mediafire.com/niz8p9iu6r9g/alv8dq6k35n4m2k/For+You.zip
What I have tried so far.
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.findAll('a')
I think the last function findAll('a')would find all the links from that page, but I could not find the direct download url in my linkslist.
Am I doing something wrong here? If so, how can I grab that link with beautifulsoup. I inspect the element in Chrome Developer Console and I see that the link is there.

You can try this to extract the url from the javascript:
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.mediafire.com/download/alv8dq6k35n4m2k/For+You.zip")
soup = BeautifulSoup(r.content)
link = soup.find("div",{"class":"download_link"})
import re
url = re.findall("http.*.zip?",link.text)[0]

Related

Extract single URL from webpage by scraping

I have been trying to scrape a website such as the one below. In the footer there are a bunch of links of their social media out of which the LinkedIn URL is the point of focus for me. Is there a way to fish out only that link maybe using regex or any other libraries available in Python.
This is what I have tried so far -
import requests
from bs4 import BeautifulSoup
url = "https://www.southcoast.org/"
req = requests.get(url)
soup = BeautifulSoup(reqs.text,"html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
But I'm fetching all the URLs instead of the one I'm looking for.
Note: I'd appreciate a dynamic code which I can use for other sites as well.
Thanks in advance for you suggestion/help.
One approach could be to use css selectors and look for string linkedin.com/company/ in values of href attributes:
soup.select_one('a[href*="linkedin.com/company/"]')['href']
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.southcoast.org/"
req = requests.get(url)
soup = BeautifulSoup(req.text,"html.parser")
# single (first) link
link = e['href'] if(e := soup.select_one('a[href*="linkedin.com/company/"]')) else None
# multiple links
links = [link['href'] for link in soup.select('a[href*="linkedin.com/company/"]')]

How do I get the inspect element code instead of the page source when they are both different?

I was trying to get all the links from the inspect element code of this website with the following code.
import requests
from bs4 import BeautifulSoup
url = 'https://chromedriver.storage.googleapis.com/index.html?path=97.0.4692.71/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
for link in soup.find_all('a'):
print(link)
However, I got no links. Then, I checked what soup was by printing it, and I compared it to the code I got after inspecting element and viewing page source on the actual website. The code returned by print(source) matched that which showed up when I clicked view page source, but it did not match the code that showed up when I clicked inspect element. Firstly, how do I get the inspect element code instead of the page source code? Secondly, why are the two different?
Just use the other URL mentioned in the comments and parse the XML with BeautifulSoup.
For example:
import requests
from bs4 import BeautifulSoup
url = "https://chromedriver.storage.googleapis.com/?delimiter=/&prefix=97.0.4692.71/"
soup = BeautifulSoup(requests.get(url).text, features="xml").find_all("Key")
keys = [f"https://chromedriver.storage.googleapis.com/{k.getText()}" for k in soup]
print("\n".join(keys))
Output:
https://chromedriver.storage.googleapis.com/97.0.4692.71/chromedriver_linux64.zip
https://chromedriver.storage.googleapis.com/97.0.4692.71/chromedriver_mac64.zip
https://chromedriver.storage.googleapis.com/97.0.4692.71/chromedriver_mac64_m1.zip
https://chromedriver.storage.googleapis.com/97.0.4692.71/chromedriver_win32.zip
https://chromedriver.storage.googleapis.com/97.0.4692.71/notes.txt

Web scraping href link in Python with Beautifulsoup

I'm trying to code a web scraping to obtain information of Linkedin Jobs post, including Job Description, Date, role, and link of the Linkedin job post. While I have made great progress obtaining job information about the job posts I'm currently stuck on how I could get the 'href' link of each job post. I have made many attempts including using class driver.find_element_by_class_name, and select_one method, neither seems to obtain the 'canonical' link by resulting none value. Could you please provide me some light?
This is the part of my code that tries to get the href link:
import requests
from bs4 import BeautifulSoup
url = https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('link'):
print(link.get('href'))
link: https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click
Picture of the code where the href link is stored
I think you were trying to access the href attribute incorrectly, to access them, use object["attribute_name"].
this works for me, searching for just links where rel = "canonical":
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
for link in soup.find_all('link', rel='canonical'):
print(link['href'])
The <link> has an attribute of rel="canonical". You can use an [attribute=value] CSS selector: [rel="canonical"] to get the value.
To use a CSS selector, use the .select_one() method instead of find().
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
print(soup.select_one('[rel="canonical"]')['href'])
Output:
https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D

I need to get specific data from a website using inspect element

I'm new to Python and I'm struggling to find a solution or a way to do the following thing. I need 2 things from a website that I can get from inspect element: the link to the .m3u8 file which can be found in the html (Elements tab) of the website and a link to a .ts file (it doesn't matter which one) from the Network tab. Does anybody know how to do this? Thanks in advance!
Use BS4 and requests:
import requests
from bs4 import BeautifulSoup
URL = 'https://stackoverflow.com/questions/64828046/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='question-header')
print(results)
from urllib.request import urlopen
import lxml.html
connection = urlopen('http://yourwebsite')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
if link.endswith(".m3u8") or link.endwith(".ts"):
print(link)
you can use other if conditions to check whether something is in the link, like:
if "m3u8" in link:
print(link)

How to sift through specific items from a webpage using conditional statement

I've made a scraper in python. It is running smoothly. Now I would like to discard or accept specific links from that page as in, links only containing "mobiles" but even after making some conditional statement I can't do so. Hope I'm gonna get any help to rectify my mistakes.
import requests
from bs4 import BeautifulSoup
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
soup = BeautifulSoup(Process.text, "lxml")
for link in soup.findAll('div',class_='')[0].findAll('a'):
if "mobiles" not in link:
print(link.get('href'))
SpecificItem()
On the other hand if I do the same thing using lxml library with xpath, It works.
import requests
from lxml import html
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
tree = html.fromstring(Process.text)
links = tree.xpath('//div[#class=""]//a/#href')
for link in links:
if "mobiles" not in link:
print(link)
SpecificItem()
So, at this point i think with BeautifulSoup library the code should be somewhat different to get the purpose served.
The root of your problem is your if condition works a bit differently between BeautifulSoup and lxml. Basically, if "mobiles" not in link: with BeautifulSoup is not checking if "mobiles" is in the href field. I didn't look too hard but I'd guess it's comparing it to the link.text field instead. Explicitly using the href field does the trick:
import requests
from bs4 import BeautifulSoup
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
soup = BeautifulSoup(Process.text, "lxml")
for link in soup.findAll('div',class_='')[0].findAll('a'):
href = link.get('href')
if "mobiles" not in href:
print(href)
SpecificItem()
That prints out a bunch of links and none of them include "mobiles".

Categories

Resources