I wrote a function to find all .pdf files from a web-page & download them. It works well when the link is publicly accessible but when I use it for a course website (which can only be accessed on my university's internet), the pdfs downloaded are corrupted and cannot be opened.
How can I fix it?
def get_pdfs(my_url):
html = urllib2.urlopen(my_url).read()
html_page = BeautifulSoup(html)
current_link = ''
links = []
for link in html_page.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
links.append(my_url + current_link)
print(links)
for link in links:
#urlretrieve(link)
wget.download(link)
get_pdfs('https://grader.eecs.jacobs-university.de/courses/320241/2019_2/')
When I use this grader link, the current_link is something like /courses/320241/2019_2/lectures/lecture_7_8.pdf but the /courses/320241/2019_2/ part is already included in the my_url and when I append it, it obviously doesn't work. However, the function works perfectly for [this link][1]:
Is there a way I can use the same function to work with both types of links?
OK, I think I understand the issue now. Try the code below on your data. I think it works, but obviously I couldn't try it directly on the page requiring login. Also, I changed your structure and variable definitions a bit, because I find it easier to think that way, but if it works, you can easily modify it to suit your own tastes.
Anyway, here goes:
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urlparse
my_urls = ['https://cnds.jacobs-university.de/courses/os-2019/', 'https://grader.eecs.jacobs-university.de/courses/320241/2019_2']
links = []
for url in my_urls:
resp = requests.get(url)
soup = bs(resp.text,'lxml')
og = soup.find("meta", property="og:url")
base = urlparse(url)
for link in soup.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
if og:
links.append(og["content"] + current_link)
else:
links.append(base.scheme+"://"+base.netloc + current_link)
for link in links:
print(link)
Related
I'm very new to python and I'm trying to create a website update tool that checks whether the link contained in a specific button has changed.
This is the code I have used:
import requests
from bs4 import BeautifulSoup
url = 'https://www.keldagroup.com/investors/creditor-considerations/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
However it produces a long string like this:
/about-kelda-group/
/about-kelda-group/kelda-group-vision-and-values/
/about-kelda-group/group-profile/
/about-kelda-group/sustainability-and-corporate-social-responsibility/
/about-kelda-group/group-profile/chief-executive-statement/
None
The url I want stays in the same place on the website. How do I choose the url from the string I've produced? I can then write some code to see if this has changed.
If you know of a simpler way to solve my issue please let me know.
You can do something like this (obviously, adapt it to your desired output/ alert etc):
import requests
from bs4 import BeautifulSoup
import schedule
import time
def get_link():
url = 'https://www.keldagroup.com/investors/creditor-considerations/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
the_ever_changing_link = soup.find('a', {'arial-label': 'Investor presentation'}).get('href')
return the_ever_changing_link
def handle_changes_to_link(value):
new_link = get_link()
if new_link == value:
print('link is the same')
else:
print('link has changed: ', new_link)
schedule.every(10).seconds.do(handle_changes_to_link, value = 'https://www.keldagroup.com/media/1399/yw-investor-update-1-dec-2021.pdf')
while True:
schedule.run_pending()
time.sleep(1)
This will check every 10 seconds (you may wanna change that hours/days) if the link has changed its original value (hardcoded as 'https://www.keldagroup.com/media/1399/yw-investor-update-1-dec-2021.pdf') and will print out:
link is the same
link is the same
link is the same
link is the same
link is the same
link is the same
link is the same
link is the same
link is the same
[...]
If the link changes, it will printout link has changed, along with the new link.
Documentation for schedule is at: https://schedule.readthedocs.io/en/stable/
Also, docs for BeautifulSoup(bs4) can be found at: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
I have been trying to scrape a website such as the one below. In the footer there are a bunch of links of their social media out of which the LinkedIn URL is the point of focus for me. Is there a way to fish out only that link maybe using regex or any other libraries available in Python.
This is what I have tried so far -
import requests
from bs4 import BeautifulSoup
url = "https://www.southcoast.org/"
req = requests.get(url)
soup = BeautifulSoup(reqs.text,"html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
But I'm fetching all the URLs instead of the one I'm looking for.
Note: I'd appreciate a dynamic code which I can use for other sites as well.
Thanks in advance for you suggestion/help.
One approach could be to use css selectors and look for string linkedin.com/company/ in values of href attributes:
soup.select_one('a[href*="linkedin.com/company/"]')['href']
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.southcoast.org/"
req = requests.get(url)
soup = BeautifulSoup(req.text,"html.parser")
# single (first) link
link = e['href'] if(e := soup.select_one('a[href*="linkedin.com/company/"]')) else None
# multiple links
links = [link['href'] for link in soup.select('a[href*="linkedin.com/company/"]')]
I am trying to pass a link I extracted from beautifulsoup.
import requests
r = requests.get('https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links[1])
This is the link I am wanting.
Output: https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip
Now I am trying to pass this link through so I can download the contents.
# make a folder if it doesn't already exist
if not os.path.exists(folder_name):
os.makedirs(folder_name)
# pass the url
url = r'link from beautifulsoup result needs to go here'
response = requests.get(url, stream = True)
# extract contents
with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
for elem in zf.namelist():
zf.extract(elem, '../data')
My overall goal is trying to take the link that I webscraped and place it in the url variable because the link is always changing on this website. I want to make it dynamic so I don't have to manually search for this link and change it when its changing and instead it changes dynamically. I hope this makes sense and appreciate any help I can get.
If I manually enter my code as the following I know it works
url = r'https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip'
If I can get my code to pass that exactly I know it'll work I'm just stuck with how to accomplish this.
I think you can do it with the find_all() method in Beautiful Soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content)
for a in soup.find_all('a'):
url = a.get('href')
Hi, I want to get the text(number 18) from em tag as shown in the picture above.
When I ran my code, it did not work and gave me only empty list. Can anyone help me? Thank you~
here is my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://blog.naver.com/kwoohyun761/221945923725'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
When you disable javascript you'll see that the like count is loaded dynamically, so you have to use a service that renders the website and then you can parse the content.
You can use an API: https://www.scraperapi.com/
Or run your own for example: https://github.com/scrapinghub/splash
EDIT:
First of all, I missed that you were using urlopen incorrectly the correct way is described here: https://docs.python.org/3/howto/urllib2.html . Assuming you are using python3, which seems to be the case judging by the print statement.
Furthermore: looking at the issue again it is a bit more complicated. When you look at the source code of the page it actually loads an iframe and in that iframe you have the actual content: Hit ctrl + u to see the source code of the original url, since the side seems to block the browser context menu.
So in order to achieve your crawling objective you have to first grab the initial page and then grab the page you are interested in:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# original url
url = "https://blog.naver.com/kwoohyun761/221945923725"
with urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
iframe = soup.find('iframe')
# iframe grabbed, construct real url
print(iframe['src'])
real_url = "https://blog.naver.com" + iframe['src']
# do your crawling
with urlopen(real_url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
You might be able to avoid one round trip by analyzing the original url and the URL in the iframe. At first glance it looked like the iframe url can be constructed from the original url.
You'll still need a rendered version of the iframe url to grab your desired value.
I don't know what this site is about, but it seems they do not want to get crawled maybe you respect that.
I've made a scraper in python. It is running smoothly. Now I would like to discard or accept specific links from that page as in, links only containing "mobiles" but even after making some conditional statement I can't do so. Hope I'm gonna get any help to rectify my mistakes.
import requests
from bs4 import BeautifulSoup
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
soup = BeautifulSoup(Process.text, "lxml")
for link in soup.findAll('div',class_='')[0].findAll('a'):
if "mobiles" not in link:
print(link.get('href'))
SpecificItem()
On the other hand if I do the same thing using lxml library with xpath, It works.
import requests
from lxml import html
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
tree = html.fromstring(Process.text)
links = tree.xpath('//div[#class=""]//a/#href')
for link in links:
if "mobiles" not in link:
print(link)
SpecificItem()
So, at this point i think with BeautifulSoup library the code should be somewhat different to get the purpose served.
The root of your problem is your if condition works a bit differently between BeautifulSoup and lxml. Basically, if "mobiles" not in link: with BeautifulSoup is not checking if "mobiles" is in the href field. I didn't look too hard but I'd guess it's comparing it to the link.text field instead. Explicitly using the href field does the trick:
import requests
from bs4 import BeautifulSoup
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
soup = BeautifulSoup(Process.text, "lxml")
for link in soup.findAll('div',class_='')[0].findAll('a'):
href = link.get('href')
if "mobiles" not in href:
print(href)
SpecificItem()
That prints out a bunch of links and none of them include "mobiles".