Downloading pdf from URL with urllib (getting weird html output instead)

Downloading pdf from URL with urllib (getting weird html output instead) - python

I am trying to download pdfs from several pdf urls.
An example: https://www.fasb.org/page/showpdf?path=0001-%201700-UFI%20AICPA%20ACSEC%20Hanson.pdf
This url directly opens into the PDF on my browser.
However, when I use this code to download it using the link, it returns an HTML file given below.
link = "https://www.fasb.org/page/showpdf?path=0001-%201700-UFI%20AICPA%20ACSEC%20Hanson.pdf"
urllib.request.urlretrieve(link, f"/content/drive/MyDrive/Research/pdfs/1.pdf")
The resulting "pdf" file or HTML code file is downloaded instead:
How do I solve this issue? Appreciate any help, thanks!

You can use BeautifulSoup or lxml to find <iframe> and get src - and use it to download file
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup as BS
url = 'https://www.fasb.org/page/showpdf?path=0001-%201700-UFI%20AICPA%20ACSEC%20Hanson.pdf'
response = urllib.request.urlopen(url)
soup = BS(response.read(), 'html.parser')
iframe = soup.find('iframe')
url = iframe['src']
filename = urllib.parse.unquote(url)
filename = filename.rsplit('/', 1)[-1]
urllib.request.urlretrieve(url, filename)
Eventually you can check few file to see if all use the same https://d2x0djib3vzbzj.cloudfront.net/ and simply replace it in url.
import urllib.request
import urllib.parse
url = 'https://www.fasb.org/page/showpdf?path=0001-%201700-UFI%20AICPA%20ACSEC%20Hanson.pdf'
url = url.replace('https://www.fasb.org/page/showpdf?path=',
'https://d2x0djib3vzbzj.cloudfront.net/')
filename = urllib.parse.unquote(url)
filename = filename.rsplit('/', 1)[-1]
urllib.request.urlretrieve(url, filename)

Related

Download pdf from all pages of one website with using Python

I need a help with one code. I want to download pdf from all pages of "/#documentu", not only one. But I don't want to write all these links in code. It must be an automatic parsing. Here's the code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import webbrowser
import urllib
import urllib.request
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
#Need to download from all pages that has a files and page "/#documentu"
#but without writing all links in code. It must be an automatic.
urlpage = "https://fasie.ru/programs/programma-innoshkolnik/#documentu"
#If there is no such folder, the script will create one automatically
folder_location = r'C:\Download'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(urlpage)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(urlpage,link['href'])).content)
path = r'C:\Download'
i = 1
for file_name in os.listdir(path):
base_name, ext = os.path.splitext(file_name)
abs_file_name = os.path.join(path, file_name)
new_abs_file_name = os.path.join(path, str(i) + ext)
os.rename(abs_file_name, new_abs_file_name)
i += 1
Need help.
Need an automatic parser

Issue downloading multiple PDFs

After running the following code, I am unable to open the downloaded PDF's. Even though the code ran successfully, the downloaded PDF files are damaged.
My computer's error message is
Unable to open file. it may be damaged or in a format Preview doesn't recognize.
Why are they damaged and how do I solve this?
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"
#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/ Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)

This issue is you are requesting the link that is within github 'blob' when you need the the 'raw' link:
'/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'
but you want:
'/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'
So just adjust that. Full code below:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"
#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
pdf_link = link['href'].replace('blob','raw')
pdf_file = requests.get('https://github.com' + pdf_link)
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(pdf_file.content)

I had to use soup.select("a[href$=.pdf]") (without the inner quotes) to get it to select the links correctly.
After that, your script works, but: what you're downloading is not a PDF, but an HTML webpage! Try visiting one of the URLs: https://github.com/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf
You'll be presented with a GitHub webpage, not the actual PDF. To get that, you need the "raw" GitHub URL, which you can see when you hover over the Download button: https://github.com/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf
So, it looks like you just have to replace blob with raw at the proper spot to make it work:
href = link['href']
href = href.replace('/blob/', '/raw/')
requests.get(urljoin(url,href).content)

The issue is that the file is not properly closed after the open/write.
Just add f.close() at the end of the code to do that.

Reading in Content From URLS in a File

I'm trying to get other subset URLs from a main URL. However,as I print to see if I get the content, I noticed that I am only getting the HTML, not the URLs within it.
import urllib
file = 'http://example.com'
with urllib.request.urlopen(file) as url:
collection = url.read().decode('UTF-8')

I think this is what you are looking for.
You can use beautiful soup library of python and this code should work with python3
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
def get_all_urls(url):
open = urlopen(url)
url_html = BeautifulSoup(open, 'html.parser')
for link in url_html.find_all('a'):
links = str(link.get('href'))
if links.startswith('http'):
print(links)
else:
print(url + str(links))
get_all_urls('url.com')

How to download srt files from websites like 'Subscene.com' using python (BeautifulSoup)

I am trying to make a subtitle downloader which takes the name of all the files in the folder and searches on the website 'Subscene.com'.
I am able to get to scrap the HTML source using beautiful soup but i am unable to get the link for the zip file from the HTML source. Downloading gets triggered by clicking on the 'Download Button'.
There is no such explicit link for the zip file to download.
Is there anyway to solve this problem?

You don't want any explicit link for download zip file
Here is the logic I used for my python downloader script
MyFile2= urllib2.urlopen( Your Url ) # input link
MyHtml2 = MyFile2.read()
soup2 = BeautifulSoup(MyHtml2,"lxml")
downloaddiv= soup2.find("div", {"class": "download"}) #finding div class for the link
downloadlink = downloaddiv.find('a') #finding url from div class
download = 'https://subscene.com/'+downloadlink['href'] #appending the above url to main domain for genrating download
r = requests.get(download) # Request for downloading
z = zipfile.ZipFile(io.BytesIO(r.content)) # Opening zip file
z.extractall() # extracting zip file
You will also need these headerfiles
import zipfile
from bs4 import BeautifulSoup
import urllib2
import lxml.html
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
import requests ,io
Hope you understand everything correct!!

How can i download all types of file in python with request library

I am building the crawler in python and i have the list of href from the page.
Now i have the list of file extensions to download like
list = ['zip','rar','pdf','mp3']
How can i save the files from that url to local directory using python
EDIT:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.example.com/downlaod"
site = urllib2.urlopen(url)
html = site.read()
soup = BeautifulSoup(html)
list_urls = soup.find_all('a')
print list_urls[6]

Going by your posted example:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.example.com/downlaod"
site = urllib2.urlopen(url)
html = site.read()
soup = BeautifulSoup(html)
list_urls = soup.find_all('a')
print list_urls[6]
So, the URL you want to fetch next is presumably list_urls[6]['href'].
The first trick is that this might be a relative URL rather than absolute. So:
newurl = list_urls[6]['href']
absurl = urlparse.urljoin(site.url, newurl)
Also, you want to only fetch the file if it has the right extension, so:
if not absurl.endswith(extensions):
return # or break or whatever
But once you've decided what URL you want to download, it's no harder than your initial fetch:
page = urllib2.urlopen(absurl)
html = page.read()
path = urlparse.urlparse(absurl).path
name = os.path.basename(path)
with open(name, 'wb') as f:
f.write(html)
That's mostly it.
There are a few things you might want to add, but if so, you have to add them all manually. For example:
Look for a Content-disposition header with a suggested filename to use in place of the URL's basename.
copyfile from page to f instead of reading the whole thing into memory and then writeing it out.
Deal with existing files with the same name.
…
But that's the basics.

You can use python requests library as you have asked in question : http://www.python-requests.org
You can save file from url like this :
import requests
url='http://i.stack.imgur.com/0LJdh.jpg'
data=requests.get(url).content
filename="image.jpg"
with open(filename, 'wb') as f:
f.write(data)

solution using urllib3
import os
import urllib3
from bs4 import BeautifulSoup
import urllib.parse
url = "https://path/site"
site = urllib3.PoolManager()
html = site.request('GET', url)
soup = BeautifulSoup(html.data, "lxml")
list_urls = soup.find_all('a')
and then a recursive function to get all the files
def recursive_function(list_urls)
newurl = list_urls[0]['href']
absurl = url+newurl
list_urls.pop(0)
if absurl.endswith(extensions): # verify if contains the targeted extensions
page = urllib3.PoolManager()
html = site.request('GET', absurl)
name = os.path.basename(absurl)
with open(name, 'wb') as f:
f.write(html.data)
return recursive_function(list_urls)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downloading pdf from URL with urllib (getting weird html output instead) - python

Related

Download pdf from all pages of one website with using Python

Issue downloading multiple PDFs

Reading in Content From URLS in a File

How to download srt files from websites like 'Subscene.com' using python (BeautifulSoup)

How can i download all types of file in python with request library

Categories

Resources