How do i download pdf file over https with python

How do i download pdf file over https with python - python

I am writing a python script, which will save pdf file locally according to the format given in URL. for eg.
https://Hostname/saveReport/file_name.pdf #saves the content in PDF file.
I am opening this URL through python script :
import webbrowser
webbrowser.open("https://Hostname/saveReport/file_name.pdf")
The url contains lots of images and text. Once this URL is opened i want to save a file in pdf format using python script.
This is what i have done so far.
Code 1:
import requests
url="https://Hostname/saveReport/file_name.pdf" #Note: It's https
r = requests.get(url, auth=('usrname', 'password'), verify=False)
file = open("file_name.pdf", 'w')
file.write(r.read())
file.close()
Code 2:
import urllib2
import ssl
url="https://Hostname/saveReport/file_name.pdf"
context = ssl._create_unverified_context()
response = urllib2.urlopen(url, context=context) #How should i pass authorization details here?
html = response.read()
In above code i am getting: urllib2.HTTPError: HTTP Error 401: Unauthorized
If i use Code 2, how can i pass authorization details?

I think this will work
import requests
import shutil
url="https://Hostname/saveReport/file_name.pdf" #Note: It's https
r = requests.get(url, auth=('usrname', 'password'), verify=False,stream=True)
r.raw.decode_content = True
with open("file_name.pdf", 'wb') as f:
shutil.copyfileobj(r.raw, f)

One way you can do that is:
import urllib3
urllib3.disable_warnings()
url = r"https://websitewithfile.com/file.pdf"
fileName = r"file.pdf"
with urllib3.PoolManager() as http:
r = http.request('GET', url)
with open(fileName, 'wb') as fout:
fout.write(r.data)

You can try something like :
import requests
response = requests.get('https://websitewithfile.com/file.pdf',verify=False, auth=('user', 'pass'))
with open('file.pdf','w') as fout:
fout.write(response.read()):

For some files - at least tar archives (or even all other files) you can use pip:
import sys
from subprocess import call, run, PIPE
url = "https://blabla.bla/foo.tar.gz"
call([sys.executable, "-m", "pip", "download", url], stdout=PIPE, stderr=PIPE)
But you should confirm that the download was successful some other way as pip would raise error for any files that are not archives containing setup.py, hence stderr=PIPE (Or may be you can determine if the download was successful by parsing subprocess error message).

Related

Beautiful Soup - urllib.error.HTTPError: HTTP Error 403: Forbidden

I am trying to download a GIF file with urrlib, but it is throwing this error:
urllib.error.HTTPError: HTTP Error 403: Forbidden
This does not happen when I download from other blog sites. This is my code:
import requests
import urllib.request
url_1 = 'https://goodlogo.com/images/logos/small/nike_classic_logo_2355.gif'
source_code = requests.get(url_1,headers = {'User-Agent': 'Mozilla/5.0'})
path = 'C:/Users/roysu/Desktop/src_code/Python_projects/python/web_scrap/myPath/'
full_name = path + ".gif"
urllib.request.urlretrieve(url_1,full_name)

Don't use urllib.request.urlretrieve. Instead, use the requests library like this:
import requests
url = 'https://goodlogo.com/images/logos/small/nike_classic_logo_2355.gif'
path = "D:\\Test.gif"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
file = open(path, "wb")
file.write(response.content)
file.close()
Output:
Hope that this helps!

Solution:
The remote server is apparently checking the user agent header and rejecting requests from Python's urllib.
urllib.request.urlretrieve() doesn't allow you to change the HTTP headers, however, you can use
urllib.request.URLopener.retrieve():
import urllib.request
url_1='https://goodlogo.com/images/logos/small/nike_classic_logo_2355.gif'
path='/home/piyushsambhi/Downloads/'
full_name= path + "testimg.gif"
opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'Mozilla/5.0')
filename, headers = opener.retrieve(url_1, full_name)
print(filename)
NOTE: You are using Python 3 and these functions are now considered part of the "Legacy interface", and URLopener has been deprecated. For that reason you should not use them in new code.
Your code imports requests, but you don't use it - you should though because it is much easier than urllib. Below mentioned code snippet works for me:
import requests
url = 'https://goodlogo.com/images/logos/small/nike_classic_logo_2355.gif'
path='/home/piyushsambhi/Downloads/'
full_name= path + "testimg1.gif"
r = requests.get(url)
with open(full_name, 'wb') as outfile:
outfile.write(r.content)
NOTE: CHANGE THE PATH VARIABLE ACCORDING TO YOUR MACHINE AND ENVIRONMENT

Downloading images using requests in Python3

I need to download an image from a url using Python. I'm using this to do so:
import requests
with requests.get(url, stream=True) as r:
with open(img_path, "wb") as f:
f.write(r.content)
In order for me to see the image in the browser, I need to be logged into my account on that site. The image may have been sent by someone else or myself.
The issue is that I am able to download some images successfully, but for other ones, I get an authentication error, i.e. that I'm not logged in.
In that case too, sometimes it downloads a file whose content is this:
{"result":"error","msg":"Not logged in: API authentication or user session required"}
And sometimes, it downloads the html file of the webpage which asks me to login to view the image.
Why am I getting this error for just some cases and not others? And how should I fix it?

Use Response.content to get the image data as bytes and then write it to a file opened in wb (write binary) mode:
import requests
image_url = "https://www.python.org/static/community_logos/python-logo-master-v3-TM.png"
img_data = requests.get(image_url).content
with open('image_name.jpg', 'wb') as f:
f.write(img_data)
Note, for authorization:
from requests.auth import HTTPBasicAuth
img_data = requests.get('image_url', auth=HTTPBasicAuth('user', 'pass')).content

You can either use the response.raw file object, or iterate over the response.
import requests
import shutil
from requests.auth import HTTPBasicAuth
r = requests.get(url, auth=HTTPBasicAuth('user', 'pass'), stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)

How to download file using Python? [duplicate]

This question already has answers here:
How to download a file over HTTP?
(30 answers)
Closed 4 years ago.
I'm completely new to Python, I want to download a file by sending a request to the server. When I type it into my browser, I see the CSV file is downloaded, but when I try sending a get request it does not return anything. for example:
import urllib2
response = urllib2.urlopen('https://publicwww.com/websites/%22google.com%22/?export=csv')
data = response.read()
print 'data: ', data
It does not show anything, how can I handle that? When I search on the web, all the questions are about how to send a get request. I can send the get request, but I have no idea of how the file can be downloaded as it is not in the response of the request.
I do not have any idea of how to find a solution for that.

You can use the urlretrieve to download the file
EX:
u = "https://publicwww.com/websites/%22google.com%22/?export=csv"
import urllib
urllib.request.urlretrieve (u, "Ktest.csv")

You can also download a file using requests module in python.
import shutil
import requests
url = "https://publicwww.com/websites/%22google.com%22/?export=csv"
response = requests.get(url, stream=True)
with open('file.csv', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response

import os
os.system("wget https://publicwww.com/websites/%22google.com%22/?export=csv")
You could try wget, if you have it.

How to upload a pdf by sending a POST Request to an API

I have tried to upload a pdf by sending a POST Request to an API in R and in Python but I am not having a lot of success.
Here is my code in R
library(httr)
url <- "https://envoc-apply-api.azurewebsites.net/api/apply"
POST(url, body = upload_file("filename.pdf"))
The status I received is 500 when I want a status of 202
I have also tried with the exact path instead of just the filename but that comes up with a file does not exist error
My code in Python
import requests
url ='https://envoc-apply-api.azurewebsites.net/api/apply'
files = {'file': open('filename.pdf', 'rb')}
r = requests.post(url, files=files)
Error I received
FileNotFoundError: [Errno 2] No such file or directory: 'filename.pdf'
I have been trying to use these to guides as examples.
R https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html
Python http://requests.readthedocs.io/en/latest/user/quickstart/
Please let me know if you need any more info.
Any help will be appreciated.

You need to specify a full path to the file:
import requests
url ='https://envoc-apply-api.azurewebsites.net/api/apply'
files = {'file': open('C:\Users\me\filename.pdf', 'rb')}
r = requests.post(url, files=files)
or something like that: otherwise it never finds filename.pdf when it tries to open it.

how can I use python 3 to download a Teamcity artifact with basic auth

I'm trying to download a zip artifact from teamcity using python 3 and i'm not having much luck.
From the browser I would normally do this:
http://USERNAME:PWD#SERVER/httpAuth/repository/downloadAll/dood_dad/latest.lastSuccessful
But if I try this with urllib.request.urlretrieve I get an exception about an invalid port - because It doesn't know about the username and password appended to the front of the url and is parsing after the ':' as a port - fair enough.
So I guess I need to use teamcitys httpAuth stuff and use the url
http://SERVERNAME/httpAuth/repository/downloadAll/dood_dad/latest.lastSuccessful
When I try that I get a 404 Unauthorized which I expected because I need to supply the username and password.
But I can't work out how.
I've added this:
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(None,
uri=url_to_open,
user='userame',
passwd='password')
opener = urllib.request.build_opener(auth_handler)
urllib.request.install_opener(opener)
local_filename, headers = urllib.request.urlretrieve(url)
But I'm still getting HTTP Error 401: Unauthorized
TIA.

You can use a lib like requests that let's you put the basic auth as a param, see more here: http://docs.python-requests.org/en/latest/user/authentication/#basic-authentication
import requests
from requests.auth import HTTPBasicAuth
import shutil
response = requests.get('http://...', auth=HTTPBasicAuth('user', 'pass'), stream=True)
with open('filename.zip', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)

this works ok:
import urllib
from urllib.request import HTTPPasswordMgrWithDefaultRealm
pwdmgr = HTTPPasswordMgrWithDefaultRealm()
pwdmgr.add_password(None, uri=url, user='XXXX', passwd='XXXX')
auth_handler = urllib.request.HTTPBasicAuthHandler(pwdmgr)
opener = urllib.request.build_opener(auth_handler)
urllib.request.install_opener(opener)
local_filename, headers = urllib.request.urlretrieve(url)
I'm not entirely sure why the newer code works over the older stuff.
FYI: the request code never worked either
response = requests.get('http://...', auth=HTTPBasicAuth('user', 'pass'), stream=True)
I kept getting an Unauthorized http errors

Obtaining Artifacts from a Build Script
import getpass
import subprocess
USERNAME = getpass.getuser()
PWD = getpass.getpass(prompt='PWD:', stream=None)
subprocess.run(['wget','http://'+USERNAME+':'+'PWD'+'#SERVER/httpAuth/repository/downloadAll/dood_dad/latest.lastSuccessful'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do i download pdf file over https with python - python

I think this will work import requests import shutil url="https://Hostname/saveReport/file_name.pdf" #Note: It's https r = requests.get(url, auth=('usrname', 'password'), verify=False,stream=True) r.raw.decode_content = True with open("file_name.pdf", 'wb') as f: shutil.copyfileobj(r.raw, f)

One way you can do that is: import urllib3 urllib3.disable_warnings() url = r"https://websitewithfile.com/file.pdf" fileName = r"file.pdf" with urllib3.PoolManager() as http: r = http.request('GET', url) with open(fileName, 'wb') as fout: fout.write(r.data)

You can try something like : import requests response = requests.get('https://websitewithfile.com/file.pdf',verify=False, auth=('user', 'pass')) with open('file.pdf','w') as fout: fout.write(response.read()):

Related

Beautiful Soup - urllib.error.HTTPError: HTTP Error 403: Forbidden

Downloading images using requests in Python3

How to download file using Python? [duplicate]

How to upload a pdf by sending a POST Request to an API

how can I use python 3 to download a Teamcity artifact with basic auth

Categories

Resources