Write PDF file from URL using urllib2 - python

I'm trying to save a dynamic pdf file generated from a web server using python's module urllib2.
I use following code to get data from server and to write that data to a file in order to store the pdf in a local disk.:
import urllib2
import cookielib
theurl = 'https://myweb.com/?pdf&var1=1'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders.append(('Cookie', cookie))
request = urllib2.Request(theurl)
print("... Sending HTTP GET to %s" % theurl)
f = opener.open(request)
data = f.read()
f.close()
opener.close()
FILE = open('report.pdf', "w")
FILE.write(data)
FILE.close()
This code runs well but the written pdf file is not well recognized by adobe reader. If I do the request manually using firefox, I have no problems to receive the file and I can visualize it withouut problems.
Comparing the received http headers (firefox and urrlib) the only difference is a http header field called "Transfer-Encoding = chunked". This field is received in firefox but it seems that is not received when I do the urllib request.
Any suggestion?

Try changing,
FILE = open('report.pdf', "w")
to
FILE = open('report.pdf', "wb")
The extra 'b' indicates to write in binary mode. Currently you are writing a binary file in ASCII/text mode.

Related

Why request fails to download an excel file from web?

the url link is the direct link to a web file (xlsb file) which I am trying to downlead. The code below works with no error and the file seems created in the path but once I try to open it, corrupt file message pops up on excel. The response status is 400 so it is a bad request. Any advice on this?
url = 'http://rigcount.bakerhughes.com/static-files/55ff50da-ac65-410d-924c-fe45b23db298'
file_name = r'local path with xlsb extension'
with open(file_name, "wb") as file:
response = requests.request(method="GET", url=url)
file.write(response.content)
Seems working for me. Try this out:
from requests import get
url = 'http://rigcount.bakerhughes.com/static-files/55ff50da-ac65-410d-924c-fe45b23db298'
# make HTTP request to fetch data
r = get(url)
# check if request is success
r.raise_for_status()
# write out byte content to file
with open('out.xlsb', 'wb') as out_file:
out_file.write(r.content)

Sending file from URL in REST request Python

This is what I'm currently using to send images to the API:
import requests
response = requests.post("http://someAPI",
auth=(username, password),
files={'imgFile': open(filepath, 'rb')}
)
It works for local files, but I would like to be able to send images from a URL as well. I know how to do this by saving the file, then opening it as a local file, sending it to the API and then removing it.
Is there any way of doing this without having to save and remove the file?
You can use StringIO. StringIO is file compatible object which has open,seek,read,write. So you can load data to it and serve from that, without writing file to disk. I normally use this approch for creating CAPTCH.
import StringIO
import requests
# Imagine you read the image file in image as base64 encoded...
buff = StringIO.StringIO()
buff.write(image.decode('base64'))
buff.seek(0)
response = requests.post("http://someAPI",
auth=(username, password),
files={'imgFile': buff}
)

How do i download pdf file over https with python

I am writing a python script, which will save pdf file locally according to the format given in URL. for eg.
https://Hostname/saveReport/file_name.pdf #saves the content in PDF file.
I am opening this URL through python script :
import webbrowser
webbrowser.open("https://Hostname/saveReport/file_name.pdf")
The url contains lots of images and text. Once this URL is opened i want to save a file in pdf format using python script.
This is what i have done so far.
Code 1:
import requests
url="https://Hostname/saveReport/file_name.pdf" #Note: It's https
r = requests.get(url, auth=('usrname', 'password'), verify=False)
file = open("file_name.pdf", 'w')
file.write(r.read())
file.close()
Code 2:
import urllib2
import ssl
url="https://Hostname/saveReport/file_name.pdf"
context = ssl._create_unverified_context()
response = urllib2.urlopen(url, context=context) #How should i pass authorization details here?
html = response.read()
In above code i am getting: urllib2.HTTPError: HTTP Error 401: Unauthorized
If i use Code 2, how can i pass authorization details?
I think this will work
import requests
import shutil
url="https://Hostname/saveReport/file_name.pdf" #Note: It's https
r = requests.get(url, auth=('usrname', 'password'), verify=False,stream=True)
r.raw.decode_content = True
with open("file_name.pdf", 'wb') as f:
shutil.copyfileobj(r.raw, f)
One way you can do that is:
import urllib3
urllib3.disable_warnings()
url = r"https://websitewithfile.com/file.pdf"
fileName = r"file.pdf"
with urllib3.PoolManager() as http:
r = http.request('GET', url)
with open(fileName, 'wb') as fout:
fout.write(r.data)
You can try something like :
import requests
response = requests.get('https://websitewithfile.com/file.pdf',verify=False, auth=('user', 'pass'))
with open('file.pdf','w') as fout:
fout.write(response.read()):
For some files - at least tar archives (or even all other files) you can use pip:
import sys
from subprocess import call, run, PIPE
url = "https://blabla.bla/foo.tar.gz"
call([sys.executable, "-m", "pip", "download", url], stdout=PIPE, stderr=PIPE)
But you should confirm that the download was successful some other way as pip would raise error for any files that are not archives containing setup.py, hence stderr=PIPE (Or may be you can determine if the download was successful by parsing subprocess error message).

Make an http POST request to upload a file using Python urllib/urllib2

I would like to make a POST request to upload a file to a web service (and get response) using Python. For example, I can do the following POST request with curl:
curl -F "file=#style.css" -F output=json http://jigsaw.w3.org/css-validator/validator
How can I make the same request with python urllib/urllib2? The closest I got so far is the following:
with open("style.css", 'r') as f:
content = f.read()
post_data = {"file": content, "output": "json"}
request = urllib2.Request("http://jigsaw.w3.org/css-validator/validator", \
data=urllib.urlencode(post_data))
response = urllib2.urlopen(request)
I got a HTTP Error 500 from the code above. But since my curl command succeeds, it must be something wrong with my python request?
I am quite new to this topic and my question may have very simple answers or mistakes.
Personally I think you should consider the requests library to post files.
url = 'http://jigsaw.w3.org/css-validator/validator'
files = {'file': open('style.css')}
response = requests.post(url, files=files)
Uploading files using urllib2 is not impossible but quite a complicated task: http://pymotw.com/2/urllib2/#uploading-files
After some digging around, it seems this post solved my problem. It turns out I need to have the multipart encoder setup properly.
from poster.encode import multipart_encode
from poster.streaminghttp import register_openers
import urllib2
register_openers()
with open("style.css", 'r') as f:
datagen, headers = multipart_encode({"file": f})
request = urllib2.Request("http://jigsaw.w3.org/css-validator/validator", \
datagen, headers)
response = urllib2.urlopen(request)
Well, there are multiple ways to do it. As mentioned above, you can send the file in "multipart/form-data". However, the target service may not be expecting this type, in which case you may try some more approaches.
Pass the file object
urllib2 can accept a file object as data. When you pass this type, the library reads the file as a binary stream and sends it out. However, it will not set the proper Content-Type header. Moreover, if the Content-Length header is missing, then it will try to access the len property of the object, which doesn't exist for the files. That said, you must provide both the Content-Type and the Content-Length headers to have the method working:
import os
import urllib2
filename = '/var/tmp/myfile.zip'
headers = {
'Content-Type': 'application/zip',
'Content-Length': os.stat(filename).st_size,
}
request = urllib2.Request('http://localhost', open(filename, 'rb'),
headers=headers)
response = urllib2.urlopen(request)
Wrap the file object
To not deal with the length, you may create a simple wrapper object. With just a little change you can adapt it to get the content from a string if you have the file loaded in memory.
class BinaryFileObject:
"""Simple wrapper for a binary file for urllib2."""
def __init__(self, filename):
self.__size = int(os.stat(filename).st_size)
self.__f = open(filename, 'rb')
def read(self, blocksize):
return self.__f.read(blocksize)
def __len__(self):
return self.__size
Encode the content as base64
Another way is encoding the data via base64.b64encode and providing Content-Transfer-Type: base64 header. However, this method requires support on the server side. Depending on the implementation, the service can either accept the file and store it incorrectly, or return HTTP 400. E.g. the GitHub API won't throw an error, but the uploaded file will be corrupted.

Log into website/server with python 3.x

I would like to be able to log into a website or server by having the user input the username and password. The python program would then log into a website and print the source code of the redirected URL. I am able to print the source code of a webpage with the following code from eHow. The server I am trying to access is 192.168.0.1 which is the wifi server of my home network.
import urllib.request
from array import array
file = open("DLINK.txt","w")
filehandle = urllib.request.urlopen('http://192.168.0.1')
for lines in filehandle.readlines():
print(lines)
file.write(str(lines))
file.close()
filehandle.close()
I'm guessing the D-Link is using Basic Authentication. In that case you may use this code from the documentation in order to get the through the login.
import urllib.request
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm='D-Link', uri=url, user=username, passwd=password)
opener = urllib.request.build_opener(auth_handler)
urllib.request.install_opener(opener)
f = urllib.request.urlopen(url)
print(f.status)
print(f.reason)
print(f.read().decode('utf-8'))

Categories

Resources