I'm creating a Python (using urllib2) parser of addresses with non-english characters in it. The goal is to find coordinates of every address.
When I open this url in Firefox:
http://maps.google.com/maps/geo?q=Czech%20Republic%2010000%20Male%C5%A1ice&output=csv
it is converted (changes in address box) to
http://maps.google.com/maps/geo?q=Czech Republic 10000 Malešice&output=csv
and returns
200,6,50.0865113,14.4918052
which is a correct result.
However, if I open the same url (encoded, with %20 and such) in urllib2 (or Opera browser), the result is
200,4,49.7715220,13.2955410
which is incorrect. How can I open the first url in urllib2 to get the "200,6,50.0865113,14.4918052" result?
Edit:
Code used
import urllib2
psc = '10000'
name = 'Malešice'
url = 'http://maps.google.com/maps/geo?q=%s&output=csv' % urllib2.quote('Czech Republic %s %s' % (psc, name))
response = urllib2.urlopen(url)
data = response.read()
print 'Parsed url %s, result %s\n' % (url, data)
output
Parsed url http://maps.google.com/maps/geo?q=Czech%20Republic%2010000%20Male%C5%A1ice&output=csv, result 200,4,49.7715220,13.2955410
I can reproduce this behavior, and at first I was dumbfounded as to why it's happening. Closer inspection of the HTTP requests with wireshark showed that the requests sent by Firefox (not surprisingly) contain a couple more HTTP-Headers.
In the end it turned out it's the Accept-Language header that makes the difference. You only get the correct result if
an Accept-Language header is set
and it has a non-english language listed first (the priorities don't seem to matter)
So, for example this Accept-Language header works:
headers = {'Accept-Language': 'de-ch,en'}
To summarize, modified like this your code works for me:
# -*- coding: utf-8 -*-
import urllib2
psc = '10000'
name = 'Malešice'
url = 'http://maps.google.com/maps/geo?q=%s&output=csv' % urllib2.quote('Czech Republic %s %s' % (psc, name))
headers = {'Accept-Language': 'de-ch,en'}
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
data = response.read()
print 'Parsed url %s, result %s\n' % (url, data)
Note: In my opinion, this is a bug in Google's geocoding API. The Accept-Language header indicates what languages the user agent prefers the content in, but it shouldn't have any effect on how the request is interpreted.
Related
I need to HTTP Basic Auth for a REST call. In the username I have to provide a domain (which has a hyphen) and then a backslash to separate it from the username, like this: DOM-AIN\user_name. Then the password is pretty benign.
This works fine with curl:
curl 'https://DOM-AIN\user_name:password#myurl.com'
I need to put this into Python now, but I've tried with requests and urllib/2/3...they don't like the \ : or the #. Even when I URL encode to %40, etc., those get interpreted as an actual : and urllib thinks I'm trying to define a port and I get an error: Invalid socket, I think, I forgot.
So I tried passing the username and password in the header using urllib3, but I get an unauthorized access error and I suspect it's because I need to somehow encode the username in the header to account for the backslash (%5C), but that doesn't seem to be working either.
Here is some code that doesn't work:
# Attempt 1
http = urllib3.PoolManager()
url1 = https://ws.....
headers = urllib3.util.make_headers(basic_auth='DOM-AIN\user_name:password')
r1 = http.request('GET', url1, headers=headers)
response = r1.data
# Attempt 2
passwordManager = urllib2.HTTPPasswordMgrWithDefaultRealm()
passwordManager.add_password(None, url, 'DOM-AIN\user_name, password)
authenticationHandler = urllib2.HTTPBasicAuthHandler(passwordManager)
opener = urllib2.build_opener(authenticationHandler)
data = opener.open(url1)
There were some other attempts with request, but I don't have those anymore. I can get the errors of these if it would be useful, but if there is already a known thing I'm doing wrong that would be great...
Backslash should be escaped in Python string literals:
username = 'DOM-AIN\\user_name' # OR
username = r'DOM-AIN\user_name' # raw-string literal
Example:
import urllib2, base64
request = urllib2.Request('https://example.com')
credentials = base64.b64encode(username + b':' + password)
request.add_header('Authorization', b'Basic ' + credentials)
response = urllib2.urlopen(request)
Note: unlike HTTPBasicAuthHandler code; it always sends the credentials without waiting for 401 response with WWW-Authenticate header.
First convert your DOM-AIN\user_name into base64 string. Lets say its XXXXYYYYYYY. Now place this base64 string into the http header like below code with urllib2.
headers = { 'Authorization:' : 'Basic XXXXYYYYYYY' }
req = urllib2.Request(url, data, headers)
I found a way using urllib, with this post's mention of FancyURLopener sending me down the right path. This was the closest I could come to replicating the way it worked in curl, although looking at Sabuj's answer there might be a way to use headers properly, but I haven't tried his method.
import urllib
opener = urllib.FancyURLopener()
data = opener.open('https://DOM-AIN**%5C%user_name:password#url.com?whatever_parameters')
response = data.read()
It works when I only URL encoded the backslash. Didn't work when I encoded the other characters like : and #.
I am writing some code to interface with redmine and I need to upload some files as part of the process, but I am not sure how to do a POST request from python containing a binary file.
I am trying to mimic the commands here:
curl --data-binary "#image.png" -H "Content-Type: application/octet-stream" -X POST -u login:password http://redmine/uploads.xml
In python (below), but it does not seem to work. I am not sure if the problem is somehow related to encoding the file or if something is wrong with the headers.
import urllib2, os
FilePath = "C:\somefolder\somefile.7z"
FileData = open(FilePath, "rb")
length = os.path.getsize(FilePath)
password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_manager.add_password(None, 'http://redmine/', 'admin', 'admin')
auth_handler = urllib2.HTTPBasicAuthHandler(password_manager)
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
request = urllib2.Request( r'http://redmine/uploads.xml', FileData)
request.add_header('Content-Length', '%d' % length)
request.add_header('Content-Type', 'application/octet-stream')
try:
response = urllib2.urlopen( request)
print response.read()
except urllib2.HTTPError as e:
error_message = e.read()
print error_message
I have access to the server and it looks like a encoding error:
...
invalid byte sequence in UTF-8
Line: 1
Position: 624
Last 80 unconsumed characters:
7z¼¯'ÅÐз2^Ôøë4g¸R<süðí6kĤª¶!»=}jcdjSPúá-º#»ÄAtD»H7Ê!æ½]j):
(further down)
Started POST "/uploads.xml" for 192.168.0.117 at 2013-01-16 09:57:49 -0800
Processing by AttachmentsController#upload as XML
WARNING: Can't verify CSRF token authenticity
Current user: anonymous
Filter chain halted as :authorize_global rendered or redirected
Completed 401 Unauthorized in 13ms (ActiveRecord: 3.1ms)
Basically what you do is correct. Looking at redmine docs you linked to, it seems that suffix after the dot in the url denotes type of posted data (.json for JSON, .xml for XML), which agrees with the response you get - Processing by AttachmentsController#upload as XML. I guess maybe there's a bug in docs and to post binary data you should try using http://redmine/uploads url instead of http://redmine/uploads.xml.
Btw, I highly recommend very good and very popular Requests library for http in Python. It's much better than what's in the standard lib (urllib2). It supports authentication as well but I skipped it for brevity here.
import requests
with open('./x.png', 'rb') as f:
data = f.read()
res = requests.post(url='http://httpbin.org/post',
data=data,
headers={'Content-Type': 'application/octet-stream'})
# let's check if what we sent is what we intended to send...
import json
import base64
assert base64.b64decode(res.json()['data'][len('data:application/octet-stream;base64,'):]) == data
UPDATE
To find out why this works with Requests but not with urllib2 we have to examine the difference in what's being sent. To see this I'm sending traffic to http proxy (Fiddler) running on port 8888:
Using Requests
import requests
data = 'test data'
res = requests.post(url='http://localhost:8888',
data=data,
headers={'Content-Type': 'application/octet-stream'})
we see
POST http://localhost:8888/ HTTP/1.1
Host: localhost:8888
Content-Length: 9
Content-Type: application/octet-stream
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: python-requests/1.0.4 CPython/2.7.3 Windows/Vista
test data
and using urllib2
import urllib2
data = 'test data'
req = urllib2.Request('http://localhost:8888', data)
req.add_header('Content-Length', '%d' % len(data))
req.add_header('Content-Type', 'application/octet-stream')
res = urllib2.urlopen(req)
we get
POST http://localhost:8888/ HTTP/1.1
Accept-Encoding: identity
Content-Length: 9
Host: localhost:8888
Content-Type: application/octet-stream
Connection: close
User-Agent: Python-urllib/2.7
test data
I don't see any differences which would warrant different behavior you observe. Having said that it's not uncommon for http servers to inspect User-Agent header and vary behavior based on its value. Try to change headers sent by Requests one by one making them the same as those being sent by urllib2 and see when it stops working.
This has nothing to do with a malformed upload. The HTTP error clearly specifies 401 unauthorized, and tells you the CSRF token is invalid. Try sending a valid CSRF token with the upload.
More about csrf tokens here:
What is a CSRF token ? What is its importance and how does it work?
you need to add Content-Disposition header, smth like this (although I used mod-python here, but principle should be the same):
request.headers_out['Content-Disposition'] = 'attachment; filename=%s' % myfname
You can use unirest, It provides easy method to post request.
`
import unirest
def callback(response):
print "code:"+ str(response.code)
print "******************"
print "headers:"+ str(response.headers)
print "******************"
print "body:"+ str(response.body)
print "******************"
print "raw_body:"+ str(response.raw_body)
# consume async post request
def consumePOSTRequestASync():
params = {'test1':'param1','test2':'param2'}
# we need to pass a dummy variable which is open method
# actually unirest does not provide variable to shift between
# application-x-www-form-urlencoded and
# multipart/form-data
params['dummy'] = open('dummy.txt', 'r')
url = 'http://httpbin.org/post'
headers = {"Accept": "application/json"}
# call get service with headers and params
unirest.post(url, headers = headers,params = params, callback = callback)
# post async request multipart/form-data
consumePOSTRequestASync()
I want to make a post request to a HTTPS-site that should respond with a .csv file.
I have this Python code:
url = 'https://www.site.com/servlet/datadownload'
values = {
'val1' : '123',
'val2' : 'abc',
'val3' : '1b3',
}
data = urllib.urlencode(values)
req = urllib2.Request(url,data)
response = urllib2.urlopen(req)
myfile = open('file.csv', 'wb')
shutil.copyfileobj(response.fp, myfile)
myfile.close()
But 'm getting the error:
BadStatusLine: '' (in httplib.py)
I've tried the post request with the Chrome Extension: Advanced REST client (screenshot) and that works fine.
What could be the problem and how could I solve it? (is it becasue of the HTTPS?)
EDIT, refactored code:
try:
#conn = httplib.HTTPSConnection(host="www.site.com", port=443)
=> Gives an BadStatusLine: '' error
conn = httplib.HTTPConnection("www.site.com");
params = urllib.urlencode({'val1':'123','val2':'abc','val3':'1b3'})
conn.request("POST", "/nps/servlet/exportdatadownload", params)
content = conn.getresponse()
print content.reason, content.status
print content.read()
conn.close()
except:
import sys
print sys.exc_info()[:2]
Output:
Found 302
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>302 Found</TITLE>
</HEAD><BODY>
<H1>Found</H1>
The document has moved here.<P>
<HR>
<ADDRESS>Oracle-Application-Server-10g/10.1.3.5.0 Oracle-HTTP-Server Server at mp-www1.mrco.be Port 7778</ADDRESS>
</BODY></HTML>
What am I doing wrong?
Is there a reason you've got to use urllib? Requests is simpler, better in almost every way, and abstracts away some of the cruft that makes urllib hard to work with.
As an example, I'd rework you example as something like:
import requests
resp = requests.post(url, data=values, allow_redirects=True)
At this point, the response from the server is available in resp.text, and you can do what you'd like with it. If requests wasn't able to POST properly (because you need a custom SSL certificate, for example), it should give you a nice error message that tells you why.
Even if you can't do this in your production environment, do this in a local shell to see what error messages you get from requests, and use that to debug urllib.
The BadStatusLine: '' (in httplib.py) gives away that there might be something else going on here. This may happen when the server sends no reply back at all, and just closes the connection.
As you mentioned that you're using an SSL connection, this might be particularly interesting to debug (with curl -v URL if you want).
If you find out that curl -2 URL (which forces the use of SSLv2) seems to work, while curl -3 URL (SSLv3), doesn't, you may want to take a look at issue #13636 and possibly #11220 on the python bugtracker. Depending on your Python version & a possibly misconfigured webserver, this might be causing a problem: the SSL defaults have changed in v2.7.3.
conn = httplib.HTTPSConnection(host='www.site.com', port=443, cert_file=_certfile)
params = urllib.urlencode({'cmd': 'token', 'device_id_st': 'AAAA-BBBB-CCCC',
'token_id_st':'DDDD-EEEE_FFFF', 'product_id':'Unit Test',
'product_ver':"1.6.3"})
conn.request("POST", "servlet/datadownload", params)
content = conn.getresponse().read()
#print response.status, response.reason
conn.close()
The server may not like the missing headers, particularly user-agent and content-type. The Chrome image shows what is used for these. Maybe try adding the headers:
import httplib, urllib
host = 'www.site.com'
url = '/servlet/datadownload'
values = {
'val1' : '123',
'val2' : 'abc',
'val3' : '1b3',
}
headers = {
'User-Agent': 'python',
'Content-Type': 'application/x-www-form-urlencoded',
}
values = urllib.urlencode(values)
conn = httplib.HTTPSConnection(host)
conn.request("POST", url, values, headers)
response = conn.getresponse()
data = response.read()
print 'Response: ', response.status, response.reason
print 'Data:'
print data
This is untested code, and you may want to experiment by adding other header values to match your screenshot. Hope it helps.
I need to perform preemptive basic authentication against an HTTP server, i.e., authenticate right away without waiting on a 401 response. Can this be done with httplib2?
Edit:
I solved it by adding an Authorization header to the request, as suggested in the accepted answer:
headers["Authorization"] = "Basic {0}".format(
base64.b64encode("{0}:{1}".format(username, password)))
Add an appropriately formed 'Authorization' header to your initial request.
This also works with the built-in httplib (for anyone wishing to minimize 3rd-party libs/modules). I am using it to authenticate with our Jenkins server using the API Token that Jenkins can create for each user.
>>> import base64, httplib
>>> headers = {}
>>> headers["Authorization"] = "Basic {0}".format(
base64.b64encode("{0}:{1}".format('<username>', '<jenkins_API_token>')))
>>> ## Enable the job
>>> conn = httplib.HTTPConnection('jenkins.myserver.net')
>>> conn.request('POST', '/job/Foo-trunk/enable', None, headers)
>>> resp = conn.getresponse()
>>> resp.status
302
>>> ## Disable the job
>>> conn = httplib.HTTPConnection('jenkins.myserver.net')
>>> conn.request('POST', '/job/Foo-trunk/disable', None, headers)
>>> resp = conn.getresponse()
>>> resp.status
302
I realize this is old, but I figured I'd throw in the solution if you're using Python 3 with httplib2 since I haven't been able to find it anywhere else. I'm also authenticating against a Jenkins server using the API Token for each Jenkins user. If you're not concerned with Jenkins, simply substitute the actual user's password for the API Token.
b64encode is expecting an binary string of ASCII characters. With Python 3 a TypeError will be raised if a plain string is passed in. To get around this, the "user:api_token" portion of the header must be encoded using either 'ascii' or 'utf-8', passed to b64encode, then the resulting byte string must be decoded to a plain string before being placed in the header. The following code did what I needed:
import httplib2, base64
cred = base64.b64encode("{0}:{1}".format(
<user>, <api_token>).encode('utf-8')).decode()
headers = {'Authorization': "Basic %s" % cred}
h = httplib2.Http('.cache')
response, content = h.request("http://my.jenkins.server/job/my_job/enable",
"GET", headers=headers)
Problem: When POSTing data with Python's urllib2, all data is URL encoded and sent as Content-Type: application/x-www-form-urlencoded. When uploading files, the Content-Type should instead be set to multipart/form-data and the contents be MIME-encoded.
To get around this limitation some sharp coders created a library called MultipartPostHandler which creates an OpenerDirector you can use with urllib2 to mostly automatically POST with multipart/form-data. A copy of this library is here: MultipartPostHandler doesn't work for Unicode files
I am new to Python and am unable to get this library to work. I wrote out essentially the following code. When I capture it in a local HTTP proxy, I can see that the data is still URL encoded and not multi-part MIME-encoded. Please help me figure out what I am doing wrong or a better way to get this done. Thanks :-)
FROM_ADDR = 'my#email.com'
try:
data = open(file, 'rb').read()
except:
print "Error: could not open file %s for reading" % file
print "Check permissions on the file or folder it resides in"
sys.exit(1)
# Build the POST request
url = "http://somedomain.com/?action=analyze"
post_data = {}
post_data['analysisType'] = 'file'
post_data['executable'] = data
post_data['notification'] = 'email'
post_data['email'] = FROM_ADDR
# MIME encode the POST payload
opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler)
urllib2.install_opener(opener)
request = urllib2.Request(url, post_data)
request.set_proxy('127.0.0.1:8080', 'http') # For testing with Burp Proxy
# Make the request and capture the response
try:
response = urllib2.urlopen(request)
print response.geturl()
except urllib2.URLError, e:
print "File upload failed..."
EDIT1: Thanks for your response. I'm aware of the ActiveState httplib solution to this (I linked to it above). I'd rather abstract away the problem and use a minimal amount of code to continue using urllib2 how I have been. Any idea why the opener isn't being installed and used?
It seems that the easiest and most compatible way to get around this problem is to use the 'poster' module.
# test_client.py
from poster.encode import multipart_encode
from poster.streaminghttp import register_openers
import urllib2
# Register the streaming http handlers with urllib2
register_openers()
# Start the multipart/form-data encoding of the file "DSC0001.jpg"
# "image1" is the name of the parameter, which is normally set
# via the "name" parameter of the HTML <input> tag.
# headers contains the necessary Content-Type and Content-Length
# datagen is a generator object that yields the encoded parameters
datagen, headers = multipart_encode({"image1": open("DSC0001.jpg")})
# Create the Request object
request = urllib2.Request("http://localhost:5000/upload_image", datagen, headers)
# Actually do the request, and get the response
print urllib2.urlopen(request).read()
This worked perfect and I didn't have to muck with httplib. The module is available here:
http://atlee.ca/software/poster/index.html
Found this recipe to post multipart using httplib directly (no external libraries involved)
import httplib
import mimetypes
def post_multipart(host, selector, fields, files):
content_type, body = encode_multipart_formdata(fields, files)
h = httplib.HTTP(host)
h.putrequest('POST', selector)
h.putheader('content-type', content_type)
h.putheader('content-length', str(len(body)))
h.endheaders()
h.send(body)
errcode, errmsg, headers = h.getreply()
return h.file.read()
def encode_multipart_formdata(fields, files):
LIMIT = '----------lImIt_of_THE_fIle_eW_$'
CRLF = '\r\n'
L = []
for (key, value) in fields:
L.append('--' + LIMIT)
L.append('Content-Disposition: form-data; name="%s"' % key)
L.append('')
L.append(value)
for (key, filename, value) in files:
L.append('--' + LIMIT)
L.append('Content-Disposition: form-data; name="%s"; filename="%s"' % (key, filename))
L.append('Content-Type: %s' % get_content_type(filename))
L.append('')
L.append(value)
L.append('--' + LIMIT + '--')
L.append('')
body = CRLF.join(L)
content_type = 'multipart/form-data; boundary=%s' % LIMIT
return content_type, body
def get_content_type(filename):
return mimetypes.guess_type(filename)[0] or 'application/octet-stream'
Just use python-requests, it will set proper headers and do upload for you:
import requests
files = {"form_input_field_name": open("filename", "rb")}
requests.post("http://httpbin.org/post", files=files)
I ran into the same problem and I needed to do a multipart form post without using external libraries. I wrote a whole blogpost about the issues I ran into.
I ended up using a modified version of http://code.activestate.com/recipes/146306/. The code in that url actually just appends the content of the file as a string, which can cause problems with binary files. Here's my working code.
import mimetools
import mimetypes
import io
import http
import json
form = MultiPartForm()
form.add_field("form_field", "my awesome data")
# Add a fake file
form.add_file(key, os.path.basename(filepath),
fileHandle=codecs.open("/path/to/my/file.zip", "rb"))
# Build the request
url = "http://www.example.com/endpoint"
schema, netloc, url, params, query, fragments = urlparse.urlparse(url)
try:
form_buffer = form.get_binary().getvalue()
http = httplib.HTTPConnection(netloc)
http.connect()
http.putrequest("POST", url)
http.putheader('Content-type',form.get_content_type())
http.putheader('Content-length', str(len(form_buffer)))
http.endheaders()
http.send(form_buffer)
except socket.error, e:
raise SystemExit(1)
r = http.getresponse()
if r.status == 200:
return json.loads(r.read())
else:
print('Upload failed (%s): %s' % (r.status, r.reason))
class MultiPartForm(object):
"""Accumulate the data to be used when posting a form."""
def __init__(self):
self.form_fields = []
self.files = []
self.boundary = mimetools.choose_boundary()
return
def get_content_type(self):
return 'multipart/form-data; boundary=%s' % self.boundary
def add_field(self, name, value):
"""Add a simple field to the form data."""
self.form_fields.append((name, value))
return
def add_file(self, fieldname, filename, fileHandle, mimetype=None):
"""Add a file to be uploaded."""
body = fileHandle.read()
if mimetype is None:
mimetype = mimetypes.guess_type(filename)[0] or 'application/octet-stream'
self.files.append((fieldname, filename, mimetype, body))
return
def get_binary(self):
"""Return a binary buffer containing the form data, including attached files."""
part_boundary = '--' + self.boundary
binary = io.BytesIO()
needsCLRF = False
# Add the form fields
for name, value in self.form_fields:
if needsCLRF:
binary.write('\r\n')
needsCLRF = True
block = [part_boundary,
'Content-Disposition: form-data; name="%s"' % name,
'',
value
]
binary.write('\r\n'.join(block))
# Add the files to upload
for field_name, filename, content_type, body in self.files:
if needsCLRF:
binary.write('\r\n')
needsCLRF = True
block = [part_boundary,
str('Content-Disposition: file; name="%s"; filename="%s"' % \
(field_name, filename)),
'Content-Type: %s' % content_type,
''
]
binary.write('\r\n'.join(block))
binary.write('\r\n')
binary.write(body)
# add closing boundary marker,
binary.write('\r\n--' + self.boundary + '--\r\n')
return binary
What a coincide, 2 years and 6 months ago I created the project
https://pypi.python.org/pypi/MultipartPostHandler2, that fixes MultipartPostHandler for utf-8 systems. I also have done some minor improvements, you are welcome to test it :)
To answer the OP's question of why the original code didn't work, the handler passed in wasn't an instance of a class. The line
# MIME encode the POST payload
opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler)
should read
opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler())