I want do download several data from a website using pythons requests package. I'm sitting behind a PROXY that need authentification.
My problem is now, that my password contains the character #. I cannot change the password since the machine is used by several persons.
So if I use the syntax (according to http://docs.python-requests.org/en/latest/user/advanced/)
http://user:password#host/
So requests splits the password and interprets the part behind the # as host. Is there a way to solve this? Maybe use quotes ore something like this?
As far as I know, you can manually use HTTPProxyAuth:
import requests
from requests.auth import HTTPProxyAuth
auth = HTTPProxyAuth('username', 'password')
proxy = {'http': 'http://host/'}
req = requests.get('http://www.google.com', proxies=proxy, auth=auth)
Related
I'm trying to send requests to an HTTPS website behind corporate proxy that requires authentication.
The below code works:
import requests
import urllib
requests.get('https://google.com',
proxies={
'http': 'http://myusername:mypassword#10.20.30.40:8080',
'https': 'http://myusername:mypassword#10.20.30.40:8080'
},
verify=False)
but I want to avoid hardcoding the username and password in the script file, especially that we have to reset our passwords every 60 days. I need it to automatically authenticate as the logged-in Windows user, same behavior as a browser.
I looked online and learned that this is not possible through requests (1, 2, 3, 4) and I would have to resort to something like pycurl or px but all examples online had username and password provided explicitly. There was also this solution utilizing win32com.client but I have no idea how to use this in place of requests.
I am trying to use urllib2 through a proxy; however, after trying just about every variation of passing my verification details using urllib2, I either get a request that hangs forever and returns nothing or I get 407 Errors. I can connect to the web fine using my browser which connects to a prox-pac and redirects accordingly; however, I can't seem to do anything via the command line curl, wget, urllib2 etc. even if I use the proxies that the prox-pac redirects to. I tried setting my proxy to all of the proxies from the pac-file using urllib2, none of which work.
My current script looks like this:
import urllib2 as url
proxy = url.ProxyHandler({'http': 'username:password#my.proxy:8080'})
auth = url.HTTPBasicAuthHandler()
opener = url.build_opener(proxy, auth, url.HTTPHandler)
url.install_opener(opener)
url.urlopen("http://www.google.com/")
which throws HTTP Error 407: Proxy Authentication Required and I also tried:
import urllib2 as url
handlePass = url.HTTPPasswordMgrWithDefaultRealm()
handlePass.add_password(None, "http://my.proxy:8080", "username", "password")
auth_handler = url.HTTPBasicAuthHandler(handlePass)
opener = url.build_opener(auth_handler)
url.install_opener(opener)
url.urlopen("http://www.google.com")
which hangs like curl or wget timing out.
What do I need to do to diagnose the problem? How is it possible that I can connect via my browser but not from the command line on the same computer using what would appear to be the same proxy and credentials?
Might it be something to do with the router? if so, how can it distinguish between browser HTTP requests and command line HTTP requests?
Frustrations like this are what drove me to use Requests. If you're doing significant amounts of work with urllib2, you really ought to check it out. For example, to do what you wish to do using Requests, you could write:
import requests
from requests.auth import HTTPProxyAuth
proxy = {'http': 'http://my.proxy:8080'}
auth = HTTPProxyAuth('username', 'password')
r = requests.get('http://wwww.google.com/', proxies=proxy, auth=auth)
print r.text
Or you could wrap it in a Session object and every request will automatically use the proxy information (plus it will store & handle cookies automatically!):
s = requests.Session(proxies=proxy, auth=auth)
r = s.get('http://www.google.com/')
print r.text
This is a newbie problem with python, advice is much appreciated.
no-ip.com provides an easy way to update a computer's changing ip-address, simply open the url
http://user:password#dynupdate.no-ip.com/nic/update?hostname=my.host.name
...both http and https work when entered in firefox. I tried to implement that in a script residing in "/etc/NetworkManager/dispatcher.d" to be used by Network Manager on a recent version of Ubuntu.
What works is the python script:
from urllib import urlopen;
urlopen("http://user:password#dynupdate.no-ip.com/nic/update?hostname=my.host.name")
What I want to have is the same with "https", which does not work as easily. Could anyone, please,
(1) show me what the script should look like for https,
(2) give me some keywords, which I can use to learn about this.
(3) perhaps even explain why it does not work any more when the script is changed to using "urllib2":
from urllib2 import urlopen;
urlopen("http://user:password#dynupdate.no-ip.com/nic/update?hostname=my.host.name")
Thank you!
The user:password part isn't in the actual URL, but a shortcut for HTTP authentication. The browser's URL parsing lib will filter them out. In urllib2, you want to
import base64, urllib2
user,password = 'john_smith','123456'
request = urllib2.Request('dynupdate.no-ip.com/nic/update?hostname=my.host.name')
auth = base64.base64encode(user + ':' + password)
request.add_header('Authorization', 'Basic ' + auth)
urllib2.urlopen(request)
Using urlopen also for url queries seems obvious. What I tried is:
import urllib2
query='http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
f = urllib2.urlopen(query)
s = f.read()
f.close()
However, for this specific url query it fails with HTTP error 403 forbidden
When entering this query in my browser, it works.
Also when using http://www.httpquery.com/ to submit the query, it works.
Do you have suggestions how to use Python right to grab the correct response?
Looks like it requires cookies... (which you can do with urllib2), but an easier way if you're doing this, is to use requests
import requests
session = requests.session()
r = session.get('http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627')
This is generally a much easier and less-stressful method of retrieving URLs in Python.
requests will automatically store and re-use cookies for you. Creating a session is slightly overkill here, but is useful for when you need to submit data to login pages etc..., or re-use cookies across a site... etc...
using urllib2 is something like
import urllib2, cookielib
cookies = cookielib.CookieJar()
opener = urllib2.build_opener( urllib2.HTTPCookieProcessor(cookies) )
data = opener.open('url').read()
It appears that the urllib2 default user agent is banned by the host. You can simply supply your own user agent string:
import urllib2
url = 'http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
request = urllib2.Request(url, headers={"User-Agent" : "MyUserAgent"})
contents = urllib2.urlopen(request).read()
print contents
Edit: after much fiddling, it seems urlgrabber succeeds where urllib2 fails, even when telling it close the connection after each file. Seems like there might be something wrong with the way urllib2 handles proxies, or with the way I use it !
Anyways, here is the simplest possible code to retrieve files in a loop:
import urlgrabber
for i in range(1, 100):
url = "http://www.iana.org/domains/example/"
urlgrabber.urlgrab(url, proxies={'http':'http://<user>:<password>#<proxy url>:<proxy port>'}, keepalive=1, close_connection=1, throttle=0)
Hello all !
I am trying to write a very simple python script to grab a bunch of files via urllib2.
This script needs to work through the proxy at work (my issue does not exist if grabbing files on the intranet, i.e. without the proxy).
Said script fails after a couple of requests with "HTTPError: HTTP Error 401: basic auth failed". Any idea why that might be ? It seems the proxy is rejecting my authentication, but why ? The first couple of urlopen requests went through correctly !
Edit: Adding a sleep of 10 seconds between requests to avoid some kind of throttling that might be done by the proxy did not change the results.
Here is a simplified version of my script (with identified information stripped, obviously):
import urllib2
passmgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
passmgr.add_password(None, '<proxy url>:<proxy port>', '<my user name>', '<my password>')
authinfo = urllib2.ProxyBasicAuthHandler(passmgr)
proxy_support = urllib2.ProxyHandler({"http" : "<proxy http address>"})
opener = urllib2.build_opener(authinfo, proxy_support)
urllib2.install_opener(opener)
for i in range(100):
with open("e:/tmp/images/tst{}.htm".format(i), "w") as outfile:
f = urllib2.urlopen("http://www.iana.org/domains/example/")
outfile.write(f.read())
Thanks in advance !
You can minimize the number of connection by using the keepalive handler from the urlgrabber module.
import urllib2
from keepalive import HTTPHandler
keepalive_handler = HTTPHandler()
opener = urllib2.build_opener(keepalive_handler)
urllib2.install_opener(opener)
fo = urllib2.urlopen('http://www.python.org')
I am unsure that this will work correctly with your Proxy setup.
You may have to hack the keepalive module.
The proxy might be throttling your requests. I guess it thinks you look like a bot.
You could add a timeout, and see if that gets you through.