Edit: after much fiddling, it seems urlgrabber succeeds where urllib2 fails, even when telling it close the connection after each file. Seems like there might be something wrong with the way urllib2 handles proxies, or with the way I use it !
Anyways, here is the simplest possible code to retrieve files in a loop:
import urlgrabber
for i in range(1, 100):
url = "http://www.iana.org/domains/example/"
urlgrabber.urlgrab(url, proxies={'http':'http://<user>:<password>#<proxy url>:<proxy port>'}, keepalive=1, close_connection=1, throttle=0)
Hello all !
I am trying to write a very simple python script to grab a bunch of files via urllib2.
This script needs to work through the proxy at work (my issue does not exist if grabbing files on the intranet, i.e. without the proxy).
Said script fails after a couple of requests with "HTTPError: HTTP Error 401: basic auth failed". Any idea why that might be ? It seems the proxy is rejecting my authentication, but why ? The first couple of urlopen requests went through correctly !
Edit: Adding a sleep of 10 seconds between requests to avoid some kind of throttling that might be done by the proxy did not change the results.
Here is a simplified version of my script (with identified information stripped, obviously):
import urllib2
passmgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
passmgr.add_password(None, '<proxy url>:<proxy port>', '<my user name>', '<my password>')
authinfo = urllib2.ProxyBasicAuthHandler(passmgr)
proxy_support = urllib2.ProxyHandler({"http" : "<proxy http address>"})
opener = urllib2.build_opener(authinfo, proxy_support)
urllib2.install_opener(opener)
for i in range(100):
with open("e:/tmp/images/tst{}.htm".format(i), "w") as outfile:
f = urllib2.urlopen("http://www.iana.org/domains/example/")
outfile.write(f.read())
Thanks in advance !
You can minimize the number of connection by using the keepalive handler from the urlgrabber module.
import urllib2
from keepalive import HTTPHandler
keepalive_handler = HTTPHandler()
opener = urllib2.build_opener(keepalive_handler)
urllib2.install_opener(opener)
fo = urllib2.urlopen('http://www.python.org')
I am unsure that this will work correctly with your Proxy setup.
You may have to hack the keepalive module.
The proxy might be throttling your requests. I guess it thinks you look like a bot.
You could add a timeout, and see if that gets you through.
Related
Requests is not honoring the proxies flag.
There is something I am missing about making a request over a proxy with python requests library.
If I enable the OS system proxy, then it works, but if I make the request with just requests module proxies setting, the remote machine will not see the proxy set in requests, but will see my real ip, it is as if not proxy was set.
The bellow example will show this effect, at the time of this post the bellow proxy is alive but any working proxy should replicate the effect.
import requests
proxy ={
'http:': 'https://143.208.200.26:7878',
'https:': 'http://143.208.200.26:7878'
}
data = requests.get(url='http://ip-api.com/json', proxies=proxy).json()
print('Ip: %s\nCity: %s\nCountry: %s' % (data['query'], data['city'], data['country']))
I also tried changing the proxy_dict format:
proxy ={
'http:': '143.208.200.26:7878',
'https:': '143.208.200.26:7878'
}
But still it has not effect.
I am using:
-Windows 10
-python 3.9.6
-urllib 1.25.8
Many thanks in advance for any response to help sort this out.
Ok is working yea !!! .
The credits for solving this goes to (Olvin Rogh) Thanks Olvin for your help and pointing out my problem. I was adding colon ":" inside the keys
This code is working now.
PROXY = {'https': 'https://143.208.200.26:7878',
'http': 'http://143.208.200.26:7878'}
with requests.Session() as session:
session.proxies = PROXY
r = session.get('http://ip-api.com/json')
print(json.dumps(r.json(), indent=2))
I am trying to use urllib2 through a proxy; however, after trying just about every variation of passing my verification details using urllib2, I either get a request that hangs forever and returns nothing or I get 407 Errors. I can connect to the web fine using my browser which connects to a prox-pac and redirects accordingly; however, I can't seem to do anything via the command line curl, wget, urllib2 etc. even if I use the proxies that the prox-pac redirects to. I tried setting my proxy to all of the proxies from the pac-file using urllib2, none of which work.
My current script looks like this:
import urllib2 as url
proxy = url.ProxyHandler({'http': 'username:password#my.proxy:8080'})
auth = url.HTTPBasicAuthHandler()
opener = url.build_opener(proxy, auth, url.HTTPHandler)
url.install_opener(opener)
url.urlopen("http://www.google.com/")
which throws HTTP Error 407: Proxy Authentication Required and I also tried:
import urllib2 as url
handlePass = url.HTTPPasswordMgrWithDefaultRealm()
handlePass.add_password(None, "http://my.proxy:8080", "username", "password")
auth_handler = url.HTTPBasicAuthHandler(handlePass)
opener = url.build_opener(auth_handler)
url.install_opener(opener)
url.urlopen("http://www.google.com")
which hangs like curl or wget timing out.
What do I need to do to diagnose the problem? How is it possible that I can connect via my browser but not from the command line on the same computer using what would appear to be the same proxy and credentials?
Might it be something to do with the router? if so, how can it distinguish between browser HTTP requests and command line HTTP requests?
Frustrations like this are what drove me to use Requests. If you're doing significant amounts of work with urllib2, you really ought to check it out. For example, to do what you wish to do using Requests, you could write:
import requests
from requests.auth import HTTPProxyAuth
proxy = {'http': 'http://my.proxy:8080'}
auth = HTTPProxyAuth('username', 'password')
r = requests.get('http://wwww.google.com/', proxies=proxy, auth=auth)
print r.text
Or you could wrap it in a Session object and every request will automatically use the proxy information (plus it will store & handle cookies automatically!):
s = requests.Session(proxies=proxy, auth=auth)
r = s.get('http://www.google.com/')
print r.text
This is a newbie problem with python, advice is much appreciated.
no-ip.com provides an easy way to update a computer's changing ip-address, simply open the url
http://user:password#dynupdate.no-ip.com/nic/update?hostname=my.host.name
...both http and https work when entered in firefox. I tried to implement that in a script residing in "/etc/NetworkManager/dispatcher.d" to be used by Network Manager on a recent version of Ubuntu.
What works is the python script:
from urllib import urlopen;
urlopen("http://user:password#dynupdate.no-ip.com/nic/update?hostname=my.host.name")
What I want to have is the same with "https", which does not work as easily. Could anyone, please,
(1) show me what the script should look like for https,
(2) give me some keywords, which I can use to learn about this.
(3) perhaps even explain why it does not work any more when the script is changed to using "urllib2":
from urllib2 import urlopen;
urlopen("http://user:password#dynupdate.no-ip.com/nic/update?hostname=my.host.name")
Thank you!
The user:password part isn't in the actual URL, but a shortcut for HTTP authentication. The browser's URL parsing lib will filter them out. In urllib2, you want to
import base64, urllib2
user,password = 'john_smith','123456'
request = urllib2.Request('dynupdate.no-ip.com/nic/update?hostname=my.host.name')
auth = base64.base64encode(user + ':' + password)
request.add_header('Authorization', 'Basic ' + auth)
urllib2.urlopen(request)
My system is not behind any proxy.
params = urllib.urlencode({'search':"August Rush"})
f = urllib.urlopen("http://www.thepiratebay.org/search/query", params)
This goes onto an infinite loop(Or just hangs). I can obviously get rid of this and use FancyUrlOpener and create the query myself rather than passing it parameters. But, I think doing the way I'm doing now is a better and cleaner approach.
Edit: This was more of a networking problem in which my Ubuntu workstation was configured to a different proxy. Had to do certain changes and it worked. Thank you!
The posted code works fine for me, with Python 2.7.2 on Windows.
Have you tried using a http-debugging tool, like Fiddler2 to see the actual conversation going between your program and the site?
If you run Fiddler2 on port 8888 on localhost, you can do this to see the request and response:
import urllib
proxies = {"http": "http://localhost:8888"}
params = urllib.urlencode({'search':"August Rush"})
f = urllib.urlopen("http://www.thepiratebay.org/search/query", params, proxies)
print len(f.read())
This works for me:
import urllib
params = urllib.urlencode({'q': "August Rush", 'page': '0', 'orderby': '99'})
f = urllib.urlopen("http://www.thepiratebay.org/s/", params)
with open('text.html', 'w') as ff:
ff.write('\n'.join(f.readlines()))
I opened http://www.thepiratebay.org with Google Chrome with network inspector enabled. I put "August Rush" into the search field and pressed 'Search'. Then i analyzed the headers sent and did the code above.
For now I am doing this: (Python3, urllib)
url = 'someurl'
headers = '(('HOST', 'somehost'), /
('Connection', 'keep-alive'),/
('Accept-Encoding' , 'gzip,deflate'))
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor())
for h in headers:
opener.addheaders.append(x)
data = 'some logging data' #username, pw etc.
opener.open('somesite/login.php, data)
res = opener.open(someurl)
data = res.read()
... some stuff here...
res1 = opener.open(someurl2)
data = res1.read()
etc.
What is happening is this;
I keep getting gzipped responses from server and I stayed logged in (I am fetching some content which is not available if I were not logged in) but I think the connection is dropping between every request opener.open;
I think that because connecting is very slow and it seems like there is new connection every time. Two questions:
a)How do I test if in fact the connection is staying-alive/dying
b)How to make it stay-alive between request for other urls ?
Take care :)
This will be a very delayed answer, but:
You should see urllib3. It is for Python 2.x but you'll get the idea when you see their README document.
And yes, urllib by default doesn't keep connections alive, I'm now implementing urllib3 for Python 3 to be staying in my toolbag :)
Just if you didn't know yet, python-requests offer keep-alive feature, thanks to urllib3.