Python - try statement breaking urllib2.urlopen - python

I'm writing a program in Python that has to make a http request while being forced onto a direct connection in order to avoid a proxy. Here is the code I use which successfully manages this:
print "INFO: Testing API..."
proxy = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
req = urllib2.urlopen('http://maps.googleapis.com/maps/api/geocode/json?address=blahblah&sensor=true')
returneddata = json.loads(req.read())
I then want to add a try statement around 'req', in order to handle a situation where the user is not connected to the internet, which I have tried like so:
try:
req = urllib2.urlopen('http://maps.googleapis.com/maps/api/geocode/json?address=blahblah&sensor=true')
except urllib2.URLError:
print "Unable to connect etc etc"
The trouble is that by doing that, it always throws the exception, even though the address is perfectly accessible & the code works without it.
Any ideas? Cheers.

Related

Python 3 Read data from URL [duplicate]

I have this simple minimal 'working' example below that opens a connection to google every two seconds. When I run this script when I have a working internet connection, I get the Success message, and when I then disconnect, I get the Fail message and when I reconnect again I get the Success again. So far, so good.
However, when I start the script when the internet is disconnected, I get the Fail messages, and when I connect later, I never get the Success message. I keep getting the error:
urlopen error [Errno -2] Name or service not known
What is going on?
import urllib2, time
while True:
try:
print('Trying')
response = urllib2.urlopen('http://www.google.com')
print('Success')
time.sleep(2)
except Exception, e:
print('Fail ' + str(e))
time.sleep(2)
This happens because the DNS name "www.google.com" cannot be resolved. If there is no internet connection the DNS server is probably not reachable to resolve this entry.
It seems I misread your question the first time. The behaviour you describe is, on Linux, a peculiarity of glibc. It only reads "/etc/resolv.conf" once, when loading. glibc can be forced to re-read "/etc/resolv.conf" via the res_init() function.
One solution would be to wrap the res_init() function and call it before calling getaddrinfo() (which is indirectly used by urllib2.urlopen().
You might try the following (still assuming you're using Linux):
import ctypes
libc = ctypes.cdll.LoadLibrary('libc.so.6')
res_init = libc.__res_init
# ...
res_init()
response = urllib2.urlopen('http://www.google.com')
This might of course be optimized by waiting until "/etc/resolv.conf" is modified before calling res_init().
Another solution would be to install e.g. nscd (name service cache daemon).
For me, it was a proxy problem.
Running the following before import urllib.request helped
import os
os.environ['http_proxy']=''
response = urllib.request.urlopen('http://www.google.com')

URLlib2 causing program to stop after 20 attempts

so i am trying to write a program wich needs to check for an urllib2 111 error
I do this by using:
def Refresher:
req = urllib2.Request('http://example.com/myfile.txt')
try:
urlopen = urllib2.urlopen(req)
except urllib2.HTTPError as e:
if e.code == 404 or e.code == 111:
error = True
At the end of refresher I update it using because refresher also edits a tk window:
root.after(75, Refresher)
My problem is that when I reboot the server (and therefor cause a 111 error) this works fine for the first 20 times. But after the 20th time through my function appears to stop running with no error being thrown in the console.Then when the server comes back up my function starts running again.
How do I keep my program refreshing as the function does other things aswell as checking if the server is down?
Thanks in advance.
Use requests instead of urllib2, it's safer to use and easier to understand, if the error persists, then the problem will be in another part of the server configuration.

Python: Access ftp like browsers do, with proxy

I want to access a ftp server, anonymous, just for download. My company have a proxy, and ftp ports (21) are blocked. I can't access the ftp server directly.
What I whant to do is to write some code that behaves exactly the same way browsers do. The idea is that, if I can download the files with my browser, there is way to do it with code.
My code works when I try to access a web site outside the company, but is still not working for ftp servers.
proxy = urllib2.ProxyHandler({'https': 'proxy.mycompanhy.com:8080',
'http': 'proxy.mycompanhy.com:80',
'ftp': 'proxy.mycompanhy.com:21' })
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
urlAddress = 'https://python.org'
# urlAddress = 'ftp://ftp1.cptec.inpe.br'
conn = urllib2.urlopen(urlAddress)
return_str = conn.read()
print return_str
When I try to access python.org, it works fine. If I remove the install_opener part, it does not work anymore, proving that the proxy is required.
When I use the ftp url, it blocks (or timeout if I choose to use these parameters).
I understand that ftp and http are two very different protocols.
What I don't understand is the mechanism that browsers use to access these ftp servers.
I mean, I don't know if there is a layer on server side that interfaces between http and ftp, retriveing a html; or if browser, in some other maner, access the ftp and builds the page.
There also might be a confusion with the ftp domain (or the url) and the connection mode. It seems to me that when urllib2 reads the ftp://... it automatically uses the port 21.
I found a solution using wget. This package handles with proxies, but documentation was very ubscure. You need to setup an environment variable with proxy name.
import wget
import os
import errno
# setup proxy
os.environ["ftp_proxy"] = "proxy.mycompanhy.com"
os.environ["http_proxy"] = "proxy.mycompanhy.com"
os.environ["https_proxy"] = "proxy.mycompanhy.com"
src = "http://domain.gov/data/fileToDownload.txt"
out = "C:\\outFolder\\outFileName.txt" # out is optional
# create output folder if it doesn't exists
outFolder, _ = os.path.split( out )
try:
os.makedirs(outFolder)
except OSError as exc: # Python >2.5
if exc.errno == errno.EEXIST and os.path.isdir(outFolder):
pass
else: raise
# download
filename = wget.download(src, out)

Skip Connection Interruptions (Site & BeautifulSoup)

I'm currently doing this with my script:
Get the body (from sourcecode) and search for a string, it does it until the string is found. (If the site updates.)
Altough, if the connection is lost, the script stops.
My 'connection' code looks something like this (This keeps repeating in a while loop every 20 seconds):
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = ('url')
openUrl = opener.open(url).read()
soup = BeautifulSoup(openUrl)
I've used urllib2 & BeautifulSoup.
Can anyone tell me how I could tell the script to "freeze" if the connection is lost and look to see if the internet connection is alive? Then continue based on the answer.(So, to check if the script CAN connect, not to see if the site is up. If it does checkings this way, the script will stop with a bunch of errors.)
Thank you!
Found the solution!
So, I need to check the connection every LOOP, before actually doing stuff.
So I created this function:
def check_internet(self):
try:
header = {"pragma" : "no-cache"}
req = urllib2.Request("http://www.google.ro", headers=header)
response = urllib2.urlopen(req,timeout=2)
return True
except urllib2.URLError as err:
return False
And it works, tested it with my connection down & up!
For the other newbies wodering:
while True:
conn = check_internet('Site or just Google, just checking for connection.')
try:
if conn is True:
#code
else:
#need to make it wait and re-do the while.
time.sleep(30)
except: urllib2.URLError as err:
#need to wait
time.sleep(20)
Works perfectly, the script has been running for about 10 hours now and it handles errors perfectly! It also works with my connection off and shows proper messages.
Open to suggestions for optimization!
Rather than "freeze" the script, I would have the script continue to run only if the connection is alive. If it's alive, run your code. If it's not alive, either attempt to reconnect, or halt execution.
while keepRunning:
if connectionIsAlive():
run_your_code()
else:
reconnect_maybe()
One way to check whether the connection is alive is described here Checking if a website is up via Python
If your program "stops with a bunch of errors" then that is likely because you're not properly handling the situation where you're unable to connect to the site (for various reasons such as you not having internet, their website is down, etc.).
You need to use a try/except block to make sure that you catch any errors that occur because you were unable to open a live connection.
try:
openUrl = opener.open(url).read()
except urllib2.URLError:
# something went wrong, how to respond?

Python problems with FancyURLopener, 401, and "Connection: close"

I'm new to Python, so forgive me if I am missing something obvious.
I am using urllib.FancyURLopener to retrieve a web document. It works fine when authentication is disabled on the web server, but fails when authentication is enabled.
My guess is that I need to subclass urllib.FancyURLopener to override the get_user_passwd() and/or prompt_user_passwd() methods. So I did:
class my_opener (urllib.FancyURLopener):
# Redefine
def get_user_passwd(self, host, realm, clear_cache=0):
print "get_user_passwd() called; host %s, realm %s" % (host, realm)
return ('name', 'password')
Then I attempt to open the page:
try:
opener = my_opener()
f = opener.open ('http://1.2.3.4/whatever.html')
content = f.read()
print "Got it: ", content
except IOError:
print "Failed!"
I expect FancyURLopener to handle the 401, call my get_user_passwd(), and retry the request.
It does not; I get the IOError exception when I call "f = opener.open()".
Wireshark tells me that the request is sent, and that the server is sending a "401 Unauthorized" response with two headers of interest:
WWW-Authenticate: BASIC
Connection: close
The connection is then closed, I catch my exception, and it's all over.
It fails the same way even if I retry the "f = opener.open()" after IOError.
I have verified that my my_opener() class is working by overriding the http_error_401() method with a simple "print 'Got 401 error'". I have also tried to override the prompt_user_passwd() method, but that doesn't happen either.
I see no way to proactively specify the user name and password.
So how do I get urllib to retry the request?
Thanks.
I just tried your code on my webserver (nginx) and it works as expected:
Get from urllib client
HTTP/1.1 401 Unauthorized from server with Headers
Connection: close
WWW-Authenticate: Basic realm="Restricted"
client tries again with Authorization header
Authorization: Basic <Base64encoded credentials>
Server responds with 200 OK + Content
So I guess your code is right (I tried it with python 2.7.1) and maybe the webserver you are trying to access is not working as expected. Here is the code tested using the free http basic auth testsite browserspy.dk (seems they are using apache - the code works as expected):
import urllib
class my_opener (urllib.FancyURLopener):
# Redefine
def get_user_passwd(self, host, realm, clear_cache=0):
print "get_user_passwd() called; host %s, realm %s" % (host, realm)
return ('test', 'test')
try:
opener = my_opener()
f = opener.open ('http://browserspy.dk/password-ok.php')
content = f.read()
print "Got it: ", content
except IOError:
print "Failed!"

Categories

Resources