Feedparser (and urllib2) issue: Connection timed out - python

Starting with urllib2 and feedparser libraries in Python I'm getting the following error most of the time whenever try to connect and fetch content from particular URL:
urllib2.URLError: <urlopen error [Errno 110] Connection timed out>
The minimal reproducible examples (basic, using feedparser.parser directly and advanced, where I use urllib2 library first to fetch XML content) are pasted below.
# test-1
import feedparser
f = feedparser.parse('http://www.zurnal24.si/index.php?ctl=show_rss')
title = f['channel']['title']
print title
# test-2
import urllib2
import feedparser
url = 'http://www.zurnal24.si/index.php?ctl=show_rss'
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
request = opener.open(url)
response = request.read()
feed = feedparser.parse(response)
title = feed['channel']['title']
print title
When I try with different URL addresses (e.g., http://www.delo.si/rss/), everything works fine. Please note that all URL's lead to non-english (i.e., Slovenian) RSS feeds.
I run my experiments both from local and remote machine (via ssh). The error reported occurs more frequently on remote machine, although it's unpredictable even on local host.
Any suggestions would be greatly appreciated.

How often does the timeout occur? If it's not frequent, you could wait after each timeout and then retry the request:
import urllib2
import feedparser
import time
import sys
url = 'http://www.zurnal24.si/index.php?ctl=show_rss'
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
# Try to connect a few times, waiting longer after each consecutive failure
MAX_ATTEMPTS = 8
for attempt in range(MAX_ATTEMPTS):
try:
request = opener.open(url)
break
except urllib2.URLError, e:
sleep_secs = attempt ** 2
print >> sys.stderr, 'ERROR: %s.\nRetrying in %s seconds...' % (e, sleep_secs)
time.sleep(sleep_secs)
response = request.read()
feed = feedparser.parse(response)
title = feed['channel']['title']
print title

As the error denotes, it is a connection problem. This may be a problem with your internet connection or with their servers/connection/bandwidth..
A simple workaround is to do your feedparsing in a while loop, of course keeping a counter of MAX retries..

Related

How Handle Expired SSL/TLS Certificate with Python Requests?

What's the correct way to handle an expired certificates with Python Requests?
I want the code to differentiate between a "connection error" and connection with an "expired TLS certificate".
import requests
def conn(URL):
try:
response = requests.get(URL)
except requests.exceptions.RequestException:
print(URL, "Cannot connect")
return False
print(URL, "connection sucessful")
return True
# valid cert
conn("https://www.google.com")
# unexistant domain
conn("https://unexistent-domain-example.com")
# expired cert
conn("https://expired-rsa-dv.ssl.com")
I want the code to differentiate between a "connection error" and connection with an "expired TLS certificate".
You can look at the exception details and see if 'CERTIFICATE_VERIFY_FAILED' is there.
import requests
def conn(URL):
try:
response = requests.get(URL)
except requests.exceptions.RequestException as e:
if 'CERTIFICATE_VERIFY_FAILED' in str(e):
print('CERTIFICATE_VERIFY_FAILED')
print(URL, f"Cannot connect: {str(e)}")
print('--------------------------')
return False
print(URL, "connection sucessful")
return True
# valid cert
conn("https://www.google.com")
# unexistant domain
conn("https://unexistent-domain-example.com")
# expired cert
conn("https://expired-rsa-dv.ssl.com")
requests is a perfect tool for requests, but your task is to check server certificate expiration date which require using lower level API. The algorithm is to retrieve server certificate, parse it and check end date.
To get certificate from server there's function ssl.get_server_certificate(). It will return certificate in PEM encoding.
There're plenty of ways how to parse PEM encoded certificate (check this question), I'd stick with "undocumented" one.
To parse time from string you can use ssl.cert_time_to_seconds().
To parse url you can use urllib.parse.urlparse(). To get current timestamp you can use time.time()
Code:
import ssl
from time import time
from urllib.parse import urlparse
from pathlib import Path
def conn(url):
parsed_url = urlparse(url)
cert = ssl.get_server_certificate((parsed_url.hostname, parsed_url.port or 443))
# save cert to temporary file (filename required for _test_decode_cert())
temp_filename = Path(__file__).parent / "temp.crt"
with open(temp_filename, "w") as f:
f.write(cert)
try:
parsed_cert = ssl._ssl._test_decode_cert(temp_filename)
except Exception:
return
finally: # delete temporary file
temp_filename.unlink()
return ssl.cert_time_to_seconds(parsed_cert["notAfter"]) > time()
It'll throw an exception on any connection error, you can handle it with try .. except over get_server_certificate() call (if needed).

Retrieve the source code of a url using pycurl and a port number?

Is there any way to retrieve the url source code and store it in a string ,provided that a particular port is defined in pycurl module so that it works for a proxy network.
platform - ubuntu or any other linux distro
Use this code to get source of url
from StringIO import StringIO
import pycurl
url = 'http://www.google.com/'
storage = StringIO()
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.WRITEFUNCTION, storage.write)
c.perform()
c.close()
content = storage.getvalue()
print content

Skip Connection Interruptions (Site & BeautifulSoup)

I'm currently doing this with my script:
Get the body (from sourcecode) and search for a string, it does it until the string is found. (If the site updates.)
Altough, if the connection is lost, the script stops.
My 'connection' code looks something like this (This keeps repeating in a while loop every 20 seconds):
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = ('url')
openUrl = opener.open(url).read()
soup = BeautifulSoup(openUrl)
I've used urllib2 & BeautifulSoup.
Can anyone tell me how I could tell the script to "freeze" if the connection is lost and look to see if the internet connection is alive? Then continue based on the answer.(So, to check if the script CAN connect, not to see if the site is up. If it does checkings this way, the script will stop with a bunch of errors.)
Thank you!
Found the solution!
So, I need to check the connection every LOOP, before actually doing stuff.
So I created this function:
def check_internet(self):
try:
header = {"pragma" : "no-cache"}
req = urllib2.Request("http://www.google.ro", headers=header)
response = urllib2.urlopen(req,timeout=2)
return True
except urllib2.URLError as err:
return False
And it works, tested it with my connection down & up!
For the other newbies wodering:
while True:
conn = check_internet('Site or just Google, just checking for connection.')
try:
if conn is True:
#code
else:
#need to make it wait and re-do the while.
time.sleep(30)
except: urllib2.URLError as err:
#need to wait
time.sleep(20)
Works perfectly, the script has been running for about 10 hours now and it handles errors perfectly! It also works with my connection off and shows proper messages.
Open to suggestions for optimization!
Rather than "freeze" the script, I would have the script continue to run only if the connection is alive. If it's alive, run your code. If it's not alive, either attempt to reconnect, or halt execution.
while keepRunning:
if connectionIsAlive():
run_your_code()
else:
reconnect_maybe()
One way to check whether the connection is alive is described here Checking if a website is up via Python
If your program "stops with a bunch of errors" then that is likely because you're not properly handling the situation where you're unable to connect to the site (for various reasons such as you not having internet, their website is down, etc.).
You need to use a try/except block to make sure that you catch any errors that occur because you were unable to open a live connection.
try:
openUrl = opener.open(url).read()
except urllib2.URLError:
# something went wrong, how to respond?

Python - try statement breaking urllib2.urlopen

I'm writing a program in Python that has to make a http request while being forced onto a direct connection in order to avoid a proxy. Here is the code I use which successfully manages this:
print "INFO: Testing API..."
proxy = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
req = urllib2.urlopen('http://maps.googleapis.com/maps/api/geocode/json?address=blahblah&sensor=true')
returneddata = json.loads(req.read())
I then want to add a try statement around 'req', in order to handle a situation where the user is not connected to the internet, which I have tried like so:
try:
req = urllib2.urlopen('http://maps.googleapis.com/maps/api/geocode/json?address=blahblah&sensor=true')
except urllib2.URLError:
print "Unable to connect etc etc"
The trouble is that by doing that, it always throws the exception, even though the address is perfectly accessible & the code works without it.
Any ideas? Cheers.

Python problems with FancyURLopener, 401, and "Connection: close"

I'm new to Python, so forgive me if I am missing something obvious.
I am using urllib.FancyURLopener to retrieve a web document. It works fine when authentication is disabled on the web server, but fails when authentication is enabled.
My guess is that I need to subclass urllib.FancyURLopener to override the get_user_passwd() and/or prompt_user_passwd() methods. So I did:
class my_opener (urllib.FancyURLopener):
# Redefine
def get_user_passwd(self, host, realm, clear_cache=0):
print "get_user_passwd() called; host %s, realm %s" % (host, realm)
return ('name', 'password')
Then I attempt to open the page:
try:
opener = my_opener()
f = opener.open ('http://1.2.3.4/whatever.html')
content = f.read()
print "Got it: ", content
except IOError:
print "Failed!"
I expect FancyURLopener to handle the 401, call my get_user_passwd(), and retry the request.
It does not; I get the IOError exception when I call "f = opener.open()".
Wireshark tells me that the request is sent, and that the server is sending a "401 Unauthorized" response with two headers of interest:
WWW-Authenticate: BASIC
Connection: close
The connection is then closed, I catch my exception, and it's all over.
It fails the same way even if I retry the "f = opener.open()" after IOError.
I have verified that my my_opener() class is working by overriding the http_error_401() method with a simple "print 'Got 401 error'". I have also tried to override the prompt_user_passwd() method, but that doesn't happen either.
I see no way to proactively specify the user name and password.
So how do I get urllib to retry the request?
Thanks.
I just tried your code on my webserver (nginx) and it works as expected:
Get from urllib client
HTTP/1.1 401 Unauthorized from server with Headers
Connection: close
WWW-Authenticate: Basic realm="Restricted"
client tries again with Authorization header
Authorization: Basic <Base64encoded credentials>
Server responds with 200 OK + Content
So I guess your code is right (I tried it with python 2.7.1) and maybe the webserver you are trying to access is not working as expected. Here is the code tested using the free http basic auth testsite browserspy.dk (seems they are using apache - the code works as expected):
import urllib
class my_opener (urllib.FancyURLopener):
# Redefine
def get_user_passwd(self, host, realm, clear_cache=0):
print "get_user_passwd() called; host %s, realm %s" % (host, realm)
return ('test', 'test')
try:
opener = my_opener()
f = opener.open ('http://browserspy.dk/password-ok.php')
content = f.read()
print "Got it: ", content
except IOError:
print "Failed!"

Categories

Resources