urllib2 timeout - python

i'm using urllib2 library for my code, i'm using a lot of (urlopen) EDIT: loadurl
i have a problem on my network, when i'm browsing sites, sometimes my browser gets stuck on "Connecting" to a certain website and sometimes my browser returns a timeout
My question is if i use urllib2 on my code it can timeout when trying to connect for too long to a certain website or the code will get stuck on that line.
i know that urllib2 can handle timeouts without specifying it on code but it can apply for this kind of situation ?
Thanks for your time
EDIT :
def checker(self)
try:
html = self.loadurl("MY URL HERE")
if self.ip_ != html:
(...)
except Exeption, error:
html = "bad"

from my small research, the urllib2.urlopen() function is added in python 2.6
so, the timeout problem should be resolved by sending custom timeout to the urllib2.urlopen function. the code should look like this ;
response = urllib2.urlopen( "---insert url here---", None, your-timeout-value)
the your-timeout-value parameter is an optional parameter which defines the timeout in seconds.
EDIT : According to your comment, I got that you don't need the code for waiting too long then you should have the following code to not get stuck;
import socket
import urllib2
socket.setdefaulttimeout(10)
10 can be changed according to a math formula related to the connection speed & website loading time.

Related

Why don't I get a response from my request?

I'm trying to make one simple request:
ua=UserAgent()
req = requests.get('https://www.casasbahia.com.br/' , headers={'User-Agent':ua.random})
I would understand if I received <Response [403] or something like that, but instead, a recive nothing, the code keep runing with no response.
using logging I see:
I know I could use a timeout to avoid keeping the code running, but I just want to understand why I don't get an response
thanks in advance
I never used this API before, but from what I researched on here just now, there are sites that can block requests from fake users.
So, for reproducing this example on my PC, I installed fake_useragent and requests modules on my Python 3.10, and tried to execute your script. It turns out that with my Authentic UserAgent string, the request can be done. When printed on the console, req.text shows the entire HTML file received from the request.
But if I try again with a fake user agent, using ua.random, it fails. The site was probably developed to detect and reject requests from fake agents (or bots).
Though again, this is just theory. I have no ways to access this site's server files to comprove it.

Python "requests" module truncating responses

When I use the python requests module, calling requests.get(url), I have found that the response from the url is being truncated.
import requests
url = 'https://gtfsrt.api.translink.com.au/Feed/SEQ'
response = requests.get(url)
print response.text
The response I get from the URL is being truncated. Is there a way to get requests to retrieve the full set of data and not truncate it?
Note: The given URL is a public transport feed which puts out a huge quantity data during the peak of day.
I ran into the same issue. The problem is not your Python code. It might be PyCharm or Utility you are using - The console has a buffer limit. You may have to increase that to see your full response.
Refer to this article for more help:
Increase output buffer when running or debugging in PyCharm
Add "gzip":true to your request options.
That fixed the issue for me.

urllib2.urlopen wait forever for specific url, though curl, browser get immediately

I just used simple code below.
urllib2.urlopen(url).read()
but it doesn't return me anything and wait forever. After adding timeout, it just timeout.
I'm testing my api with python. but urllib2.urlopen wait forever for my api, though curl, browser get immediately. urllib can get other url and my api server show me that it return value when python version request it. At least the server tried to return value when python request. It looks like that there are some server side issue. But which part of urllib cause this? How can i fix it?

Link Scraping with requests, bs4. Getting Warning: unresponsive script

Im trying to collect all links from a webpage using requests, Beautifulsoup4 and SoupStrainer in Python3.3. For writing my code im using Komodo Edit 8.0 and also let my scripts run in Komodo Edit. So far everything works fine but on some webpages it occurs that im getting a popup with the following Warning
Warning unresponsive script
A script on this page may be busy, or it may have stopped responding. You can stop the script
now, or you can continue to see if the script will complete.
Script: viewbufferbase:797
Then i can chose if i want to continue or stop the script.
Here a little code snippet:
try:
r = requests.get(adress, headers=headers)
soup = BeautifulSoup(r.text, parse_only=SoupStrainer('a', href=True))
for link in soup.find_all('a'):
#some code
except requests.exceptions.RequestException as e:
print(e)
My question is what is causing this error. Is it my python script that is taking too long on a webpage or is it a script on the webpage im scraping? I cant think of the latter because technically im not executing the scripts on the page right?
Or can it maybe be my bad internet-connection?
Oh and another little question, with the above code snippet am im downloading pictures or just the plain html-code? Because sometimes when i look into my connection status for me its way too much data that im receiving just for requesting plain html code?
If so, how can I avoid downloading such stuff and how is it possible in general to avoid downloads with requests, because sometimes it can be that my program ends on a download page.
Many Thanks!
The issue might be either long loading times of a site, or a cycle in your website links' graph - i.e. page1 (Main Page) has link to page2 (Terms of Service) which in turn has link to page1. You could try this snippet to see how long it takes to get a response from a website (snippet usage included).
Regarding your last question:
I'm pretty sure requests doesn't parse your response's content (except for .json() method). What you might be experiencing is a link to a resource, like Free Cookies! which you script would visit. requests have mechanics to counter such case, see this for reference. Moreover, the aforementioned technique allows checking Content-Type header to make sure you're downloading pages you're interested in.

Unexpected behaviour with Python urllib

I am tryint to consume a JSON response but I have one very weird behaviour. The end point is a Java app running on Tomcat.
I want to load the following url
http://opendata.diavgeia.gov.gr/api/decisions?count=50&output=json_full&from=1
Using Ruby open-uri I load the json. If I hit in in the browser I still get the response. Once I try to use Python 's urllib or urllib2 I get an error
javax.servlet.ServletException: Could not resolve view with name 'jsonView' in servlet with name 'diavgeia-api'
It s quite a strange and I guess the error lies in the API server. Any hints ?
The server appears to need an 'Accept' header:
>>> print urllib2.urlopen(
... urllib2.Request(
... "http://opendata.diavgeia.gov.gr/api/decisions?count=50&output=json_full&from=1",
... headers={"accept": "*/*"})).read()[:200]
{"model":{"queryInfo":{"total":117458,"count":50,"order":"desc","from":1},"expandedDecisions":[{"metadata":{"date":1291932000000,"tags":{"tag":[]},"decisionType":{"uid":27,"label":"ΔΑΠΑΝΗ","extr
Two possibilities, neither of which hold water:
The server is only prepared to use HTTP 1.1 (which urllib apparently doesn't support, but urllib2 does)
It's doing user agent sniffing, and rejecting Python (I tried using Firefox's UA string instead, but it still gave me an error)

Categories

Resources