I wrote a web scraper using requests module. I open up a session and send subsequent requests using this session. It has 2 phases.
1) Scrape page by page and collect id's in an array.
2) Get details about each id in the array using requests to an ajax server on the same host.
The scraper works fine on my Linux machine. However when I run the bot on Windows 10, phase 1 is completed just fine but after a couple of requests in phase 2 python throws this exception
File "c:\python27\lib\site-packages\requests\adapters.py", line 453, in send
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', error(10054, 'Varolan bir ba\xf0lant\xfd uzaktaki bir ana bilgisayar taraf\xfdndan zorla kapat\xfdld'))
What is different between two OS's which causes this? How can I overcome this problem?
Having modified my request code like below using retrying module had no positive effects. Now script doesn't throw exceptions but simply hangs doing nothing.
#retry(wait_exponential_multiplier=1000, wait_exponential_max=10000, stop_max_attempt_number=7)
def doReq(self, url):
time.sleep(0.5)
response = self.session.get(url, headers=self.headers)
return response
I still don't know why this problem occurs only in Windows. However, retrying decorator seems to have fixed the problem of socket error. The reason why the script hangs was due to the server not responding to a request. By default requests mode waits forever for a response. By adding a timeout value requests throws a timeout exception and retry decorator catches it and tries again. I know this is a work around rather than a solution but this is the best I've got right now.
Related
I'm working on Python 2.7 code to read a value from HTML page by using urllib2 library. I want to timeout the urllib2.urlopen function after 5 seconds in case of no Internet and jump to remaining code.
It works as expected when computer is connected to working internet connection. And for testing if I set timeout=0.1 it timed out suddenly without opening url, as expected. But when there is no Internet, timeout not works either I set timeout to 0.1, 5, or any other value. It simply does not timed out.
This is my Code:
import urllib2
url = "https://alfahd.witorbit.net/fingerprint.php?n"
try:
response = urllib2.urlopen(url , timeout=5).read()
print response
except Exception as e:
print e
Result when connected to Internet with timeout value 5:
180
Result when connected to Internet with timeout value 0.1 :
<urlopen error timed out>
Seems timeout is working.
Result when NOT connected to Internet and with any timeout value (it timed out after about 40 seconds every time I open url despite of any value I set for timeout=:
<urlopen error [Errno -3] Temporary failure in name resolution>
How can I timeout urllib2.urlopen when there is no Internet connectivity? Am I missing some thing? Please guide me to solve this issue. Thanks!
Because name resolution happens before the request is made, it's not subject to the timeout. You can prevent this error in name resolution by providing the IP for the host in your /etc/hosts file. For example, if the host is subdomain.example.com and the IP is 10.10.10.10 you would add the following line in the /etc/hosts file
10.10.10.10 subdomain.example.com
Alternatively, you may be able to simply use the IP address directly, however, some webservers require you use the hostname, in which case you'll need to modify the hosts file to use the name offline.
I have an API written in Flask and am testing the endpoints with nosetests using requests to send a request to the API. During the tests, I randomly get an error
ConnectionError: HTTPConnectionPool(host='localhost', port=5555): Max retries exceeded with url: /api (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fe4e794fd50>: Failed to establish a new connection: [Errno 111] Connection refused',))
This error only seems to happen when running tests and randomly affects anywhere between none and all of the tests. All of my tests are being run from one subclass of unittests.TestCase:
class WebServerTests(unittest.TestCase):
# Args to run web server with
server_args = {'port': WEB_SERVER_PORT, 'debug': True}
# Process to run web server
server_process = multiprocessing.Process(
target=les.web_server.run_server, kwargs=server_args)
#classmethod
def setup_class(cls):
"""
Set up testing
"""
# Start server
cls.server_process.start()
#classmethod
def teardown_class(cls):
"""
Clean up after testing
"""
# Kill server
cls.server_process.terminate()
cls.server_process.join()
def test_api_info(self):
"""
Tests /api route that gives information about API
"""
# Test to make sure the web service returns the expected output, which at
# the moment is just the version of the API
url = get_endpoint_url('api')
response = requests.get(url)
assert response.status_code == 200, 'Status Code: {:d}'.format(
response.status_code)
assert response.json() == {
'version': module.__version__}, 'Response: {:s}'.format(response.json())
Everything is happening on localhost and the server is listening on 127.0.0.1. My guess would be that too many requests are being sent to the server and some are being refused, but I'm not seeing anything like that in the debug logs. I had also thought that it may be an issue with the server process not being up before the requests were being made, but the issue persists with a sleep after starting the server process. Another attempt was to let requests attempt retrying the connection by setting requests.adapters.DEFAULT_RETRIES. That didn't work either.
I've tried running the tests on two machines both normally and in docker containers and the issue seems to occur regardless of the platform on which they are run.
Any ideas of what may be causing this and what could be done to fix it?
It turns out that my problem was indeed an issue with the server not having enough time to start up, so the tests would be running before it could respond to tests. I thought I had tried to fix this with a sleep, but had accidentally placed it after creating the process instead of after starting the process. In the end, changing
cls.server_process.start()
to
cls.server_process.start()
time.sleep(1)
fixed the issue.
What if I send off a GET/POST request and I get hit by a Connect timeout (not read timeout) and retry the request after?
Will the old request be cancled also on the server or will it maybe still arrive at the server at a later time and executed on server?
Also if we do not get hit by connect timeout but to get the response just takes longer it should mean the request arrived at the server but is probably still being executed on the server right? So we should wait until response is recieved since we etablished the connection for sure?
Thank you in advance!
I'm trying to execute a few simple Python scripts to retrieve information from the internet, unfortunately I'm behind a Corporate Proxy which makes this somewhat Tricky. So far I have installed CNTLM and configured it to work with Pycharm 1.4. I have configured both such that when I 'Check Connection' to www.google.com in pycharm using my manual proxy settings it returns 'Connection Successful'.
However when I try and simple scripts from pycharm, they all seem to timeout. Any advice? By way of code example, this will return a 503 response. Thanks!
import requests
URL = "http://google.com"
try:
response = requests.get(URL)
print response
except Exception as e:
print "Something went wrong:"
print e
I want to use python to log into a website which uses Microsoft Forefront, and retrieve the content of an internal webpage for processing.
I am not new to python but I have not used any URL libraries.
I checked the following posts:
How can I log into a website using python?
How can I login to a website with Python?
How to use Python to login to a webpage and retrieve cookies for later usage?
Logging in to websites with python
I have also tried a couple of modules such as requests. Still I am unable to understand how this should be done, Is it enough to enter username/password? Or should I somehow use the cookies to authenticate? Any sample code would really be appreciated.
This is the code I have so far:
import requests
NAME = 'XXX'
PASSWORD = 'XXX'
URL = 'https://intra.xxx.se/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3'
def main():
# Start a session so we can have persistant cookies
session = requests.session()
# This is the form data that the page sends when logging in
login_data = {
'username': NAME,
'password': PASSWORD,
'SubmitCreds': 'login',
}
# Authenticate
r = session.post(URL, data=login_data)
# Try accessing a page that requires you to be logged in
r = session.get('https://intra.xxx.se/?t=1-2')
print r
main()
but the above code results in the following exception, on the session.post-line:
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='intra.xxx.se', port=443): Max retries exceeded with url: /CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3 (Caused by <class 'socket.error'>: [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond)
UPDATE:
I noticed that I was providing wrong username/password.
Once that was updated I get a HTTP-200 response with the above code, but when I try to access any internal site I get a HTTP 401 response. Why Is this happening? What is wrong with the above code? Should I be using the cookies somehow?
TMG can be notoriously fussy about what types of connections it blocks. The next step is to find out why TMG is blocking your connection attempts.
If you have access to the TMG server, log in to it, start the TMG management user-interface (I can't remember what it is called) and have a look at the logs for failed requests coming from your IP address. Hopefully it should tell you why the connection was denied.
It seems you are attempting to connect to it over an intranet. One way I've seen it block connections is if it receives them from an address it considers to be on its 'internal' network. (TMG has two network interfaces as it is intended to be used between two networks: an internal network, whose resources it protects from threats, and an external network, where threats may come from.) If it receives on its external network interface a request that appears to have come from the internal network, it assumes the IP address has been spoofed and blocks the connection. However, I can't be sure that this is the case as I don't know what this TMG server's internal network is set up as nor whether your machine's IP address is on this internal network.