How to overcome the problem of maximum response limit in web scrapping?

How to overcome the problem of maximum response limit in web scrapping? - python

I'm trying to scrap Yahoo finance for stock market info. As I want the data of whole NASDAQ(over 8000 stocks), I'm using multithreading to reduce the execution time. the problem is that it seems Yahoo only allows certain number of my requests to be responded and block all others, giving the error:
equests.exceptions.ConnectionError: HTTPSConnectionPool(host='finance.yahoo.com', port=443): Max retries exceeded with url: /quote/CAT?p=CAT&.tsrc=fin-srch (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000024386785A30>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
Is there any way for resolving the issue?

Related

HTTP request error: what is the distinction between "Name or service not known" and "Temporary failure in name resolution" errors

While scraping some sites with the requests package in python I came across these 2 Http Connection error's
Name or service not known
ConnectionError: HTTPConnectionPool(host='1xbet666041.top', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f04cd03ab10>: Failed to establish a new connection: [Errno -2] Name or service not known'))
Temporary failure in name resolution
ConnectionError: HTTPConnectionPool(host='1xbet666041.top', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fb9bacf7f10>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')
Can any one please explain the difference between the 2 ?
Notice that I got these 2 different errors when using the same domain. The difference was that the first get call was made from DataBricks and the other from google collab.
Thanks.

Name or service not known
A name resolution has been done, and it succeeded in getting a reply, but the reply was: "this name has no IP address". So it is a positive assertion on a negative result.
Temporary failure in name resolution
No name resolution was possible at all, so it is a negative assertion on the possibility of resolving that name.
Notice that I got these 2 different errors when using the same domain.
Yes, but using which nameserver(s)?
Assess your domain correct DNS configuration using a tool like DNSViz online. Any error has to be fixed, and warnings too.

Max retries exceeded with url: /api/json. Failed to establish a new connection: [Errno 111] Connection refused

I am using jenkins package of python to fetch job details from jenkins server. For most of the servers I am able to fetch the job data but for few servers I am getting below error.
Unable to authenticate with any scheme: auth(kerberos) HTTPConnectionPool(host='xxxxxxxx', port=8080): Max retries exceeded with url: /api/json (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb1d4b55898>: Failed to establish a new connection: [Errno 111] Connection refused',)) auth(anonymous) HTTPConnectionPool(host='xxxxxxxx', port=8080): Max retries exceeded with url: /api/json (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb1cbce20f0>: Failed to establish a new connection: [Errno 111] Connection refused',))
I have below code where the error is occuring.
server_object = jenkins.Jenkins("http://"+["jenkins_hostname"]+":"+['jenkins_port'], username=xxxxx, password=xxxxx, timeout=120,)
I tried to login in to the server with userid and password, Its working when using browser.
I am not able to figure out what exactly the issue as I am able to login in to the jenkins using browser. Now I am stuck as this error is not coming for all servers. Please help in resolving the error.

Kuali Python API Connection

I have recently begun a project for my work that involves me having to connect to an accounting application's API (Kuali).
I have only recently begun working with APIs and am having a great degree of difficulty connecting to this server. When running the following code I receive this error:
import requests
requests.get('https://university.kuali.co/api/v1/auth/authenticate','Authorization: Basic mykeyhere')
print(requests.status_codes)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='university.kuali.co', port=443): Max retries exceeded with url: /api/v1/auth/authenticate?Authorization:%20Basic%20mykeyhere (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x109c276d0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
I would greatly appreciate any help as this project would save a huge amount of my time.
The link to the API documentation can be found below.
https://developers.kuali.co/#general

Authorisation Should be under headers
requests.get(url,headers={“Authorisation”:”Basic your key here”})

Is it possible to check number of HTTPConnection request made

I have a uWSGI server running in a linux VM node where multiple requests are made to this.
At only some point there are some errors like ReadTimeout, HTTPConnectionPool and recovered automatically.
ConnectionError: HTTPConnectionPool(host='10.1.1.1', port=8000): Max retries exceeded with url: /app_servers (Caused by NewConnectionError('<requests.packages.urllib3.connection.
HTTPConnection object at 0x7f16e8a89190>: Failed to establish a new connection: [Errno 101] Network is unreachable',))
Is it due to requests exceeded ? or some network lookup issue.
I tried using netstat and sar command to identify the root cause, but CPU and IO stats are fine.
No of establisbed connected(ESTABLISHED) and CLOSE_WAIT state requests are also less. Not sure how to check for the past time.
How to check the number of http connection made at that point of time or why the HTTPConnectionPool (Max url exceeds)error occurs

Python requests.get(url, timeout=75) does not wait for specified timeout

requests.get("http://172.19.235.178", timeout=75)
is my piece of code.
It is trying a get request on the url which is a phone and is supposed to wait upto 75 seconds for it to return a 200OK.
This request works perfectly on one Ubuntu machine but does not wait for 75 seconds on another machine.

according the documentation on https://2.python-requests.org/en/master/user/advanced/#timeouts you can set a timeout in the requests connection part but the timeout you are encountering is an OS related socket timeout.
notice that if you do:
requests.get("http://172.19.235.178", timeout=1)
you get:
ConnectTimeout: HTTPConnectionPool(host='172.19.235.178', port=80):
Max retries exceeded with url: / (Caused by
ConnectTimeoutError(, 'Connection to 172.19.235.178 timed out. (connect
timeout=1)'))
while when you do
requests.get("http://172.19.235.178", timeout=75)
you get:
ConnectionError: HTTPConnectionPool(host='172.19.235.178', port=80): Max
retries exceeded with url: / (Caused by
NewConnectionError(': Failed to establish a new connection: [Errno
10060] A connection attempt failed because the connected party did not
properly respond after a period of time, or established connection
failed because connected host has failed to respond',))
while you could change you OS behavior as stated here: http://willbryant.net/overriding_the_default_linux_kernel_20_second_tcp_socket_connect_timeout
In your case I would put a timeout of 10 and iterate over it a few times with a try except statement

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to overcome the problem of maximum response limit in web scrapping? - python

Related

HTTP request error: what is the distinction between "Name or service not known" and "Temporary failure in name resolution" errors

Max retries exceeded with url: /api/json. Failed to establish a new connection: [Errno 111] Connection refused

Kuali Python API Connection

Is it possible to check number of HTTPConnection request made

Python requests.get(url, timeout=75) does not wait for specified timeout

Categories

Resources