Google crawl 503 service unavailable

Google crawl 503 service unavailable - python

I have got a very strange problem when I crawl google search engine with wget, curl or python on my servers. Google redirects me to an address starting with [ipv4|ipv6].google.fr/sorry/IndexRedirect... and finally send a 503 error, service unavailable...
Sometimes crawl works correctly and sometimes not during the day, and I tried almost everything possible : forcing ipv4/ipv6 instead of hostname, referer, user agent, vpn, .com/.fr/, proxies and tor, ...
I guess this is an error from Google Servers... any idea ? thanks !
wget "http://google.fr/search?q=test"
--2015-06-03 10:19:52-- http://google.fr/search?q=test
Resolving google.fr (google.fr)... 2a00:1450:400c:c05::5e, 173.194.67.94
Connecting to google.fr (google.fr)|2a00:1450:400c:c05::5e|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://ipv6.google.com/sorry/IndexRedirect?continue=http://google.fr/search%3Fq%3Dtest&q=CGMSECABQdAAUQABAAAAAAAAH1QYqPG6qwUiGQDxp4NLQuHgP_i-oiUu0ZShPumAZRF3u_0 [following]
--2015-06-03 10:19:53-- http://ipv6.google.com/sorry/IndexRedirect?continue=http://google.fr/search%3Fq%3Dtest&q=CGMSECABQdAAUQABAAAAAAAAH1QYqPG6qwUiGQDxp4NLQuHgP_i-oiUu0ZShPumAZRF3u_0
Resolving ipv6.google.com (ipv6.google.com)... 2a00:1450:400c:c05::64
Connecting to ipv6.google.com (ipv6.google.com)|2a00:1450:400c:c05::64|:80... connected.
HTTP request sent, awaiting response... 503 Service Unavailable
2015-06-03 10:19:53 ERROR 503: Service Unavailable.

Google have triggers to sniff out bots and other abuse of their Terms of Service, so they set a limit (or a "throttle") on the number of calls that the same i.p. address can make over a certain period of time. I believe it's something like 10 calls per minute. Case in point: If you paste your Url into a browser when it fails with a 503 error, you'll get a Captcha challenge from Google to prove you are not a bot.
I am using the pattern.web module to do essentially the same thing as you are doing (for harmless research purposes, of course!), and the documentation for that library shows the throttling limits for most popular APIs (Google, Bing, Twitter, Facebook...).
Try sending your requests every 15+ seconds or so, to avoid tripping up the throttle limit.

Related

python flask requests works on localhost but does not work on remote server

I'm trying to make a wallpaper page from the website "https://www.wallpaperflare.com",
When I try to run it on localhost it always works and displays the original page of the website.
But when I deploy to Heroku the page doesn't display the original page from the website, but "Error Get Request, Code 403" Which means requests don't work on that url.
This is my code:
#app.route("/wallpapers", methods=["GET"])
def wallpaper ():
page = requests.get("https://www.wallpaperflare.com")
if page.status_code == 200:
return page.text
else:
return "Error Get Request, Code {}".format(page.status_code)
is there a way to solve it?

HTTP Error code 403 means Forbidden. You can read more here
It means wallpaperflare.com is not allowing you to make the request. This is because websites generally do not want scripts to be making requests to them. Make sure to read robots.txt of a site to see it's script crawling policies. More on that here
It works on your local machine as it is not yet blacklisted by wallpaperflare.com

Two things here:
the user agent - unless you spoof it, the request module is going to use its own string and it's very obvious you are a bot
the IP address - your server IP address may be denied for various reasons, whereas your home IP address works just fine.
It is also possible that the remote site applies different policies based on the client, if you are a bot then you might be allowed to crawl a bit but rate limiting measures could apply for example.

I always get the 429 status when making google queries through Python via proxy

I am using the requests library to make google queries.
url='https://google.com/search?hl=en&q='
request_result = requests.get(url+query, headers=headers,proxies=proxies)
I always get the 429 status, when I make a request via proxy. I have tried several proxies, free ones as well as one that is paid and has a dynamically changing IP.
Does this mean that somebody else has made requests via these proxies or is there something I can do to make ti work. If I run through my IP it works fine.

Implementing WebSockets with Sony's Audio Control API in Python

Sony's website provided a example to use WebSockets to works with their api in Node.js
https://developer.sony.com/develop/audio-control-api/get-started/websocket-example#tutorial-step-3
it worked fine for me. But when i was trying to implement it in Python, it does not seems to work
i use websocket_client
import websocket
ws = websocket.WebSocket()
ws.connect("ws://192.168.0.34:54480/sony/avContent",sslopt={"cert_reqs": ssl.CERT_NONE})
gives
websocket._exceptions.WebSocketBadStatusException: Handshake status 403 Forbidden
but in their example code, there is not any kinds of authrization or authentication

I recently had the same problem. Here is what I found out:
Normal HTTP responses can contain Access-Control-Allow-Origin headers to explicitly allow other websites to request data. Otherwise, web browsers block such "cross-origin" requests, because the user could be logged in there for example.
This "same-origin-policy" apparently does not apply to WebSockets and the handshakes can't have these headers. Therefore any website could connect to your Sony device. You probably wouldn't want some website to set your speaker/receiver volume to 100% or maybe upload a defective firmware, right?
That's why the audio control API checks the Origin header of the handshake. It always contains the website the request is coming from.
The Python WebSocket client you use assumes http://192.168.0.34:54480/sony/avContent as the origin by default in your case. However, it seems that the API ignores the content of the Origin header and just checks whether it's there.
The WebSocket#connect method has a parameter named suppress_origin which can be used to exclude the Origin header.
TL;DR
The Sony audio control API doesn't accept WebSocket handshakes that contain an Origin header.
You can fix it like this:
ws.connect("ws://192.168.0.34:54480/sony/avContent",
sslopt={"cert_reqs": ssl.CERT_NONE},
suppress_origin=True)

fitbit API HTTPS error

I'm trying to get my heart rate and sleep data through fitbit API, i'm using this:
https://github.com/orcasgit/python-fitbit
in order to connect to the server and get the access and refresh tokens (i use gather_kays_oauth2 to get the tokens).
And when i'm conecting in HTTP I do manage to get the sleep data, but when i'm trying to get the HR like that:
client.time_series("https://api.fitbit.com/1/user/-/activities/heart/date/today/1d.json", period="1d")
I get this error:
HTTPBadRequest: this request must use the HTTPS protocol
And for some reason i can't connect in HTTPS - when i do try it, the browser pops up an ERR_SSL_PROTOCOL_ERROR even before the FITBIT Authorization Page.
i tried to follow and fix any settings that may cause the browser to fail, but they're all good and the error still pops up.
I've tried to change the callback URL, i searched for other fitbit OAUTH2 connection guides, but i only manage to connect in HTTP and not HTTPS
Does anyone knows how to solve it?

Your code should be client.time_series('activities/heart', period='1d') to get heart rate.
For the first parameter resource, it doesn't need the resource URL, but it asks you to put one of these: activities, body, foods, heart, sleep.
Here is the link of source code from python-fitbit:
http://python-fitbit.readthedocs.io/en/latest/_modules/fitbit/api.html#Fitbit.time_series
Added:
If you want to get the full heart rate data per minute (["activities-heart-intraday"] dataset), try client.intraday_time_series('activities/heart'). It will return data with the one-minute/one-second detail.

Ok I've worked out the HTTPS issue in relation to my need. It was because I sent a request to.
https://api.fitbit.com//1/user/-/activities/recent.json
I removed the additional forward slash after .com and it worked
https://api.fitbit.com/1/user/-/activities/recent.json
However, this is not the same issue you had which returned the same message for me this request must use the HTTPS protocol.
Which would suggest that any unhandled errors due to malformed requests to Fitbit return this same error. Rather than one that gives you a little more clue as to what just happened.

504 in a request to Typeform API using request library

I am integrating the answers given by some users to a typeform poll. I am using the request library and 50% of the times I get the response and 50% of the times i get a 504 code. Does anyone have an advice/link on how to solve this?

This may not be something that you can correct on on the client side.
A 504 status code indicates that some gateway server or service attempted to forward your request to the origin server and either received a 504 from upstream or no response within it's timeout interval. For example, this is a code that is used by Amazon's CDN and load-balancing services. If the failure ratio is close to 50% with a significant sample size, there's a good chance there's load-balancing involved.
As with any 5xx status code, the implication is that once the issue causing the failure is resolved upstream, you should be able to send the same request without modification and get a successful response. So any attempt to mitigate it on your side should be made with respect to retry/backoff strategy, rather than changing the content of your request.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Google crawl 503 service unavailable - python

Related

python flask requests works on localhost but does not work on remote server

I always get the 429 status when making google queries through Python via proxy

Implementing WebSockets with Sony's Audio Control API in Python

fitbit API HTTPS error

504 in a request to Typeform API using request library

Categories

Resources