how to prevent Web scrapers/crawlers/bots to hit my API

how to prevent Web scrapers/crawlers/bots to hit my API - python

I'm new to Stack overflow. I have a API developed in Flask Python 3. I am calling that API from front end.
I need to give the response for only 3 requests from that browser/IP/whatever the best way to detect that user and on the 4th requests i need to throw error that your limit exceeded.
How can i achieve this to identify the user is same user and clock them permanently till he signups.
Front end: html and ajax request
Back-end : pyhton3 and flask
I'm able to get the ip address by below method
#app.route('/api/givedata', methods=['GET'])
def givedata():
return jsonify({'ip': request.remote_addr}), 200
But, some scraping guys may use VPN and other things to access. How to prevent these data crawlers/Bots?

Related

python flask requests works on localhost but does not work on remote server

I'm trying to make a wallpaper page from the website "https://www.wallpaperflare.com",
When I try to run it on localhost it always works and displays the original page of the website.
But when I deploy to Heroku the page doesn't display the original page from the website, but "Error Get Request, Code 403" Which means requests don't work on that url.
This is my code:
#app.route("/wallpapers", methods=["GET"])
def wallpaper ():
page = requests.get("https://www.wallpaperflare.com")
if page.status_code == 200:
return page.text
else:
return "Error Get Request, Code {}".format(page.status_code)
is there a way to solve it?

HTTP Error code 403 means Forbidden. You can read more here
It means wallpaperflare.com is not allowing you to make the request. This is because websites generally do not want scripts to be making requests to them. Make sure to read robots.txt of a site to see it's script crawling policies. More on that here
It works on your local machine as it is not yet blacklisted by wallpaperflare.com

Two things here:
the user agent - unless you spoof it, the request module is going to use its own string and it's very obvious you are a bot
the IP address - your server IP address may be denied for various reasons, whereas your home IP address works just fine.
It is also possible that the remote site applies different policies based on the client, if you are a bot then you might be allowed to crawl a bit but rate limiting measures could apply for example.

fitbit API HTTPS error

I'm trying to get my heart rate and sleep data through fitbit API, i'm using this:
https://github.com/orcasgit/python-fitbit
in order to connect to the server and get the access and refresh tokens (i use gather_kays_oauth2 to get the tokens).
And when i'm conecting in HTTP I do manage to get the sleep data, but when i'm trying to get the HR like that:
client.time_series("https://api.fitbit.com/1/user/-/activities/heart/date/today/1d.json", period="1d")
I get this error:
HTTPBadRequest: this request must use the HTTPS protocol
And for some reason i can't connect in HTTPS - when i do try it, the browser pops up an ERR_SSL_PROTOCOL_ERROR even before the FITBIT Authorization Page.
i tried to follow and fix any settings that may cause the browser to fail, but they're all good and the error still pops up.
I've tried to change the callback URL, i searched for other fitbit OAUTH2 connection guides, but i only manage to connect in HTTP and not HTTPS
Does anyone knows how to solve it?

Your code should be client.time_series('activities/heart', period='1d') to get heart rate.
For the first parameter resource, it doesn't need the resource URL, but it asks you to put one of these: activities, body, foods, heart, sleep.
Here is the link of source code from python-fitbit:
http://python-fitbit.readthedocs.io/en/latest/_modules/fitbit/api.html#Fitbit.time_series
Added:
If you want to get the full heart rate data per minute (["activities-heart-intraday"] dataset), try client.intraday_time_series('activities/heart'). It will return data with the one-minute/one-second detail.

Ok I've worked out the HTTPS issue in relation to my need. It was because I sent a request to.
https://api.fitbit.com//1/user/-/activities/recent.json
I removed the additional forward slash after .com and it worked
https://api.fitbit.com/1/user/-/activities/recent.json
However, this is not the same issue you had which returned the same message for me this request must use the HTTPS protocol.
Which would suggest that any unhandled errors due to malformed requests to Fitbit return this same error. Rather than one that gives you a little more clue as to what just happened.

Instagram Client Side Authentication using Python

I am currently working on a bottle project and will be using Instagram api. I was hoping to use the client side authentication however I am having problems with the access token as it does not returns as a parameter.
I am making the request here:
https://api.instagram.com/oauth/authorize/?client_id=client_id&redirect_uri=redirect_uri&response_type=token&scope=basic+follower_list
The app is redirected to token page correctly and I can even see the token in the url. But when I try to parse it, it comes out empty.
#route('/oauth_callback')
def success_message():
token = request.GET.get("access_token")
print token.values()
return "success"
The token.values() returns an empty list.
ps: Keep in mind that when I try to do the same operation with server side authentication, I can successfully get the code and exchange it for a token.

Once you make a query to Instagram api you must be receiving below response?
http://your-redirect-uri#access_token=ACCESS-TOKEN
the part after # is termed as fragment and not query_string parameter and there is no way you can retrieve that information on Server side in Bottle.
To actually get fragments, bottle.request.urlparts is used
urlparts
The url string as an urlparse.SplitResult tuple. The tuple
contains (scheme, host, path, query_string and fragment), but the
fragment is always empty because it is not visible to the server.
Use the SDK and preferably Server Side operations -> https://github.com/facebookarchive/python-instagram
If you will to go with this approach, then managing a JavaScript which parses the access-token and then posts to your bottle api for your consumption which I don't recommend...
From https://instagram.com/developer/authentication/
Client-Side (Implicit) Authentication
If you are building an app that does not have a server component (a
purely javascript app, for instance), you will notice that it is
impossible to complete step three above to receive your access_token
without also having to store the secret on the client. You should
never pass or store your client_id secret onto a client. For these
situations there is the Implicit Authentication Flow.

Can I persist an http connection (or other data) across Flask requests?

I'm working on a Flask app which retrieves the user's XML from the myanimelist.net API (sample), processes it, and returns some data. The data returned can be different depending on the Flask page being viewed by the user, but the initial process (retrieve the XML, create a User object, etc.) done before each request is always the same.
Currently, retrieving the XML from myanimelist.net is the bottleneck for my app's performance and adds on a good 500-1000ms to each request. Since all of the app's requests are to the myanimelist server, I'd like to know if there's a way to persist the http connection so that once the first request is made, subsequent requests will not take as long to load. I don't want to cache the entire XML because the data is subject to frequent change.
Here's the general overview of my app:
from flask import Flask
from functools import wraps
import requests
app = Flask(__name__)
def get_xml(f):
#wraps(f)
def wrap():
# Get the XML before each app function
r = requests.get('page_from_MAL') # Current bottleneck
user = User(data_from_r) # User object
response = f(user)
return response
return wrap
#app.route('/one')
#get_xml
def page_one(user_object):
return 'some data from user_object'
#app.route('/two')
#get_xml
def page_two(user_object):
return 'some other data from user_object'
if __name__ == '__main__':
app.run()
So is there a way to persist the connection like I mentioned? Please let me know if I'm approaching this from the right direction.

I think you aren't approaching this from the right direction because you place your app too much as a proxy of myanimelist.net.
What happens when you have 2000 users? Your app end up doing tons of requests to myanimelist.net, and a mean user could definitely DoS your app (or use it to DoS myanimelist.net).
This is a much cleaner way IMHO :
Server side :
Create a websocket server (ex: https://github.com/aaugustin/websockets/blob/master/example/server.py)
When a user connects to the websocket server, add the client to a list, remove it from the list on disconnect.
For every connected users, do frequently check myanimelist.net to get the associated xml (maybe lower the frequence the more online users you get)
for every xml document, make a diff with your server local version, and send that diff to the client using the websocket channel (assuming there is a diff).
Client side :
on receiving diff : update the local xml with the differences.
disconnect from websocket after n seconds of inactivity + when disconnected add a button on the interface to reconnect
I doubt you can do anything much better assuming myanimelist.net doesn't provide a "push" API.

Facebook graph api on appengine Invalid Request URL

I have a request to the fb graph api that goes like so:
https://graph.facebook.com/?access_token=<ACCESSTOKEN>&fields=id,name,email,installed&ids=<A LONG LONG LIST OF IDS>
If the number of ids goes above 200-ish in the request, the following things happen:
in browser: works
in local tests urllib: timeout on deployed
appengine application: "Invalid request URL (followed by url)" this
one doesn't hang at all though
For number of ids below 200 or so , it works fine for all of them.
Sure I could just slice the id list up and fetch them separately, but I would like to know why this is happening and what it means?

I didn't read your question through the first time around. I didn't scroll the embedded code to the right to realize that you were using a long URL.
There's usually maximum URL lengths. This will prevent you from having a long HTTP GET request. The way to get around that is to embed the parameters in the data of a POST request.
It looks like FB's Graph API does support it, according to this question:
using POST request on Facebook Graph API

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to prevent Web scrapers/crawlers/bots to hit my API - python

Related

python flask requests works on localhost but does not work on remote server

fitbit API HTTPS error

Instagram Client Side Authentication using Python

Can I persist an http connection (or other data) across Flask requests?

Facebook graph api on appengine Invalid Request URL

Categories

Resources