Facebook graph api on appengine Invalid Request URL - python

I have a request to the fb graph api that goes like so:
https://graph.facebook.com/?access_token=<ACCESSTOKEN>&fields=id,name,email,installed&ids=<A LONG LONG LIST OF IDS>
If the number of ids goes above 200-ish in the request, the following things happen:
in browser: works
in local tests urllib: timeout on deployed
appengine application: "Invalid request URL (followed by url)" this
one doesn't hang at all though
For number of ids below 200 or so , it works fine for all of them.
Sure I could just slice the id list up and fetch them separately, but I would like to know why this is happening and what it means?

I didn't read your question through the first time around. I didn't scroll the embedded code to the right to realize that you were using a long URL.
There's usually maximum URL lengths. This will prevent you from having a long HTTP GET request. The way to get around that is to embed the parameters in the data of a POST request.
It looks like FB's Graph API does support it, according to this question:
using POST request on Facebook Graph API

Related

how to prevent Web scrapers/crawlers/bots to hit my API

I'm new to Stack overflow. I have a API developed in Flask Python 3. I am calling that API from front end.
I need to give the response for only 3 requests from that browser/IP/whatever the best way to detect that user and on the 4th requests i need to throw error that your limit exceeded.
How can i achieve this to identify the user is same user and clock them permanently till he signups.
Front end: html and ajax request
Back-end : pyhton3 and flask
I'm able to get the ip address by below method
#app.route('/api/givedata', methods=['GET'])
def givedata():
return jsonify({'ip': request.remote_addr}), 200
But, some scraping guys may use VPN and other things to access. How to prevent these data crawlers/Bots?

Unable to get complete source code of web page using Python [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

fitbit API HTTPS error

I'm trying to get my heart rate and sleep data through fitbit API, i'm using this:
https://github.com/orcasgit/python-fitbit
in order to connect to the server and get the access and refresh tokens (i use gather_kays_oauth2 to get the tokens).
And when i'm conecting in HTTP I do manage to get the sleep data, but when i'm trying to get the HR like that:
client.time_series("https://api.fitbit.com/1/user/-/activities/heart/date/today/1d.json", period="1d")
I get this error:
HTTPBadRequest: this request must use the HTTPS protocol
And for some reason i can't connect in HTTPS - when i do try it, the browser pops up an ERR_SSL_PROTOCOL_ERROR even before the FITBIT Authorization Page.
i tried to follow and fix any settings that may cause the browser to fail, but they're all good and the error still pops up.
I've tried to change the callback URL, i searched for other fitbit OAUTH2 connection guides, but i only manage to connect in HTTP and not HTTPS
Does anyone knows how to solve it?
Your code should be client.time_series('activities/heart', period='1d') to get heart rate.
For the first parameter resource, it doesn't need the resource URL, but it asks you to put one of these: activities, body, foods, heart, sleep.
Here is the link of source code from python-fitbit:
http://python-fitbit.readthedocs.io/en/latest/_modules/fitbit/api.html#Fitbit.time_series
Added:
If you want to get the full heart rate data per minute (["activities-heart-intraday"] dataset), try client.intraday_time_series('activities/heart'). It will return data with the one-minute/one-second detail.
Ok I've worked out the HTTPS issue in relation to my need. It was because I sent a request to.
https://api.fitbit.com//1/user/-/activities/recent.json
I removed the additional forward slash after .com and it worked
https://api.fitbit.com/1/user/-/activities/recent.json
However, this is not the same issue you had which returned the same message for me this request must use the HTTPS protocol.
Which would suggest that any unhandled errors due to malformed requests to Fitbit return this same error. Rather than one that gives you a little more clue as to what just happened.

Python - Get timing property from network response

How to get the properties as shown on the image (Blocked, DNS resolution, Connecting ...) after sending the request?
From firefox, the waiting time = ~650ms
From python, requests.Response.elapsed.total_seconds() = ~750ms
Since the result is difference, i want to have a more details result as shown on firefox developer mode.
You can only get the total time of the request because the response doesn´t know more itself.
More informations are only logged by the programms which does handle the request and start-stopping a timer for some steps.
You need to track times in/on you connection-framework or you can have a look on the FireFox API for "timings" - there are some more APIs so maybe you find something you are able to use for your case - but main fact is you can´t do it directly and only with your script because request and response are fired/catched and logging/getting data does happen between other components then.

Access HTTP Header Details Through Python Instagram API

I'm using the Instagram API to retrieve all photos from a list of hashtags. Starting today, I've been hitting the API rate limits (429 error, specifically). To debug this, I've been trying to see how to get the number of calls I have left per hour and incorporate that into my script.
The Instagram API says that you can access the number of calls left in the hour through the HTTP header, but I'm not sure how to access that in Python.
The following fields are provided in the header of each response and their values are related to the type of call that was made (authenticated or unauthenticated):
X-Ratelimit-Remaining: the remaining number of calls available to your app within the 1-hour window
X-Ratelimit-Limit: the total number of calls allowed within the 1-hour window
http://instagram.com/developer/limits
How would I access this data in Python?
I assumed it was much a much fancier solution based on a few other answers on SO, but after some researching, I found a really simple solution!
import requests
r = requests.get('URL of the API response here')
r.headers

Categories

Resources