Asynchronous JSON Requests in Python - python

I'm using an API for doing HTTP requests that return JSON. The calling of the api, however, depends on a start and an end page to be indicated, such as this:
def API_request(URL):
while(True):
try:
Response = requests.get(URL)
Data = Response.json()
return(Data['data'])
except Exception as APIError:
print(APIError)
continue
break
def build_orglist(start_page, end_page):
APILink = ("http://sc-api.com/?api_source=live&system=organizations&action="
"all_organizations&source=rsi&start_page={0}&end_page={1}&items_"
"per_page=500&sort_method=&sort_direction=ascending&expedite=1&f"
"ormat=json".format(start_page, end_page))
return(API_request(APILink))
The only way to know if you're not longer at an existing page is when the JSON will be null, like this.
If I wanted to do multiple build_orglist going over every single page asynchronously until I reach the end (Null JSON) how could I do so?

I went with a mix of #LukasGraf's answer of using sessions to unify all of my HTTP connections into a single session as well as made use of grequests for making the group of HTTP requests in parallel.

Related

Get Data from API Link

I have a link with some data link to the json file. I want in python to make a script who get the id value.
the value that I want
Can someone help me, I have done that but it keep saying that { "response": "Too many requests" }
My code :
response_API = requests.get('https://api.scratch.mit.edu/users/FlyPhoenix/') data = response_API.text print(data)
This solution might work for you:
import json
import requests
response_API = requests.get('https://api.scratch.mit.edu/users/FlyPhoenix/')
data = response_API.text
print(json.loads(data)['id'])
print() will display the ID you want.
About { "response": "Too many requests" }:
Obviously, your rate of requests has been too high and the server is not willing to accept this.
You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests. (as says in How to avoid HTTP error 429 (Too Many Requests) python)
Additional solution:
Check if your code is not in a loop making the request over and over.
if you have a dynamic IP restart your router. If not wait until you can make request again.
import requests
def get_id():
URL = "https://api.scratch.mit.edu/users/FlyPhoenix/"
resp = requests.get(URL)
if resp.status_code == 200:
data_dict = resp.json()
return (data_dict["id"])
else:
print("Error: request failed with status code", resp.status_code)
print (get_id())

request.urlopen(url) not return website response or timeout

I want to take some website's sources for a project. When i try to get response, program just stuck and wait for response. No matter how long i wait no timeout or response. Here is my code:
link = "https://eu.mouser.com/"
linkResponse = urllib.request.urlopen(link)
readedResponse = linkResponse.readlines()
writer = open("html.txt", "w")
for line in readedResponse:
writer.write(str(line))
writer.write("\n")
writer.close()
When i try to other websites, urlopen return their response. But when i try to get "eu.mouser.com" and "uk.farnell.com" not return their response. I ll skip their response, even urlopen not return a timeout. What is the problem there? Is there another way to take the website's sources? (Sorry for my bad english)
urllib.request.urlopen docs claims that
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
without explaining how to find said default, I managed to provoke timeout after directly providing 5 (seconds) as timeout
import urllib.request
url = "https://uk.farnell.com"
urllib.request.urlopen(url, timeout=5)
gives
socket.timeout: The read operation timed out
There are some sites that protect themselves from automated crawlers by implementing mechanisms that detect such bots. These can be very diverse and also change over time. If you really want to do everything you can to get the page crawled automatically, this usually means that you have to implement steps yourself to circumvent these protective barriers.
One example of this is the header information that is provided with every request. This can be changed before making the request, e.g. via request's header customization. But there are probably more things to do here and there.
If you're interested in starting developing such a thing (leaving aside the question of whether this is allowed at all), you can take this as a starting point:
from collections import namedtuple
from contextlib import suppress
import requests
from requests import ReadTimeout
Link = namedtuple("Link", ["url", "filename"])
links = {
Link("https://eu.mouser.com/", "mouser.com"),
Link("https://example.com/", "example1.com"),
Link("https://example.com/", "example2.com"),
}
for link in links:
with suppress(ReadTimeout):
response = requests.get(link.url, timeout=3)
with open(f"html-{link.filename}.txt", "w", encoding="utf-8") as file:
file.write(response.text)
where such protected sites which lead to ReadTimeOut errors are simply ignored and with the possibility to go further - e.g. by enhancing requests.get(link.url, timeout=3) with a suitable headers parameter. But as I already mentioned, this is probably not the only customization which had to be done and the legal aspects should also be clarified.

While loop makes the browser freeze with Brython

I’m trying to get the response of an api request made with ajax.ajax() and the response is stored into ['apiResponse'] in the HTML5 Local Storage (but the rest of the python function processes without waiting for it be put into localStorage).
Because of this I need to wait for it before getting the response and I thought I could do what I did below for the program to wait before it proceed.
Unfortunately the browser seems to freeze every time I put a while loop...
If someone know how to make Brython and the browser to stop freezing or another method to do what I wanna do...
(It would really help me as it’s the only step before succeeding getting Spotify api requests response)
from browser import ajax #to make requests
from browser.local_storage import storage as localStorage #to access HTML5 Local Storage
import json #to convert a json-like string into a Python Dict
#Request to the API
def on_complete(req):
if req.status==200 or req.status==0:
localStorage['apiResponse'] = req.text
else:
print("An error occured while asking Spotify for data")
def apiRequest(requestUrl, requestMethod):
req = ajax.ajax()
req.bind('complete', on_complete)
req.open(requestMethod, requestUrl, True)
req.set_header('Authorization', localStorage['header'])
req.send()
def response():
while localStorage['apiResponse'] == '':
continue
print('done')
return json.loads(localStorage['apiResponse'])
Thanks in advance!

urllib.request.urlopen is behaving strange. Not returning the data the next day. Why?

I am trying to read Twitter feeds using the URL. Yesterday I was able to pull some 80K tweets using the code and due to some updates on my machine, my Mac terminal stopped responding before the python code completed.
Today the same code is not returning any json data. It is throwing me empty results. While if I type the same URL in browser I am able to get a json file with full of data.
Here is my code:
Method 1:
try:
urllib.request.urlcleanup()
response = urllib.request.urlopen(url)
print('URL to used: ', url)
testURL = response.geturl()
print('URL you used: ', testURL)
jsonResponse = response.read()
jsonResponse = urllib.request.urlopen(url).read()
This printed:
URL to used: https://twitter.com/i/search/timeline?f=tweets&q=%20since%3A2017-08-14%20until%3A2017-08-15%20USA&src=typd&max_position=
URL you used: https://twitter.com/i/search/timeline?f=tweets&q=%20since%3A2017-08-14%20until%3A2017-08-15%20USA&src=typd&max_position=
json: {'items_html': '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n', 'focused_refresh_interval': 30000, 'has_more_items': False, 'min_position': 'TWEET--', 'new_latent_count': 0}
****Method 2:****
try:
request = urllib.request.Request(url, headers=headers)
except:
print("Thats the problem here:")
try:
response = urllib.request.urlopen(request)
except:
print("Exception while fetching response")
testURL = response.geturl()
print('URL you used: ', testURL)
try:
jsonResponse = response.read()
except:
print("Exception while reading response")
Same results in both the cases.
Kindly help.
Based on my testing this behavior has nothing to do with urllib. The same thing happens with the requests library for example.
It appears Twitter detects automated scraping via repeated hits against the search URL, based on your IP address and user agent (UA) string. At some point, subsequent hits return empty results. This seems to happen after a day or so, probably as a result of delayed analysis on Twitter's part.
If you change the UA string in the search URL request header, you should once again receive valid results in the response. Twitter will probably block you again after a while, so you'll need to change your UA string frequently.
I assume Twitter expires these blocks after some timeout, but I don't know how long that will take.
By way of reference, the twitter-past-crawler project demonstrates using a semi-random UA string taken from a file containing multiple UA strings.
Also, the Twitter-Search-API-Python project uses a hardcoded UA string, which stopped working a day or so after my first test. Changing the string in the code (adding random characters) resulted in a resumption of prior functionality.

Python + requests + splinter: What's the fastest/best way to make multiple concurrent 'get' requests?

Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.
The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:
import requests
from splinter import Browser
browser = Browser('chrome')
# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)
So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?
Decide to use either requests or splinter
Read about Requests: HTTP for Humans
Read about Splinter
Related
Read about keep-alive
Read about blocking-or-non-blocking
Read about timeouts
Read about errors-and-exceptions
If you are able to get not hanging requests, you can think of repeated requests, for instance:
while True:
requests.get(...
if request is succesfull:
break
time.sleep(1)
Gevent provides a framework for running asynchronous network requests.
It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.
Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.
from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests
pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
# This will raise any exceptions raised by the request
# Need to catch errors, or check if an exception was
# thrown by checking `greenlet.exception`
response = greenlet.get()
text_response = response.text
Could also use map and a response handling function instead of get.
See gevent documentation for more information.
In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:
import time
import requests
def get_content(url, timeout):
# raise Timeout exception if more than x sends have passed
resp = requests.get(url, timeout=timeout)
# raise generic exception if request is unsuccessful
if resp.status_code != 200:
raise LookupError('status is not 200')
return resp.content
timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
try:
response = get_content('https://example.com', timeout=timeout)
retry_interval = 0 # reset retry interval after success
break
except (LookupError, requests.exceptions.Timeout):
retry_interval += 10
if retry_interval > max_retry_interval:
retry_interval = max_retry_interval
time.sleep(retry_interval)
# process response
If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.
From the documentation for requests:
If the remote server is very slow, you can tell Requests to wait
forever for a response, by passing None as a timeout value and then
retrieving a cup of coffee.
import requests
#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)
#Check the status code to see how the server is handling the request
print r.status_code
Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.
Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests
which you can use to make concurrent requests endlessly until a 200 response:
import grequests
urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]
def keep_going():
rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
out = grequests.map(rs) #Send them all at the same time
for i in out:
if i.status_code == 200:
print i.text
del urls[out.index(i)] #If we have the content, delete the URL
return
while urls:
keep_going()

Categories

Resources