GAE - How can I combine the results of several asynchronous url fetches? - python

I have a Google AppEngine (in Python) application where I need to perform 4 to 5 url fetches, and then combine the data before I print it out to the response.
I can do this without any problems using a synchronous workflow, but since the urls that I am fetching are not related or dependent on each other, performing this asynchronously would be the most ideal (and quickest).
I have read and re-read the documentation here, but I just can't figure out how to get read the contents for each url. I've also searched the web for a small example (which is really what I am in need of). I have seen this SO question, but again, here they don't mention anything about reading the contents of these individual asynchronous url fetches.
Does anyone have any simple examples of how to perform 4 or 5 asynchronous url fetches with AppEngine? And then combine the results before I print it to the response?
Here is what I have so far:
rpcs = []
for album in result_object['data']:
total_facebook_photo_count = total_facebook_photo_count + album['count']
facebook_albumid_array.append(album['id'])
#Get the photos in the photo album
facebook_photos_url = 'https://graph.facebook.com/%s/photos?access_token=%s&limit=1000' % (album['id'], access_token)
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, facebook_photos_url)
rpcs.append(rpc)
for rpc in rpcs:
result = rpc.get_result()
self.response.out.write(result.content)
However, it still looks like the line: result = rpc.get_result() is forcing it to wait for the first request to finish, then the second, then the third, and so forth. Is there a way to simply put the results in a variables as they are received?
Thanks!

In the example, text = result.content is where you get the content (body).
To do url fetches in parallell, you could set them up, add to a list and check results afterwards. Expanding on the example already mentioned, it could look something like:
from google.appengine.api import urlfetch
futures = []
for url in urls:
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, url)
futures.append(rpc)
contents = []
for rpc in futures:
try:
result = rpc.get_result()
if result.status_code == 200:
contents.append(result.content)
# ...
except urlfetch.DownloadError:
# Request timed out or failed.
# ...
concatenated_result = '\n'.join(contents)
In this example, we assemble the body of all the requests that returned status code 200, and concatenate with linebreak between them.
Or with ndb, my personal preference for anything async on GAE, something like:
#ndb.tasklet
def get_urls(urls):
ctx = ndb.get_context()
result = yield map(ctx.urlfetch, urls)
contents = [r.content for r in result if r.status_code==200]
raise ndb.Return('\n'.join(contents))

I use this code (implmented before I learned about ndb tasklets):
while rpcs:
rpc = UserRPC.wait_any(rpcs)
result = rpc.get_result()
# process result here
rpcs.remove(rpc)

Related

Multiple Unique Autonomous Sessions with Python Requests

I have searched through all the requests docs I can find, including requests-futures, and I can't find anything addressing this question. Wondering if I missed something or if anyone here can help answer:
Is it possible to create and manage multiple unique/autonomous sessions with requests.Session()?
In more detail, I'm asking if there is a way to create two sessions I could use "get" with separately (maybe via multiprocessing) that would retain their own unique set of headers, proxies, and server-assigned sticky data.
If I want Session_A to hit someimaginarysite.com/page1.html using specific headers and proxies, get cookied and everything else, and then hit someimaginarysite.com/page2.html with the same everything, I can just create one session object and do it.
But what if I want a Session_B, at the same time, to start on page3.html and then hit page4.html with totally different headers/proxies than Session_A, and be assigned its own cookies and whatnot? Crawl multiple pages consecutively with the same session, not just a single request followed by another request from a blank (new) session.
Is this as simple as saying:
import requests
Session_A = requests.Session()
Session_B = requests.Session()
headers_a = {A headers here}
proxies_a = {A proxies here}
headers_b = {B headers here}
proxies_b = {B proxies here}
response_a = Session_A.get('https://someimaginarysite.com/page1.html', headers=headers_a, proxies=proxies_a)
response_a = Session_A.get('https://someimaginarysite.com/page2.html', headers=headers_a, proxies=proxies_a)
# --- and on a separate thread/processor ---
response_b = Session_B.get('https://someimaginarysite.com/page3.html', headers=headers_b, proxies=proxies_b)
response_b = Session_B.get('https://someimaginarysite.com/page4.html', headers=headers_b, proxies=proxies_b)
Or will the above just create one session accessible by two names so the server will see the same cookies and session appearing with two ips and two sets of headers... which would seem more than a little odd.
Greatly appreciate any help with this, I've exhausted my research abilities.
I think that there is probably a better way to do this, but without more information about the pagination and exactly what you want, it is a bit hard to understand exactly what you need. The following will make 2 threads with sequential calls in each keeping the same headers and proxies. Once again, there may be a better way to approach but with the limited information it's a bit murky.
import requests
import concurrent.futures
def session_get_same(urls, headers, proxies):
lst = []
with requests.Session() as s:
for url in urls:
lst.append(s.get(url, headers=headers, proxies=proxies))
return lst
def main():
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [
executor.submit(
session_get_same,
urls=[
'https://someimaginarysite.com/page1.html',
'https://someimaginarysite.com/page2.html'
],
headers={'user-agent': 'curl/7.61.1'},
proxies={'https': 'https://10.10.1.11:1080'}
),
executor.submit(
session_get_same,
urls=[
'https://someimaginarysite.com/page3.html',
'https://someimaginarysite.com/page4.html'
],
headers={'user-agent': 'curl/7.61.2'},
proxies={'https': 'https://10.10.1.11:1080'}
),
]
flst = []
for future in concurrent.futures.as_completed(futures):
flst.append(future.result())
return flst
Not sure if this is better for the first function, mileage may vary
def session_get_same(urls, headers, proxies):
lst = []
with requests.Session() as s:
s.headers.update(headers)
s.proxies.update(proxies)
for url in urls:
lst.append(s.get(url))
return lst

Obtain all Woocommerce Orders via Python API

I'm looking to export all orders from the WooCommerce API via a python script.
I've followed the
authentication process
and I have been using method to obtain orders described
here. My code looks like the following:
wcapi = API(
url = "url",
consumer_key = consumerkey,
consumer_secret = consumersecret
)
r = wcapi.get('orders')
r = r.json()
r = r['orders']
print(len(r)) # output: 8
This outputs the most recent 8 orders, but I would like to access all of them. There are over 200 orders placed via woocommerce right now. How do I access all of the orders?
Please tell me there is something simple I am missing.
My ultimate goal is to pull these orders automatically, transform them, and then upload to a visualization tool. All input is appreciated.
First: Initialize your API (as you did).
wcapi = API(
url=eshop.url,
consumer_key=eshop.consumer_key,
consumer_secret=eshop.consumer_secret,
wp_api=True,
version="wc/v2",
query_string_auth=True,
verify_ssl = True,
timeout=10
)
Second: Fetch the orders from your request(as you did).
r=wcapi.get("orders")
Third: Fetch the total pages.
total_pages = int(r.headers['X-WP-TotalPages'])
Forth: For every page catch the json and access the data through the API.
for i in range(1,total_pages+1):
r=wcapi.get("orders?&page="+str(i)).json()
...
The relevant parameters found in the corresponding documentation are page and per_page. The per_page parameter defines how many orders should be retrieved at every request. The page parameter defines the current page of the order collection.
For example, the request sent by wcapi.get('orders/per_page=5&page=2') will return orders 5 to 10.
However, as the default of per_page is 10, it is not clear as to why you get only 8 orders.
I encountered the same problem with paginated response for products.
I built on the same approach described by #gtopal, whereby the X-WP-TotalPages header returned by WooCommerce is used to iterate through each page of results.
I knew that I would probably encounter the same issue for other WooCommerce API requests (such as orders), and I didn't want to have to confuse my code by repeatedly performing a loop when I requested a paginated set of results.
To avoid this I used a decorator to abstract the pagination logic, so that get_all_wc_orders can focus just on the request.
I hope the decorator below might be useful to someone else (gist)
from woocommerce import API
WC_MAX_API_RESULT_COUNT = 100
wcapi = API(
url=url,
consumer_key=key,
consumer_secret=secret,
version="wc/v3",
timeout=300,
)
def wcapi_aggregate_paginated_response(func):
"""
Decorator that repeat calls a decorated function to get
all pages of WooCommerce API response.
Combines the response data into a single list.
Decorated function must accept parameters:
- wcapi object
- page number
"""
def wrapper(wcapi, page=0, *args, **kwargs):
items = []
page = 0
num_pages = WC_MAX_API_RESULT_COUNT
while page < num_pages:
page += 1
log.debug(f"{page=}")
response = func(wcapi, page=page, *args, **kwargs)
items.extend(response.json())
num_pages = int(response.headers["X-WP-TotalPages"])
num_products = int(response.headers["X-WP-Total"])
log.debug(f"{num_products=}, {len(items)=}")
return items
return wrapper
#wcapi_aggregate_paginated_response
def get_all_wc_orders(wcapi, page=1):
"""
Query WooCommerce rest api for all products
"""
response = wcapi.get(
"orders",
params={
"per_page": WC_MAX_API_RESULT_COUNT,
"page": page,
},
)
response.raise_for_status()
return response
orders = get_all_wc_orders(wcapi)

Google BigQuery python - error paginating table

I have a large table in BigQuery, which i have to go through, get all data and process it in my GAE app. Since my table is going to be about 4m rows, i decided i have to get data via pagination mechanism implemeted in code examples here > https://cloud.google.com/bigquery/querying-data
def async_query(query):
client = bigquery.Client()
query_job = client.run_async_query(str(uuid.uuid4()), query)
query_job.use_legacy_sql = False
query_job.use_query_cache = False
query_job.begin()
wait_for_job(query_job)
query_results = query_job.results()
page_token = None
output_rows = []
while True:
rows, total_rows, page_token = query_results.fetch_data(max_results = 200, page_token = page_token)
output_rows = output_rows + rows
if not page_token:
break
def wait_for_job(job):
while True:
job.reload() # Refreshes the state via a GET request.
if job.state == 'DONE':
if job.error_result:
raise RuntimeError(job.errors)
return
time.sleep(1)
But when i execute it i receive an error:
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.
When max_results parameter > table size it works fine. When max_results < table size and pagination required - i get this error.
Am i missing something?
The error indicates your overall request handler processing takes too long. Very likely because of the multiple query_results.fetch_data iterations, due to pagination.
You may want to check:
Dealing with DeadlineExceededErrors
Deadline errors: 60 seconds or less in Google App Engine
You'll probably have to re-think your app a bit, maybe try to not get the whole result immediately and instead either:
get just a portion of the result
get the result on a separate request, later on, after obtaining it in the background, for example:
via a single task queue request (10 min instead or 60s deadline)
by assembling it from multiple chunks collected in separate task queue requests to really make it scalable (not sure if this actually works with bigquery, I only tried it with the datastore)

how to send muliptle requests and make sure the response comes back within a second in python

I am trying to validate what the throttle limit for an endpoint using python code.
Basically I have set Throttlelimit on the endpoint I am testing is 3calls/sec. The test does 4 calls and checks the status codes to have atleast 1 429 response.
The validation I have fails sometimes because it looks like the responses take more than a second to respond. The code I tried are:
Method1:
request = requests.Request(method='GET', url=GLOBALS["url"], params=context.payload, headers=context.headers)
context.upperlimit = int(GLOBALS["ThrottleLimit"]) + 1
reqs = [request for i in range(0, context.upperlimit)]
with BaseThrottler(name='base-throttler', reqs_over_time=(context.upperlimit, 1)) as bt:
throttled_requests = bt.multi_submit(reqs)
context.responses = [tr.response for tr in throttled_requests]
assert(429 in [ i.status_code for i in context.responses])
Method2:
request = requests.get(url=GLOBALS["url"], params=context.payload, headers=context.headers)
url = request.url
urls = set([])
for i in range(0, context.upperlimit):
urls.add(grequests.get(url))
context.responses = grequests.map(urls)
assert(429 in [ i.status_code for i in context.responses])
Is there a way that I can make sure all the responses came back in the same second and if not it should try again before failing the test.
I suppose you are using requests and grequests library. You can set a timeout as explained in the docs and also for grequests.
Plain requests
requests.get(url, timeout=1)
Using grequests
grequests.get(url, timeout=1)
Timeout value is "number of seconds"
Using timeout won't necessarily ensure the condition that you are looking for, which is that all 4 requests were received by the endpoint within one second (not that each individual response was received within one second of sending the request).
One quick and dirty way to solve this is to simply time the execution of the code, and ensure that all responses were received in less than a second (using the timeit module)
start_time = timeit.default_timer()
context.responses = grequests.map(urls)
elapsed = timeit.default_timer() - start_time
if elapsed < 1:
assert(429 in [ i.status_code for i in context.responses])
This is crude because it is checking round trip time, but will ensure that all requests were received within a second. If you need more specificity, or find that the condition is not met often enough, you could add a header to the response with the exact time the request was received by the endpoint, and then verify that all requests hit the endpoint within one second of each other.

Correct greenlet termination

I am using gevent to download some html pages.
Some websites are way too slow, some stop serving requests after period of time. That is why I had to limit total time for a group of requests I make. For that I use gevent "Timeout".
timeout = Timeout(10)
timeout.start()
def downloadSite():
# code to download site's url one by one
url1 = downloadUrl()
url2 = downloadUrl()
url3 = downloadUrl()
try:
gevent.spawn(downloadSite).join()
except Timeout:
print 'Lost state here'
But the problem with it is that i loose all the state when exception fires up.
Imagine I crawl site 'www.test.com'. I have managed to download 10 urls right before site admins decided to switch webserver for maintenance. In such case i will lose information about crawled pages when exception fires up.
The question is - how do I save state and process the data even if Timeout happens ?
Why not try something like:
timeout = Timeout(10)
def downloadSite(url):
with Timeout(10):
downloadUrl(url)
urls = ["url1", "url2", "url3"]
workers = []
limit = 5
counter = 0
for i in urls:
# limit to 5 URL requests at a time
if counter < limit:
workers.append(gevent.spawn(downloadSite, i))
counter += 1
else:
gevent.joinall(workers)
workers = [i,]
counter = 0
gevent.joinall(workers)
You could also save a status in a dict or something for every URL, or append the ones that fail in a different array, to retry later.
A self-contained example:
import gevent
from gevent import monkey
from gevent import Timeout
gevent.monkey.patch_all()
import urllib2
def get_source(url):
req = urllib2.Request(url)
data = None
with Timeout(2):
response = urllib2.urlopen(req)
data = response.read()
return data
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
print contents[5]
It implements one timeout for each request. In this example, contents contains 10 times the HTML source of google.com, each retrieved in an independent request. If one of the requests had timed out, the corresponding element in contents would be None. If you have questions about this code, don't hesitate to ask in the comments.
I saw your last comment. Defining one timeout per request definitely is not wrong from the programming point of view. If you need to throttle traffic to the website, then just don't spawn 100 greenlets simultaneously. Spawn 5, wait until they returned. Then, you can possibly wait for a given amount of time, and spawn the next 5 (already shown in the other answer by Gabriel Samfira, as I see now). For my code above, this would mean, that you would have to repeatedly call
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
whereas N should not be too high.

Categories

Resources