Google App Engine Task Queue Fetch Statistics fail - python

I have a pull queue being serviced by a backend and when the queue is empty I need to trigger another script.
At the moment I am using a very crude detection in the method that leases the tasks from the queue, so that if the task list returned is empty we presume that there are no more to lease and trigger the next step. However, while this works most of the time, occasionally a lease request seems to return an empty list even though there are tasks available.
Anyway, the better way to do it I think is to use the fetch_statistics method of the Queue. That way the script can monitor whats going on in the pull queue and know that there are no more items left in the queue. Now this is obviously available via the REST api for queues, but it seems rather backward to use this when I am using these internally.
So I am making the Queue.fetch_statistics() call, but it throws an error. I've tried putting the stated error into Google, but it returns nothing. Same here on stackoverflow.
It always throws:
AttributeError: type object 'QueueStatistics' has no attribute '_QueueStatistics__TranslateError'
My code is:
q = taskqueue.Queue('reporting-pull')
try:
logging.debug(q.fetch_statistics())
except Exception, e:
logging.exception(e)
Can anyone shed any light on this? I am I doing something really stupid here?

Just incase it is useful to anyone else, here is an example function to get you started getting queue info from your app. Its only an example, and could do with better error handling, but it should get you up and running. Previously we have used the Taskqueue client but I thought that was a bit overkill as we can lease and delete in the code anyway, so I used app identity, and it worked a treat.
from google.appengine.api import taskqueue
from google.appengine.api import app_identity
from google.appengine.api import urlfetch
try:
import json
except ImportError:
import simplejson as json
import logging
def get_queue_info(queue_name, stats=False):
'''
Uses the Queue REST API to fetch queue info
Args:
queue_name: string - the name of the queue
stats: boolean - get the stats info too
RETURNS:
DICT: from the JSON response or False on fail
'''
scope = 'https://www.googleapis.com/auth/taskqueue'
authorization_token, _ = app_identity.get_access_token(scope)
app_id = app_identity.get_application_id()
#note the s~ denoting HRD its not mentioned in the docs as far as
#I can see, but it wont work without it
uri = 'https://www.googleapis.com/taskqueue/v1beta1/projects/s~%s/taskqueues/%s?getStats=%s' % (app_id, queue_name, stats)
#make the call to the API
response = urlfetch.fetch(uri, method="GET", headers = {"Authorization": "OAuth " + authorization_token})
if response.status_code == 200:
result = json.loads(response.content)
else:
logging.error('could not get queue')
logging.error(response.status_code)
logging.error(response.content)
return False
return result
Don't forget to update your queue.yaml with the acl for your app identity
-name: queue_name
mode: pull
acl:
- user_email: myappid#appspot.gserviceaccount.com
I hope someone finds this useful.
In the meantime I have posted a Feature request so we can do this with the Queue object, please go and star it if you want it too. http://goo.gl/W8Pk1

The Task Queue Statistics API is now documented and publicly available. The error no longer occurs.

The immediate reason for the specific error you're getting seems to be a bug in the code; Queue.fetch_statistics() calls QueueStatistics.fetch() calls QueueStatistics._FetchMultipleQueues() which apparently encounters an apiproxy_errors.ApplicationError and then tries to call cls.__TranslateError() but there is no such method on the QueueStatistics class.
I don't know the deeper reason for the ApplicationError, but it may mean that the feature is not yet supported by the production runtime.

Related

Python multiprocess queue can get() some objects but not others

In the code below when I call .get() in get_query_result I actually get an ImportError as described in the function comment. The actual result in do_query is a large dictionary. It might be in the 10s of MB range. I've verified that this large object can be pickled, and that when adding it to the output queue there are no errors. Simpler objects work just fine in place of result.
Of course I've distilled this code down from it's actual form, but it seems like perhaps there's some issue with the size of result and perhaps it's just not deserializeable when trying to read it off the queue? I've even tried adding a sleep before calling .get() in case the internal mechanisms of SimpleQueue need time to get all the bytes into the queue but it didn't make a difference. I've also tried using multiprocess.Queue and pipes with the same result each time. Also the module name in the error message is slightly different each time making me think this has something to do with the object in the queue only being incorrectly serialized or something.
Your help is greatly appreciated.
import multiprocessing
from multiprocessing import queues
import pickle
def do_query(recv_queue, send_queue):
while True:
item = recv_queue.get()
#result = long_running_function(item)
try:
test = pickle.loads(pickle.dumps(result))
#this works and prints this large object out just fine
print(test)
except pickle.PicklingError:
print('obj was not pickleable')
try:
send_queue.put([result])
except Exception as e:
#This also never happens
print('Exception when calling .put')
class QueueWrapper:
def __init__(self)
self._recv_queue = queues.SimpleQueue()
self._send_queue = queues.SimpleQueue()
self._proc = None
def send_query(self, query):
self._send_queue.put(query)
def get_query_result(self):
'''
When calling .get() I get "ImportError: No module named tmp_1YYBC\n"
If I replace 'result' in do_query with something like ['test'] it works fine.
'''
return self._recv_queue.get()
def init(self):
self._proc = multiprocess.Process(target=do_query, args=(self._send_queue, self._recv_queue))
self._proc.start()
if __name__=='__main__':
queue_wrapper = QueueWrapper()
queue_wrapper.init()
queue_wrapper.send_query('test')
#import error raised when calling below
queue_wrapper.get_query_result()
Edit1: If I change the code to pickle result myself and then send that pickled result in the queue I'm able to successfully call .get on the other end of the queue. However when I go to unpickle that item I get the same error as before. To recap that means I can pickle and unpickle the object I want to send between processes in the process running do_query just fine, if I pickle it myself I can send it between processes just fine, but when I go to unpickle manually I get an error. Almost seems like I'm able to read off the queue before it's done being written to or something? That shouldn't be the case if I'm understanding .get and .put correctly.
Edit2: After some more digging I see that the type(result) is returning <class tmp_1YYBC._sensor_msgs__Image> which is not correct and should be just sensor_msgs.msg._Image.Image but it's interesting to note that the weird prefix appears in my error message. If I try to construct a new Image, copy all the data from result into that image and send that newly created object in the queue I get the exact same error message... It seems like pickle or the other process in general has trouble knowing how to construct this image object?

tda-api assigning streaming endpoints to variables

I am new to pyhton APIs. I can not get the script to return a value. Could anyone give me a direction please. I can not get the lambda function to work properly. I am trying to save the streamed data into variables to use with a set of operations.
from tda.auth import easy_client
from tda.client import Client
from tda.streaming import StreamClient
import asyncio
import json
import config
import pathlib
import math
import pandas as pd
client = easy_client(
api_key=config.API_KEY,
redirect_uri=config.REDIRECT_URI,
token_path=config.TOKEN_PATH)
stream_client = StreamClient(client, account_id=config.ACCOUNT_ID)
async def read_stream():
login = asyncio.create_task(stream_client.login())
await login
service = asyncio.create_task(stream_client.quality_of_service(StreamClient.QOSLevel.EXPRESS))
await service
book_snapshots = {}
def my_nasdaq_book_handler(msg):
book_snapshots.update(msg)
stream_client.add_nasdaq_book_handler(my_nasdaq_book_handler)
stream = stream_client.nasdaq_book_subs(['GOOG','AAPL','FB'])
await stream
while True:
await stream_client.handle_message()
print(book_snapshots)
asyncio.run(read_stream())
Callbacks
This (wrong) assumption
stream_client.add_nasdaq_book_handler() contains all the trade data.
shows difficulties in understanding the callback concept. Typically the naming pattern add handler indicates that this concept is being used. There is also the comment in the boiler plate code from the Streaming Client docs
# Always add handlers before subscribing because many streams start sending
# data immediately after success, and messages with no handlers are dropped.
that consistently talks about subscribing - also this word is a strong indicator.
The basic principle of a callback is that instead you pull the information from a service (and being blocked until it's available), you enable that service to push that information to you when it's available. You do this typically be first registering one (or more) interest(s) with the service and after then wait for the things to come.
In section Handling Messages they give an example for function (to provide by you) as follows:
def sample_handler(msg):
print(json.dumps(msg, indent=4))
which takes a str argument which is dumped in JSON format to the console. The lambda in your example does exactly the same.
Lambdas
it's not possible to return a value from a lambda function because it is anonymous
This is not correct. If lambda functions wouldn't be able to return values, they wouldn't play such an important role. See 4.7.6. Lambda Expressions in the Python 3 docs.
The problem in your case is that both functions don't do anything you want, both just print to console. Now you need to get into these functions to tell what to do.
Control
Actually, your program runs within this loop
while True:
await stream_client.handle_message()
each stream_client.handle_message() call finally causes a call to the function you registered by calling stream_client.add_nasdaq_book_handler. So that's the point: your script defines what to do when messages arrive before it gets waiting.
For example, your function could just collect the arriving messages:
book_snapshots = []
def my_nasdaq_book_handler(msg):
book_snapshots.append(msg)
A global object book_snapshots is used in the implementation. You may expand/change this function at will (of course translating the information into JSON format will help you accessing it in a structured way). This line will register your function:
stream_client.add_nasdaq_book_handler(my_nasdaq_book_handler)

Why does Firebase event return empty object on second and subsequent events?

I have a Python Firebase SDK on the server, which writes to Firebase real-time DB.
I have a Javascript Firebase client on the browser, which registers itself as a listener for "child_added" events.
Authentication is handled by the Python server.
With Firebase rules allowing reads, the client listener gets data on the first event (all data at that FB location), but only a key with empty data on subsequent child_added events.
Here's the listener registration:
firebaseRef.on
(
"child_added",
function(snapshot, prevChildKey)
{
console.log("FIREBASE REF: ", firebaseRef);
console.log("FIREBASE KEY: ", snapshot.key);
console.log("FIREBASE VALUE: ", snapshot.val());
}
);
"REF" is always good.
"KEY" is always good.
But "VALUE" is empty after the first full retrieval of that db location.
I tried instantiating the firebase reference each time anew inside the listen function. Same result.
I tried a "value" event instead of "child_added". No improvement.
The data on the Firebase side looks perfect in the FB console.
Here's how the data is being written by the Python admin to firebase:
def push_value(rootAddr, childAddr, data):
try:
ref = db.reference(rootAddr)
posts_ref = ref.child(childAddr)
new_post_ref = posts_ref.push()
new_post_ref.set(data)
except Exception:
raise
And as I said, this works perfectly to put the data at the correct place in FB.
Why the empty event objects after the first download of the database, on subsequent events?
I found the answer. Like most things, it turned out to be simple, but took a couple of days to find. Maybe this will save someone else.
On the docs page:
http://firebase.google.com/docs/database/admin/save-data#section-push
"In JavaScript and Python, the pattern of calling push() and then
immediately calling set() is so common that the Firebase SDK lets you
combine them by passing the data to be set directly to push() as
follows..."
I suggest the wording should emphasize that you must do it that way.
The earlier Python example on the same page doesn't work:
new_post_ref = posts_ref.push()
new_post_ref.set({
'author': 'gracehop',
'title': 'Announcing COBOL, a New Programming Language'
})
A separate empty push() followed by set(data) as in this example, won't work for Python and Javascript because in those cases the push() implicitly also does a set() and so an empty push triggers unwanted event listeners with empty data, and the set(data) didn't trigger an event with data, either.
In other words, the code in the question:
new_post_ref = posts_ref.push()
new_post_ref.set(data)
must be:
new_post_ref = posts_ref.push(data)
with set() not explicitly called.
Since this push() code happens only when new objects are written to FB, the initial download to the client wasn't affected.
Though the documentation may be trying to convey the evolution of the design, it fails to point out that only the last Python and Javascript example given will work and the others shouldn't be used.

django/python : error when get value from dictionary

I have python/django code hosted at dotcloud and redhat openshift. For handling different user, I use token and save it in dictionary. But when I get the value from dict, it sometimes throws an error(key value error).
import threading
thread_queue = {}
def download(request):
dl_val = request.POST["input1"]
client_token = str(request.POST["pagecookie"])
# save client token as keys and thread object as value in dictionary
thread_queue[client_token] = DownloadThread(dl_val,client_token)
thread_queue[client_token].start()
return render_to_response("progress.html",
{ "dl_val" : dl_val, "token" : client_token })
The code below is executed in 1 second intervals via javascript xmlhttprequest to server.
It will check variable inside another thread and return the value to user page.
def downloadProgress(request, token):
# sometimes i use this for check the content of dict
#resp = HttpResponse("thread_queue = "+str(thread_queue))
#return resp
prog, total = thread_queue[str(token)].getValue() # problematic line !
if prog == 0:
# prevent division by zero
return HttpResponse("0")
percent = float(prog) / float(total)
percent = round(percent*100, 2)
if percent >= 100:
try:
f_name = thread_queue[token].getFileName()[1]
except:
downloadProgress(request,token)
resp = HttpResponse('<a href="http://'+request.META['HTTP_HOST']+
'/dl/'+token+'/">'+f_name+'</a><br />')
return resp
else:
return HttpResponse(str(percent))
After testing for several days, it sometimes return :
thread_queue = {}
It sometimes succeeds :
thread_queue = {'wFVdMDF9a2qSQCAXi7za': , 'EVukb7QdNdDgCf2ZtVSw': , 'C7pkqYRvRadTfEce5j2b': , '2xPFhR6wm9bs9BEQNfdd': }
I never get this result when I'm running django locally via manage.py runserver, and accessing it with google chrome, but when I upload it to dotcloud or openshift, it always gives the above problem.
My question :
How can I solve this problem ?
Does dotcloud and openshift limit their python cpu usage ?
Or is the problem inside the python dictionary ?
Thank You.
dotCloud has 4 worker processes by default for the python service. When you run the dev server locally, you are only running one process. Like #martijn said, your issue is related to the fact that your dict isn't going to be shared between these processes.
To fix this issue, you could use something like redis or memcached to store this information instead. If you need a more long term storage solution, then using a database is probably better suited.
dotCloud does not limit the CPU usage, The CPU is shared amongst others on the same host, and allows bursting, but in the end everyone has the same amount of CPU.
Looking at your code, you should check to make sure there is a value in the dict before you access it, or at a minimum surround the code with a try except block, to handle the case when the data isn't there.
str_token = str(token)
if str_token in thread_queue:
prog, total = thread_queue[str_token].getValue() # problematic line !
else:
# value isn't there, do something else
Presumably dotcloud and openshift run multiple processes of your code; the dict is not going to be shared between these processes.
Note that that also means the extra processes will not have access to your extra tread either.
Use an external database for this kind of information instead. For long-running asynchronous jobs like these you also need to run them in a separate worker process. Look at Celery for an all-in-one solution for asynchronous job handling, for example.

Is get_result() a required call for put_async() in Google App Engine

With the new release of GAE 1.5.0, we now have an easy way to do async datastore calls. Are we required to call get_result() after calling
'put_async'?
For example, if I have an model called MyLogData, can I just call:
put_async(MyLogData(text="My Text"))
right before my handler returns without calling the matching get_result()?
Does GAE automatically block on any pending calls before sending the result to the client?
Note that I don't really care to handle error conditions. i.e. I don't mind if some of these puts fail.
I don't think there is any sure way to know if get_result() is required unless someone on the GAE team verifies this, but I think it's not needed. Here is how I tested it.
I wrote a simple handler:
class DB_TempTestModel(db.Model):
data = db.BlobProperty()
class MyHandler(webapp.RequestHandler):
def get(self):
starttime = datetime.datetime.now()
lots_of_data = ' '*500000
if self.request.get('a') == '1':
db.put(DB_TempTestModel(data=lots_of_data))
db.put(DB_TempTestModel(data=lots_of_data))
db.put(DB_TempTestModel(data=lots_of_data))
db.put(DB_TempTestModel(data=lots_of_data))
if self.request.get('a') == '2':
db.put_async(DB_TempTestModel(data=lots_of_data))
db.put_async(DB_TempTestModel(data=lots_of_data))
db.put_async(DB_TempTestModel(data=lots_of_data))
db.put_async(DB_TempTestModel(data=lots_of_data))
self.response.out.write(str(datetime.datetime.now()-starttime))
I ran it a bunch of times on a High Replication Application.
The data was always there, making me believe that unless there is a failure in the datastore side of things (unlikely), it's gonna be written.
Here's the interesting part. When the data is written with put_async() (?a=2), the amount of time (to process the request) was on average about 2 to 3 times as fast as put()(?a=1) (not a very scientific test - just eyeballing it).
But the cpu_ms and api_cpu_ms were the same for both ?a=1 and ?a=2.
From the logs:
ms=440 cpu_ms=627 api_cpu_ms=580 cpm_usd=0.036244
vs
ms=149 cpu_ms=627 api_cpu_ms=580 cpm_usd=0.036244
On the client side, looking at the network latency of the requests, it showed the same results, i.e. `?a=2' requests were at least 2 times faster. Definitely a win on the client side... but it seems to not have any gain on the server side.
Anyone on the GAE team care to comment?
db.put_async works fine without get_result when deployed (in fire-and-forget style) but in locally it won't take action until get_result gets called more context
I dunno, but this works:
import datetime
from google.appengine.api import urlfetch
def main():
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, "some://artificially/slow.url")
print "Content-type: text/plain"
print
print str(datetime.datetime.now())
if __name__ == '__main__':
main()
The remote URL sleeps 3 seconds and then sends me an email. The App Engine handler returns immediately, and the remote URL completes as expected. Since both services abstract the same underlying RPC framework, I would guess the datastore behaves similarly.
Good question, though. Perhaps Nick or another Googler can answer definitively.

Categories

Resources