Efficient API requests during iteration - python

So I'm looking for a way to speed up the output of the following code, calling google's natural language API:
tweets = json.load(input)
client = language.LanguageServiceClient()
sentiment_tweets = []
iterations = 1000
start = timeit.default_timer()
for i, text in enumerate(d['text'] for d in tweets):
document = types.Document(
content=text,
type=enums.Document.Type.PLAIN_TEXT)
sentiment = client.analyze_sentiment(document=document).document_sentiment
results = {'text': text, 'sentiment':sentiment.score, 'magnitude':sentiment.magnitude}
sentiment_tweets.append(results)
if (i % iterations) == 0:
print(i, " tweets processed")
sentiment_tweets_json = [json.dumps(sentiments) for sentiments in sentiment_tweets]
stop = timeit.default_timer()
The issue is the tweets list is around 100k entries, iterating and making calls one by one does not produce an output on a feasible timescale. I'm exploring potentially using asyncio for parallel calls, although as I'm still a beginner with Python and unfamiliar with the package, I'm not sure if you can make a function a coroutine with itself such that each instance of the function iterates through the list as expected, progressing sequentially. There is also the question of managing the total number of calls made by the app to be within the defined quota limits of the API. Just wanted to know if I was going in the right direction.

I use this method for concurrent calls:
from concurrent import futures as cf
def execute_all(mfs: list, max_workers: int = None):
"""Excecute concurrently and mfs list.
Parameters
----------
mfs : list
[mfs1, mfs2,...]
mfsN = {
tag: str,
fn: function,
kwargs: dict
}
.
max_workers : int
Description of parameter `max_workers`.
Returns
-------
dict
{status, result, error}
status = {tag1, tag2,..}
result = {tag1, tag2,..}
error = {tag1, tag2,..}
"""
result = {
'status': {},
'result': {},
'error': {}
}
max_workers = len(mfs)
with cf.ThreadPoolExecutor(max_workers=max_workers) as exec:
my_futures = {
exec.submit(x['fn'], **x['kwargs']): x['tag'] for x in mfs
}
for future in cf.as_completed(my_futures):
tag = my_futures[future]
try:
result['result'][tag] = future.result()
result['status'][tag] = 0
except Exception as err:
result['error'][tag] = err
result['result'][tag] = None
result['status'][tag] = 1
return result
Where each result returns indexed by a given tag (if matters to you identify the which call return which result) when:
mfs = [
{
'tag': 'tweet1',
'fn': process_tweet,
'kwargs': {
'tweet': tweet1
}
},
{
'tag': 'tweet2',
'fn': process_tweet,
'kwargs': {
'tweet': tweet2
}
},
]
results = execute_all(mfs, 2)

While async is one way you could go, another that might be easier is using the multiprocessing python functionalities.
from multiprocessing import Pool
def process_tweet(tweet):
pass # Fill in the blanks here
# Use five processes at once
with Pool(5) as p:
processes_tweets = p.map(process_tweet, tweets, 1)
In this case "tweets" is an iterator of some sort, and each element of that iterator will get passed to your function. The map function will make sure the results come back in the same order the arguments were supplied.

Related

asynchroneous error handling and response processing of an unbounded list of tasks using zeep

So here is my use case:
I read from a database rows containing information to make a complex SOAP call (I'm using zeep to do these calls).
One row from the database corresponds to a request to the service.
There can be up to 20 thousand lines, so I don't want to read everything in memory before making the calls.
I need to process the responses - when the
response is OK, I need to store some returned information back into
my database, and when there is an exception I need to process the
exception for that particular request/response pair.
I need also to capture some external information at the time of the request creation, so that I know where to store the response from the request. In my current code I'm using the delightful property of gather() that makes the results come in the same order.
I read the relevant PEPs and Python documentation but I'm still very confused, as there seems to be multiple ways to solve the same problem.
I also went through countless exercises on the web, but the examples are all trivial - it's either asyncio.sleep() or some webscraping with a finite list of urls.
The solution that I have come up so far kinda works - the asyncio.gather() method is very, very, useful, but I have not been able to 'feed' it from a generator. I'm currently just counting to an arbitrary size and then starting a .gather() operation. I've transcribed the code, with boring parts left out and I've tried to anonymise the code
I've tried solutions involving semaphores, queues, different event loops, but I'm failing every time. Ideally I'd like to be able to create Futures 'continuously' - I think I'm missing the logic of 'convert this awaitable call to a future'
I'd be grateful for any help!
import asyncio
from asyncio import Future
import zeep
from zeep.plugins import HistoryPlugin
history = HistoryPlugin()
max_concurrent_calls = 5
provoke_errors = True
def export_data_async(db_variant: str, order_nrs: set):
st = time.time()
results = []
loop = asyncio.get_event_loop()
def get_client1(service_name: str, system: Systems = Systems.ACME) -> Tuple[zeep.Client, zeep.client.Factory]:
client1 = zeep.Client(wsdl=system.wsdl_url(service_name=service_name),
transport=transport,
plugins=[history],
)
factory_ns2 = client1.type_factory(namespace='ns2')
return client1, factory_ns2
table = 'ZZZZ'
moveback_table = 'EEEEEE'
moveback_dict = create_default_empty_ordered_dict('attribute1 attribute2 attribute3 attribute3')
client, factory = get_client1(service_name='ACMEServiceName')
if log.isEnabledFor(logging.DEBUG):
client.wsdl.dump()
zeep_log = logging.getLogger('zeep.transports')
zeep_log.setLevel(logging.DEBUG)
with Db(db_variant) as db:
db.open_db(CON_STRING[db_variant])
db.init_table_for_read(table, order_list=order_nrs)
counter_failures = 0
tasks = []
sids = []
results = []
def handle_future(future: Future) -> None:
results.extend(future.result())
def process_tasks_concurrently() -> None:
nonlocal tasks, sids, counter_failures, results
futures = asyncio.gather(*tasks, return_exceptions=True)
futures.add_done_callback(handle_future)
loop.run_until_complete(futures)
for i, response_or_fault in enumerate(results):
if type(response_or_fault) in [zeep.exceptions.Fault, zeep.exceptions.TransportError]:
counter_failures += 1
log_webservice_fault(sid=sids[i], db=db, err=response_or_fault, object=table)
else:
db.write_dict_to_table(
moveback_table,
{'sid': sids[i],
'attribute1': response_or_fault['XXX']['XXX']['xxx'],
'attribute2': response_or_fault['XXX']['XXX']['XXXX']['XXX'],
'attribute3': response_or_fault['XXXX']['XXXX']['XXX'],
}
)
db.commit_db_con()
tasks = []
sids = []
results = []
return
for row in db.rows(table):
if int(row.id) % 2 == 0 and provoke_errors:
payload = faulty_message_payload(row=row,
factory=factory,
)
else:
payload = message_payload(row=row,
factory=factory,
)
tasks.append(client.service.myRequest(
MessageHeader=factory.MessageHeader(**message_header_arguments(row=row)),
myRequestPayload=payload,
_soapheaders=[security_soap_header],
))
sids.append(row.sid)
if len(tasks) == max_concurrent_calls:
process_tasks_concurrently()
if tasks: # this is the remainder of len(db.rows) % max_concurrent_calls
process_tasks_concurrently()
loop.run_until_complete(transport.session.close())
db.execute_this_statement(statement=update_sql)
db.commit_db_con()
log.info(db.activity_log)
if counter_failures:
log.info(f"{table :<25} Count failed: {counter_failures}")
print("time async: %.2f" % (time.time() - st))
return results
Failed attempt with Queue: (blocks at await client.service)
loop = asyncio.get_event_loop()
counter = 0
results = []
async def payload_generator(db_variant: str, order_nrs: set):
# code that generates the data for the request
yield counter, row, payload
async def service_call_worker(queue, results):
while True:
counter, row, payload = await queue.get()
results.append(await client.service.myServicename(
MessageHeader=calculate_message_header(row=row)),
myPayload=payload,
_soapheaders=[security_soap_header],
)
)
print(colorama.Fore.BLUE + f'after result returned {counter}')
# Here do the relevant processing of response or error
queue.task_done()
async def main_with_q():
n_workers = 3
queue = asyncio.Queue(n_workers)
e = pprint.pformat(queue)
p = payload_generator(DB_VARIANT, order_list_from_args())
results = []
workers = [asyncio.create_task(service_call_worker(queue, results))
for _ in range(n_workers)]
async for c in p:
await queue.put(c)
await queue.join() # wait for all tasks to be processed
for worker in workers:
worker.cancel()
if __name__ == '__main__':
try:
loop.run_until_complete(main_with_q())
loop.run_until_complete(transport.session.close())
finally:
loop.close()

While loop with two conditionals

Can't seem to figure out why this two conditional while loop is not working. It should iterate over JSON until it finds one where both 'producer' = producer given & transactions list is not empty.
It seems to be treating it as an OR, because to stops as soon as transactions are not empty.
# Performs a get info and looks for producer name, if found returns head_block num.
def get_testnetproducer_cpustats(producer):
# Get current headblock from testnet
current_headblock = headblock("testnet")
# Set producer to random name
blockproducer = "nobody"
transactions = []
while producer != blockproducer and len(transactions) == 0:
currentblock = getblock(current_headblock)
print(currentblock)
transactions = currentblock['transactions']
# Set producer of current block
blockproducer = currentblock['producer']
print(blockproducer)
# Deduct 1 from current block to check next block
current_headblock = current_headblock - 1
else:
return print(current_headblock)
Here is an example of the JSON:
{'timestamp': '2020-08-16T20:33:11.000', 'producer': 'waxtribetest', 'confirmed': 0, 'previous': '029abe7da7da6691681eb9e9c6310532ed313a93dca25026f7f3059d2765177f', 'transaction_mroot': '0000000000000000000000000000000000000000000000000000000000000000', 'action_mroot': '08bc38e7a4f1a832c3f9986e9c6553add7bbef8f0cfbe274772684b4f9d9a1a1', 'schedule_version': 4418, 'new_producers': None, 'producer_signature': 'SIG_K1_JzwAkDxatuPKBYVfP718JmNMTt5YGAR1E9Z5bBB4T3HtJxvDZq6SZU7HdjNK5SLigSAByWesPmPRjznWPF634czzRtbVwJ', 'transactions': [], 'id': '029abe7ec71e2af243c499b22036c350143d46054dec1659b4c3847650d7a3d2', 'block_num': 43695742, 'ref_block_prefix': 2996421699}
As #barny noted, use or instead of and:
while producer != blockproducer or len(transactions) == 0:

Fastest way to delete a Collection from Firestore?

I have an application that loads millions of documents to a collection, using 30-80 workers to simultaneously load the data. Sometimes, I find that the loading process didn't complete smoothly, and with other databases I can simply delete the table and start over, but not with Firestore collections. I have to list the documents and delete them, and I've not found a way to scale this with the same capacity as my loading process. What I'm doing now is that I have two AppEngine hosted Flask/Python methods, one to get a page of 1000 documents and pass to another method to delete them. This way the process to list documents is not blocked by the process to delete them. It's still taking days to complete which is too long.
Method to get list of documents and create a task to delete them, which is single threaded:
#app.route('/delete_collection/<collection_name>/<batch_size>', methods=['POST'])
def delete_collection(collection_name, batch_size):
batch_size = int(batch_size)
coll_ref = db.collection(collection_name)
print('Received request to delete collection {} {} docs at a time'.format(
collection_name,
batch_size
))
num_docs = batch_size
while num_docs >= batch_size:
docs = coll_ref.limit(batch_size).stream()
found = 0
deletion_request = {
'doc_ids': []
}
for doc in docs:
deletion_request['doc_ids'].append(doc.id)
found += 1
num_docs = found
print('Creating request to delete docs: {}'.format(
json.dumps(deletion_request)
))
# Add to task queue
queue = tasks_client.queue_path(PROJECT_ID, LOCATION, 'database-manager')
task_meet = {
'app_engine_http_request': { # Specify the type of request.
'http_method': 'POST',
'relative_uri': '/delete_documents/{}'.format(
collection_name
),
'body': json.dumps(deletion_request).encode(),
'headers': {
'Content-Type': 'application/json'
}
}
}
task_response_meet = tasks_client.create_task(queue, task_meet)
print('Created task to delete {} docs: {}'.format(
batch_size,
json.dumps(deletion_request)
))
Here is the method I use to delete the documents, which can scale. In effect it only processes 5-10 at a time, limited by the rate which the other method passes pages of doc_ids to delete. Separating the two helps, but not that much.
#app.route('/delete_documents/<collection_name>', methods=['POST'])
def delete_documents(collection_name):
# Validate we got a body in the POST
if flask.request.json:
print('Request received to delete docs from :{}'.format(collection_name))
else:
message = 'No json found in request: {}'.format(flask.request)
print(message)
return message, 400
# Validate that the payload includes a list of doc_ids
doc_ids = flask.request.json.get('doc_ids', None)
if doc_ids is None:
return 'No doc_ids specified in payload: {}'.format(flask.request.json), 400
print('Received request to delete docs: {}'.format(doc_ids))
for doc_id in doc_ids:
db.collection(collection_name).document(doc_id).delete()
return 'Finished'
if __name__ == '__main__':
# Set environment variables for running locally
app.run(host='127.0.0.1', port=8080, debug=True)
I've tried running multiple concurrent executions of delete_collection(), but am not certain that even helps, as I'm not sure if every time it calls limit(batch_size).stream() that it gets a distinct set of documents or possibly is getting duplicates.
How can I make this run faster?
This is what I came up with. It's not super fast (120-150 docs per second), but all the other examples I found in python didn't work at all:
now = datetime.now()
then = now - timedelta(days=DOCUMENT_EXPIRATION_DAYS)
doc_counter = 0
commit_counter = 0
limit = 5000
while True:
docs = []
print('Getting next doc handler')
docs = [snapshot for snapshot in db.collection(collection_name)
.where('id.time', '<=', then)
.limit(limit)
.order_by('id.time', direction=firestore.Query.ASCENDING
).stream()]
batch = db.batch()
for doc in docs:
doc_counter = doc_counter + 1
if doc_counter % 500 == 0:
commit_counter += 1
print('Committing batch {} from {}'.format(commit_counter, doc.to_dict()['id']['time']))
batch.commit()
batch.delete(doc.reference)
batch.commit()
if len(docs) == limit:
continue
break
print('Deleted {} documents in {} seconds.'.format(doc_counter, datetime.now() - now))
As mentioned in the other comments, .stream() has a 60 second deadline. This iterative structure sets a limit of 5000 after which .stream() is called again, which keeps it under the 60 second limit. If anybody knows how to speed this up, let me know.
Here is my simple Python script that I used to test batch deletes. Like #Chris32 said, the batch mode will delete thousands of documents per second if latency isn't too bad.
from time import time
from uuid import uuid4
from google.cloud import firestore
DB = firestore.Client()
def generate_user_data(entries = 10):
print('Creating {} documents'.format(entries))
now = time()
batch = DB.batch()
for counter in range(entries):
# Each transaction or batch of writes can write to a maximum of 500 documents.
# https://cloud.google.com/firestore/quotas#writes_and_transactions
if counter % 500 == 0 and counter > 0:
batch.commit()
user_id = str(uuid4())
data = {
"some_data": str(uuid4()),
"expires_at": int(now)
}
user_ref = DB.collection(u'users').document(user_id)
batch.set(user_ref, data)
batch.commit()
print('Wrote {} documents in {:.2f} seconds.'.format(entries, time() - now))
def delete_one_by_one():
print('Deleting documents one by one')
now = time()
docs = DB.collection(u'users').where(u'expires_at', u'<=', int(now)).stream()
counter = 0
for doc in docs:
doc.reference.delete()
counter = counter + 1
print('Deleted {} documents in {:.2f} seconds.'.format(counter, time() - now))
def delete_in_batch():
print('Deleting documents in batch')
now = time()
docs = DB.collection(u'users').where(u'expires_at', u'<=', int(now)).stream()
batch = DB.batch()
counter = 0
for doc in docs:
counter = counter + 1
if counter % 500 == 0:
batch.commit()
batch.delete(doc.reference)
batch.commit()
print('Deleted {} documents in {:.2f} seconds.'.format(counter, time() - now))
generate_user_data(10)
delete_one_by_one()
print('###')
generate_user_data(10)
delete_in_batch()
print('###')
generate_user_data(2000)
delete_in_batch()
In this public documentation is described how using a callable Cloud Function you can take advantage of the firestore delete command in the Firebase Command Line Interface deleting up to 4000 documents per second.

How to mock a redis client in Python?

I just found that a bunch of unit tests are failing, due a developer hasn't mocked out the dependency to a redis client within the test. I'm trying to give a hand in this matter but have difficulties myself.
The method writes to a redis client:
redis_client = get_redis_client()
redis_client.set('temp-facility-data', cPickle.dumps(df))
Later in the assert the result is retrieved:
res = cPickle.loads(get_redis_client().get('temp-facility-data'))
expected = pd.Series([set([1, 2, 3, 4, 5])], index=[1])
assert_series_equal(res.variation_pks, expected)
I managed to patch the redis client's get() and set() successfully.
#mock.patch('redis.StrictRedis.get')
#mock.patch('redis.StrictRedis.set')
def test_identical(self, mock_redis_set, mock_redis_get):
mock_redis_get.return_value = ???
f2 = deepcopy(self.f)
f3 = deepcopy(self.f)
f2.pk = 2
f3.pk = 3
self.one_row(f2, f3)
but I don't know how to set the return_value of get() to what the set() would set in the code, so that the test would pass.
Right now this line fails the test:
res = cPickle.loads(get_redis_client().get('temp-facility-data'))
TypeError: must be string, not MagicMock
Any advice please?
Think you can use side effect to set and get value in a local dict
data = {}
def set(key, val):
data[key] = val
def get(key):
return data[key]
mock_redis_set.side_effect = set
mock_redis_get.side_effect = get
not tested this but I think it should do what you want
If you want something more complete, you can try fakeredis
#patch("redis.Redis", return_value=fakeredis.FakeStrictRedis())
def test_something():
....
I think you can do something like this.
redis_cache = {
"key1": (b'\x80\x04\x95\x08\x00\x00\x00\x00\x00\x00\x00\x8c\x04test\x94.', "test"),
"key2": (None, None),
}
def get(redis_key):
if redis_key in redis_cache:
return redis_cache[redis_key][0]
else:
return None
mock = MagicMock()
mock.get = Mock(side_effect=get)
with patch('redis.StrictRedis', return_value=mock) as p:
for key in redis_cache:
result = self.MyClass.my_function(key)
self.assertEqual(result, redis_cache[key][1])

How to track the progress of individual tasks inside a group which forms the header to a chord in celery?

import celery
def temptask(n):
header=list(tempsubtask.si(i) for i in range(n))
callback=templink.si('printed at last?')
r = celery.chord(celery.group(header))(callback)
return r
#task()
def tempsubtask(i):
print i
for x in range(i):
time.sleep(2)
current_task.update_state(
state='PROGRESS', meta={'completed': x, 'total': i })
#task()
def templink(x):
print 'this should be run at last %s'%x
#executing temptask
r = temptask(100)
I want acccess to the progress status updated by tempsubtask. How can I go about achieving it?
I've had a similar question. Most examples on the net are outdated, the docs didn't help much, but the docs have links to sources, reading which did help me.
My objective was to organize parallel tasks in groups. The groups would have to be executed sequentially in order.
So I decided to generate the task ids before starting any tasks separately and only assigning them. I'm using Celery 4.3.0
Here's a brief example.
Firstly I needed a dummy task to make execution sequential and to be able to check the state of a certain group. As this is used a callback, it will complete only after all other tasks in the group.
#celery.task(bind=True, name="app.tasks.dummy_task")
def dummy_task( self, results=None, *args, **kwargs ):
return results
My comments here explain how I assign ids.
from celery.utils import uuid
from celery import group, chord, chain
# Generating task ids,
# which can be saved to a db, sent to the client and so on
#
# This is done before executing any tasks
task_id_1 = uuid()
task_id_2 = uuid()
chord_callback_id_1 = uuid()
chord_callback_id_2 = uuid()
workflow_id = None
# Generating goups, using signatures
# the group may contain any number of tasks
group_1 = group(
[
celery.signature(
'app.tasks.real_task',
args=(),
kwargs = { 'email': some_email, 'data':some_data },
options = ( {'task_id': task_id_1 } )
)
]
)
group_2 = group(
[
celery.signature(
'app.tasks.real_task',
args=(),
kwargs = { 'email': some_email, 'data':some_data },
options = ( {'task_id': task_id_2 } )
)
]
)
# Creating callback task which will simply rely the result
# Using the task id, which has been generated before
#
# The dummy task start after all tasks in this group are completed
# This way we know that the group is completed
chord_callback = celery.signature(
'app.tasks.dummy_task',
options=( {'task_id': chord_callback_id_1 } )
)
chord_callback_2 = celery.signature(
'app.tasks.dummy_task',
options=( {'task_id': chord_callback_id_2 } )
)
# we can monitor each step status
# by its chord callback id
# the id of the chord callback
step1 = chord( group_1, body=chord_callback )
# the id of the chord callback
step2 = chord( group_2, body=chord_callback_2 )
# start the workflow execution
# the steps will execute sequentially
workflow = chain( step1, step2 )()
# the id of the last cord callback
workflow_id = workflow.id
# return any ids you need
print( workflow_id )
That's how I can check the status of any task in my app.
# This is a simplified example
# some code is omitted
from celery.result import AsyncResult
def task_status( task_id=None ):
# PENDING
# RECEIVED
# STARTED
# SUCCESS
# FAILURE
# REVOKED
# RETRY
task = AsyncResult(task_id)
response = {
'state': task.state,
}
return jsonify(response), 200
After hours of googling I stumbled upon http://www.manasupo.com/2012/03/chord-progress-in-celery.html . Though the solution there didn't work for me out of the box, it did inspire me to try something similar.
from celery.utils import uuid
from celery import chord
class ProgressChord(chord):
def __call__(self, body=None, **kwargs):
_chord = self.type
body = (body or self.kwargs['body']).clone()
kwargs = dict(self.kwargs, body=body, **kwargs)
if _chord.app.conf.CELERY_ALWAYS_EAGER:
return self.apply((), kwargs)
callback_id = body.options.setdefault('task_id', uuid())
r= _chord(**kwargs)
return _chord.AsyncResult(callback_id), r
and instead of executing celery.chord I use ProgressChord as follows:
def temptask(n):
header=list(tempsubtask.si(i) for i in range(n))
callback=templink.si('printed at last?')
r = celery.Progresschord(celery.group(header))(callback)
return r
returned value of r contained a tuple having both, callback's asyncresult and a group result. So success looked something like this:
In [3]: r
Out[3]:
(<AsyncResult: bf87507c-14cb-4ac4-8070-d32e4ff326a6>,
<GroupResult: af69e131-5a93-492d-b985-267484651d95 [4672cbbb-8ec3-4a9e-971a-275807124fae, a236e55f-b312-485c-a816-499d39d7de41, e825a072-b23c-43f2-b920-350413fd5c9e, e3f8378d-fd02-4a34-934b-39a5a735871d, c4f7093b-9f1a-4e5e-b90d-66f83b9c97c4, d5c7dc2c-4e10-4e71-ba2b-055a33e15f02, 07b1c6f7-fe95-4c1f-b0ba-6bc82bceaa4e, 00966cb8-41c2-4e95-b5e7-d8604c000927, e039c78e-6647-4c8d-b59b-e9baf73171a0, 6cfdef0a-25a2-4905-a40e-fea9c7940044]>)
I inherited and overrode [celery.chord][1] instead of [celery.task.chords.Chord][2] because I couldn't find it's source anywhere.
Old problem and I wasted a several days to find a better and modern solution. In my current project I must to track group progress separately and release lock in final callback.
And current solution is much more simple (but harder to guess), subject lines commented at the end:
#celery_app.task(name="_scheduler", track_started=True, ignore_result=False)
def _scheduler():
lock = cache.lock("test_lock")
if not lock.acquire(blocking=False):
return {"Error": "Job already in progress"}
lock_code = lock.local.token.decode("utf-8")
tasks = []
for x in range(100):
tasks.append(calculator.s())
_group = group(*tasks)
_chord = chord(_group)(_get_results.s(token=lock_code))
group_results = _chord.parent # This is actual group inside chord
group_results.save() # I am saving it to usual results backend, and can track progress inside.
return _chord # can return anything, I need only chord.
I am working in Celery 5.1

Categories

Resources