On my local machine the script runs fine but in the cloud it 500 all the time. This is a cron task so I don't really mind if it takes 5min...
< class 'google.appengine.runtime.DeadlineExceededError' >:
Any idea whether it's possible to increase the timeout?
Thanks,
rui
You cannot go beyond 30 secs, but you can indirectly increase timeout by employing task queues - and writing task that gradually iterate through your data set and processes it. Each such task run should of course fit into timeout limit.
EDIT
To be more specific, you can use datastore query cursors to resume processing in the same place:
http://code.google.com/intl/pl/appengine/docs/python/datastore/queriesandindexes.html#Query_Cursors
introduced first in SDK 1.3.1:
http://googleappengine.blogspot.com/2010/02/app-engine-sdk-131-including-major.html
The exact rules for DB query timeouts are complicated, but it seems that a query cannot live more than about 2 mins, and a batch cannot live more than about 30 seconds. Here is some code that breaks a job into multiple queries, using cursors to avoid those timeouts.
def make_query(start_cursor):
query = Foo()
if start_cursor:
query.with_cursor(start_cursor)
return query
batch_size = 1000
start_cursor = None
while True:
query = make_query(start_cursor)
results_fetched = 0
for resource in query.run(limit = batch_size):
results_fetched += 1
# Do something
if results_fetched == batch_size:
start_cursor = query.cursor()
break
else:
break
Below is the code I use to solve this problem, by breaking up a single large query into multiple small ones. I use the google.appengine.ext.ndb library -- I don't know if that is required for the code below to work.
(If you are not using ndb, consider switching to it. It is an improved version of the db library and migrating to it is easy. For more information, see https://developers.google.com/appengine/docs/python/ndb.)
from google.appengine.datastore.datastore_query import Cursor
def ProcessAll():
curs = Cursor()
while True:
records, curs, more = MyEntity.query().fetch_page(5000, start_cursor=curs)
for record in records:
# Run your custom business logic on record.
RunMyBusinessLogic(record)
if more and curs:
# There are more records; do nothing here so we enter the
# loop again above and run the query one more time.
pass
else:
# No more records to fetch; break out of the loop and finish.
break
Related
I have a function that queries a large table for the purposes of indexing it... It creates a server-side cursor named "all_accounts".
def get_all_accounts(self):
cursor = self.get_cursor('all_accounts')
cursor.execute("SELECT * FROM account_summary LIMIT 20000;")
I then process those 2,000 or so at a time to insert into a NoSQL solution:
def index_docs(self, cursor):
while True:
# consume result over a series of iterations
# with each iteration fetching 2000 records
record_count = cursor.rowcount
records = cursor.fetchmany(size=2000)
if not records:
break
for r in records:
# do stuff
I'd like the index_docs function to be consuming the cursor fetchmany() calls in parallel x10 as my bottleneck is not caused by the target system, but rather the single threaded nature of my script. I have done a few async/worker things in the past, but the psycopg2 cursor seemed like it might be an issue. Thoughts?
I think you'll be safe if a single process/thread accesses the cursor and dishes out work to multiple worker processes that push to the other database. (At a quick glance, server-side cursors can't be shared between connections, but I could be wrong there.)
That is, something like this. Generally you'd use imap_unordered to iterate over a collection of single items (and use a higher chunksize than the default 1), but I think we can just as well use the batches here...
import multiprocessing
def get_batches(conn):
cursor = conn.get_cursor('all_accounts')
cursor.execute("SELECT * FROM account_summary LIMIT 20000;")
while True:
records = cursor.fetchmany(size=500)
if not records:
break
yield list(records)
def process_batch(batch):
# (this function is run in child processes)
for r in batch:
# ...
return "some arbitrary result"
def main():
conn = connect...()
with multiprocessing.Pool() as p:
batch_generator = get_batches(conn)
for result in p.imap_unordered(process_batch, get_batches):
print(result) # doesn't really matter
Background
I need to send out a large batch of notifications to around ~1 mil devices and I'm building it out using Google Cloud Functions.
In the current setup I enqueue each device token as a PubSub message that:
stores a pending notification in DataStore, used for keeping track of retries and success status
attempts to send the notification
marks the notification as either successful or failed if it's retried enough and hasn't gone through
This works more or less fine and I get decent performance out of this, something 1.5K tokens processed per second.
Issue
I want to keep track of the current progress of the whole job. Given that I know how many notifications I'm expecting to process I want to do be able to report something like x/1_000_000 processed and then consider it done when the sum of failures + successes is as much as what I wanted to process.
The DataStore documentation suggests not running a count on the entities themselves because it won't be performant, which I can confirm. I implemented a counter following their example documentation of a sharded counter which I'm including at the end.
The issue I'm seeing is that it is both quite slow and very prone to returning 409 Contention errors which makes my function invocations retry which is not ideal given that the count itself is not essential to the process and there's only a limited retry budget per notification. In practice the thing that fails the most is incrementing the counter which happens at the end of the process which would increase load on notification reads to check their status on retry and means that I end up with a counter that is less than the actual successful notifications.
I ran a quick benchmark using wrk and seem to get around 400 RPS out of incrementing the counter with an average latency of 250ms. This is quite slow comparing to the notification logic itself that does around 3 DataStore queries per notification and is presumably more complex than incrementing a counter. When added to the contention errors I end up with an implementation that I don't consider stable. I understand that Datastore usually auto-scales with continuous heavy usage but the pattern of using this service is very rare and for the whole batch of tokens so there would not be any previous traffic to scale this up.
Questions
Is there something I'm missing about the counter implementation that could be improved to make it less slow?
Is there a different approach I should consider to get what I want?
Code
The code that interacts with datastore
DATASTORE_READ_BATCH_SIZE = 100
class Counter():
kind = "counter"
shards = 2000
#staticmethod
def _key(namespace, shard):
return hashlib.sha1(":".join([str(namespace), str(shard)]).encode('utf-8')).hexdigest()
#staticmethod
def count(namespace):
keys = []
total = 0
for shard in range(Counter.shards):
if len(keys) == DATASTORE_READ_BATCH_SIZE:
counters = client.get_multi(keys)
total = total + sum([int(c["count"]) for c in counters])
keys = []
keys.append(client.key(Counter.kind, Counter._key(namespace, shard)))
if len(keys) != 0:
counters = client.get_multi(keys)
total = total + sum([int(c["count"]) for c in counters])
return total
#staticmethod
def increment(namespace):
key = client.key(Counter.kind, Counter._key(namespace, random.randint(0, Counter.shards - 1)))
with client.transaction():
entity = client.get(key)
if entity is None:
entity = datastore.Entity(key=key)
entity.update({
"count": 0,
})
entity.update({
"count": entity["count"] + 1,
})
client.put(entity)
This is called from a Google Cloud Function like so
from flask import abort, jsonify, make_response
from src.notify import FCM, APNS
from src.lib.datastore import Counter
def counter(request):
args = request.args
if args.get("platform"):
Counter.increment(args["platform"])
return
return jsonify({
FCM: Counter.count(FCM),
APNS: Counter.count(APNS)
})
This is used both for incrementing and reading the counts and is split by platform for iOS and Android.
In the end I gave up on the counter and started also saving the status of the notifications in BigQuery. The pricing is still reasonable as it’s still per use and the streaming version of data inserting seems to be fast enough that it doesn’t cause me any issues in practice.
With this I can use a simple sql query to count all the entities matching a batched job. This ends up taking something around 3 seconds for all the entities which, compared to the alternative is acceptable performance for me given that this is only for internal use.
I followed the below code in order to implement a parallel select query on a postgres database:
https://tech.geoblink.com/2017/07/06/parallelizing-queries-in-postgresql-with-python/
My basic problem is that I have ~6k queries that need to be executed, and I am trying to optimise the execution of these select queries. Initially it was a single query with the where id in (...) contained all 6k predicate IDs but I ran into issues with the query using up > 4GB of RAM on the machine it ran on, so I decided to split it out into 6k individual queries which when synchronously keeps a steady memory usage. However it takes a lot longer to run time wise, which is less of an issue for my use case. Even so I am trying to reduce the time as much as possible.
This is what my code looks like:
class PostgresConnector(object):
def __init__(self, db_url):
self.db_url = db_url
self.engine = self.init_connection()
self.pool = self.init_pool()
def init_pool(self):
CPUS = multiprocessing.cpu_count()
return multiprocessing.Pool(CPUS)
def init_connection(self):
LOGGER.info('Creating Postgres engine')
return create_engine(self.db_url)
def run_parallel_queries(self, queries):
results = []
try:
for i in self.pool.imap_unordered(self.execute_parallel_query, queries):
results.append(i)
except Exception as exception:
LOGGER.error('Error whilst executing %s queries in parallel: %s', len(queries), exception)
raise
finally:
self.pool.close()
self.pool.join()
LOGGER.info('Parallel query ran producing %s sets of results of type: %s', len(results), type(results))
return list(chain.from_iterable(results))
def execute_parallel_query(self, query):
con = psycopg2.connect(self.db_url)
cur = con.cursor()
cur.execute(query)
records = cur.fetchall()
con.close()
return list(records)
However whenever this runs, I get the following error:
TypeError: can't pickle _thread.RLock objects
I've read lots of similar questions regarding the use of multiprocessing and pickleable objects but I cant for the life of me figure out what I am doing wrong.
The pool is generally one per process (which I believe is the best practise) but shared per instance of the connector class so that its not creating a pool for each use of the parallel_query method.
The top answer to a similar question:
Accessing a MySQL connection pool from Python multiprocessing
Shows an almost identical implementation to my own, except using MySql instead of Postgres.
Am I doing something wrong?
Thanks!
EDIT:
I've found this answer:
Python Postgres psycopg2 ThreadedConnectionPool exhausted
which is incredibly detailed and looks as though I have misunderstood what multiprocessing.Pool vs a connection pool such as ThreadedConnectionPool gives me. However in the first link it doesn't mention needing any connection pools etc. This solution seems good but seems A LOT of code for what I think is a fairly simple problem?
EDIT 2:
So the above link solves another problem, which I would have likely run into anyway so I'm glad I found that, but it doesnt solve the initial issue of not being able to use imap_unordered down to the pickling error. Very frustrating.
Lastly, I think its probably worth noting that this runs in Heroku, on a worker dyno, using Redis rq for scheduling, background tasks etc and a hosted instance of Postgres as the database.
To put it simply, postgres connection and sqlalchemy connection pool is thread safe, however they are not fork-safe.
If you want to use multiprocessing, you should initialize the engine in each child processes after the fork.
You should use multithreading instead if you want to share engines.
Refer to Thread and process safety in psycopg2 documentation:
libpq connections
shouldn’t be used by a forked processes, so when using a module such
as multiprocessing or a forking web deploy method such as FastCGI make
sure to create the connections after the fork.
If you are using multiprocessing.Pool, there is a keyword argument initializer which can be used to run code once on each child process. Try this:
class PostgresConnector(object):
def __init__(self, db_url):
self.db_url = db_url
self.pool = self.init_pool()
def init_pool(self):
CPUS = multiprocessing.cpu_count()
return multiprocessing.Pool(CPUS, initializer=self.init_connection(self.db_url))
#classmethod
def init_connection(cls, db_url):
def _init_connection():
LOGGER.info('Creating Postgres engine')
cls.engine = create_engine(db_url)
return _init_connection
def run_parallel_queries(self, queries):
results = []
try:
for i in self.pool.imap_unordered(self.execute_parallel_query, queries):
results.append(i)
except Exception as exception:
LOGGER.error('Error whilst executing %s queries in parallel: %s', len(queries), exception)
raise
finally:
pass
#self.pool.close()
#self.pool.join()
LOGGER.info('Parallel query ran producing %s sets of results of type: %s', len(results), type(results))
return list(chain.from_iterable(results))
def execute_parallel_query(self, query):
with self.engine.connect() as conn:
with conn.begin():
result = conn.execute(query)
return result.fetchall()
def __getstate__(self):
# this is a hack, if you want to remove this method, you should
# remove self.pool and just pass pool explicitly
self_dict = self.__dict__.copy()
del self_dict['pool']
return self_dict
Now, to address the XY problem.
Initially it was a single query with the where id in (...) contained
all 6k predicate IDs but I ran into issues with the query using up >
4GB of RAM on the machine it ran on, so I decided to split it out into
6k individual queries which when synchronously keeps a steady memory
usage.
What you may want to do instead is one of these options:
write a subquery that generates all 6000 IDs and use the subquery in your original bulk query.
as above, but write the subquery as a CTE
if your ID list comes from an external source (i.e. not from the database), then you can create a temporary table containing the 6000 IDs and then run your original bulk query against the temporary table
However, if you insist on running 6000 IDs through python, then the fastest query is likely neither to do all 6000 IDs in one go (which will run out of memory) nor to run 6000 individual queries. Instead, you may want to try to chunk the queries. Send 500 IDs at once for example. You will have to experiment with the chunk size to determine the largest number of IDs you can send at one time while still comfortably within your memory budget.
I am running the same simulation in a loop with different parameters. Each simulation makes use a pandas DataFrame (data) which is only read, never modified. Using ipyparallel (IPython parallel), I can put this DataFrames into the global variable space of each engine in my view before simulations start:
view['data'] = data
The engines then have access to the DataFrame for all the simulations which get run on them. The process of copying the data (if pickled, data is 40MB) is only a few seconds. However, It appears that if the number of simulations grows, memory usage grows very large. I imagine this shared data is getting copied for each task rather than just for each engine. What's the best practice for sharing static read-only data from a client with engines? Copying it once per engine is acceptable, but ideally it would only have to be copied once per host (I have 4 engines on host1 and 8 engines on host2).
Here's my code:
from ipyparallel import Client
import pandas as pd
rc = Client()
view = rc[:] # use all engines
view.scatter('id', rc.ids, flatten=True) # So we can track which engine performed what task
def do_simulation(tweaks):
""" Run simulation with specified tweaks """
# Do sim stuff using the global data DataFrame
return results, id, tweaks
if __name__ == '__main__':
data = pd.read_sql("SELECT * FROM my_table", engine)
threads = [] # store list of tweaks dicts
for i in range(4):
for j in range(5):
for k in range(6):
threads.append(dict(i=i, j=j, k=k)
# Set up globals for each engine. This is the read-only DataFrame
view['data'] = data
ar = view.map_async(do_simulation, threads)
# Our async results should pop up over time. Let's measure our progress:
for idx, (results, id, tweaks) in enumerate(ar):
print 'Progress: {}%: Simulation {} finished on engine {}'.format(100.0 * ar.progress / len(ar), idx, id)
# Store results as a pickle for the future
pfile = '{}_{}_{}.pickle'.format(tweaks['i'], tweaks['j'], tweaks['j'])
# Save our results to a pickle file
pd.to_pickle(results, out_file_path + pfile)
print 'Total execution time: {} (serial time: {})'.format(ar.wall_time, ar.serial_time)
If simulation counts are small (~50), then it takes a while to get started, but i start to see progress print statements. Strangely, multiple tasks will get assigned to the same engine and I don't see a response until all of those assigned tasks are completed for that engine. I would expect to see a response from enumerate(ar) every time a single simulation task completes.
If simulation counts are large (~1000), it takes a long time to get started, i see the CPUs throttle up on all engines, but no progress print statements are seen until a long time (~40mins), and when I do see progress, it appears a large block (>100) of tasks went to same engine, and awaited completion from that one engine before providing some progress. When that one engine did complete, i saw the ar object provided new responses ever 4 secs - this may have been the time delay to write the output pickle files.
Lastly, host1 also runs the ipycontroller task, and it's memory usage goes up like crazy (a Python task shows using >6GB RAM, a kernel task shows using 3GB). The host2 engine doesn't really show much memory usage at all. What would cause this spike in memory?
I have used this logic in a code couple years ago, and I got using this. My code was something like:
shared_dict = {
# big dict with ~10k keys, each with a list of dicts
}
balancer = engines.load_balanced_view()
with engines[:].sync_imports(): # your 'view' variable
import pandas as pd
import ujson as json
engines[:].push(shared_dict)
results = balancer.map(lambda i: (i, my_func(i)), id)
results_data = results.get()
If simulation counts are small (~50), then it takes a while to get
started, but i start to see progress print statements. Strangely,
multiple tasks will get assigned to the same engine and I don't see a
response until all of those assigned tasks are completed for that
engine. I would expect to see a response from enumerate(ar) every time
a single simulation task completes.
In my case, my_func() was a complex method where I put lots of logging messages written into a file, so I had my print statements.
About the task assignment, as I used load_balanced_view(), I left to the library find its way, and it did great.
If simulation counts are large (~1000), it takes a long time to get
started, i see the CPUs throttle up on all engines, but no progress
print statements are seen until a long time (~40mins), and when I do
see progress, it appears a large block (>100) of tasks went to same
engine, and awaited completion from that one engine before providing
some progress. When that one engine did complete, i saw the ar object
provided new responses ever 4 secs - this may have been the time delay
to write the output pickle files.
About the long time, I haven't experienced that, so I can't say nothing.
I hope this might cast some light in your problem.
PS: as I said in the comment, you could try multiprocessing.Pool. I guess I haven't tried to share a big, read-only data as a global variable using it. I would give a try, because it seems to work.
Sometimes you need to scatter your data grouping by a category, so that you are sure that the each subgroup will be entirely contained by a single cluster.
This is how I usually do it:
# Connect to the clusters
import ipyparallel as ipp
client = ipp.Client()
lview = client.load_balanced_view()
lview.block = True
CORES = len(client[:])
# Define the scatter_by function
def scatter_by(df,grouper,name='df'):
sz = df.groupby([grouper]).size().sort_values().index.unique()
for core in range(CORES):
ids = sz[core::CORES]
print("Pushing {0} {1}s into cluster {2}...".format(size(ids),grouper,core))
client[core].push({name:df[df[grouper].isin(ids)]})
# Scatter the dataframe df grouping by `year`
scatter_by(df,'year')
Notice that the function I'm suggesting scatters makes sure each cluster will host a similar number of observations, which is usually a good idea.
i have to do some long-time (2-3 days i think) tasks with my django ORM data. I look around and didnt find any good solutions.
django-tasks - http://code.google.com/p/django-tasks/ is not well documented, and i dont have any ideas how to use it.
celery - http://ask.github.com/celery/ is excessive for my tasks. Is it good for longtime tasks?
So, what i need to do, i just get all data or parts of data from my database, like:
Entry.objects.all()
And then i need execute same function for each one of QuerySet.
I think it should work around 2-3 days.
So, maybe someone explain for me how to build it.
P.S:at the moment i have only one idea, use cron and database to store process execution timeline.
Use Celery Sub-Tasks. This will allow you to start a long-running task (with many short-running subtasks underneath it), and keep good data on it's execution status within Celery's task result store. As an added bonus, subtasks will be spread across worker proccesses allowing you to take full advantage of multi-core servers or even multiple servers in order to reduce task runtime.
http://ask.github.com/celery/userguide/tasksets.html#task-sets
http://docs.celeryproject.org/en/latest/reference/celery.task.sets.html
EDIT: example:
import time, logging as log
from celery.task import task
from celery.task.sets import TaskSet
from app import Entry
#task(send_error_emails=True)
def long_running_analysis():
entries = list(Entry.objects.all().values('id'))
num_entries = len(entries)
taskset = TaskSet(analyse_entry.subtask(entry.id) for entry in entries)
results = taskset.apply_async()
while not results.ready()
time.sleep(10000)
print log.info("long_running_analysis is %d% complete",
completed_count()*100/num_entries)
if results.failed():
log.error("Analysis Failed!")
result_set = results.join() # brings back results in
# the order of entries
#perform collating or count or percentage calculations here
log.error("Analysis Complete!")
#task
def analyse_entry(id): # inputs must be serialisable
logger = analyse_entry.get_logger()
entry = Entry.objects.get(id=id)
try:
analysis = entry.analyse()
logger.info("'%s' found to be %s.", entry, analysis['status'])
return analysis # must be a dict or serialisable.
except Exception as e:
logger.error("Could not process '%s': %s", entry, e)
return None
If your calculations cannot be seggregated to per-entry tasks, you can always set it up so that one subtask performs tallys, one subtask performs another analysis type. and this will still work, and will still allow you to benifit from parelelleism.