Why is traceback.extract_stack() in Python so slow? - python

During tests I found out that calling traceback.extract_stack() is very slow. The price for getting stack track is comparable to executing a database query.
I'm wondering if I'm doing something wrong or missing something. What's surprising for me that I suppose calling extract_stack() is an internal call in Python, it's executed during the runtime in memory and should be super fast if not instant. In contrast calling database query involves external service (network communication) etc.
Example code is below. You can try how much time it takes to retrieve traceback in let say 20.000 iterations and how fast it is retrieving just first few items from the stack trace - set the limit=None parameter to something else.
My tests showed various results on various systems/configurations but all have in common that calling a stack trace is not orderds of magnitude cheaper, its almost the same as calling SQL insert.
20k SQL inserts | 20k stack traces
Win 5.4 sec 14.4 sec
FreeBSD 5.0 sec 3.7 sec
Ubuntu GCP 16.6 sec 2.4 sec
Windows: laptop, local SSD. FreeBSD: server, local SSD. Ubuntu: Google Cloud, shared SSD.
Am I doing something wrong or is there any explanation why is traceback.extract_stack() so slow? Can I retrieve stack trace somehow faster?
Example code. Run $ pip install pytest and then $ pytest -s -v
import datetime
import unittest
import traceback
class TestStackTrace(unittest.TestCase):
def test_stack_trace(self):
start_time = datetime.datetime.now()
iterations = 20000
for i in range(0, iterations):
stack_list = traceback.extract_stack(limit=None) # set 0, 1, 2...
stack_len = len(stack_list)
self.assertEqual(1, 1)
finish_time = datetime.datetime.now()
print('\nStack length: {}, iterations: {}'.format(stack_len, iterations))
print('Trace elapsed time: {}'.format(finish_time - start_time))
You don't need it but if you want the comparison with SQL insert, here it is. Just insert it as a second test method in the TestStackTrace class. Run CREATE DATABASE pytest1; and CREATE TABLE "test_table1" (num_value BIGINT, str_value VARCHAR(10));
def test_sql_query(self):
start_time = datetime.datetime.now()
con_str = "host='127.0.0.1' port=5432 user='postgres' password='postgres' dbname='pytest1'"
con = psycopg2.connect(con_str)
con.autocommit = True
con.set_session(isolation_level='READ COMMITTED')
cur = con.cursor()
for i in range(0, 20000):
cur.execute('INSERT INTO test_table1 (num_value, str_value) VALUES (%s, %s) ', (i, i))
finish_time = datetime.datetime.now()
print('\nSQL elapsed time: {}'.format(finish_time - start_time))

traceback.extract_stack() is not an internal call in Python implemented in C. The entire traceback module is implemented in Python, which is why it is relatively slow. Since stack traces are typically only required during debugging, its performance usually isn't a concern. You may have to re-implement it as a C/C++ extension on your own if you really need a high-performance version of it.

Related

What's causing so much overhead in Google BigQuery query?

I am running the following function to profile a BigQuery query:
# q = "SELECT * FROM bqtable LIMIT 1'''
def run_query(q):
t0 = time.time()
client = bigquery.Client()
t1 = time.time()
res = client.query(q)
t2 = time.time()
results = res.result()
t3 = time.time()
records = [_ for _ in results]
t4 = time.time()
print (records[0])
print ("Initialize BQClient: %.4f | ExecuteQuery: %.4f | FetchResults: %.4f | PrintRecords: %.4f | Total: %.4f | FromCache: %s" % (t1-t0, t2-t1, t3-t2, t4-t3, t4-t0, res.cache_hit))
And, I get something like the following:
Initialize BQClient: 0.0007 | ExecuteQuery: 0.2854 | FetchResults: 1.0659 | PrintRecords: 0.0958 | Total: 1.4478 | FromCache: True
I am running this on a GCP machine and it is only fetching ONE result in location US (same region, etc.), so the network transfer should (I hope?) be negligible. What's causing all the overhead here?
I tried this on the GCP console and it says the cache hit takes less than 0.1s to return, but in actuality, it's over a second. Here is an example video to illustrate: https://www.youtube.com/watch?v=dONZH1cCiJc.
Notice for the first query, for example, it says it returned in 0.253s from cache:
However, if you view the above video, the query actually STARTED at 7 seconds and 3 frames --
And it COMPLETED at 8 seconds and 13 frames --
That is well over a second -- almost a second and a half!! That number is similar to what I get when I execute a query from the command-line in python.
So why then does it report that it only took 0.253s when in actuality, to do the query and return the one result, it takes over five times that amount?
In other words, it seems like there's about a second overhead REGARDLESS of the query time (which are not noted at all in the execution details). Are there any ways to reduce this time?
The UI is reporting the query execution time, not the total time.
Query execution time is how long it takes BigQuery to actually scan the data and compute the result. If it's just reading from cache then it will be very quick and usually under 1 second, which reflects the timing you're seeing.
However that doesn't include downloading the result table and displaying it in the UI. You actually measured this in your Python script which shows the FetchResults step taking over 1 second, and this is the same thing that's happening in the browser console. For example, a cached query result containing millions of rows will be executed very quickly but might take 30 seconds to fully download.
BigQuery is a large-scale analytical (OLAP) system and is designed for throughput rather than latency. It uses a distributed design with an intensive planning process and writes all results to temporary tables. This allows it to process petabytes in seconds but the trade-off is that every query will take a few seconds to run, no matter how small.
You can look at the official documentation for more info on query planning and performance, but in this situation there is no way to reduce the latency any further. A few seconds is currently the best case scenario for BigQuery.
If you need lower response times for repeated queries then you can look into storing the results in your own caching layer (like Redis), or use BigQuery to aggregate data into a much smaller dataset and then store that in a traditional relational database (like Postgres or MySQL).

multiprocessing / psycopg2 TypeError: can't pickle _thread.RLock objects

I followed the below code in order to implement a parallel select query on a postgres database:
https://tech.geoblink.com/2017/07/06/parallelizing-queries-in-postgresql-with-python/
My basic problem is that I have ~6k queries that need to be executed, and I am trying to optimise the execution of these select queries. Initially it was a single query with the where id in (...) contained all 6k predicate IDs but I ran into issues with the query using up > 4GB of RAM on the machine it ran on, so I decided to split it out into 6k individual queries which when synchronously keeps a steady memory usage. However it takes a lot longer to run time wise, which is less of an issue for my use case. Even so I am trying to reduce the time as much as possible.
This is what my code looks like:
class PostgresConnector(object):
def __init__(self, db_url):
self.db_url = db_url
self.engine = self.init_connection()
self.pool = self.init_pool()
def init_pool(self):
CPUS = multiprocessing.cpu_count()
return multiprocessing.Pool(CPUS)
def init_connection(self):
LOGGER.info('Creating Postgres engine')
return create_engine(self.db_url)
def run_parallel_queries(self, queries):
results = []
try:
for i in self.pool.imap_unordered(self.execute_parallel_query, queries):
results.append(i)
except Exception as exception:
LOGGER.error('Error whilst executing %s queries in parallel: %s', len(queries), exception)
raise
finally:
self.pool.close()
self.pool.join()
LOGGER.info('Parallel query ran producing %s sets of results of type: %s', len(results), type(results))
return list(chain.from_iterable(results))
def execute_parallel_query(self, query):
con = psycopg2.connect(self.db_url)
cur = con.cursor()
cur.execute(query)
records = cur.fetchall()
con.close()
return list(records)
However whenever this runs, I get the following error:
TypeError: can't pickle _thread.RLock objects
I've read lots of similar questions regarding the use of multiprocessing and pickleable objects but I cant for the life of me figure out what I am doing wrong.
The pool is generally one per process (which I believe is the best practise) but shared per instance of the connector class so that its not creating a pool for each use of the parallel_query method.
The top answer to a similar question:
Accessing a MySQL connection pool from Python multiprocessing
Shows an almost identical implementation to my own, except using MySql instead of Postgres.
Am I doing something wrong?
Thanks!
EDIT:
I've found this answer:
Python Postgres psycopg2 ThreadedConnectionPool exhausted
which is incredibly detailed and looks as though I have misunderstood what multiprocessing.Pool vs a connection pool such as ThreadedConnectionPool gives me. However in the first link it doesn't mention needing any connection pools etc. This solution seems good but seems A LOT of code for what I think is a fairly simple problem?
EDIT 2:
So the above link solves another problem, which I would have likely run into anyway so I'm glad I found that, but it doesnt solve the initial issue of not being able to use imap_unordered down to the pickling error. Very frustrating.
Lastly, I think its probably worth noting that this runs in Heroku, on a worker dyno, using Redis rq for scheduling, background tasks etc and a hosted instance of Postgres as the database.
To put it simply, postgres connection and sqlalchemy connection pool is thread safe, however they are not fork-safe.
If you want to use multiprocessing, you should initialize the engine in each child processes after the fork.
You should use multithreading instead if you want to share engines.
Refer to Thread and process safety in psycopg2 documentation:
libpq connections
shouldn’t be used by a forked processes, so when using a module such
as multiprocessing or a forking web deploy method such as FastCGI make
sure to create the connections after the fork.
If you are using multiprocessing.Pool, there is a keyword argument initializer which can be used to run code once on each child process. Try this:
class PostgresConnector(object):
def __init__(self, db_url):
self.db_url = db_url
self.pool = self.init_pool()
def init_pool(self):
CPUS = multiprocessing.cpu_count()
return multiprocessing.Pool(CPUS, initializer=self.init_connection(self.db_url))
#classmethod
def init_connection(cls, db_url):
def _init_connection():
LOGGER.info('Creating Postgres engine')
cls.engine = create_engine(db_url)
return _init_connection
def run_parallel_queries(self, queries):
results = []
try:
for i in self.pool.imap_unordered(self.execute_parallel_query, queries):
results.append(i)
except Exception as exception:
LOGGER.error('Error whilst executing %s queries in parallel: %s', len(queries), exception)
raise
finally:
pass
#self.pool.close()
#self.pool.join()
LOGGER.info('Parallel query ran producing %s sets of results of type: %s', len(results), type(results))
return list(chain.from_iterable(results))
def execute_parallel_query(self, query):
with self.engine.connect() as conn:
with conn.begin():
result = conn.execute(query)
return result.fetchall()
def __getstate__(self):
# this is a hack, if you want to remove this method, you should
# remove self.pool and just pass pool explicitly
self_dict = self.__dict__.copy()
del self_dict['pool']
return self_dict
Now, to address the XY problem.
Initially it was a single query with the where id in (...) contained
all 6k predicate IDs but I ran into issues with the query using up >
4GB of RAM on the machine it ran on, so I decided to split it out into
6k individual queries which when synchronously keeps a steady memory
usage.
What you may want to do instead is one of these options:
write a subquery that generates all 6000 IDs and use the subquery in your original bulk query.
as above, but write the subquery as a CTE
if your ID list comes from an external source (i.e. not from the database), then you can create a temporary table containing the 6000 IDs and then run your original bulk query against the temporary table
However, if you insist on running 6000 IDs through python, then the fastest query is likely neither to do all 6000 IDs in one go (which will run out of memory) nor to run 6000 individual queries. Instead, you may want to try to chunk the queries. Send 500 IDs at once for example. You will have to experiment with the chunk size to determine the largest number of IDs you can send at one time while still comfortably within your memory budget.

Timing Code Execution Time

So, I am interested in timing some of the code I am setting up. Borrowing a timer function from the 4th edition of Learning Python, I tried:
import time
reps = 100
repslist = range(reps)
def timer(func):
start = time.clock()
for i in repslist:
ret = func()
elasped = time.clock()-start
return elapsed
Then, I paste in whatever I want to time, and put:
print(timer(func)) #replace func with the function you want to time
When I run it on my code, I do get an answer, but it's nonsense. Suspecting something was wrong, I put a time.sleep(0.1) call in my code, and got a result of 0.8231
Does anybody know why this might be the case or how to fix it? I suspect that the time.clock() call might be at fault.
According to the help docs for clock:
Return the CPU time or real time since the start of the process or since the first call to clock(). This has as much precision as the system records.
The second call to clock already returns the elapsed time between it and the first clock call. You don't need to manually subtract start.
Change
elasped = time.clock()-start
to
elasped = time.clock()
If you want to timer a function perhaps give decorators a try(documentation here):
import time
def timeit(f):
def timed(*args, **kw):
ts = time.time()
result = f(*args, **kw)
te = time.time()
print 'func:%r args:[%r, %r] took: %2.4f sec' % \
(f.__name__, args, kw, te-ts)
return result
return timed
Then when you write a function you just use the decorator, here:
#timeit
def my_example_function():
for i in range(10000):
print "x"
This will print out the time the function took to execute:
func:'my_example_function' args:[(), {}] took: 0.4220 sec
After fixing the typo in the first intended use of elapsed, your code works fine with either time.clock or time.time (or Py3's time.monotonic for that matter) on my Linux system.
The difference would be in the (OS specific) behavior for clock; on most UNIX-like OSes it will return the processor time used by the program since it launched (so time spent blocked, on I/O, locks, page faults, etc. wouldn't count), while on Windows it's a wall clock timer (so time spent blocked would count) that counts seconds since first call.
The UNIX-like version of time.clock is also fairly unreliable if used in a long running program when clock_t is only 32 bits; the value it returns will wrap roughly every 72 minutes of processor time.
Of course, time.time isn't perfect either; it follows the system clock, so an NTP time update (or any other change to the system clock) occurring between calls will give erroneous results (on Python 3.3+, you'd use time.monotonic to avoid this problem). It's also not guaranteed to have granularity finer than 1 second, so if your function doesn't take an awfully long time to run, on a system with low res time.time you won't get particularly useful results.
Really, you should be looking at the Python batteries designed for this (that also handle issues like garbage collection overhead and the like). The timeit module already has a function that does what you want, but handles all the edge cases and issues I mentioned. For example, to time some global function named foo for 100 reps, you'd just do:
import timeit
def foo():
...
print(timeit.timeit('foo()', 'from __main__ import foo', number=100))
It fixes most of the issues I mention by selecting the best timing function for the OS you're on (and also fixes other sources of jitter, e.g. cyclic garbage collection, which is disabled during the test and reenabled at the end).
Even if you don't want to use that for some reason, if you're using Python 3.3 or higher, take a look at the replacements for time.clock, e.g. time.perf_counter (includes time spent sleeping) or time.process_time (includes only CPU time), both of which are portable, reliable, fast, and high resolution for better accuracy.
The time.sleep() will terminate for any signal. read about it here ...
http://www.tutorialspoint.com/python/time_sleep.htm

avoid expensive setup in timeit.repeat() benchmark

I'm trying to measure the execution time of a small Python code snippet of mine and I'm wondering what's the best way to do so.
Ideally, I would like to run some sort of setup (which takes a loooong time), then run some test code a couple of times, and get the minimum time of these runs.
timeit() seemed appropriate, but I'm not sure how to obtain the minimum time without re-executing the setup. Small code snippet demonstrating the question:
import timeit
setup = 'a = 2.0' # expensive
stmt = 'b = a**2' # also takes significantly longer than timer resolution
# this executes setup and stmt 10 times and the minimum of these 10
# runs is returned:
timings1 = timeit.repeat(stmt = stmt, setup = setup, repeat = 10, number = 1)
# this executes setup once and stmt 10 times but the overall time of
# these 10 runs is returned (and I would like to have the minimum
# of the 10 runs):
timings2 = timeit.repeat(stmt = stmt, setup = setup, repeat = 1, number = 10)
Have you tried using datetime to do your timing for you?
start = datetime.datetime.now()
print datetime.datetime.now() - start #prints a datetime.timedelta object`
That will give you the time elapsed and you can control where it started with tiny overhead.
Edit: This is a video of a guy who uses it also for doing some timing, it seems to be the easiest way to get the running time. http://www.youtube.com/watch?v=Iw9-GckD-gQ

Is it possible to increase the response timeout in Google App Engine?

On my local machine the script runs fine but in the cloud it 500 all the time. This is a cron task so I don't really mind if it takes 5min...
< class 'google.appengine.runtime.DeadlineExceededError' >:
Any idea whether it's possible to increase the timeout?
Thanks,
rui
You cannot go beyond 30 secs, but you can indirectly increase timeout by employing task queues - and writing task that gradually iterate through your data set and processes it. Each such task run should of course fit into timeout limit.
EDIT
To be more specific, you can use datastore query cursors to resume processing in the same place:
http://code.google.com/intl/pl/appengine/docs/python/datastore/queriesandindexes.html#Query_Cursors
introduced first in SDK 1.3.1:
http://googleappengine.blogspot.com/2010/02/app-engine-sdk-131-including-major.html
The exact rules for DB query timeouts are complicated, but it seems that a query cannot live more than about 2 mins, and a batch cannot live more than about 30 seconds. Here is some code that breaks a job into multiple queries, using cursors to avoid those timeouts.
def make_query(start_cursor):
query = Foo()
if start_cursor:
query.with_cursor(start_cursor)
return query
batch_size = 1000
start_cursor = None
while True:
query = make_query(start_cursor)
results_fetched = 0
for resource in query.run(limit = batch_size):
results_fetched += 1
# Do something
if results_fetched == batch_size:
start_cursor = query.cursor()
break
else:
break
Below is the code I use to solve this problem, by breaking up a single large query into multiple small ones. I use the google.appengine.ext.ndb library -- I don't know if that is required for the code below to work.
(If you are not using ndb, consider switching to it. It is an improved version of the db library and migrating to it is easy. For more information, see https://developers.google.com/appengine/docs/python/ndb.)
from google.appengine.datastore.datastore_query import Cursor
def ProcessAll():
curs = Cursor()
while True:
records, curs, more = MyEntity.query().fetch_page(5000, start_cursor=curs)
for record in records:
# Run your custom business logic on record.
RunMyBusinessLogic(record)
if more and curs:
# There are more records; do nothing here so we enter the
# loop again above and run the query one more time.
pass
else:
# No more records to fetch; break out of the loop and finish.
break

Categories

Resources