ProcessPoolExecutor read-only variable speed discrepancy - python

I have two versions of a multiprocessing program using concurrent.futures.ProcessPoolExecutor (Python 3.6, Linux) with surprising speed discrepancies despite seemingly minor changes (one is ~3x slower than the other).
Each child process executes a simple function that reads from a large dict (it does not alter it) and returns a result.
The first version of the function passes the dict into executor.submit() as an argument.
The second version of the function reads from the global dict directly.
Code samples
Variable passed in:
#!/usr/bin/env python3
import concurrent.futures, pstats, sys, cProfile
BIG_DICT = {i: 2*i for i in range(10000)}
def foo(d):
return d[0]
with concurrent.futures.ProcessPoolExecutor(max_workers=10) as executor:
tasks = [executor.submit(foo, BIG_DICT) for _ in range(100000)]
for task in concurrent.futures.as_completed(tasks):
task.result()
Global variable read from:
#!/usr/bin/env python3
import concurrent.futures, pstats, sys, cProfile
BIG_DICT = {i: 2*i for i in range(10000)}
def foo():
return BIG_DICT[0]
with concurrent.futures.ProcessPoolExecutor(max_workers=10) as executor:
tasks = [executor.submit(foo) for _ in range(100000)]
for task in concurrent.futures.as_completed(tasks):
task.result()
Ideas
I've profiled both versions of the program using cProfile and the majority of execution time seems to be spent waiting for locks. The global version only waits for about 10 seconds, while the pass-in version waits for almost 80 seconds!
From what I understand, when a process is forked it should make a copy of its parent's memory. As the program is multiprocessed and BIG_DICT is never actually modified after creation, there shouldn't be any need for locking to maintain state consistency between submitting each process.
Since BIG_DICT needs to be copied into the memory space of each child process in both versions, why is there so much discrepancy in execution time?
A couple of ideas I have floating around:
Implementation detail of ProcessPoolExecutor
GIL quirk
Some sort of Python runtime/OS optimisation
Profiling results
Variable passed in:
7672287 function calls in 92.434 seconds
Ordered by: internal time, cumulative time
List reduced from 247 to 12 due to restriction <0.05>
ncalls tottime percall cumtime percall filename:lineno(function)
460133 75.428 0.000 75.428 0.000 {method 'acquire' of '_thread.lock' objects}
100001 7.034 0.000 7.034 0.000 {built-in method posix.write}
100001 2.490 0.000 2.490 0.000 {method '__enter__' of '_multiprocessing.SemLock' objects}
100001 0.686 0.000 78.344 0.001 _base.py:196(as_completed)
90033 0.553 0.000 75.879 0.001 threading.py:263(wait)
100000 0.548 0.000 13.639 0.000 process.py:449(submit)
190033 0.366 0.000 0.713 0.000 _base.py:174(_yield_finished_futures)
100000 0.351 0.000 0.598 0.000 _base.py:312(__init__)
90033 0.327 0.000 76.335 0.001 threading.py:533(wait)
100001 0.261 0.000 7.617 0.000 connection.py:181(send_bytes)
480065 0.260 0.000 0.382 0.000 threading.py:239(__enter__)
100001 0.258 0.000 11.329 0.000 queues.py:339(put)
Ordered by: internal time, cumulative time
List reduced from 247 to 12 due to restriction <0.05>
Function was called by...
ncalls tottime cumtime
{method 'acquire' of '_thread.lock' objects} <- 90033 0.078 0.078 threading.py:251(_acquire_restore)
190033 0.391 0.391 threading.py:254(_is_owned)
180066 74.956 74.956 threading.py:263(wait)
1 0.003 0.003 threading.py:1062(_wait_for_tstate_lock)
{built-in method posix.write} <- 100001 7.034 7.034 connection.py:365(_send)
{method '__enter__' of '_multiprocessing.SemLock' objects} <- 100001 2.490 2.490 synchronize.py:95(__enter__)
_base.py:196(as_completed) <-
threading.py:263(wait) <- 90033 0.553 75.879 threading.py:533(wait)
process.py:449(submit) <- 100000 0.548 13.639 local.py:13(<listcomp>)
_base.py:174(_yield_finished_futures) <- 190033 0.366 0.713 _base.py:196(as_completed)
_base.py:312(__init__) <- 100000 0.351 0.598 process.py:449(submit)
threading.py:533(wait) <- 90032 0.327 76.334 _base.py:196(as_completed)
1 0.000 0.001 threading.py:828(start)
connection.py:181(send_bytes) <- 100001 0.261 7.617 queues.py:339(put)
threading.py:239(__enter__) <- 100000 0.070 0.116 _base.py:174(_yield_finished_futures)
100000 0.033 0.051 _base.py:405(result)
100000 0.083 0.108 queue.py:115(put)
90032 0.040 0.058 threading.py:523(clear)
90033 0.034 0.050 threading.py:533(wait)
queues.py:339(put) <- 100000 0.258 11.329 process.py:449(submit)
1 0.000 0.000 process.py:499(shutdown)
Global variable read from:
5949819 function calls in 27.158 seconds
Ordered by: internal time, cumulative time
List reduced from 247 to 12 due to restriction <0.05>
ncalls tottime percall cumtime percall filename:lineno(function)
160569 10.072 0.000 10.072 0.000 {method 'acquire' of '_thread.lock' objects}
100001 5.453 0.000 5.453 0.000 {method '__enter__' of '_multiprocessing.SemLock' objects}
100001 5.338 0.000 5.338 0.000 {built-in method posix.write}
100000 0.883 0.000 1.163 0.000 _base.py:312(__init__)
100000 0.477 0.000 15.671 0.000 process.py:449(submit)
100001 0.438 0.000 6.133 0.000 connection.py:181(send_bytes)
100001 0.304 0.000 12.921 0.000 queues.py:339(put)
100000 0.304 0.000 0.304 0.000 process.py:116(__init__)
100001 0.277 0.000 0.432 0.000 reduction.py:38(__init__)
100000 0.267 0.000 0.333 0.000 threading.py:334(notify)
100000 0.240 0.000 0.747 0.000 queue.py:115(put)
100006 0.238 0.000 0.280 0.000 threading.py:215(__init__)
Ordered by: internal time, cumulative time
List reduced from 247 to 12 due to restriction <0.05>
Function was called by...
ncalls tottime cumtime
{method 'acquire' of '_thread.lock' objects} <- 15142 0.007 0.007 threading.py:251(_acquire_restore)
115142 0.038 0.038 threading.py:254(_is_owned)
30284 10.022 10.022 threading.py:263(wait)
1 0.004 0.004 threading.py:1062(_wait_for_tstate_lock)
{method '__enter__' of '_multiprocessing.SemLock' objects} <- 100001 5.453 5.453 synchronize.py:95(__enter__)
{built-in method posix.write} <- 100001 5.338 5.338 connection.py:365(_send)
_base.py:312(__init__) <- 100000 0.883 1.163 process.py:449(submit)
process.py:449(submit) <- 100000 0.477 15.671 global.py:13(<listcomp>)
connection.py:181(send_bytes) <- 100001 0.438 6.133 queues.py:339(put)
queues.py:339(put) <- 100000 0.304 12.921 process.py:449(submit)
1 0.000 0.000 process.py:499(shutdown)
process.py:116(__init__) <- 100000 0.304 0.304 process.py:449(submit)
reduction.py:38(__init__) <- 100001 0.277 0.432 reduction.py:48(dumps)
threading.py:334(notify) <- 100000 0.267 0.333 queue.py:115(put)
queue.py:115(put) <- 100000 0.240 0.747 process.py:449(submit)
threading.py:215(__init__) <- 100000 0.238 0.280 _base.py:312(__init__)
3 0.000 0.000 queue.py:27(__init__)
1 0.000 0.000 queues.py:67(_after_fork)
2 0.000 0.000 threading.py:498(__init__)

Related

Numpy mean of flattened large array slower than mean of mean of all axes

Running Numpy version 1.19.2, I get better performance cumulating the mean of every individual axis of an array than by calculating the mean over an already flattened array.
shape = (10000,32,32,3)
mat = np.random.random(shape)
# Call this Method A.
%%timeit
mat_means = mat.mean(axis=0).mean(axis=0).mean(axis=0)
14.6 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
mat_reshaped = mat.reshape(-1,3)
# Call this Method B
%%timeit
mat_means = mat_reshaped.mean(axis=0)
135 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is odd, since doing the mean multiple times has the same bad access pattern (perhaps even worse) than the one on the reshaped array. We also do more operations this way. As a sanity check, I converted the array to FORTRAN order:
mat_reshaped_fortran = mat.reshape(-1,3, order='F')
%%timeit
mat_means = mat_reshaped_fortran.mean(axis=0)
12.2 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This yields the performance improvement I expected.
For Method A, prun gives:
36 function calls in 0.019 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
3 0.018 0.006 0.018 0.006 {method 'reduce' of 'numpy.ufunc' objects}
1 0.000 0.000 0.019 0.019 {built-in method builtins.exec}
3 0.000 0.000 0.019 0.006 _methods.py:143(_mean)
3 0.000 0.000 0.000 0.000 _methods.py:59(_count_reduce_items)
1 0.000 0.000 0.019 0.019 <string>:1(<module>)
3 0.000 0.000 0.019 0.006 {method 'mean' of 'numpy.ndarray' objects}
3 0.000 0.000 0.000 0.000 _asarray.py:86(asanyarray)
3 0.000 0.000 0.000 0.000 {built-in method numpy.array}
3 0.000 0.000 0.000 0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
6 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
6 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
While for Method B:
14 function calls in 0.166 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.166 0.166 0.166 0.166 {method 'reduce' of 'numpy.ufunc' objects}
1 0.000 0.000 0.166 0.166 {built-in method builtins.exec}
1 0.000 0.000 0.166 0.166 _methods.py:143(_mean)
1 0.000 0.000 0.000 0.000 _methods.py:59(_count_reduce_items)
1 0.000 0.000 0.166 0.166 <string>:1(<module>)
1 0.000 0.000 0.166 0.166 {method 'mean' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 _asarray.py:86(asanyarray)
1 0.000 0.000 0.000 0.000 {built-in method numpy.array}
1 0.000 0.000 0.000 0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
2 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
2 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Note: np.setbufsize(1e7) doesn't seem to have any effect.
What is the reason for this performance difference?
Let's call your original matrix mat. mat.shape = (10000,32,32,3). Visually, this is like having a "stack" of 10,000 * 32x32x3 * rectangular prisms (I think of them as LEGOs) of floats.
Now lets think about what you did in terms of floating point operations (flops):
In Method A, you do mat.mean(axis=0).mean(axis=0).mean(axis=0). Let's break this down:
You take the mean of each position (i,j,k) across all 10,000 LEGOs. This gives you back a single LEGO of size 32x32x3 which now contains the first set of means. This means you have performed 10,000 additions and 1 division per mean, of which there are 32323 = 3072. In total, you've done 30,723,072 flops.
You then take the mean again, this time of each position (j,k), where i is now the number of the layer (vertical position) you are currently on. This gives you a piece of paper with 32x3 means written on it. You have performed 32 additions and 1 divisions per mean, of which there are 32*3 = 96. In total, you've done 3,168 flops.
Finally, you take the mean of each column k, where j is now the row you are currently on. This gives you a stub with 3 means written on it. You have performed 32 additions and 1 division per mean, of which there are 3. In total, you've done 99 flops.
The grand total of all this is 30,723,072 + 3,168 + 99 = 30,726,339 flops.
In Method B, you do mat_reshaped = mat.reshape(-1,3); mat_means = mat_reshaped.mean(axis=0). Let's break this down:
You reshaped everything, so mat is a long roll of paper of size 10,240,000x3. You take the mean of each column k, where j is now the row you are currently on. This gives you a stub with 3 means written on it. You have performed 10,240,000 additions and 1 division per mean, of which there are 3. In total, you've done 30,720,003 flops.
So now you're saying to yourself "What! All of that work, only to show that the slower method actually does ~less~ work?! " Here's the problem: Although Method B has less work to do, it does not have a lot less work to do, meaning just from a flop standpoint, we would expect things to be similar in terms of runtime.
You also have to consider the size of your reshaped array in Method B: a matrix with 10,240,000 rows is HUGE!!! It's really hard/inefficient for the computer to access all of that, and more memory accesses means longer runtimes. The fact is that in its original 10,000x32x32x3 shape, the matrix was already partitioned into convenient slices that the computer could access more efficiently: this is actually a common technique when handling giant matrices Jaime's response to a similar question or even this article: both talk about how breaking up a big matrix into smaller slices helps your program be more memory efficient, therefore making it run faster.

How to profile flask endpoint?

I would like to profile a flask apps endpoint to see where it is slowing down when executing the endpoints functions. I have tried using Pycharms built-in profiler but the output tells me that most time is spent in the wait function i.e waiting for user input. I have tried installing the flask-profiler but was not able to set it up due to a project structure different than the package was expecting. Any help is appreciated. Thank you!
Werkzeug has a built in application profiler based on cProfile.
With help from this gist I managed to set it up as follows:
from flask import Flask
from werkzeug.middleware.profiler import ProfilerMiddleware
from time import sleep
app = Flask(__name__)
app.wsgi_app = ProfilerMiddleware(app.wsgi_app)
#app.route('/')
def index():
print ('begin')
sleep(3)
print ('end')
return 'success'
A request to this endpoint, results in the following summary in the termainl:
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
begin
end
--------------------------------------------------------------------------------
PATH: '/'
298 function calls in 2.992 seconds
Ordered by: internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.969 2.969 2.969 2.969 {built-in method time.sleep}
1 0.002 0.002 0.011 0.011 /usr/local/lib/python3.7/site-packages/flask/app.py:1955(finalize_request)
1 0.002 0.002 0.008 0.008 /usr/local/lib/python3.7/site-packages/werkzeug/wrappers/base_response.py:173(__init__)
35 0.002 0.000 0.002 0.000 {built-in method builtins.isinstance}
4 0.001 0.000 0.001 0.000 /usr/local/lib/python3.7/site-packages/werkzeug/datastructures.py:910(_unicodify_header_value)
2 0.001 0.000 0.003 0.002 /usr/local/lib/python3.7/site-packages/werkzeug/datastructures.py:1298(__setitem__)
1 0.001 0.001 0.001 0.001 /usr/local/lib/python3.7/site-packages/werkzeug/datastructures.py:960(__getitem__)
6 0.001 0.000 0.001 0.000 /usr/local/lib/python3.7/site-packages/werkzeug/_compat.py:210(to_unicode)
2 0.000 0.000 0.002 0.001 /usr/local/lib/python3.7/site-packages/werkzeug/datastructures.py:1212(set)
4 0.000 0.000 0.000 0.000 {method 'decode' of 'bytes' objects}
1 0.000 0.000 0.002 0.002 /usr/local/lib/python3.7/site-packages/werkzeug/wrappers/base_response.py:341(set_data)
10 0.000 0.000 0.001 0.000 /usr/local/lib/python3.7/site-packages/werkzeug/local.py:70(__getattr__)
8 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.008 0.008 /usr/local/lib/python3.7/site-packages/flask/app.py:2029(make_response)
1 0.000 0.000 0.004 0.004 /usr/local/lib/python3.7/site-packages/werkzeug/routing.py:1551(bind_to_environ)
1 0.000 0.000 0.000 0.000 /usr/local/lib/python3.7/site-packages/werkzeug/_internal.py:67(_get_environ)
1 0.000 0.000 0.001 0.001 /usr/local/lib/python3.7/site-packages/werkzeug/routing.py:1674(__init__)
[snipped for berevity]
You could limit the results down slightly by passing a restrictions argument:
restrictions (Iterable[Union[str, int, float]]) – A tuple of restrictions to filter stats by. See pstats.Stats.print_stats().
So, for example if you were interested in the python file living at /code/app.py specifically, you could instead define the profiler like:
app.wsgi_app = ProfilerMiddleware(app.wsgi_app, restrictions=('/code/app.py',))
Resulting in the output:
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
begin
end
--------------------------------------------------------------------------------
PATH: '/'
300 function calls in 3.016 seconds
Ordered by: internal time, call count
List reduced from 131 to 2 due to restriction <'/code/app.py'>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 3.007 3.007 /code/app.py:12(index)
1 0.000 0.000 2.002 2.002 /code/app.py:9(slower)
--------------------------------------------------------------------------------
With some tweaking this could prove useful to solve your issue.

strange result from timeit

I tried to repeat the functionality of IPython %time, but for some strange reason, results of testing of some function are horrific.
IPython:
In [11]: from random import shuffle
....: import numpy as np
....: def numpy_seq_el_rank(seq, el):
....: return sum(seq < el)
....:
....: seq = np.array(xrange(10000))
....: shuffle(seq)
....:
In [12]: %timeit numpy_seq_el_rank(seq, 10000//2)
10000 loops, best of 3: 46.1 µs per loop
Python:
from timeit import timeit, repeat
def my_timeit(code, setup, rep, loops):
result = repeat(code, setup=setup, repeat=rep, number=loops)
return '%d loops, best of %d: %0.9f sec per loop'%(loops, rep, min(result))
np_setup = '''
from random import shuffle
import numpy as np
def numpy_seq_el_rank(seq, el):
return sum(seq < el)
seq = np.array(xrange(10000))
shuffle(seq)
'''
np_code = 'numpy_seq_el_rank(seq, 10000//2)'
print 'Numpy seq_el_rank:\n\t%s'%my_timeit(code=np_code, setup=np_setup, rep=3, loops=100)
And its output:
Numpy seq_el_rank:
100 loops, best of 3: 1.655324947 sec per loop
As you can see, in python i made 100 loops instead 10000 (and get 35000 times slower result) as in ipython, because it takes really long time. Can anybody explain why result in python is so slow?
UPD:
Here is cProfile.run('my_timeit(code=np_code, setup=np_setup, rep=3, loops=10000)') output:
30650 function calls in 4.987 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 4.987 4.987 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 <timeit-src>:2(<module>)
3 0.001 0.000 4.985 1.662 <timeit-src>:2(inner)
300 0.006 0.000 4.961 0.017 <timeit-src>:7(numpy_seq_el_rank)
1 0.000 0.000 4.987 4.987 Lab10.py:47(my_timeit)
3 0.019 0.006 0.021 0.007 random.py:277(shuffle)
1 0.000 0.000 0.002 0.002 timeit.py:121(__init__)
3 0.000 0.000 4.985 1.662 timeit.py:185(timeit)
1 0.000 0.000 4.985 4.985 timeit.py:208(repeat)
1 0.000 0.000 4.987 4.987 timeit.py:239(repeat)
2 0.000 0.000 0.000 0.000 timeit.py:90(reindent)
3 0.002 0.001 0.002 0.001 {compile}
3 0.000 0.000 0.000 0.000 {gc.disable}
3 0.000 0.000 0.000 0.000 {gc.enable}
3 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
3 0.000 0.000 0.000 0.000 {isinstance}
3 0.000 0.000 0.000 0.000 {len}
3 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
29997 0.001 0.000 0.001 0.000 {method 'random' of '_random.Random' objects}
2 0.000 0.000 0.000 0.000 {method 'replace' of 'str' objects}
1 0.000 0.000 0.000 0.000 {min}
3 0.003 0.001 0.003 0.001 {numpy.core.multiarray.array}
1 0.000 0.000 0.000 0.000 {range}
300 4.955 0.017 4.955 0.017 {sum}
6 0.000 0.000 0.000 0.000 {time.clock}
Well, one issue is that you're misreading the results. ipython is telling you how long it took each of the 10,000 iterations for the set of 10,000 iterations with the lowest total time. The timeit.repeat module is reporting how long the whole round of 100 iterations took (again, for the shortest of three). So the real discrepancy is 46.1 µs per loop (ipython) vs. 16.5 ms per loop (python), still a factor of ~350x difference, but not 35,000x.
You didn't show profiling results for ipython. Is it possible that in your ipython session, you did either from numpy import sum or from numpy import *? If so, you'd have been timing the numpy.sum (which is optimized for numpy arrays and would run several orders of magnitude faster), while your python code (which isolated the globals in a way that ipython does not) ran the normal sum (that has to convert all the values to Python ints and sum them).
If you check your profiling output, virtually all of your work is being done in sum; if that part of your code was sped up by several orders of magnitude, the total time would reduce similarly. That would explain the "real" discrepancy; in the test case linked above, it was a 40x difference, and that was for a smaller array (the smaller the array, the less numpy can "show off") with more complex values (vs. summing 0s and 1s here I believe).
The remainder (if any) is probably an issue of how the code is being evaled slightly differently, or possibly weirdness with the random shuffle (for consistent tests, you'd want to seed random with a consistent seed to make the "randomness" repeatable) but I doubt that's a difference of more than a few percent.
There could be any number of reasons this code is running slower in one implementation of python than another. One may be optimized differently than another, one may pre-compile certain parts while the other is fully interpreted. The only way to figure out why is to profile your code.
https://docs.python.org/2/library/profile.html
import cProfile
cProfile.run('repeat(code, setup=setup, repeat=rep, number=loops)')
Will give a result similar to
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <stdin>:1(testing)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'upper' of 'str' objects}
Which shows you when function calls were made, how many times they were made and how long they took.

Most efficient way to create an array of cos and sin in Numpy

I need to store an array of size n with values of cos(x) and sin(x), lets say
array[[cos(0.9), sin(0.9)],
[cos(0.35),sin(0.35)],
...]
The arguments of each pair of cos and sin is given by random choice. My code as far as I have been improving it is like this:
def randvector():
""" Generates random direction for n junctions in the unitary circle """
x = np.empty([n,2])
theta = 2 * np.pi * np.random.random_sample((n))
x[:,0] = np.cos(theta)
x[:,1] = np.sin(theta)
return x
Is there a shorter way or more effective way to achieve this?
Your code is effective enough. And justhalf's answer is not bad I think.
For effective and short, How about this code?
def randvector(n):
theta = 2 * np.pi * np.random.random_sample((n))
return np.vstack((np.cos(theta), np.sin(theta))).T
UPDATE
Append cProfile result.
justhalf's
5 function calls in 4.707 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 4.707 4.707 <string>:1(<module>)
1 2.452 2.452 4.706 4.706 test.py:6(randvector1)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.010 0.010 0.010 0.010 {method 'random_sample' of 'mtrand.RandomState' objects}
1 2.244 2.244 2.244 2.244 {numpy.core.multiarray.array}
OP's
5 function calls in 0.088 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.088 0.088 <string>:1(<module>)
1 0.079 0.079 0.088 0.088 test.py:9(randvector2)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.009 0.009 0.009 0.009 {method 'random_sample' of 'mtrand.RandomState' objects}
1 0.000 0.000 0.000 0.000 {numpy.core.multiarray.empty}
mine
21 function calls in 0.087 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.087 0.087 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 numeric.py:322(asanyarray)
1 0.000 0.000 0.002 0.002 shape_base.py:177(vstack)
2 0.000 0.000 0.000 0.000 shape_base.py:58(atleast_2d)
1 0.076 0.076 0.087 0.087 test.py:17(randvector3)
6 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {map}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.009 0.009 0.009 0.009 {method 'random_sample' of 'mtrand.RandomState' objects}
2 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}
1 0.002 0.002 0.002 0.002 {numpy.core.multiarray.concatenate}
Your code already looks fine to me, but here are a few more thoughts.
Here's a one-liner.
It is marginally slower than your version.
def randvector2(n):
return np.exp((2.0j * np.pi) * np.random.rand(n, 1)).view(dtype=np.float64)
I get these timings for n=10000
Yours:
1000 loops, best of 3: 716 µs per loop
my shortened version:
1000 loops, best of 3: 834 µs per loop
Now if speed is a concern, your approach is really very good.
Another answer shows how to use hstack.
That works well.
Here is another version that is just a little different from yours and is marginally faster.
def randvector3(n):
x = np.empty([n,2])
theta = (2 * np.pi) * np.random.rand(n)
np.cos(theta, out=x[:,0])
np.sin(theta, out=x[:,1])
return x
This gives me the timing:
1000 loops, best of 3: 698 µs per loop
If you have access to numexpr, the following is faster (at least on my machine).
import numexpr as ne
def randvector3(n):
sample = np.random.rand(n, 1)
c = 2.0j * np.pi
return ne.evaluate('exp(c * sample)').view(dtype=np.float64)
This gives me the timing:
1000 loops, best of 3: 366 µs per loop
Honestly though, if I were writing this for anything that wasn't extremely performance intensive, I'd do pretty much the same thing you did.
It makes your intent pretty clear to the reader.
The version with hstack works well too.
Another quick note:
When I run timings for n=10, my one-line version is fastest.
When I do n=10000000, the fast pure-numpy version is fastest.
You can use list comprehension to make the code a little bit shorter:
def randvector(n):
return np.array([(np.cos(theta), np.sin(theta)) for theta in 2*np.pi*np.random.random_sample(n)])
But, as IanH mentioned in comments, this is slower. In fact, through my experiment, this is 5x slower, because this doesn't take advantage of NumPy vectorization.
So to answer your question:
Is there a shorter way?
Yes, which is what I give in this answer, although it's only shorter by a few characters (but it saves many lines!)
Is there a more effective (I believe you meant "efficient") way?
I believe the answer to this question, without overly complicating the code, is no, since numpy already optimizes the vectorization (assigning of the cos and sin values to the array)
Timing
Comparing various methods:
OP's randvector: 0.002131 s
My randvector: 0.013218 s
mskimm's randvector: 0.003175 s
So it seems that mskimm's randvector looks good in terms of code length end efficiency =D

Python getting meaningful results from cProfile

I have a Python script in a file which takes just over 30 seconds to run. I am trying to profile it as I would like to cut down this time dramatically.
I am trying to profile the script using cProfile, but essentially all it seems to be telling me is that yes, the main script took a long time to run, but doesn't give the kind of breakdown I was expecting. At the terminal, I type something like:
cat my_script_input.txt | python -m cProfile -s time my_script.py
The results I get are:
<my_script_output>
683121 function calls (682169 primitive calls) in 32.133 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 31.980 31.980 32.133 32.133 my_script.py:18(<module>)
121089 0.050 0.000 0.050 0.000 {method 'split' of 'str' objects}
121090 0.038 0.000 0.049 0.000 fileinput.py:243(next)
2 0.027 0.014 0.036 0.018 {method 'sort' of 'list' objects}
121089 0.009 0.000 0.009 0.000 {method 'strip' of 'str' objects}
201534 0.009 0.000 0.009 0.000 {method 'append' of 'list' objects}
100858 0.009 0.000 0.009 0.000 my_script.py:51(<lambda>)
952 0.008 0.000 0.008 0.000 {method 'readlines' of 'file' objects}
1904/952 0.003 0.000 0.011 0.000 fileinput.py:292(readline)
14412 0.001 0.000 0.001 0.000 {method 'add' of 'set' objects}
182 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
1 0.000 0.000 0.000 0.000 fileinput.py:80(<module>)
1 0.000 0.000 0.000 0.000 fileinput.py:197(__init__)
1 0.000 0.000 0.000 0.000 fileinput.py:266(nextfile)
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 fileinput.py:91(input)
1 0.000 0.000 0.000 0.000 fileinput.py:184(FileInput)
1 0.000 0.000 0.000 0.000 fileinput.py:240(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
This doesn't seem to be telling me anything useful. The vast majority of the time is simply listed as:
ncalls tottime percall cumtime percall filename:lineno(function)
1 31.980 31.980 32.133 32.133 my_script.py:18(<module>)
In my_script.py, Line 18 is nothing more than the closing """ of the file's header block comment, so it's not that there is a whole load of work concentrated in Line 18. The script as a whole is mostly made up of line-based processing with mostly some string splitting, sorting and set work, so I was expecting to find the majority of time going to one or more of these activities. As it stands, seeing all the time grouped in cProfile's results as occurring on a comment line doesn't make any sense or at least does not shed any light on what is actually consuming all the time.
EDIT: I've constructed a minimum working example similar to my above case to demonstrate the same behavior:
mwe.py
import fileinput
for line in fileinput.input():
for i in range(10):
y = int(line.strip()) + int(line.strip())
And call it with:
perl -e 'for(1..1000000){print "$_\n"}' | python -m cProfile -s time mwe.py
To get the result:
22002536 function calls (22001694 primitive calls) in 9.433 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 8.004 8.004 9.433 9.433 mwe.py:1(<module>)
20000000 1.021 0.000 1.021 0.000 {method 'strip' of 'str' objects}
1000001 0.270 0.000 0.301 0.000 fileinput.py:243(next)
1000000 0.107 0.000 0.107 0.000 {range}
842 0.024 0.000 0.024 0.000 {method 'readlines' of 'file' objects}
1684/842 0.007 0.000 0.032 0.000 fileinput.py:292(readline)
1 0.000 0.000 0.000 0.000 fileinput.py:80(<module>)
1 0.000 0.000 0.000 0.000 fileinput.py:91(input)
1 0.000 0.000 0.000 0.000 fileinput.py:197(__init__)
1 0.000 0.000 0.000 0.000 fileinput.py:184(FileInput)
1 0.000 0.000 0.000 0.000 fileinput.py:266(nextfile)
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 fileinput.py:240(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Am I using cProfile incorrectly somehow?
As I mentioned in a comment, when you can't get cProfile to work externally, you can often use it internally instead. It's not that hard.
For example, when I run with -m cProfile in my Python 2.7, I get effectively the same results you did. But when I manually instrument your example program:
import fileinput
import cProfile
pr = cProfile.Profile()
pr.enable()
for line in fileinput.input():
for i in range(10):
y = int(line.strip()) + int(line.strip())
pr.disable()
pr.print_stats(sort='time')
… here's what I get:
22002533 function calls (22001691 primitive calls) in 3.352 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
20000000 2.326 0.000 2.326 0.000 {method 'strip' of 'str' objects}
1000001 0.646 0.000 0.700 0.000 fileinput.py:243(next)
1000000 0.325 0.000 0.325 0.000 {range}
842 0.042 0.000 0.042 0.000 {method 'readlines' of 'file' objects}
1684/842 0.013 0.000 0.055 0.000 fileinput.py:292(readline)
1 0.000 0.000 0.000 0.000 fileinput.py:197(__init__)
1 0.000 0.000 0.000 0.000 fileinput.py:91(input)
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 fileinput.py:266(nextfile)
1 0.000 0.000 0.000 0.000 fileinput.py:240(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
That's a lot more useful: It tells you what you probably already expected, that more than half your time is spent calling str.strip().
Also, note that if you can't edit the file containing code you wish to profile (mwe.py), you can always do this:
import cProfile
pr = cProfile.Profile()
pr.enable()
import mwe
pr.disable()
pr.print_stats(sort='time')
Even that doesn't always work. If your program calls exit(), for example, you'll have to use a try:/finally: wrapper and/or an atexit. And it it calls os._exit(), or segfaults, you're probably completely hosed. But that isn't very common.
However, something I discovered later: If you move all code out of the global scope, -m cProfile seems to work, at least for this case. For example:
import fileinput
def f():
for line in fileinput.input():
for i in range(10):
y = int(line.strip()) + int(line.strip())
f()
Now the output from -m cProfile includes, among other things:
2000000 4.819 0.000 4.819 0.000 :0(strip)
100001 0.288 0.000 0.295 0.000 fileinput.py:243(next)
I have no idea why this also made it twice as slow… or maybe that's just a cache effect; it's been a few minutes since I last ran it, and I've done lots of web browsing in between. But that's not important, what's important is that most of the time is getting charged to reasonable places.
But if I change this to move the outer loop to the global level, and only its body into a function, most of the time disappears again.
Another alternative, which I wouldn't suggest except as a last resort…
I notice that if I use profile instead of cProfile, it works both internally and externally, charging time to the right calls. However, those calls are also about 5x slower. And there seems to be an additional 10 seconds of constant overhead (which gets charged to import profile if used internally, whatever's on line 1 if used externally). So, to find out that split is using 70% of my time, instead of waiting 4 seconds and doing 2.326 / 3.352, I have to wait 27 seconds, and do 10.93 / (26.34 - 10.01). Not much fun…
One last thing: I get the same results with a CPython 3.4 dev build—correct results when used internally, everything charged to the first line of code when used externally. But PyPy 2.2/2.7.3 and PyPy3 2.1b1/3.2.3 both seem to give me correct results with -m cProfile. This may just mean that PyPy's cProfile is faked on top of profile because the pure-Python code is fast enough.
Anyway, if someone can figure out/explain why -m cProfile isn't working, that would be great… but otherwise, this is usually a perfectly good workaround.

Categories

Resources