Why does python call builtins.compile when importing numpy? - python

I ran this code with python 3.7 to see what happens when I call import numpy.
import cProfile, pstats
profiler = cProfile.Profile()
profiler.enable()
import numpy
profiler.disable()
# Get and print table of stats
stats = pstats.Stats(profiler).sort_stats('time')
stats.print_stats()
The first few lines of output look like this:
79557 function calls (76496 primitive calls) in 0.120 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
32/30 0.015 0.000 0.017 0.001 {built-in method _imp.create_dynamic}
318 0.015 0.000 0.015 0.000 {built-in method builtins.compile}
115 0.011 0.000 0.011 0.000 {built-in method marshal.loads}
648 0.006 0.000 0.006 0.000 {built-in method posix.stat}
119 0.004 0.000 0.005 0.000 <frozen importlib._bootstrap_external>:914(get_data)
246/244 0.004 0.000 0.007 0.000 {built-in method builtins.__build_class__}
329 0.002 0.000 0.012 0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
59 0.002 0.000 0.002 0.000 {built-in method posix.getcwd}
It spends a lot of time on builtins.compile. Is it creating the bytecode for NumPy for pycache? Why would that happen every time?
I'm on Mac OS. What I really want is to speed up the import, and it seems to me that compile should not be necessary.

User L3viathan pointed out in a comment that the code for numpy contains explicit calls to compile. This explains why builtins.compile is getting called. Thanks!

Related

How to improve "nt.stat" and "getmodule" performance in python?

I am running a python script using a local packakge folder on my computer. It is significantly slow. I tried to profile it use the "cProfile". Below is the result
27163509262 function calls (26876957168 primitive calls) in 45242.287 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
62862676 9841.903 0.000 9841.904 0.000 {built-in method nt.stat}
23360775 1778.411 0.000 4344.513 0.000 mypath\AppData\Local\Programs\Python\Python310\lib\inspect.py:850(getmodule)
-1833258512/-1833258536 1667.529 -0.000 2230.835 -0.000 {built-in method builtins.isinstance}
42865521 1168.789 0.000 1373.451 0.000 mypath\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\dtypes\cast.py:468(maybe_promote)
-1058433517 1035.505 -0.000 1065.701 -0.000 {built-in method builtins.hasattr}
515005085 967.796 0.000 4666.300 0.000 mypath\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\series.py:943(__getitem__)
-1322991622 676.388 -0.000 908.008 -0.000 mypath\AppData\Local\Programs\Python\Python310\lib\inspect.py:182(ismodule)
515005085 542.937 0.000 2657.906 0.000 mypath\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\series.py:1052(_get_value)
23960548 479.766 0.000 646.542 0.000 {pandas._libs.lib.maybe_convert_objects}
562604434 475.813 0.000 1652.195 0.000 mypath\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\base.py:3577(get_loc)
3364611 475.311 0.000 475.311 0.000 {pandas._libs.tslibs.vectorized.ints_to_pydatetime}
...
The first two functions above are "nt.stat" and "getmodule" cost me 33% of the time. What are they and how can I improve them?

How to profile flask endpoint?

I would like to profile a flask apps endpoint to see where it is slowing down when executing the endpoints functions. I have tried using Pycharms built-in profiler but the output tells me that most time is spent in the wait function i.e waiting for user input. I have tried installing the flask-profiler but was not able to set it up due to a project structure different than the package was expecting. Any help is appreciated. Thank you!
Werkzeug has a built in application profiler based on cProfile.
With help from this gist I managed to set it up as follows:
from flask import Flask
from werkzeug.middleware.profiler import ProfilerMiddleware
from time import sleep
app = Flask(__name__)
app.wsgi_app = ProfilerMiddleware(app.wsgi_app)
#app.route('/')
def index():
print ('begin')
sleep(3)
print ('end')
return 'success'
A request to this endpoint, results in the following summary in the termainl:
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
begin
end
--------------------------------------------------------------------------------
PATH: '/'
298 function calls in 2.992 seconds
Ordered by: internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.969 2.969 2.969 2.969 {built-in method time.sleep}
1 0.002 0.002 0.011 0.011 /usr/local/lib/python3.7/site-packages/flask/app.py:1955(finalize_request)
1 0.002 0.002 0.008 0.008 /usr/local/lib/python3.7/site-packages/werkzeug/wrappers/base_response.py:173(__init__)
35 0.002 0.000 0.002 0.000 {built-in method builtins.isinstance}
4 0.001 0.000 0.001 0.000 /usr/local/lib/python3.7/site-packages/werkzeug/datastructures.py:910(_unicodify_header_value)
2 0.001 0.000 0.003 0.002 /usr/local/lib/python3.7/site-packages/werkzeug/datastructures.py:1298(__setitem__)
1 0.001 0.001 0.001 0.001 /usr/local/lib/python3.7/site-packages/werkzeug/datastructures.py:960(__getitem__)
6 0.001 0.000 0.001 0.000 /usr/local/lib/python3.7/site-packages/werkzeug/_compat.py:210(to_unicode)
2 0.000 0.000 0.002 0.001 /usr/local/lib/python3.7/site-packages/werkzeug/datastructures.py:1212(set)
4 0.000 0.000 0.000 0.000 {method 'decode' of 'bytes' objects}
1 0.000 0.000 0.002 0.002 /usr/local/lib/python3.7/site-packages/werkzeug/wrappers/base_response.py:341(set_data)
10 0.000 0.000 0.001 0.000 /usr/local/lib/python3.7/site-packages/werkzeug/local.py:70(__getattr__)
8 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.008 0.008 /usr/local/lib/python3.7/site-packages/flask/app.py:2029(make_response)
1 0.000 0.000 0.004 0.004 /usr/local/lib/python3.7/site-packages/werkzeug/routing.py:1551(bind_to_environ)
1 0.000 0.000 0.000 0.000 /usr/local/lib/python3.7/site-packages/werkzeug/_internal.py:67(_get_environ)
1 0.000 0.000 0.001 0.001 /usr/local/lib/python3.7/site-packages/werkzeug/routing.py:1674(__init__)
[snipped for berevity]
You could limit the results down slightly by passing a restrictions argument:
restrictions (Iterable[Union[str, int, float]]) – A tuple of restrictions to filter stats by. See pstats.Stats.print_stats().
So, for example if you were interested in the python file living at /code/app.py specifically, you could instead define the profiler like:
app.wsgi_app = ProfilerMiddleware(app.wsgi_app, restrictions=('/code/app.py',))
Resulting in the output:
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
begin
end
--------------------------------------------------------------------------------
PATH: '/'
300 function calls in 3.016 seconds
Ordered by: internal time, call count
List reduced from 131 to 2 due to restriction <'/code/app.py'>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 3.007 3.007 /code/app.py:12(index)
1 0.000 0.000 2.002 2.002 /code/app.py:9(slower)
--------------------------------------------------------------------------------
With some tweaking this could prove useful to solve your issue.

Python - Garbage collection very slow, can't disable gc

I am developing a program that uses pandas dataframes and large dictionaries. The dataframe is read from a CSV that is approx. 700MB.
I am using Python 3.7.3 on Windows
I noticed that the program I am running is extremely slow, and slows down after each loop of the algorithm.
The program reads every line of the dataframe, checks some conditions on every item of every line of the df, and if those conditions are met, it stores the item and his state in a dictionary. This dictionary can get pretty big.
I have tried profiling my code with CProfile and I have found that the garbage-collector is the function that uses up about 90% of the execution time.
I have seen similar problems resolved by calling gc.disable() but this did nothing for me.
Weirdly (I have no idea if this is normal) but if I print(len(gc.get_objects())) as the first line of the code I get 51053 which seems a lot considering no function has been called yet.
My CProfile attempt : (on a small chunk of the CSV, as it would take hours to complete the attempt on the full CSV)
cProfile.run('get_pfs_errors("Logs/L5/L5_2000.csv")', 'restats.txt')
import pstats
from pstats import SortKey
p = pstats.Stats('restats.txt')
p.sort_stats(SortKey.CUMULATIVE).print_stats(10)
p.sort_stats(SortKey.TIME).print_stats(10)
Here are the stats from CProfile :
Tue Jun 18 15:40:19 2019 restats.txt
1719320 function calls (1459451 primitive calls) in 7.569 seconds
Ordered by: cumulative time
List reduced from 819 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 7.569 7.569 {built-in method builtins.exec}
1 0.001 0.001 7.569 7.569 <string>:1(<module>)
1 0.000 0.000 7.568 7.568 C:/Users/BC744818/Documents/OPTISS_L1_5/test_profile.py:6(get_pfs_errors)
1 0.006 0.006 7.503 7.503 C:\Users\BC744818\Documents\OPTISS_L1_5\utils\compute_pfs_rules.py:416(compute_pfs_rules)
1 0.197 0.197 7.498 7.498 C:\Users\BC744818\Documents\OPTISS_L1_5\utils\compute_pfs_rules.py:323(test_logs)
264 0.001 0.000 6.532 0.025 C:\Users\BC744818\Documents\OPTISS_L1_5\venv\lib\site-packages\pandas\core\series.py:982(__setitem__)
529 0.010 0.000 6.158 0.012 C:\Users\BC744818\Documents\OPTISS_L1_5\venv\lib\site-packages\pandas\core\generic.py:3205(_check_setitem_copy)
528 6.125 0.012 6.125 0.012 {built-in method gc.collect}
264 0.004 0.000 3.430 0.013 C:\Users\BC744818\Documents\OPTISS_L1_5\venv\lib\site-packages\pandas\core\series.py:985(setitem)
264 0.004 0.000 3.413 0.013 C:\Users\BC744818\Documents\OPTISS_L1_5\venv\lib\site-packages\pandas\core\indexing.py:183(__setitem__)
Tue Jun 18 15:40:19 2019 restats.txt
1719320 function calls (1459451 primitive calls) in 7.569 seconds
Ordered by: internal time
List reduced from 819 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
528 6.125 0.012 6.125 0.012 {built-in method gc.collect}
264 0.405 0.002 0.405 0.002 {built-in method gc.get_objects}
1 0.197 0.197 7.498 7.498 C:\Users\BC744818\Documents\OPTISS_L1_5\utils\compute_pfs_rules.py:323(test_logs)
71280/33 0.048 0.000 0.091 0.003 C:\Users\BC744818\AppData\Local\Programs\Python\Python37\lib\copy.py:132(deepcopy)
159671 0.033 0.000 0.056 0.000 {built-in method builtins.isinstance}
289 0.026 0.000 0.026 0.000 {built-in method nt.stat}
167191/83791 0.024 0.000 0.040 0.000 C:\Users\BC744818\AppData\Local\Programs\Python\Python37\lib\json\encoder.py:333(_iterencode_dict)
8118/33 0.019 0.000 0.090 0.003 C:\Users\BC744818\AppData\Local\Programs\Python\Python37\lib\copy.py:236(_deepcopy_dict)
167263/83794 0.017 0.000 0.048 0.000 C:\Users\BC744818\AppData\Local\Programs\Python\Python37\lib\json\encoder.py:277(_iterencode_list)
1067/800 0.017 0.000 0.111 0.000 C:\Users\BC744818\Documents\OPTISS_L1_5\venv\lib\site-packages\pandas\core\indexes\base.py:253(__new__)
Thank you #user9993950, I solved it thanks to you.
When I tested this program, I had a SettingWithCopyWarning but I wanted to fix the speed of the program before fixing this warning.
Yet it so happens that by fixing the warning I also greatly increased the speed of the program and gc is no longer taking up all of the running time
I don't know what caused this though, if someone knows and wants to share the knowledge please do.

Python getting meaningful results from cProfile

I have a Python script in a file which takes just over 30 seconds to run. I am trying to profile it as I would like to cut down this time dramatically.
I am trying to profile the script using cProfile, but essentially all it seems to be telling me is that yes, the main script took a long time to run, but doesn't give the kind of breakdown I was expecting. At the terminal, I type something like:
cat my_script_input.txt | python -m cProfile -s time my_script.py
The results I get are:
<my_script_output>
683121 function calls (682169 primitive calls) in 32.133 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 31.980 31.980 32.133 32.133 my_script.py:18(<module>)
121089 0.050 0.000 0.050 0.000 {method 'split' of 'str' objects}
121090 0.038 0.000 0.049 0.000 fileinput.py:243(next)
2 0.027 0.014 0.036 0.018 {method 'sort' of 'list' objects}
121089 0.009 0.000 0.009 0.000 {method 'strip' of 'str' objects}
201534 0.009 0.000 0.009 0.000 {method 'append' of 'list' objects}
100858 0.009 0.000 0.009 0.000 my_script.py:51(<lambda>)
952 0.008 0.000 0.008 0.000 {method 'readlines' of 'file' objects}
1904/952 0.003 0.000 0.011 0.000 fileinput.py:292(readline)
14412 0.001 0.000 0.001 0.000 {method 'add' of 'set' objects}
182 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
1 0.000 0.000 0.000 0.000 fileinput.py:80(<module>)
1 0.000 0.000 0.000 0.000 fileinput.py:197(__init__)
1 0.000 0.000 0.000 0.000 fileinput.py:266(nextfile)
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 fileinput.py:91(input)
1 0.000 0.000 0.000 0.000 fileinput.py:184(FileInput)
1 0.000 0.000 0.000 0.000 fileinput.py:240(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
This doesn't seem to be telling me anything useful. The vast majority of the time is simply listed as:
ncalls tottime percall cumtime percall filename:lineno(function)
1 31.980 31.980 32.133 32.133 my_script.py:18(<module>)
In my_script.py, Line 18 is nothing more than the closing """ of the file's header block comment, so it's not that there is a whole load of work concentrated in Line 18. The script as a whole is mostly made up of line-based processing with mostly some string splitting, sorting and set work, so I was expecting to find the majority of time going to one or more of these activities. As it stands, seeing all the time grouped in cProfile's results as occurring on a comment line doesn't make any sense or at least does not shed any light on what is actually consuming all the time.
EDIT: I've constructed a minimum working example similar to my above case to demonstrate the same behavior:
mwe.py
import fileinput
for line in fileinput.input():
for i in range(10):
y = int(line.strip()) + int(line.strip())
And call it with:
perl -e 'for(1..1000000){print "$_\n"}' | python -m cProfile -s time mwe.py
To get the result:
22002536 function calls (22001694 primitive calls) in 9.433 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 8.004 8.004 9.433 9.433 mwe.py:1(<module>)
20000000 1.021 0.000 1.021 0.000 {method 'strip' of 'str' objects}
1000001 0.270 0.000 0.301 0.000 fileinput.py:243(next)
1000000 0.107 0.000 0.107 0.000 {range}
842 0.024 0.000 0.024 0.000 {method 'readlines' of 'file' objects}
1684/842 0.007 0.000 0.032 0.000 fileinput.py:292(readline)
1 0.000 0.000 0.000 0.000 fileinput.py:80(<module>)
1 0.000 0.000 0.000 0.000 fileinput.py:91(input)
1 0.000 0.000 0.000 0.000 fileinput.py:197(__init__)
1 0.000 0.000 0.000 0.000 fileinput.py:184(FileInput)
1 0.000 0.000 0.000 0.000 fileinput.py:266(nextfile)
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 fileinput.py:240(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Am I using cProfile incorrectly somehow?
As I mentioned in a comment, when you can't get cProfile to work externally, you can often use it internally instead. It's not that hard.
For example, when I run with -m cProfile in my Python 2.7, I get effectively the same results you did. But when I manually instrument your example program:
import fileinput
import cProfile
pr = cProfile.Profile()
pr.enable()
for line in fileinput.input():
for i in range(10):
y = int(line.strip()) + int(line.strip())
pr.disable()
pr.print_stats(sort='time')
… here's what I get:
22002533 function calls (22001691 primitive calls) in 3.352 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
20000000 2.326 0.000 2.326 0.000 {method 'strip' of 'str' objects}
1000001 0.646 0.000 0.700 0.000 fileinput.py:243(next)
1000000 0.325 0.000 0.325 0.000 {range}
842 0.042 0.000 0.042 0.000 {method 'readlines' of 'file' objects}
1684/842 0.013 0.000 0.055 0.000 fileinput.py:292(readline)
1 0.000 0.000 0.000 0.000 fileinput.py:197(__init__)
1 0.000 0.000 0.000 0.000 fileinput.py:91(input)
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 fileinput.py:266(nextfile)
1 0.000 0.000 0.000 0.000 fileinput.py:240(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
That's a lot more useful: It tells you what you probably already expected, that more than half your time is spent calling str.strip().
Also, note that if you can't edit the file containing code you wish to profile (mwe.py), you can always do this:
import cProfile
pr = cProfile.Profile()
pr.enable()
import mwe
pr.disable()
pr.print_stats(sort='time')
Even that doesn't always work. If your program calls exit(), for example, you'll have to use a try:/finally: wrapper and/or an atexit. And it it calls os._exit(), or segfaults, you're probably completely hosed. But that isn't very common.
However, something I discovered later: If you move all code out of the global scope, -m cProfile seems to work, at least for this case. For example:
import fileinput
def f():
for line in fileinput.input():
for i in range(10):
y = int(line.strip()) + int(line.strip())
f()
Now the output from -m cProfile includes, among other things:
2000000 4.819 0.000 4.819 0.000 :0(strip)
100001 0.288 0.000 0.295 0.000 fileinput.py:243(next)
I have no idea why this also made it twice as slow… or maybe that's just a cache effect; it's been a few minutes since I last ran it, and I've done lots of web browsing in between. But that's not important, what's important is that most of the time is getting charged to reasonable places.
But if I change this to move the outer loop to the global level, and only its body into a function, most of the time disappears again.
Another alternative, which I wouldn't suggest except as a last resort…
I notice that if I use profile instead of cProfile, it works both internally and externally, charging time to the right calls. However, those calls are also about 5x slower. And there seems to be an additional 10 seconds of constant overhead (which gets charged to import profile if used internally, whatever's on line 1 if used externally). So, to find out that split is using 70% of my time, instead of waiting 4 seconds and doing 2.326 / 3.352, I have to wait 27 seconds, and do 10.93 / (26.34 - 10.01). Not much fun…
One last thing: I get the same results with a CPython 3.4 dev build—correct results when used internally, everything charged to the first line of code when used externally. But PyPy 2.2/2.7.3 and PyPy3 2.1b1/3.2.3 both seem to give me correct results with -m cProfile. This may just mean that PyPy's cProfile is faked on top of profile because the pure-Python code is fast enough.
Anyway, if someone can figure out/explain why -m cProfile isn't working, that would be great… but otherwise, this is usually a perfectly good workaround.

When profiling Cython Code, what is `stringsource`?

I have a heavy Cython function that I'm trying to optimize. I am profiling per this following tutorial http://docs.cython.org/src/tutorial/profiling_tutorial.html. My profile output looks like this:
ncalls tottime percall cumtime percall filename:lineno(function)
1 7.521 7.521 18.945 18.945 routing_cython_core.pyx:674(resolve_flat_regions_for_drainage)
6189250 4.964 0.000 4.964 0.000 stringsource:323(__cinit__)
6189250 2.978 0.000 7.942 0.000 stringsource:618(memoryview_cwrapper)
6009849 0.868 0.000 0.868 0.000 routing_cython_core.pyx:630(_is_flat)
6189250 0.838 0.000 0.838 0.000 stringsource:345(__dealloc__)
6189250 0.527 0.000 0.527 0.000 stringsource:624(memoryview_check)
1804189 0.507 0.000 0.683 0.000 routing_cython_core.pyx:646(_is_sink)
15141 0.378 0.000 0.378 0.000 {_gdal_array.BandRasterIONumPy}
3 0.066 0.022 0.086 0.029 /home/rpsharp/local/workspace/invest-natcap.invest-3/invest_natcap/raster_utils.py:235(new_raster_from_base_uri)
11763 0.048 0.000 0.395 0.000 /usr/lib/python2.7/dist-packages/osgeo/gdal_array.py:189(BandReadAsArray)
Specifically I'm interested in lines 2 and 3 that call stringsource:323(__cinit__) and stringsource:618(memoryview_cwrapper) many times. A Google turned up references to memory views which I'm not using in that function, although I am statically typing numpy arrays. Any idea what these calls are and if I can avoid/optimize them?
Okay, turns out I did have a memory view. I was calling an inline function that passed a statically typed numpy array to a memory view, thus invoking all those extra calls to stringsource. Replacing the memoryview type in the function call with a numpy type fixed this.

Categories

Resources