Does effective Cython cProfiling imply writing many sub functions? - python

I am trying to optimize some code with Cython, but cProfile is not providing enough information.
To do a good job at profiling, should I create many sub-routines func2, func3,... , func40 ?
Note below that i have a function func1 in mycython.pyx, but it has many for loops and internal manipulations. But cProfile does not tell me stats for those loops .
2009 function calls in 81.254 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 81.254 81.254 <string>:1(<module>)
2 0.000 0.000 0.021 0.010 blah.py:1495(len)
2000 0.000 0.000 0.000 0.000 blah.py:1498(__getitem__)
1 0.214 0.214 0.214 0.214 mycython.pyx:718(func2)
1 80.981 80.981 81.216 81.216 mycython.pyx:743(func1)
1 0.038 0.038 81.254 81.254 {mycython.func1}
2 0.021 0.010 0.021 0.010 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

Yes, it does. The finest granularity available to cProfile is a function call. You must split up func1 into multiple functions. (Note that you can make them functions defined inside func1 and thus only available to func1.)
If you want finer-grained profiling (line-level), then you need a different profiler. Take a look at this line-level profiler, but I don't think it works for Cython.

You need to enable profiling support for your Cython code. Use
# cython: profile=True
http://docs.cython.org/src/tutorial/profiling_tutorial.html

Related

Why does python call builtins.compile when importing numpy?

I ran this code with python 3.7 to see what happens when I call import numpy.
import cProfile, pstats
profiler = cProfile.Profile()
profiler.enable()
import numpy
profiler.disable()
# Get and print table of stats
stats = pstats.Stats(profiler).sort_stats('time')
stats.print_stats()
The first few lines of output look like this:
79557 function calls (76496 primitive calls) in 0.120 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
32/30 0.015 0.000 0.017 0.001 {built-in method _imp.create_dynamic}
318 0.015 0.000 0.015 0.000 {built-in method builtins.compile}
115 0.011 0.000 0.011 0.000 {built-in method marshal.loads}
648 0.006 0.000 0.006 0.000 {built-in method posix.stat}
119 0.004 0.000 0.005 0.000 <frozen importlib._bootstrap_external>:914(get_data)
246/244 0.004 0.000 0.007 0.000 {built-in method builtins.__build_class__}
329 0.002 0.000 0.012 0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
59 0.002 0.000 0.002 0.000 {built-in method posix.getcwd}
It spends a lot of time on builtins.compile. Is it creating the bytecode for NumPy for pycache? Why would that happen every time?
I'm on Mac OS. What I really want is to speed up the import, and it seems to me that compile should not be necessary.
User L3viathan pointed out in a comment that the code for numpy contains explicit calls to compile. This explains why builtins.compile is getting called. Thanks!

Python - Garbage collection very slow, can't disable gc

I am developing a program that uses pandas dataframes and large dictionaries. The dataframe is read from a CSV that is approx. 700MB.
I am using Python 3.7.3 on Windows
I noticed that the program I am running is extremely slow, and slows down after each loop of the algorithm.
The program reads every line of the dataframe, checks some conditions on every item of every line of the df, and if those conditions are met, it stores the item and his state in a dictionary. This dictionary can get pretty big.
I have tried profiling my code with CProfile and I have found that the garbage-collector is the function that uses up about 90% of the execution time.
I have seen similar problems resolved by calling gc.disable() but this did nothing for me.
Weirdly (I have no idea if this is normal) but if I print(len(gc.get_objects())) as the first line of the code I get 51053 which seems a lot considering no function has been called yet.
My CProfile attempt : (on a small chunk of the CSV, as it would take hours to complete the attempt on the full CSV)
cProfile.run('get_pfs_errors("Logs/L5/L5_2000.csv")', 'restats.txt')
import pstats
from pstats import SortKey
p = pstats.Stats('restats.txt')
p.sort_stats(SortKey.CUMULATIVE).print_stats(10)
p.sort_stats(SortKey.TIME).print_stats(10)
Here are the stats from CProfile :
Tue Jun 18 15:40:19 2019 restats.txt
1719320 function calls (1459451 primitive calls) in 7.569 seconds
Ordered by: cumulative time
List reduced from 819 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 7.569 7.569 {built-in method builtins.exec}
1 0.001 0.001 7.569 7.569 <string>:1(<module>)
1 0.000 0.000 7.568 7.568 C:/Users/BC744818/Documents/OPTISS_L1_5/test_profile.py:6(get_pfs_errors)
1 0.006 0.006 7.503 7.503 C:\Users\BC744818\Documents\OPTISS_L1_5\utils\compute_pfs_rules.py:416(compute_pfs_rules)
1 0.197 0.197 7.498 7.498 C:\Users\BC744818\Documents\OPTISS_L1_5\utils\compute_pfs_rules.py:323(test_logs)
264 0.001 0.000 6.532 0.025 C:\Users\BC744818\Documents\OPTISS_L1_5\venv\lib\site-packages\pandas\core\series.py:982(__setitem__)
529 0.010 0.000 6.158 0.012 C:\Users\BC744818\Documents\OPTISS_L1_5\venv\lib\site-packages\pandas\core\generic.py:3205(_check_setitem_copy)
528 6.125 0.012 6.125 0.012 {built-in method gc.collect}
264 0.004 0.000 3.430 0.013 C:\Users\BC744818\Documents\OPTISS_L1_5\venv\lib\site-packages\pandas\core\series.py:985(setitem)
264 0.004 0.000 3.413 0.013 C:\Users\BC744818\Documents\OPTISS_L1_5\venv\lib\site-packages\pandas\core\indexing.py:183(__setitem__)
Tue Jun 18 15:40:19 2019 restats.txt
1719320 function calls (1459451 primitive calls) in 7.569 seconds
Ordered by: internal time
List reduced from 819 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
528 6.125 0.012 6.125 0.012 {built-in method gc.collect}
264 0.405 0.002 0.405 0.002 {built-in method gc.get_objects}
1 0.197 0.197 7.498 7.498 C:\Users\BC744818\Documents\OPTISS_L1_5\utils\compute_pfs_rules.py:323(test_logs)
71280/33 0.048 0.000 0.091 0.003 C:\Users\BC744818\AppData\Local\Programs\Python\Python37\lib\copy.py:132(deepcopy)
159671 0.033 0.000 0.056 0.000 {built-in method builtins.isinstance}
289 0.026 0.000 0.026 0.000 {built-in method nt.stat}
167191/83791 0.024 0.000 0.040 0.000 C:\Users\BC744818\AppData\Local\Programs\Python\Python37\lib\json\encoder.py:333(_iterencode_dict)
8118/33 0.019 0.000 0.090 0.003 C:\Users\BC744818\AppData\Local\Programs\Python\Python37\lib\copy.py:236(_deepcopy_dict)
167263/83794 0.017 0.000 0.048 0.000 C:\Users\BC744818\AppData\Local\Programs\Python\Python37\lib\json\encoder.py:277(_iterencode_list)
1067/800 0.017 0.000 0.111 0.000 C:\Users\BC744818\Documents\OPTISS_L1_5\venv\lib\site-packages\pandas\core\indexes\base.py:253(__new__)
Thank you #user9993950, I solved it thanks to you.
When I tested this program, I had a SettingWithCopyWarning but I wanted to fix the speed of the program before fixing this warning.
Yet it so happens that by fixing the warning I also greatly increased the speed of the program and gc is no longer taking up all of the running time
I don't know what caused this though, if someone knows and wants to share the knowledge please do.

How to time disassembled representation of a Python code source

So far, when I want to inspect what might cause some sort of code to run faster compared to a very similar method, I'm using the dis module. However, the further steps of comparing what the causes is basically adding/removing lines.
Is there a more sophisticated way of actually listing what the high-offenders are?
What kind of code do you want to analyze? If you want to analyze pure python code. You can use profile. For example:
import cProfile
cProfile.run("x=1")
Or you can run a function: cProfile.run("function()")
Then it will show you something like the following:
4 function calls in 0.013 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.013 0.013 0.013 0.013 <ipython-input-7-8201fb940887>:1(fun)
1 0.000 0.000 0.013 0.013 <string>:1(<module>)
1 0.000 0.000 0.013 0.013 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

Python vs IPython cProfile sort_order

lovers,
When running "cProfile" in "IPython" I can't get the "sort_order" option to work, in contrast to running the equivalent code in the system shell (which I've redirected to a file, to be able to see the first lines of the output). What am I missing?
E.g. when running the following code:
%run -m cProfile -s cumulative myscript.py
gives me the following output (Ordered by: standard name):
9885548 function calls (9856804 primitive calls) in 17.054 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 <string>:1(DeprecatedOption)
1 0.000 0.000 0.000 0.000 <string>:1(RegisteredOption)
6 0.000 0.000 0.001 0.000 <string>:1(non_reentrant)
1 0.000 0.000 0.000 0.000 <string>:2(<module>)
32 0.000 0.000 0.000 0.000 <string>:8(__new__)
1 0.000 0.000 0.000 0.000 ImageFilter.py:106(MinFilter)
1 0.000 0.000 0.000 0.000 ImageFilter.py:122(MaxFilter)
1 0.000 0.000 0.000 0.000 ImageFilter.py:140(ModeFilter)
... rest omitted
The IMO equivalent code run from the system shell (Win7):
python -m cProfile -s cumulative myscript.py > outputfile.txt
gives me the following sorted output:
9997772 function calls (9966740 primitive calls) in 17.522 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.116 0.116 17.531 17.531 reprep.py:1(<module>)
6 0.077 0.013 11.700 1.950 reprep.py:837(add_biorep2treatment)
9758 0.081 0.000 6.927 0.001 ops.py:538(wrapper)
33592 0.100 0.000 4.209 0.000 frame.py:1635(__getitem__)
23918 0.010 0.000 3.834 0.000 common.py:111(isnull)
23918 0.041 0.000 3.823 0.000 common.py:128(_isnull_new)
... rest omitted
I also noticed that there is a difference in the number of function calls. Why?
I'm running Python 2.7.6 64bit (from Enthought) and have made sure that the exact same version of python are used for both executions (though of course the first one has an additional "IPython" "layer").
I know I've got a working solution, but the interactive version would be a time saver and I would like to understand why there's a difference.
Thank you for your time and help!!
%run has some options for profiling. Actually from the docs for %prun:
If you want to run complete programs under the profiler's control, use
%run -p [prof_opts] filename.py [args to program] where prof_opts
contains profiler specific options as described here.
Is probably a better way to do it.

When profiling Cython Code, what is `stringsource`?

I have a heavy Cython function that I'm trying to optimize. I am profiling per this following tutorial http://docs.cython.org/src/tutorial/profiling_tutorial.html. My profile output looks like this:
ncalls tottime percall cumtime percall filename:lineno(function)
1 7.521 7.521 18.945 18.945 routing_cython_core.pyx:674(resolve_flat_regions_for_drainage)
6189250 4.964 0.000 4.964 0.000 stringsource:323(__cinit__)
6189250 2.978 0.000 7.942 0.000 stringsource:618(memoryview_cwrapper)
6009849 0.868 0.000 0.868 0.000 routing_cython_core.pyx:630(_is_flat)
6189250 0.838 0.000 0.838 0.000 stringsource:345(__dealloc__)
6189250 0.527 0.000 0.527 0.000 stringsource:624(memoryview_check)
1804189 0.507 0.000 0.683 0.000 routing_cython_core.pyx:646(_is_sink)
15141 0.378 0.000 0.378 0.000 {_gdal_array.BandRasterIONumPy}
3 0.066 0.022 0.086 0.029 /home/rpsharp/local/workspace/invest-natcap.invest-3/invest_natcap/raster_utils.py:235(new_raster_from_base_uri)
11763 0.048 0.000 0.395 0.000 /usr/lib/python2.7/dist-packages/osgeo/gdal_array.py:189(BandReadAsArray)
Specifically I'm interested in lines 2 and 3 that call stringsource:323(__cinit__) and stringsource:618(memoryview_cwrapper) many times. A Google turned up references to memory views which I'm not using in that function, although I am statically typing numpy arrays. Any idea what these calls are and if I can avoid/optimize them?
Okay, turns out I did have a memory view. I was calling an inline function that passed a statically typed numpy array to a memory view, thus invoking all those extra calls to stringsource. Replacing the memoryview type in the function call with a numpy type fixed this.

Categories

Resources