I designed data warehousing application, but I struggle with poor performance when fetching data from the source and saving them to db. - only approximately 150 kB/s.
Because of the limitations imposed by the customer, I am forced to use Django on 64bit Win machine and save data to MS SQL Express (exact versions below). I am using django-mssql (1.7) backend.
Original data are stored in .dbf file (Visual FoxPro), dbfread returns each row from a file as Python dict (this is not the issue, tested by running just reader discarding data). This dictionary is then checked for data quality (function sanitize_value_for_db() below), data are copied to Django data model (attributes of the object populated; tables are wide, hence each table/object has about 100 columns/attributes) and objects are saved to db. using Django objects.bulk_create() (in batches of 50-100).
I run the code through profiler using cProfile and pstats modules. Results are below. I see that most of the time is spent in PyWin. But I have no clue if there is something I can do. Any hints or opinions will be greatly appreciated. Thanks.
Configuration:
Xeon E5-2403 v2 # 1,8 GHz, 30 GB RAM
Windows Server 2012 R2 (64bit)
MS SQL Express 64bit, v 11.02.2100.60
Python (v3.4.3:9b73f1c3e601,
Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AMD64)] on win32#
pywin32-219.win-amd64-py3.4
Django (1.7.10)
django-mssql (1.7)
Profile (on smaller data sample):
ncalls tottime percall cumtime percall filename:lineno(function)
2306881 332.596 0.000 332.596 0.000 {built-in method compile}
4592205 186.028 0.000 186.028 0.000 {method 'InvokeTypes' of 'PyIDispatch' objects}
39558280 176.963 0.000 176.963 0.000 {method 'Bind' of 'PyITypeComp' objects}
9889570 173.905 0.000 712.091 0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:390(_LazyAddAttr_)
17464949 151.818 0.000 279.692 0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:151(_AddFunc_)
7203 142.584 0.020 2215.686 0.308 c:\Python34\Lib\site-packages\sqlserver_ado\dbapi.py:587(execute)
4 120.012 30.003 120.012 30.003 {built-in method sleep}
11484250 100.385 0.000 1269.561 0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:444(__getattr__)
12167658 67.229 0.000 67.229 0.000 {method 'Invoke' of 'PyIDispatch' objects}
22100535 50.127 0.000 53.982 0.000 <string>:12(__init__)
4599456 48.409 0.000 225.498 0.000 c:\Python34\Lib\site-packages\win32com\client\__init__.py:18(__WrapDispatch)
27390427 45.257 0.000 65.391 0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:420(_ResolveType)
7189 44.294 0.006 2296.538 0.319 c:\Python34\Lib\site-packages\django\db\models\query.py:911(_insert)
9889570 43.621 0.000 759.155 0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:381(__LazyMap__)
2306881 43.605 0.000 130.943 0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:303(MakeDispatchFuncMethod)
5275612 33.300 0.000 402.786 0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:524(__setattr__)
17464949 32.793 0.000 32.793 0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:58(__init__)
4599456 32.494 0.000 124.536 0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:120(Dispatch)
2292471 30.889 0.000 843.841 0.000 c:\Python34\Lib\site-packages\sqlserver_ado\dbapi.py:266(_configure_parameter)
7007079 27.918 0.000 53.176 0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:516(MakePublicAttributeName)
2306881 27.907 0.000 46.266 0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:483(_BuildArgList)
15201179 25.593 0.000 25.593 0.000 {method 'GetTypeAttr' of 'PyITypeInfo' objects}
68338972 24.271 0.000 24.758 0.000 {built-in method isinstance}
2306881 23.056 0.000 538.037 0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:314(_make_method_)
17464949 21.683 0.000 21.683 0.000 {method 'GetNames' of 'PyITypeInfo' objects}
7007079 20.396 0.000 20.396 0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:546(<listcomp>)
4599456 18.166 0.000 18.166 0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:172(__init__)
3422203 15.624 0.000 20.870 0.000 C:\MRI\mri\dwh\daq\daq_utils.py:152(sanitize_value_for_db)
Related
I tried to repeat the functionality of IPython %time, but for some strange reason, results of testing of some function are horrific.
IPython:
In [11]: from random import shuffle
....: import numpy as np
....: def numpy_seq_el_rank(seq, el):
....: return sum(seq < el)
....:
....: seq = np.array(xrange(10000))
....: shuffle(seq)
....:
In [12]: %timeit numpy_seq_el_rank(seq, 10000//2)
10000 loops, best of 3: 46.1 µs per loop
Python:
from timeit import timeit, repeat
def my_timeit(code, setup, rep, loops):
result = repeat(code, setup=setup, repeat=rep, number=loops)
return '%d loops, best of %d: %0.9f sec per loop'%(loops, rep, min(result))
np_setup = '''
from random import shuffle
import numpy as np
def numpy_seq_el_rank(seq, el):
return sum(seq < el)
seq = np.array(xrange(10000))
shuffle(seq)
'''
np_code = 'numpy_seq_el_rank(seq, 10000//2)'
print 'Numpy seq_el_rank:\n\t%s'%my_timeit(code=np_code, setup=np_setup, rep=3, loops=100)
And its output:
Numpy seq_el_rank:
100 loops, best of 3: 1.655324947 sec per loop
As you can see, in python i made 100 loops instead 10000 (and get 35000 times slower result) as in ipython, because it takes really long time. Can anybody explain why result in python is so slow?
UPD:
Here is cProfile.run('my_timeit(code=np_code, setup=np_setup, rep=3, loops=10000)') output:
30650 function calls in 4.987 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 4.987 4.987 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 <timeit-src>:2(<module>)
3 0.001 0.000 4.985 1.662 <timeit-src>:2(inner)
300 0.006 0.000 4.961 0.017 <timeit-src>:7(numpy_seq_el_rank)
1 0.000 0.000 4.987 4.987 Lab10.py:47(my_timeit)
3 0.019 0.006 0.021 0.007 random.py:277(shuffle)
1 0.000 0.000 0.002 0.002 timeit.py:121(__init__)
3 0.000 0.000 4.985 1.662 timeit.py:185(timeit)
1 0.000 0.000 4.985 4.985 timeit.py:208(repeat)
1 0.000 0.000 4.987 4.987 timeit.py:239(repeat)
2 0.000 0.000 0.000 0.000 timeit.py:90(reindent)
3 0.002 0.001 0.002 0.001 {compile}
3 0.000 0.000 0.000 0.000 {gc.disable}
3 0.000 0.000 0.000 0.000 {gc.enable}
3 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
3 0.000 0.000 0.000 0.000 {isinstance}
3 0.000 0.000 0.000 0.000 {len}
3 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
29997 0.001 0.000 0.001 0.000 {method 'random' of '_random.Random' objects}
2 0.000 0.000 0.000 0.000 {method 'replace' of 'str' objects}
1 0.000 0.000 0.000 0.000 {min}
3 0.003 0.001 0.003 0.001 {numpy.core.multiarray.array}
1 0.000 0.000 0.000 0.000 {range}
300 4.955 0.017 4.955 0.017 {sum}
6 0.000 0.000 0.000 0.000 {time.clock}
Well, one issue is that you're misreading the results. ipython is telling you how long it took each of the 10,000 iterations for the set of 10,000 iterations with the lowest total time. The timeit.repeat module is reporting how long the whole round of 100 iterations took (again, for the shortest of three). So the real discrepancy is 46.1 µs per loop (ipython) vs. 16.5 ms per loop (python), still a factor of ~350x difference, but not 35,000x.
You didn't show profiling results for ipython. Is it possible that in your ipython session, you did either from numpy import sum or from numpy import *? If so, you'd have been timing the numpy.sum (which is optimized for numpy arrays and would run several orders of magnitude faster), while your python code (which isolated the globals in a way that ipython does not) ran the normal sum (that has to convert all the values to Python ints and sum them).
If you check your profiling output, virtually all of your work is being done in sum; if that part of your code was sped up by several orders of magnitude, the total time would reduce similarly. That would explain the "real" discrepancy; in the test case linked above, it was a 40x difference, and that was for a smaller array (the smaller the array, the less numpy can "show off") with more complex values (vs. summing 0s and 1s here I believe).
The remainder (if any) is probably an issue of how the code is being evaled slightly differently, or possibly weirdness with the random shuffle (for consistent tests, you'd want to seed random with a consistent seed to make the "randomness" repeatable) but I doubt that's a difference of more than a few percent.
There could be any number of reasons this code is running slower in one implementation of python than another. One may be optimized differently than another, one may pre-compile certain parts while the other is fully interpreted. The only way to figure out why is to profile your code.
https://docs.python.org/2/library/profile.html
import cProfile
cProfile.run('repeat(code, setup=setup, repeat=rep, number=loops)')
Will give a result similar to
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <stdin>:1(testing)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'upper' of 'str' objects}
Which shows you when function calls were made, how many times they were made and how long they took.
lovers,
When running "cProfile" in "IPython" I can't get the "sort_order" option to work, in contrast to running the equivalent code in the system shell (which I've redirected to a file, to be able to see the first lines of the output). What am I missing?
E.g. when running the following code:
%run -m cProfile -s cumulative myscript.py
gives me the following output (Ordered by: standard name):
9885548 function calls (9856804 primitive calls) in 17.054 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 <string>:1(DeprecatedOption)
1 0.000 0.000 0.000 0.000 <string>:1(RegisteredOption)
6 0.000 0.000 0.001 0.000 <string>:1(non_reentrant)
1 0.000 0.000 0.000 0.000 <string>:2(<module>)
32 0.000 0.000 0.000 0.000 <string>:8(__new__)
1 0.000 0.000 0.000 0.000 ImageFilter.py:106(MinFilter)
1 0.000 0.000 0.000 0.000 ImageFilter.py:122(MaxFilter)
1 0.000 0.000 0.000 0.000 ImageFilter.py:140(ModeFilter)
... rest omitted
The IMO equivalent code run from the system shell (Win7):
python -m cProfile -s cumulative myscript.py > outputfile.txt
gives me the following sorted output:
9997772 function calls (9966740 primitive calls) in 17.522 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.116 0.116 17.531 17.531 reprep.py:1(<module>)
6 0.077 0.013 11.700 1.950 reprep.py:837(add_biorep2treatment)
9758 0.081 0.000 6.927 0.001 ops.py:538(wrapper)
33592 0.100 0.000 4.209 0.000 frame.py:1635(__getitem__)
23918 0.010 0.000 3.834 0.000 common.py:111(isnull)
23918 0.041 0.000 3.823 0.000 common.py:128(_isnull_new)
... rest omitted
I also noticed that there is a difference in the number of function calls. Why?
I'm running Python 2.7.6 64bit (from Enthought) and have made sure that the exact same version of python are used for both executions (though of course the first one has an additional "IPython" "layer").
I know I've got a working solution, but the interactive version would be a time saver and I would like to understand why there's a difference.
Thank you for your time and help!!
%run has some options for profiling. Actually from the docs for %prun:
If you want to run complete programs under the profiler's control, use
%run -p [prof_opts] filename.py [args to program] where prof_opts
contains profiler specific options as described here.
Is probably a better way to do it.
I have a Python script in a file which takes just over 30 seconds to run. I am trying to profile it as I would like to cut down this time dramatically.
I am trying to profile the script using cProfile, but essentially all it seems to be telling me is that yes, the main script took a long time to run, but doesn't give the kind of breakdown I was expecting. At the terminal, I type something like:
cat my_script_input.txt | python -m cProfile -s time my_script.py
The results I get are:
<my_script_output>
683121 function calls (682169 primitive calls) in 32.133 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 31.980 31.980 32.133 32.133 my_script.py:18(<module>)
121089 0.050 0.000 0.050 0.000 {method 'split' of 'str' objects}
121090 0.038 0.000 0.049 0.000 fileinput.py:243(next)
2 0.027 0.014 0.036 0.018 {method 'sort' of 'list' objects}
121089 0.009 0.000 0.009 0.000 {method 'strip' of 'str' objects}
201534 0.009 0.000 0.009 0.000 {method 'append' of 'list' objects}
100858 0.009 0.000 0.009 0.000 my_script.py:51(<lambda>)
952 0.008 0.000 0.008 0.000 {method 'readlines' of 'file' objects}
1904/952 0.003 0.000 0.011 0.000 fileinput.py:292(readline)
14412 0.001 0.000 0.001 0.000 {method 'add' of 'set' objects}
182 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
1 0.000 0.000 0.000 0.000 fileinput.py:80(<module>)
1 0.000 0.000 0.000 0.000 fileinput.py:197(__init__)
1 0.000 0.000 0.000 0.000 fileinput.py:266(nextfile)
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 fileinput.py:91(input)
1 0.000 0.000 0.000 0.000 fileinput.py:184(FileInput)
1 0.000 0.000 0.000 0.000 fileinput.py:240(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
This doesn't seem to be telling me anything useful. The vast majority of the time is simply listed as:
ncalls tottime percall cumtime percall filename:lineno(function)
1 31.980 31.980 32.133 32.133 my_script.py:18(<module>)
In my_script.py, Line 18 is nothing more than the closing """ of the file's header block comment, so it's not that there is a whole load of work concentrated in Line 18. The script as a whole is mostly made up of line-based processing with mostly some string splitting, sorting and set work, so I was expecting to find the majority of time going to one or more of these activities. As it stands, seeing all the time grouped in cProfile's results as occurring on a comment line doesn't make any sense or at least does not shed any light on what is actually consuming all the time.
EDIT: I've constructed a minimum working example similar to my above case to demonstrate the same behavior:
mwe.py
import fileinput
for line in fileinput.input():
for i in range(10):
y = int(line.strip()) + int(line.strip())
And call it with:
perl -e 'for(1..1000000){print "$_\n"}' | python -m cProfile -s time mwe.py
To get the result:
22002536 function calls (22001694 primitive calls) in 9.433 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 8.004 8.004 9.433 9.433 mwe.py:1(<module>)
20000000 1.021 0.000 1.021 0.000 {method 'strip' of 'str' objects}
1000001 0.270 0.000 0.301 0.000 fileinput.py:243(next)
1000000 0.107 0.000 0.107 0.000 {range}
842 0.024 0.000 0.024 0.000 {method 'readlines' of 'file' objects}
1684/842 0.007 0.000 0.032 0.000 fileinput.py:292(readline)
1 0.000 0.000 0.000 0.000 fileinput.py:80(<module>)
1 0.000 0.000 0.000 0.000 fileinput.py:91(input)
1 0.000 0.000 0.000 0.000 fileinput.py:197(__init__)
1 0.000 0.000 0.000 0.000 fileinput.py:184(FileInput)
1 0.000 0.000 0.000 0.000 fileinput.py:266(nextfile)
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 fileinput.py:240(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Am I using cProfile incorrectly somehow?
As I mentioned in a comment, when you can't get cProfile to work externally, you can often use it internally instead. It's not that hard.
For example, when I run with -m cProfile in my Python 2.7, I get effectively the same results you did. But when I manually instrument your example program:
import fileinput
import cProfile
pr = cProfile.Profile()
pr.enable()
for line in fileinput.input():
for i in range(10):
y = int(line.strip()) + int(line.strip())
pr.disable()
pr.print_stats(sort='time')
… here's what I get:
22002533 function calls (22001691 primitive calls) in 3.352 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
20000000 2.326 0.000 2.326 0.000 {method 'strip' of 'str' objects}
1000001 0.646 0.000 0.700 0.000 fileinput.py:243(next)
1000000 0.325 0.000 0.325 0.000 {range}
842 0.042 0.000 0.042 0.000 {method 'readlines' of 'file' objects}
1684/842 0.013 0.000 0.055 0.000 fileinput.py:292(readline)
1 0.000 0.000 0.000 0.000 fileinput.py:197(__init__)
1 0.000 0.000 0.000 0.000 fileinput.py:91(input)
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 fileinput.py:266(nextfile)
1 0.000 0.000 0.000 0.000 fileinput.py:240(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
That's a lot more useful: It tells you what you probably already expected, that more than half your time is spent calling str.strip().
Also, note that if you can't edit the file containing code you wish to profile (mwe.py), you can always do this:
import cProfile
pr = cProfile.Profile()
pr.enable()
import mwe
pr.disable()
pr.print_stats(sort='time')
Even that doesn't always work. If your program calls exit(), for example, you'll have to use a try:/finally: wrapper and/or an atexit. And it it calls os._exit(), or segfaults, you're probably completely hosed. But that isn't very common.
However, something I discovered later: If you move all code out of the global scope, -m cProfile seems to work, at least for this case. For example:
import fileinput
def f():
for line in fileinput.input():
for i in range(10):
y = int(line.strip()) + int(line.strip())
f()
Now the output from -m cProfile includes, among other things:
2000000 4.819 0.000 4.819 0.000 :0(strip)
100001 0.288 0.000 0.295 0.000 fileinput.py:243(next)
I have no idea why this also made it twice as slow… or maybe that's just a cache effect; it's been a few minutes since I last ran it, and I've done lots of web browsing in between. But that's not important, what's important is that most of the time is getting charged to reasonable places.
But if I change this to move the outer loop to the global level, and only its body into a function, most of the time disappears again.
Another alternative, which I wouldn't suggest except as a last resort…
I notice that if I use profile instead of cProfile, it works both internally and externally, charging time to the right calls. However, those calls are also about 5x slower. And there seems to be an additional 10 seconds of constant overhead (which gets charged to import profile if used internally, whatever's on line 1 if used externally). So, to find out that split is using 70% of my time, instead of waiting 4 seconds and doing 2.326 / 3.352, I have to wait 27 seconds, and do 10.93 / (26.34 - 10.01). Not much fun…
One last thing: I get the same results with a CPython 3.4 dev build—correct results when used internally, everything charged to the first line of code when used externally. But PyPy 2.2/2.7.3 and PyPy3 2.1b1/3.2.3 both seem to give me correct results with -m cProfile. This may just mean that PyPy's cProfile is faked on top of profile because the pure-Python code is fast enough.
Anyway, if someone can figure out/explain why -m cProfile isn't working, that would be great… but otherwise, this is usually a perfectly good workaround.
I have a heavy Cython function that I'm trying to optimize. I am profiling per this following tutorial http://docs.cython.org/src/tutorial/profiling_tutorial.html. My profile output looks like this:
ncalls tottime percall cumtime percall filename:lineno(function)
1 7.521 7.521 18.945 18.945 routing_cython_core.pyx:674(resolve_flat_regions_for_drainage)
6189250 4.964 0.000 4.964 0.000 stringsource:323(__cinit__)
6189250 2.978 0.000 7.942 0.000 stringsource:618(memoryview_cwrapper)
6009849 0.868 0.000 0.868 0.000 routing_cython_core.pyx:630(_is_flat)
6189250 0.838 0.000 0.838 0.000 stringsource:345(__dealloc__)
6189250 0.527 0.000 0.527 0.000 stringsource:624(memoryview_check)
1804189 0.507 0.000 0.683 0.000 routing_cython_core.pyx:646(_is_sink)
15141 0.378 0.000 0.378 0.000 {_gdal_array.BandRasterIONumPy}
3 0.066 0.022 0.086 0.029 /home/rpsharp/local/workspace/invest-natcap.invest-3/invest_natcap/raster_utils.py:235(new_raster_from_base_uri)
11763 0.048 0.000 0.395 0.000 /usr/lib/python2.7/dist-packages/osgeo/gdal_array.py:189(BandReadAsArray)
Specifically I'm interested in lines 2 and 3 that call stringsource:323(__cinit__) and stringsource:618(memoryview_cwrapper) many times. A Google turned up references to memory views which I'm not using in that function, although I am statically typing numpy arrays. Any idea what these calls are and if I can avoid/optimize them?
Okay, turns out I did have a memory view. I was calling an inline function that passed a statically typed numpy array to a memory view, thus invoking all those extra calls to stringsource. Replacing the memoryview type in the function call with a numpy type fixed this.
When running:
./manage.py test appname
How can I disable all the stats/logging/output after "OK"?
I have already commented out the entire logging section - no luck.
Also commented out any print_stat calls - no luck
my manage.py is pretty bare so it likely isn't that.
I run many tests and constantly have to scroll up thousands of terminal lines to view results.
Clearly, I am new to Python/Django and it's testing framework, so I would appreciate any help.
----------------------------------------------------------------------
Ran 2 tests in 2.133s
OK
1933736 function calls (1929454 primitive calls) in 2.133 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 2.133 2.133 <string>:1(<module>)
30 0.000 0.000 0.000 0.000 <string>:8(__new__)
4 0.000 0.000 0.000 0.000 Cookie.py:315(_quote)
26 0.000 0.000 0.000 0.000 Cookie.py:333(_unquote)
10 0.000 0.000 0.000 0.000 Cookie.py:432(__init__)
28 0.000 0.000 0.000 0.000 Cookie.py:441(__setitem__)
.
.
.
2 0.000 0.000 0.000 0.000 {time.gmtime}
18 0.000 0.000 0.000 0.000 {time.localtime}
18 0.000 0.000 0.000 0.000 {time.strftime}
295 0.000 0.000 0.000 0.000 {time.time}
556 0.000 0.000 0.000 0.000 {zip}
If it helps, I am importing:
from django.utils import unittest
class TestEmployeeAdd(unittest.TestCase):
def setUp(self):
If you use a unix-like shell (Mac does) you can use the head command to do the trick like this:
python manage.py test appname | head -n 3
Switch the number 3 for the one you need to truncate the output after the OK line.
Also you can test if you like more the output yielded by setting the verbosity of the command to minimal like this:
python manage.py test appname -v 0
Hope this helps!