I wrote the following script
Basically, I'm just learning Python for Machine Learning and wanted to check how really computationally intensive tasks would perform. I observe that for 10**8 iterations, Python takes up a lot of RAM (around 3.8 GB) and also a lot of CPU time (just froze my system)
I want to know if there is any way to limit the time/memory consumption either through code or some global settings
Script -
initial_start = time.clock()
for i in range(9):
start = time.clock()
for j in range(10**i):
pass
stop = time.clock()
print 'Looping exp(',i,') times takes', stop - start, 'seconds'
final_stop = time.clock()
print 'Overall program time is',final_stop - initial_start,'seconds'
In Python 2, range creates a list. Use xrange instead. For a more detailed explanation see Should you always favor xrange() over range()?
Note that a no-op for loop is a very poor benchmark that tells you pretty much nothing about Python.
Also note, as per gnibbler's comment, Python 3's range is works like Python 2's xrange.
look at this question: How to limit the heap size?
To address your script, the timeit module measures the time it takes to perform an action more accurately
>>> import timeit
>>> for i in range(9):
... print timeit.timeit(stmt='pass', number=10**i)
...
0.0
0.0
0.0
0.0
0.0
0.015625
0.0625
0.468752861023
2.98439407349
Your example is taking most of its time dealing with the gigantic lists of numbers you're putting it memory. xrange instead of range will help fix that issue but you're still using a terrible benchmark. the loop is going to execute over and over and not actually do anything, so the cpu is busy checking the condition and entering the loop.
As you can see, creating these lists is taking the majority of the time here
>>> timeit.timeit(stmt='range(10**7)', number=1)
0.71875405311584473
>>> timeit.timeit(stmt='for i in range(10**7): pass', number=1)
1.093757152557373
Python takes RAM because you're creating a very large list of 10 ** 8 length with range function. That's where iterators become useful.
Use xrange instead of range.
It will work the same way as range do but instead of creating that large list in memory, xrange will just calculate inner index (incrementing it's value by 1 each iteration).
If you're considering Python for machine learning, take a look at numpy. Its philosophy is to implement all "inner loops" (matrix operations, linear algebra) in optimized C, and to use Python to manipulate input and output and to manage high-level algorithms - sort of like Matlab that uses Python. That gives you the best of both worlds: ease and readability of Python, and speed of C.
To get back to your question, benchmarking numpy operations will give you a more realistic assessment of Python's performances for machine learning.
As regards cpu, you have a for loop running for billions of iterations without any sort of sleep or pause inbetween, so no wonder the process hogs the cpu completely ( at least on a single core computer).
Related
I'm trying the following code:
import multiprocessing
import time
import random
def square(x):
return x**2
pool = multiprocessing.Pool(4)
l = [random.random() for i in xrange(10**8)]
now = time.time()
pool.map(square, l)
print time.time() - now
now = time.time()
map(square, l)
print time.time() - now
and the pool.map version consistently runs several seconds more slowly than the normal map version (19 seconds vs 14 seconds).
I've looked at the questions: Why is multiprocessing.Pool.map slower than builtin map? and multiprocessing.Pool() slower than just using ordinary functions
and they seem to chalk it up to to either IPC overhead or disk saturation, but I feel like in my example those aren't obviously the issue; I'm not writing/reading anything to/from disk, and the computation is long enough that it seems like IPC overhead should be small compared to the total time saved by the multiprocessing (I'm estimating that, since I'm doing work on 4 cores instead of 1, I should cut the computation time down from 14 seconds to about 3.5 seconds). I'm not saturating my cpu I don't think; checking cat /proc/cpuinfo shows that I have 4 cores, but even when I multiprocess to only 2 processes it's still slower than just the normal map function (and even slower than 4 processes). What else could be slowing down the multiprocessed version? Am I misunderstanding how IPC overhead scales?
If it's relevant, this code is written in Python 2.7, and my OS is Linux Mint 17.2
pool.map splits a list into N jobs (where N is the size of the list) and dispatches those to the processes.
The work a single process is doing is shown in your code:
def square(x):
return x**2
This operation takes very little time on modern CPUs, no matter how big the number is.
In your example you're creating a huge list and performing an irrelevant operation on every single element. Of course the IPC overhead will be greater compared to the regular map function which is optimized for fast looping.
In order to see your example working as you expect, just add a time.sleep(0.1) call to the square function. This simulates a long running task. Of course you might want to reduce the size of the list or it will take forever to complete.
Here's the code I wrote for comparing performance of numpy vs Matlab. It just measures the average time taken for matrix multiplication (1701x576 matrix M1 * 576x576 matrix M2).
Matlab version : (M1 is (1701x576) while M2 is (576x576) matrix)
function r = benchmark(M1,M2)
total_time=0;
for i=1:4
for j=1:1500
tic;
a=M1*M2;
tim=toc;
total_time =total_time+tim;
end
end
avg_time = total_time/4
r=avg_time
end
Python version :
def benchmark():
iters = range(1500)
for i in range(4):
for j in iters:
tic = time.time()
a=M1.dot(M2);
toc = time.time() - tic
t_time=t_time+toc;
return t_time/4
Matlab version takes almost ~18.2s , while Python takes ~19.3s . Ive repeated this test multiple times , and Matlab was always performing better than Python (even if smaller difference) in all cases . My understanding is Numpy uses efficient and compiled code for vector operations , and is supposed to be faster than Matlab.
Then , why could Matlab perform faster than Numpy ? The test was done on a 32 core machine .
Where did I go wrong ? or is this expected for Numpy to be slower than Matlab.
Are there ways to improve the performance for Python ?
Edit : Updated the matlab code to fix the loop index/return value error . The error was the result of me trying to edit the names in the snipper to make it presentable just before posting(a bad idea everytime :) ).
[edited to remove the mention of loops; that was my mistake]
Couple things--
First, the multicore nature of the machine doesn't really matter unless you're explicitly using those extra cores (or linking NumPy against a BLAS library that uses multiple cores -- thanks #ali_m). If you're not, it'll run about as fast on a 32-core machine as it will on a 4-core machine (assuming the clock speeds of the cores themselves are roughly equal).
Second, using purely off-the-shelf Matlab vs off-the-shelf NumPy, Matlab generally beats out NumPy. This is a very general statement, though; YMMV. Also, speaking of Matlab, there does indeed appear to be a bug in the looping indices.
Third, this may not be the best benchmark for performance; there may be some unseen caching issues taking place under the hood that aren't obvious. A better one would be to randomly generate the matrices on-the-fly in each iteration and multiply them, but even this could be problematic depending on the random number generator.
There are two major issues I can see with the test.
The first is that you are using global variable lookup in Python while you are using local variable lookup in MATLAB. Global variable lookup in Python is relatively slow. Making sure the variables are local like they are in MATLAB will affect the performance.
The second is that you are re-doing the same calculation over and over. MATLAB has a JIT for loops and numpy has a cache for calculations, both of which can reduce the time for repeated calculations.
So to make the comparisons more equal and reliable, you should create new, random matrices each time through the loop. This will prevent caching and JIT from messing up your results, and will make sure the variables are all local.
There is a bug in your Matlab code. It appears that you are using the same loop control variable in nested loops.
The outer loop actually only runs once.
Edit: The outer loop actually runs the correct number of times. The two loop control variables seem to be independent.
Beginner here, looked for an answer, but can't find one.
I know (or rather suspect) that part of the problem with the following code is how big the list of combinations gets.
(Maybe too, the last line seems like an error, in that, if I just run 'print ...' rather than 'comb += ...' it runs quickly and quits. Would 'append' be more graceful?)
I'm not 100% sure if the system hang is due to disk I/O (swapping?), CPU use, or memory... running it under Windows seems to result in a rather large disk I/O by 'System', while under Linux, top was showing high CPU and memory use before it was killed. In both cases though, the rest of the system was unusable while this operation was going (tried it in the Python interpreter directly, as well as in PyCharm).
So two part question: 1) is there some 'safe' way to test code like this that won't affect the rest of the system negatively, and 2) for this specific example, how should I rewrite it?
Trying this code (which I do not recommend!):
from itertools import combinations_with_replacement as cwr
comb = []
iterable = [1,2,3,4]
for x in xrange(4,100):
comb += cwr(iterable, x)
Thanks!
EDIT: Should have specified, but it is python2.7 code here as well (guess the xrange makes it obvious it's not 3 anyways). The Windows machine that's hanging has 4 GB of RAM, but it looks like the hang is on disk I/O. The original problem I was (and still am) working on was a question at codewars.com, about how many ways to make change given a list of possible coins and an amount to make. The solution I'd come up with worked for small amounts, and not big ones. Obviously, I need to come up with a better algorithm to solve that problem... so this is non-essential code, certainly. However, I would like to know if there's something I can do to set the programming environment so that bugs in my code don't propagate and choke my system this way.
FURTHER EDIT:
I was working on the problem again tonight, and realized that I didn't need to append to a master list (as some of you hinted to me in the comments), but just work on the subset that was collected. I hadn't really given enough of the code to make that obvious, but my key problem here was the line:
comb += cwr(iterable, x)
which should have been
comb = cwr(iterable, x)
Since you are trying to compute combinations with replacement, the number of orderings that must be considered will be 4^nth power.(4 because your iterable has 4 items).
More generally speaking, the number of orderings to be computed is the number of elements that can be at any spot in the list, raised to the power of how long the list is.
You are trying to compute 4^nth power for n between 3 and 99. 4^99 power is 4.01734511064748 * 1059.
I'm afraid not even a quantum computer would be much help computing that.
This isn't a very powerful laptop (3.7 GiB,Intel® Celeron(R) CPU N2820 # 2.13GHz × 2, 64bit ubuntu) but it did it in 15s or so (but did slow noticeably, top showed 100% cpu (dual core) and 35% memory. It took about 15s to release the memory when if finished.
len(comb) was 4,421,240
I had to change your code to
from itertools import combinations_with_replacement as cwr
comb = []
iterable = [1,2,3,4]
for x in xrange(4,100):
comb.extend(list(cwr(iterable, x)))
ED - just re-tried as per your original and it does run OK. My mistake. It looks as though it is the memory requirement. If you really need to do this you could write it to a file.
re-ED being curious about the back-of-an-envelope complexity calculation above not squaring my experience, I tried plotting n (X axis) against the length of list returned by combinations_with_replacement() (Y axis) for iterable lengths 2,3,4,5 i. The result seems to be below n**(i-1) (Which ties in with the figure I got for 4,99 above. It's actually (i+n-1)! / n! / (i-1)! which approximates to n**(i-1)/i! for n much bigger than i)
Also, running the plot I didn't keep the full comb list in memory and this did improve computer performance quite a bit, so maybe that's a relevant point: rather than produce a giant list then work on it afterwords, do the calculations in the loop.
I am trying to optimize some python code (to speed up some matrix operations), my code is something similar to this one (my real dataset is also similar to 'gps'),
import numpy as np
gps = [np.random.rand(50,50) for i in xrange(1000)]
ips = np.zeros( (len(gps),len(gps)), dtype='float32')
for i in xrange(len(gps)):
for j in xrange(0,i+1):
ips[i,j]= f.innerProd(gps[i],gps[j])
ips[j,i]= ips[i,j]
print "Inner product matrix: %3.0f %% done (%d of %d)"% \
(((i+1)**2.)/(len(gps)**2.)*100, i, len(gps))
def innerProd(mat1,mat2):
return float(np.sum(np.dot(np.dot(mat1,mat2),mat1)))
What I would like to understand is , why is it that the program begins running fast during the first iterations and then slows down as it iterates further? I know the question might be a bit naive but I really want to have a clearer idea of what is happening before I attempt anything else. I already implemented my function in Fortran (leaving within the Fortran realm any for loops) and used f2py to create a dynamic lib to call the function from python, this would be the new code in python..
import numpy as np
import myfortranInnProd as fip
gps = [np.random.rand(50,50) for i in xrange(1000)]
ips = np.zeros( (len(gps),len(gps)), dtype='float32')
ips = fip.innerProd(gps)
unfortunately I only found out (surprisingly) that my fortran-python version runs 1.5 ~ 2 times slower than the first version (it is important to mention that I used MATMUL() on the Fortran implementation). I have been googling around for a while and I believe that this "slow down" has something to do with the memory bandwidth, memory allocation or caching, given the large datasets, but I am not very sure about what is really happening behind and how could I improve the performance. I have run the code on both a small intel atom , 2GB ram and a 4 core intel xeon, with 8GB (of course with a correspondingly scaled dataset) and the "slow down" behavior is the same.
I just need to understand why is it that this 'slow down' happens? would it do any good if i implement the function in C ? or try to implement it to run on a GPU ? Any other ideas how to improve it? Thanks in advance
At the risk of stating the obvious, the number of executions of the inner loop will grow each time you complete an execution of the outer loop. When i is 0, the inner loop will only be executed once, but when i is 100, it will be executed 101 times. Could this explain your observations, or do you mean that each execution of the inner loop itself is getting slower over time?
The number of executions of the inner for loop depends on the value of i, the index of the outer for loop. Since you're displaying your debug each time the inner loop finishes, it gets displayed less and less often as i grows. (Note that the percentage increases regularly, however.)
I am going through a link about generators that someone posted. In the beginning he compares the two functions below. On his setup he showed a speed increase of 5% with the generator.
I'm running windows XP, python 3.1.1, and cannot seem to duplicate the results. I keep showing the "old way"(logs1) as being slightly faster when tested with the provided logs and up to 1GB of duplicated data.
Can someone help me understand whats happening differently?
Thanks!
def logs1():
wwwlog = open("big-access-log")
total = 0
for line in wwwlog:
bytestr = line.rsplit(None,1)[1]
if bytestr != '-':
total += int(bytestr)
return total
def logs2():
wwwlog = open("big-access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
getbytes = (int(x) for x in bytecolumn if x != '-')
return sum(getbytes)
*edit, spacing messed up in copy/paste
For what it's worth, the main purpose of the speed comparison in the presentation was to point out that using generators does not introduce a huge performance overhead. Many programmers, when first seeing generators, might start wondering about the hidden costs. For example, is there all sorts of fancy magic going on behind the scenes? Is using this feature going to make my program run twice as slow?
In general that's not the case. The example is meant to show that a generator solution can run at essentially the same speed, if not slightly faster in some cases (although it depends on the situation, version of Python, etc.). If you are observing huge differences in performance between the two versions though, then that would be something worth investigating.
In David Beazley's slides that you linked to, he states that all tests were run with "Python 2.5.1 on OS X 10.4.11," and you say you're running tests with Python 3.1 on Windows XP. So, realize you're doing some apples to oranges comparison. I suspect of the two variables, the Python version matters much more.
Python 3 is a different beast than Python 2. Many things have changed under the hood, (even within the Python 2 branch). This includes performance optimizations as well as performance regressions (see, for example, Beazley's own recent blog post on I/O in Python 3). For this reason, the Python Performance Tips page states explicitly,
You should always test these tips with
your application and the version of
Python you intend to use and not just
blindly accept that one method is
faster than another.
I should mention that one area that you can count on generators helping is in reducing memory consumption, rather than CPU consumption. If you have a large amount of data where you calculate or extract something from each individual piece, and you don't need the data after, generators will shine. See generator comprehension for more details.
You don't have an answer after almost a half an hour. I'm posting something that makes sense to me, not necessarily the right answer. I figure that this is better than nothing after almost half an hour:
The first algorithm uses a generator. A generator functions by loading the first page of results from the list (into memory) and continually loads the successive pages (into memory) until there is nothing left to read from input.
The second algorithm uses two generators, each with an if statement for a total of two comparisons per loop as opposed to the first algorithm's one comparison.
Also the second algorithm calls the sum function at the end as opposed to the first algorithm that simply keeps adding relevant integers as it keeps encountering them.
As such, for sufficiently large inputs, the second algorithm has more comparisons and an extra function call than the first. This could possibly explain why it takes longer to finish than the first algorithm.
Hope this helps