Problem
The code performs geostatistic interpolation by applying kriging. For small data size, it works great. However, when the data is large, the computational time increases drastically.
Constants
c1, c2, c3, c4 are constants
matrx is a dataset of size 6000 x 6000
Variables
data is a 6000 x 3 array
gdata is a 10000 x 2 array
Code
An extract of the code where I am having the problem is below:
prediction = []
for i, dummy_val in range(10000):
semivariance = []
for j in range(len(data[:, 2])):
distance = np.sqrt((gdata[i, 0]-data[j, 0])**2 + (gdata[i, 1]-data[j, 1])**2)
semivariance.append((c1 + c2*(1-np.exp(-(distance/c3)**c4))))
semivariance.append(1)
iweights = np.linalg.lstsq(matrx, semivariance, rcond=None)
weights = iweights[:-3][0][:-1]
prediction.append(np.sum(data[:, 2]*weights))
When I debug the code, I realize that the problem comes from the
iweights = np.linalg.lstsq(matrx, semivariance, rcond=None)
which runs very slow for the large matrx and semivariance array that I am using.
Is there a pythonic way to help improve the computational speed or a way I could rewrite the entire block of code to improve the speed?
Are you using the MKL library binding to numpy? If not you can try it and check if it affect the performance of your code.
Also for the prediction and semivariance lists you are creating empty lists for both of them, then append a value for them on each iteration. As per my understanding, the number of iterations is fixed in your code, so will the code be faster if you created the lists with full size from the start to avoid dynamic creation of new list with every append? I don't know if the interpreter is smart enough to detect that the size of the list is fixed to create it and avoid the reallocation withou your help.
Related
After some researches on StackOverflow, i didn't find a simple answer to my problem. So I share with you my code in order to find some help.
S=np.random.random((495,930,495,3,3))
#The shape of S is (495,930,495,3,3)
#I want to calculate for each small array (z,y,x,3,3) some features
for z in range(S.shape[0]):
for y in range(S.shape[1]):
for x in range(S.shape[2]):
res[z,y,x,0]=np.array(np.linalg.det(S[z,y,x])/np.trace(S[z,y,x]))
res[z,y,x,1]=np.array(S[z,y,x].mean())
res[z,y,x,2:]=np.array(np.linalg.eigvals(S[z,y,x]))
Here is my problem. The size of the S array is huge. So I was wondering if it is possible to make this for loop faster.
I had to reduce the shape to (49,93,49,3,3) so that it runs in acceptable time on my hardware. I was able to shave off 5-10% by avoiding unnecessary work (not optimizing your algorithm). Unnecessary work includes, but is not limited to:
Performing (global) lookups
Calculating the same value several times
You might also want to try a different python runtime, such as PyPy instead of CPython.
Here is my updated version of your script:
#!/usr/bin/python
import numpy as np
def main():
# avoid lookups
array = np.array
trace = np.trace
eigvals = np.linalg.eigvals
det = np.linalg.det
#The shape of S is (495,930,495,3,3)
shape = (49,93,49,3,3) # so my computer can run it
S=np.random.random(shape)
res = np.ndarray(shape) # missing from the question, I hope this is correct
#I want to calculate for each small array (z,y,x,3,3) some features
# get shape only once, instead of z times for shape1 and z*y times for shape2
shape1 = S.shape[1]
shape2 = S.shape[2]
for z in range(S.shape[0]):
for y in range(shape1):
for x in range(shape2):
# get value once instead of 4 times
s = S[z,y,x]
res[z,y,x,0]=array(det(s)/trace(s))
res[z,y,x,1]=array(s.mean())
res[z,y,x,2:]=array(eigvals(s))
# function to have local (vs. global) lookups
main()
Runtime was reduced from 25 to 23 seconds (measured with hyperfine).
Useful references:
Why does Python code run faster in a function?
Python: Two simple functions, Why is the first one faster than the second one
Python import X or from X import Y? (performance)
Is there a performance cost putting python imports inside functions?
If I have ratios to split a dataset into training, validation, and test sets, what is the most orthodox and elegant way of doing this in Python?
For instance, I split my data into 60% training, 20% testing, and 20% validation. I have 1000 rows of data with 10 features each, and a label vector of size 1000. The training set matrix should be of size (600, 10), and so on.
If I create new matrices of features and lists of labels, it wouldn't be memory efficient right? Lets say I did something like this:
TRAIN_PORTION = int(datasetSize * tr)
VALIDATION_PORTION = int(datasetSize * va)
# Whatever is left will be for testing
TEST_PORTION = datasetSize - TRAIN_PORTION - VALIDATION_PORTION
trainingSet = dataSet[0, TRAIN_PORTION:]
validationSet = dataSet[TRAIN_PORTION,
TRAIN_PORTION + VALIDATIONPORTION:]
testSet = dataset[TRAIN_PORTION+VALIATION_PORTION, datasetSize:]
That would leave me with the double amount of used memory, right?
Sorry for the incorrect Python syntax, and thank you for any help.
That's correct: you will double the memory usage that way. To avoid doubling the memory usage, you need to do one of two things:
Release the memory from one sub-matrix before you create the next; this reduces your memory high-water mark to 1.6x the main matrix;
Write your processing routines to stop at the proper row, always working on the original matrix.
You can achieve the first one by passing list slices to your processing routines, such as
model_test(data_set[:TRAIN_PORTION])
Remember, when you refer to a slice, the interpreter will build a temporary object that results from the given limits.
RESPONSE TO OP COMMENT
The reference I gave you does create a new list. To avoid using more memory, pass the entire list and the desired limits, such as
process_function(data_set, 0, TRAIN_PORTION)
process_function(data_set, TRAIN_PORTION,
TRAIN_PORTION + VALIDATION_PORTION)
process_function(data_set,
TRAIN_PORTION + VALIDATION_PORTION,
len(data_set))
If you want to do this with just list slices, then please explain where you're having trouble, and why the various pieces of documentation and the tutorials aren't satisfying your needs.
If you would use numpy-arrays (your code actually looks like that), it's possible to use views (memory is shared). It's not always easy to understand which operation results in a view and which does not. Here are some hints.
Short example:
import numpy as np
a = np.random.normal(size=(1000, 10))
b = a[:600]
print (b.flags['OWNDATA'])
# False
print(b[3,2])
# 0.373994992467 (some random-val)
a[3,2] = 88888888
print(b[3,2])
# 88888888.0
print(a.shape)
# (1000, 10)
print(b.shape)
# (600, 10)
This will probably allow you to do some in-place shuffle at the beginning and then use those linear-segments of your data to obtain views of train, val, test.
I recently came across this for loop code in MATLAB that confused me because the inverse loop do the same thing faster. why this happens?
clear all
a = rand(1000,1000);
b = rand(1000,1000);
for i=1:1000
for j=1:1000
c(i,j) = a(i,j) + b(i,j);
end
end
and the same code with inverse loop:
clear all
a = rand(1000,1000);
b = rand(1000,1000);
for i=1000:-1:1
for j=1000:-1:1
c(i,j) = a(i,j) + b(i,j);
end
end
i do the same in python with range(1000,1,-1) and found the same result(the inverse loop is still faster).
Since you did not preallocate your output variable c when you go in the reverse order c is initially preallocated to a 1000 x 1000 matrix after the first for loop iteration. When you count up c increases in size each loop which requires reallocation of memory on each iteration and thus is slower. Matlab will show this as a warning if you have them turned on.
The inverse loop is faster, because the first iteration (c(1000,1000)=..) creates an array of size 1000x1000 while first piece of code continuously increases the size of the variable c.
To avoid such problems, preallocated the variables you write in loops. Insert a c=zeros(1000,1000) and both versions run fast. Your matlab editor shows you warnings (yellow lines) which indicate potential performance problems and other problems with your code. Read these messages!
I am rewriting an analysis code for Molecular Dynamics time series. Due to the huge amount of time steps (150 000 for each simulation run) which have to be analysed, it is very important that my code is as fast as possible.
The old code is very slow (actually it requires 300 to 500 times more time compared to my one) because it was written for the analysis of a few thousand PDB files and not a bunch full of different simulations (around 60), each one having 150 000 time steps. I know that C or Fortran would be the swiss army knife in this case but my experience with c is .....
Therefore I am trying to use as much as possible numpy/scipy routines for my python code. Because I've a license for the accelerated distribution of anaconda with the mkl, this is a really significant speedup.
Now I am facing a problem and I hope that I can explain it in a manner that you understand what i mean.
I have three arrays each one with a shape of (n, 3, 20). In the first row are all residuals of my peptide, commonly around 23 to 31. In the second row are coordinates in the order xyz and in the third row are some specific time steps.
Now I'am calculating the torsion for each residual at each time step. my code for the case of arrays with shape (n,3,1) its:
def fast_torsion(d1, d2, d3):
tt = dot(d1, np.cross(d2, d3))
tb = dot(d1, d1) * dot(d2, d2)
torsion = np.zeros([len(d1), 1])
for i in xrange(len(d1)):
if tb[i] != 0:
torsion[i] = tt[i]/tb[i]
return torsion
Now I tried to use the same code for the arrays with the extended third axis but the cross product function produces the wrong values compared to the original slow code, which is using a for loop.
I tried this code with my the big arrays it is around 10 to 20 times faster than a for loop solution and around 200 times fast than the old code.
What I am trying is that np.cross() only computes the cross product over the second (xyz) axis and iterates over the other two axis. In the case with the short third axis it works fine, but with the big arrays it only works for the first time step. I also tried the axis settings but I had no chance.
I can also use Cython or numba if this is the only solution for my problem.
P.S. Sorry for my english I hope you can understand everything.
np.crosshas axisa, axisb and axisc keyword arguments to select where in the input and output arguments are the vectors to be cross multiplied. I think you want to use:
np.cross(d2, d3, axisa=1, axisb=1, axisc=1)
If you don't include the axisc=1, the result of the multiplication will be at the end of the output array.
Also, you can avoid explicitly looping over your torsion array by doing:
torsion = np.zeros((len(d1), 1)
idx = (tb !=0)
torsion[idx] = tt[idx] / tb[idx]
I had a pretty compact way of computing the partition function of an Ising-like model using itertools, lambda functions, and large NumPy arrays. Given a network consisting of N nodes and Q "states"/node, I have two arrays, h-fields and J-couplings, of sizes (N,Q) and (N,N,Q,Q) respectively. J is upper-triangular, however. Using these arrays, I have been computing the partition function Z using the following method:
# Set up lambda functions and iteration tuples of the form (A_1, A_2, ..., A_n)
iters = itertools.product(range(Q),repeat=N)
hf = lambda s: h[range(N),s]
jf = lambda s: np.array([J[fi,fj,s[fi],s[fj]] \
for fi,fj in itertools.combinations(range(N),2)]).flatten()
# Initialize and populate partition function array
pf = np.zeros(tuple([Q for i in range(N)]))
for it in iters:
hterms = np.exp(hf(it)).prod()
jterms = np.exp(-jf(it)).prod()
pf[it] = jterms * hterms
# Calculates partition function
Z = pf.sum()
This method works quickly for small N and Q, say (N,Q) = (5,2). However, for larger systems (N,Q) = (18,3), this method cannot even create the pf array due to memory issues because it has Q^N nontrivial elements. Any ideas on how to either overcome this memory issue or how to alter the code to work on subarrays?
Edit: Made a small mistake in the definition of jf. It has been corrected.
You can avoid the large array just by initializing Z to 0, and incrementing it by jterms * iterms in each iteration. This still won't get you out of calculating and summing Q^N numbers, however. To do that, you probably need to figure out a way to simplify the partition function algebraically.
Not sure what you are trying to compute but I tested your code with ChrisB suggestion and jf will not work for Q=3.
Perhaps you shouldn't use a dense numpy array to encode your function? You could try sparse arrays or just straight Python with Numba compilation. This blogpost shows using Numba on the simple Ising model with good performance.