I am rewriting an analysis code for Molecular Dynamics time series. Due to the huge amount of time steps (150 000 for each simulation run) which have to be analysed, it is very important that my code is as fast as possible.
The old code is very slow (actually it requires 300 to 500 times more time compared to my one) because it was written for the analysis of a few thousand PDB files and not a bunch full of different simulations (around 60), each one having 150 000 time steps. I know that C or Fortran would be the swiss army knife in this case but my experience with c is .....
Therefore I am trying to use as much as possible numpy/scipy routines for my python code. Because I've a license for the accelerated distribution of anaconda with the mkl, this is a really significant speedup.
Now I am facing a problem and I hope that I can explain it in a manner that you understand what i mean.
I have three arrays each one with a shape of (n, 3, 20). In the first row are all residuals of my peptide, commonly around 23 to 31. In the second row are coordinates in the order xyz and in the third row are some specific time steps.
Now I'am calculating the torsion for each residual at each time step. my code for the case of arrays with shape (n,3,1) its:
def fast_torsion(d1, d2, d3):
tt = dot(d1, np.cross(d2, d3))
tb = dot(d1, d1) * dot(d2, d2)
torsion = np.zeros([len(d1), 1])
for i in xrange(len(d1)):
if tb[i] != 0:
torsion[i] = tt[i]/tb[i]
return torsion
Now I tried to use the same code for the arrays with the extended third axis but the cross product function produces the wrong values compared to the original slow code, which is using a for loop.
I tried this code with my the big arrays it is around 10 to 20 times faster than a for loop solution and around 200 times fast than the old code.
What I am trying is that np.cross() only computes the cross product over the second (xyz) axis and iterates over the other two axis. In the case with the short third axis it works fine, but with the big arrays it only works for the first time step. I also tried the axis settings but I had no chance.
I can also use Cython or numba if this is the only solution for my problem.
P.S. Sorry for my english I hope you can understand everything.
np.crosshas axisa, axisb and axisc keyword arguments to select where in the input and output arguments are the vectors to be cross multiplied. I think you want to use:
np.cross(d2, d3, axisa=1, axisb=1, axisc=1)
If you don't include the axisc=1, the result of the multiplication will be at the end of the output array.
Also, you can avoid explicitly looping over your torsion array by doing:
torsion = np.zeros((len(d1), 1)
idx = (tb !=0)
torsion[idx] = tt[idx] / tb[idx]
Related
Problem
The code performs geostatistic interpolation by applying kriging. For small data size, it works great. However, when the data is large, the computational time increases drastically.
Constants
c1, c2, c3, c4 are constants
matrx is a dataset of size 6000 x 6000
Variables
data is a 6000 x 3 array
gdata is a 10000 x 2 array
Code
An extract of the code where I am having the problem is below:
prediction = []
for i, dummy_val in range(10000):
semivariance = []
for j in range(len(data[:, 2])):
distance = np.sqrt((gdata[i, 0]-data[j, 0])**2 + (gdata[i, 1]-data[j, 1])**2)
semivariance.append((c1 + c2*(1-np.exp(-(distance/c3)**c4))))
semivariance.append(1)
iweights = np.linalg.lstsq(matrx, semivariance, rcond=None)
weights = iweights[:-3][0][:-1]
prediction.append(np.sum(data[:, 2]*weights))
When I debug the code, I realize that the problem comes from the
iweights = np.linalg.lstsq(matrx, semivariance, rcond=None)
which runs very slow for the large matrx and semivariance array that I am using.
Is there a pythonic way to help improve the computational speed or a way I could rewrite the entire block of code to improve the speed?
Are you using the MKL library binding to numpy? If not you can try it and check if it affect the performance of your code.
Also for the prediction and semivariance lists you are creating empty lists for both of them, then append a value for them on each iteration. As per my understanding, the number of iterations is fixed in your code, so will the code be faster if you created the lists with full size from the start to avoid dynamic creation of new list with every append? I don't know if the interpreter is smart enough to detect that the size of the list is fixed to create it and avoid the reallocation withou your help.
I've recently "taught" myself python in order to analyze data for my experiments. As such I'm pretty clueless on many aspects. I've managed to make my analysis work for certain files but in some cases it breaks down and I imagine it is a result of faulty programming.
Currently I export a file containing 3 numpy arrays. One of these arrays is my signal (float values from -10 to 10). What I wish to do is to normalize every datum in this array to a range of values that preceed it. (i.e. the 30001st value must have the average of the preceeding 3000 values subtracted from it and then the difference must then be divided by thisvery same average (the preceeding 3000 values). My data is collected at a rate of 100Hz thus to get a normalization of the alst 30s i must use the preceeding 3000values.
As it stand this is how I've managed to make it work:
this stores the signal into the variable photosignal
photosignal = np.array(seg.analogsignals[0], ndmin=1)
now this the part I use to get the delta F/F over a moving window of 30s
normalizedphotosignal = [(uu-(np.mean(photosignal[uu-3000:uu])))/abs(np.mean(photosignal[uu-3000:uu])) for uu in photosignal[3000:]]
The following adds 3000 values to the beginning to keep the array the same length since later on i must time lock it to another list that is the same length
holder =list(range(3000))
normalizedphotosignal = holder + normalizedphotosignal
What I have noticed is that in certain files this code gives me an error because it says that the"slice" is empty and therefore it cannot create a mean.
I think maybe there is a better way to program this that could avoid this problem altogether. Or this a correct way to approach this problem?
So i tried the solution but it is quite slow and it nevertheless still gives me the "empty slice error".
I went over the moving average post and found this method:
def running_mean(x, N):
cumsum = np.cumsum(np.insert(x, 0, 0))
return (cumsum[N:] - cumsum[:-N]) / N
however I'm having trouble accommodating it to my desired output. namely (x-running average)/running average
Allright so I finally figured it out thanks to your help and the posts you referred me to.
The calculation for my entire data (300 000 +) takes about a second!
I used the following code:
def runningmean(x,N):
cumsum =np.cumsum(np.insert(x,0,0))
return (cumsum[N:] -cumsum[:-N])/N
photosignal = np.array(seg.analogsignal[0], ndmin =1)
photosignalaverage = runningmean(photosignal, 3000)
holder = np.zeros(2999)
photosignalaverage = np.append(holder,photosignalaverage)
detalfsignal = (photosignal-photosignalaverage)/abs(photosignalaverage)
Photosignal stores my raw signal in a numpy array.
Photosignalaverage uses cumsum to calculate the running average of every datapoint in photosignal. I then add the first 2999 values as 0, to maintian the same list size as my photosignal.
I then use basic numpy calculations to get my delta F/F signal.
Thank you once more for the feedback, was truly helpful!
Your approach goes in the right direction. However, you made a mistake in your list comprehension: you are using uu as your index whereas uu are the elements of your input data photosignal.
You want something like this:
normalizedphotosignal2 = np.zeros((photosignal.shape[0]-3000))
for i, uu in enumerate(photosignal[3000:]):
normalizedphotosignal2 = (uu - (np.mean(photosignal[i-3000:i]))) / abs(np.mean(photosignal[i-3000:i]))
Keep in mind that for-loops are relatively slow in python. If performance is an issue here, you could try avoiding the for loop and use numpy methods instead (e.g. have a look at Moving average or running mean).
Hope this helps.
First time posting, so I apologize for any confusion.
I have two numpy arrays which are time stamps for a signal.
chan1,chan2 looks like:
911.05, 7.7
1055.6, 455.0
1513.4, 1368.15
4604.6, 3004.4
4970.35, 3344.25
13998.25, 4029.9
15008.7, 6310.15
15757.35, 7309.75
16244.2, 8696.1
16554.65, 9940.0
..., ...
and so on, (up to 65000 elements per chan. pre file)
Edit : The lists are already sorted but the issue is that they are not always equal in spacing. There are gaps that could show up, which would misalign them, so chan1[3] could be closer to chan2[23] instead of, if the spacing was qual chan2[2 or 3 or 4] : End edit
For each elements in chan1, I am interested in finding the closest neighbor in chan2, which is done with:
$ np.min(np.abs(chan2-chan1[i]))
and to keep track of positive or neg. difference:
$ index=np.where( np.abs( chan2-chan1[i]) == res[i])[0][0]
$ if chan2[index]-chan1[i] <0.0 : res[i]=res[i]*(-1.0)
Lastly, I create a histogram of all the differences, in a range I am interested in.
My concern is that I do this in the for loop. I usually try to avoid for loops when I can by utilizing the numpy arrays, as each operation can be performed on the entire array. However, in this case I am unable to find a solution or a build in function (which I understand run significantly faster than anything I can make).
The routine takes about 0.03 seconds per file. There are a few more things happening outside of the function but not a significant number, mostly plotting after everything is done, and a loop to read in files.
I was wondering if anyone has seen a similar problem, or is familiar enough with the python libraries to suggest a solution (maybe a build in function?) to obtain the data I am interested in? I have to go over hundred of thousands of files, and currently my data analysis is about 10 slower than data acquisition. We are also in the middle of upgrading our instruments to where we will be able to obtain data 10-100 times faster, and so the analysis speed is going to become an serious issue.
I would prefer not to use a cluster to brute force the problem, and not too familiar with parallel processing, although I would not mind dabbling in it. It would take me a while to write it in C, and I am not sure if I would be able to make it faster.
Thank you in advance for your help.
def gen_hist(chan1,chan2):
res=np.arange(1,len(chan1)+1,1)*0.0
for i in range(len(chan1)):
res[i]=np.min(np.abs(chan2-chan1[i]))
index=np.where( np.abs( chan2-chan1[i]) == res[i])[0][0]
if chan2[index]-chan1[i] <0.0 : res[i]=res[i]*(-1.0)
return np.histogram(res,bins=np.arange(time_range[0]-interval,\
time_range[-1]+interval,\
interval))[0]
After all the files are cycled through I obtain a plot of the data:
Example of the histogram
Your question is a little vague, but I'm assuming that, given two sorted arrays, you're trying to return an array containing the differences between each element of the first array and the closest value in the second array.
Your algorithm will have a worst case of O(n^2) (np.where() and np.min() are O(n)). I would tackle this by using two iterators instead of one. You store the previous (r_p) and current (r_c) value of the right array and the current (l_c) value of the left array. For each value of the left array, increment the right array until r_c > l_c. Then append min(abs(r_p - l_c), abs(r_c - l_c)) to your result.
In code:
l = [ ... ]
r = [ ... ]
i = 0
j = 0
result = []
r_p = r_c = r[0]
while i < len(l):
l_c = l[i]
while r_c < l and j < len(r):
j += 1
r_c = r[j]
r_p = r[j-1]
result.append(min(abs(r_c - l_c), abs(r_p - l_c)))
i += 1
This runs in O(n). If you need additional speed out of it, try writing it in C or running it in Cython.
I'm writing a program in Python that's processing some data generated during experiments, and it needs to estimate the slope of the data. I've written a piece of code that does this quite nicely, but it's horribly slow (and I'm not very patient). Let me explain how this code works:
1) It grabs a small piece of data of size dx (starting with 3 datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope using OLS regression. If the difference is too small, it will increase dx and redo the loop with this new dx
4) This continues for all the datapoints
[See updated code further down]
For a datasize of about 100k measurements, this takes about 40 minutes, whereas the rest of the program (it does more processing than just this bit) takes about 10 seconds. I am certain there is a much more efficient way of doing these operations, could you guys please help me out?
Thanks
EDIT:
Ok, so I've got the problem solved by using only binary searches, limiting the number of allowed steps by 200. I thank everyone for their input and I selected the answer that helped me most.
FINAL UPDATED CODE:
def slope(self, data, time):
(wave1, wave2) = wt.dwt(data, "db3")
std = 2*np.std(wave2)
e = std/0.05
de = 5*std
N = len(data)
slopes = np.ones(shape=(N,))
data2 = np.concatenate((-data[::-1]+2*data[0], data, -data[::-1]+2*data[N-1]))
time2 = np.concatenate((-time[::-1]+2*time[0], time, -time[::-1]+2*time[N-1]))
for n in xrange(N+1, 2*N):
left = N+1
right = 2*N
for i in xrange(200):
mid = int(0.5*(left+right))
diff = np.abs(data2[n-mid+N]-data2[n+mid-N])
if diff >= e:
if diff < e + de:
break
right = mid - 1
continue
left = mid + 1
leftlim = n - mid + N
rightlim = n + mid - N
y = data2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
x = time2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
xavg = np.average(x)
yavg = np.average(y)
xlen = len(x)
slopes[n-N] = (np.dot(x,y)-xavg*yavg*xlen)/(np.dot(x,x)-xavg*xavg*xlen)
return np.array(slopes)
Your comments suggest that you need to find a better method to estimate ik+1 given ik. No knowledge of values in data would yield to the naive algorithm:
At each iteration for n, leave i at previous value, and see if the abs(data[start]-data[end]) value is less than e. If it is, leave i at its previous value, and find your new one by incrementing it by 1 as you do now. If it is greater, or equal, do a binary search on i to find the appropriate value. You can possibly do a binary search forwards, but finding a good candidate upper limit without knowledge of data can prove to be difficult. This algorithm won't perform worse than your current estimation method.
If you know that data is kind of smooth (no sudden jumps, and hence a smooth plot for all i values) and monotonically increasing, you can replace the binary search with a search backwards by decrementing its value by 1 instead.
How to optimize this will depend on some properties of your data, but here are some ideas:
Have you tried profiling the code? Using one of the Python profilers can give you some useful information about what's taking the most time. Often, a piece of code you've just written will have one biggest bottleneck, and it's not always obvious which piece it is; profiling lets you figure that out and attack the main bottleneck first.
Do you know what typical values of i are? If you have some idea, you can speed things up by starting with i greater than 0 (as #vhallac noted), or by increasing i by larger amounts — if you often see big values for i, increase i by 2 or 3 at a time; if the distribution of is has a long tail, try doubling it each time; etc.
Do you need all the data when doing the least squares regression? If that function call is the bottleneck, you may be able to speed it up by using only some of the data in the range. Suppose, for instance, that at a particular point, you need i to be 200 to see a large enough (above-noise) change in the data. But you may not need all 400 points to get a good estimate of the slope — just using 10 or 20 points, evenly spaced in the start:end range, may be sufficient, and might speed up the code a lot.
I work with Python for similar analyses, and have a few suggestions to make. I didn't look at the details of your code, just to your problem statement:
1) It grabs a small piece of data of size dx (starting with 3
datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is
larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope
using OLS regression. If the difference is too small, it will increase
dx and redo the loop with this new dx
4) This continues for all the datapoints
I think the more obvious reason for slow execution is the LOOPING nature of your code, when perhaps you could use the VECTORIZED (array-based operations) nature of Numpy.
For step 1, instead of taking pairs of points, you can perform directly `data[3:] - data[-3:] and get all the differences in a single array operation;
For step 2, you can use the result from array-based tests like numpy.argwhere(data > threshold) instead of testing every element inside some loop;
Step 3 sounds conceptually wrong to me. You say that if the difference is too small, it will increase dx. But if the difference is small, the resulting slope would be small because it IS actually small. Then, getting a small value is the right result, and artificially increasing dx to get a "better" result might not be what you want. Well, it might actually be what you want, but you should consider this. I would suggest that you calculate the slope for a fixed dx across the whole data, and then take the resulting array of slopes to select your regions of interest (for example, using data_slope[numpy.argwhere(data_slope > minimum_slope)].
Hope this helps!
I'm running kmeans on a large dataset and I'm always getting the error below:
Error using kmeans (line 145)
Some points have small relative magnitudes, making them effectively zero.
Either remove those points, or choose a distance other than 'cosine'.
Error in runkmeans (line 7)
[L, C]=kmeans(data, 10, 'Distance', 'cosine', 'EmptyAction', 'drop')
My problem is that even when I add a 1 to all the vectors, I still get this error. I would expect it to pass then, but apparently there are too many zero's still (that is what is causing it, right?).
My question is this: what is the condition that makes Matlab decide that a point has "a small relative magnitude" and "is effectively zero"?
I want to remove all these points from my dataset using python, before I hand over the data to Matlab, because I need to compare my results with a gold standard that I process in python.
Thanks in advance!
EDIT-ANSWER
The correct answer was given below, but in case someone finds this question through Google, here's how you remove the "effectively zero-vectors" from your matrix in python. Every row (!) is a data point, so you want to transpose in python or Matlab if you're running kmeans:
def getxnorm(data):
return np.sqrt(np.sum(data ** 2, axis=1))
def remove_zero_vector(data, startxnorm, excluded=[]):
eps = 2.2204e-016
xnorm = getxnorm(data)
if np.min(xnorm) <= (eps * np.max(xnorm)):
local_index=np.transpose(np.where(xnorm == np.min(xnorm)))[0][0]
global_index=np.transpose(np.where(startxnorm == np.min(xnorm)))[0][0]
data=np.delete(data, local_index, 0) # data with zero vector removed
excluded.append(global_index) # add global index to list of excluded vectors
return remove_zero_vector(data, startxnorm, excluded)
else:
return (data, excluded)
I'm sure there's a much more scipythonic way for doing this, but it'll do :-)
If you're using this kmeans, then the relevant code that is throwing the error is:
case 'cosine'
Xnorm = sqrt(sum(X.^2, 2));
if any(min(Xnorm) <= eps * max(Xnorm))
error(['Some points have small relative magnitudes, making them ', ...
'effectively zero.\nEither remove those points, or choose a ', ...
'distance other than ''cosine''.'], []);
end
So there's your test.
As you can see, what's important is relative size, so adding one to everything only makes things worse (max(Xnorm) is getting larger too). A good fix might be to scale all the data by a constant.
In your other question it looked like your data was scalar. If your input vectors only have one feature/dimension the cosine distance between them will always be undefined (or zero) because by definition they are pointing in the same direction (along the single axis). The cosine measure gives the angle between two vectors, which can only be non-zero if the vectors can point in different directions (ie dimension > 1).