I have two arrays as an output from a simulation script where one contains IDs and one times, i.e. something like:
ids = np.array([2, 0, 1, 0, 1, 1, 2])
times = np.array([.1, .3, .3, .5, .6, 1.2, 1.3])
These arrays are always of the same size. Now I need to calculate the differences of times, but only for those times with the same ids. Of course, I can simply loop over the different ids an do
for id in np.unique(ids):
diffs = np.diff(times[ids==id])
print diffs
# do stuff with diffs
However, this is quite inefficient and the two arrays can be very large. Does anyone have a good idea on how to do that more efficiently?
You can use array.argsort() and ignore the values corresponding to change in ids:
>>> id_ind = ids.argsort(kind='mergesort')
>>> times_diffs = np.diff(times[id_ind])
array([ 0.2, -0.2, 0.3, 0.6, -1.1, 1.2])
To see which values you need to discard, you could use a Counter to count the number of times per id (from collections import Counter)
or just sort ids, and see where its diff is nonzero: these are the indices where id change, and where you time diffs are irrelevant:
times_diffs[np.diff(ids[id_ind]) == 0] # ids[id_ind] being the sorted indices sequence
and finally you can split this array with np.split and np.where:
np.split(times_diffs, np.where(np.diff(ids[id_ind]) != 0)[0])
As you mentionned in your comment, argsort() default algorithm (quicksort) might not preserve order between equals times, so the argsort(kind='mergesort') option must be used.
Say you np.argsort by ids:
inds = np.argsort(ids, kind='mergesort')
>>> array([1, 3, 2, 4, 5, 0, 6])
Now sort times by this, np.diff, and prepend a nan:
diffs = np.concatenate(([np.nan], np.diff(times[inds])))
>>> diffs
array([ nan, 0.2, -0.2, 0.3, 0.6, -1.1, 1.2])
These differences are correct except for the boundaries. Let's calculate those
boundaries = np.concatenate(([False], ids[inds][1: ] == ids[inds][: -1]))
>>> boundaries
array([False, True, False, True, True, False, True], dtype=bool)
Now we can just do
diffs[~boundaries] = np.nan
Let's see what we got:
>>> ids[inds]
array([0, 0, 1, 1, 1, 2, 2])
>>> times[inds]
array([ 0.3, 0.5, 0.3, 0.6, 1.2, 0.1, 1.3])
>>> diffs
array([ nan, 0.2, nan, 0.3, 0.6, nan, 1.2])
I'm adding another answer, since, even though these things are possible in numpy, I think that the higher-level pandas is much more natural for them.
In pandas, you could do this in one step, after creating a DataFrame:
df = pd.DataFrame({'ids': ids, 'times': times})
df['diffs'] = df.groupby(df.ids).transform(pd.Series.diff)
This gives:
>>> df
ids times diffs
0 2 0.1 NaN
1 0 0.3 NaN
2 1 0.3 NaN
3 0 0.5 0.2
4 1 0.6 0.3
5 1 1.2 0.6
6 2 1.3 1.2
The numpy_indexed package (disclaimer: I am its author) contains efficient and flexible functionality for these kind of grouping operations:
import numpy_indexed as npi
unique_ids, diffed_time_groups = npi.group_by(keys=ids, values=times, reduction=np.diff)
Unlike pandas, it does not require a specialized datastructure just to perform this kind of rather elementary operation.
Related
I have a list (or some type of array) with nearly all values between 0 and 1, but I have the occasional value that is slightly negative or greater than 1.
list_values = [-0.01, 0, 0.5, 0.9, 1.0, 1.01]
I want to replace negatives with 0 and values greater than 1 with 1.
With only 1 condition, I would use np.where like this:
arr_values = np.where(pd.Series(list_values) < 0, 0, pd.Series(list_values))
To work with multiple conditions, I could define a function and then apply it using a lambda function:
def change_values(value):
if value < 0:
return 0
elif value > 1:
return 1
else:
return value
series_values = pd.Series(list_values).apply(lambda x: change_values(value=x))
Is there a faster way to accomplish this?
You want to use np.clip:
>>> import numpy as np
>>> list_values = [-0.01, 0, 0.5, 0.9, 1.0, 1.01]
>>> arr = np.array(list_values)
>>> np.clip(arr, 0.0, 1.0)
array([0. , 0. , 0.5, 0.9, 1. , 1. ])
This is likely the fastest approach, if you can ignore the cost of converting to an array. Should be a lot better for larger lists/arrays.
Involving pandas in this operation isn't the way to go unless you eventually want a pandas data structure.
This is a rather easy task with numpy.
import numpy as np
n = np.array([-0.01, 0, 0.5, 0.9, 1.0, 1.01])
n[n > 1] = 1
n[n < 0] = 0
>>> print(n)
[0. 0. 0.5 0.9 1. 1. ]
I think this is an easy question for experienced numpy users.
I have a score matrix. The raw index corresponds to samples and column index corresponds to items. For example,
score_matrix =
[[ 1. , 0.3, 0.4],
[ 0.2, 0.6, 0.8],
[ 0.1, 0.3, 0.5]]
I want to get top-M indices of items for each samples. Also I want to get top-M scores. For example,
top2_ind =
[[0, 2],
[2, 1],
[2, 1]]
top2_score =
[[1. , 0.4],
[0,8, 0.6],
[0.5, 0.3]]
What is the best way to do this using numpy?
Here's an approach using np.argpartition -
idx = np.argpartition(a,range(M))[:,:-M-1:-1] # topM_ind
out = a[np.arange(a.shape[0])[:,None],idx] # topM_score
Sample run -
In [343]: a
Out[343]:
array([[ 1. , 0.3, 0.4],
[ 0.2, 0.6, 0.8],
[ 0.1, 0.3, 0.5]])
In [344]: M = 2
In [345]: idx = np.argpartition(a,range(M))[:,:-M-1:-1]
In [346]: idx
Out[346]:
array([[0, 2],
[2, 1],
[2, 1]])
In [347]: a[np.arange(a.shape[0])[:,None],idx]
Out[347]:
array([[ 1. , 0.4],
[ 0.8, 0.6],
[ 0.5, 0.3]])
Alternatively, possibly slower, but a bit shorter code to get idx would be with np.argsort -
idx = a.argsort(1)[:,:-M-1:-1]
Here's a post containing some runtime test that compares np.argsort and np.argpartition on a similar problem.
I'd use argsort():
top2_ind = score_matrix.argsort()[:,::-1][:,:2]
That is, produce an array which contains the indices which would sort score_matrix:
array([[1, 2, 0],
[0, 1, 2],
[0, 1, 2]])
Then reverse the columns with ::-1, then take the first two columns with :2:
array([[0, 2],
[2, 1],
[2, 1]])
Then similar but with regular np.sort() to get the values:
top2_score = np.sort(score_matrix)[:,::-1][:,:2]
Which following the same mechanics as above, gives you:
array([[ 1. , 0.4],
[ 0.8, 0.6],
[ 0.5, 0.3]])
In case someone is interested in the both the values and corresponding indices without tempering with the order, the following simple approach will be helpful. Though it could be computationally expensive if working with large data since we are using a list to store tuples of value, index.
import numpy as np
values = np.array([0.01,0.6, 0.4, 0.0, 0.1,0.7, 0.12]) # a simple array
values_indices = [] # define an empty list to store values and indices
while values.shape[0]>1:
values_indices.append((values.max(), values.argmax()))
# remove the maximum value from the array:
values = np.delete(values, values.argmax())
The final output as list of tuples:
values_indices
[(0.7, 5), (0.6, 1), (0.4, 1), (0.12, 3), (0.1, 2), (0.01, 0)]
Easy way would be:
To get top-2 indices
np.argsort(-score_matrix)[:, :2]
To get top-2 values
-np.sort(-score_matrix)[:, :2]
I need to test a boolean in an array, and based on the answer, apply an elementwise operation on a matrix. I seem to be getting the boolean answer for the ROW and not for the individual element itself. how do i test and get the answer for each individual element?
i have an matrix of probabilities
probs = np.array([[0.1, 0.2, 0.3, 0.3, 0.7],
[0.1, 0.2, 0.3, 0.3, 0.7],
[0.7, 0.2, 0.6, 0.1, 0.0]])
and a matrix of test arrays
tst = ([False, False, True, True, False],
[True, False, True, False, False],
)
t = np.asarray(tst).astype('bool')
and this segment of code i have written which outputs the answer, but obviously test the entire row, as everything is FALSE.
for row in tst:
mat = []
for row1 in probs:
temp = []
if row == True:
temp.append(row1)
else: temp.append(row1-1)
mat.append(temp)
mat
Out[42]:
[[array([-0.9, -0.8, -0.7, -0.7, -0.3])],
[array([-0.9, -0.8, -0.7, -0.7, -0.3])],
[array([-0.3, -0.8, -0.4, -0.9, -1. ])]]
i need the new matrix to be
[[-0.9, -0.8, 0.3, 0.3, -0.3],
[-0.9, -0.8, 0.3, 0.3, -0.3],
[-0.3, -0.8, 0.6, 0.1, -1]
for the 1st array in tst. Thanks very much for any assistance!
You need to keep the values as-is if test is True and substract 1 otherwise
Your loop doesn't work because you're comparing a list with a boolean. After that you're adding the whole row minus 1 (substracting 1 to all elements)
My solution: substract the boolean row to the values row, but inverting True and False (if True, don't substract, if False, substract):
for row in tst:
mat = []
for row1 in probs:
mat.append(row1-[not v for v in row])
print(np.asarray(mat))
prints (for each iteration) (note that you have 2 results since you're combining 2 truth tables with your matrix):
[[-0.9 -0.8 0.3 0.3 -0.3]
[-0.9 -0.8 0.3 0.3 -0.3]
[-0.3 -0.8 0.6 0.1 -1. ]]
[[ 0.1 -0.8 0.3 -0.7 -0.3]
[ 0.1 -0.8 0.3 -0.7 -0.3]
[ 0.7 -0.8 0.6 -0.9 -1. ]]
(I'm not a numpy expert at all, sorry if this is clumsy, comments welcome)
You don't need need a loop here. You have an array and an corresponding mask array.
probs[np.invert(tst)]-=1.
The mask will give you back the true values. You wan't the false values so invert the tst array.
# This would be a longer version, if you are not familiar with the synthax above
probs[np.invert(tst)]=probs[np.invert(tst)]-1.
If you want to create a new numpy Array (you created a list of numpy-arrays in your code) it will work for example this way.
# copy the numpy array
mat=np.copy(probs)
mat[np.invert(tst)]=probs[np.invert(tst)]-1
I would recommend you to take a look at a view beginners tutorials first, programming will be much easier if you know for example the difference between lists and numpy-arrays and how to handle them.
https://www.scipy.org/scipylib/faq.html#what-advantages-do-numpy-arrays-offer-over-nested-python-lists
https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
or a short explanation
Python List vs. Array - when to use?
I have a numpy array of M*N dimensions in which each element of the array is a float with a value between 0-1.
Input: for simplicity purpose lets consider a 3*4 array:
a=np.array([
[0.1, 0.2, 0.3, 0.6],
[0.3, 0.4, 0.8, 0.7],
[0.5, 0.6, 0.2, 0.1]
])
I want to consider 3 columns at a time (say col 0,1,2 for first iteration and 1,2,3 for second) and get the maximum value of multiplication of all possible combinations of the 3 columns and also get the index of their respective values.
In this case I should get max value of 0.5*0.6*0.8=0.24 and the index of the rows of values that gave the max value: (2,2,1) in this case.
Output: [[0.24,(2,2,1)],[0.336,(2,1,1)]]
I can do this using loops but I want to avoid them as it would affect running time, is there anyway I can do that in numpy?
Here's an approach using NumPy strides that is supposedly very efficient for such sliding windowed operations as it creates a view into the array without actually making copies -
N = 3 # Window size
m,n = a.strides
p,q = a.shape
a3D = np.lib.stride_tricks.as_strided(a,shape=(p, q-N +1, N),strides=(m,n,n))
out1 = a3D.argmax(0)
out2 = a3D.max(0).prod(1)
Sample run -
In [69]: a
Out[69]:
array([[ 0.1, 0.2, 0.3, 0.6],
[ 0.3, 0.4, 0.8, 0.7],
[ 0.5, 0.6, 0.2, 0.1]])
In [70]: out1
Out[70]:
array([[2, 2, 1],
[2, 1, 1]])
In [71]: out2
Out[71]: array([ 0.24 , 0.336])
We can zip those two outputs together if needed in that format -
In [75]: zip(out2,map(tuple,out1))
Out[75]: [(0.23999999999999999, (2, 2, 1)), (0.33599999999999997, (2, 1, 1))]
Im working with two arrays, trying to work with them like a 2 dimensional array. I'm using a lot of vectorized calculations with NumPy. Any idea how I would populate an array like this:
X = [1, 2, 3, 1, 2, 3, 1, 2, 3]
or:
X = [0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8]
Ignore the first part of the message.
I had to populate two arrays in a form of a grid. But the grid dimensions varied from the users, thats why I needed a general form. I worked on it all this morning and finally got what I wanted.
I apologize if I caused any confusion earlier. English is not my tongue language, and sometimes it is hard for me to explain things.
This is the code that did the job for me:
myIter = linspace(1, N, N)
for x in myIter:
for y in myIter:
index = ((x - 1)*N + y) - 1
X[index] = x / (N+1)
Y[index] = y / (N+1)
The user inputs N.
And the length of X, Y is N*N.
You can use the function tile. From the examples:
>>> a = np.array([0, 1, 2])
>>> np.tile(a, 2)
array([0, 1, 2, 0, 1, 2])
With this function, you can also reshape your array at once like they do in the other answers with reshape (by defining the 'repeats' is more dimensions):
>>> np.tile(a, (2, 1))
array([[0, 1, 2],
[0, 1, 2]])
Addition: and a little comparison of the difference in speed between the built in function tile and the multiplication:
In [3]: %timeit numpy.array([1, 2, 3]* 3)
100000 loops, best of 3: 16.3 us per loop
In [4]: %timeit numpy.tile(numpy.array([1, 2, 3]), 3)
10000 loops, best of 3: 37 us per loop
In [5]: %timeit numpy.array([1, 2, 3]* 1000)
1000 loops, best of 3: 1.85 ms per loop
In [6]: %timeit numpy.tile(numpy.array([1, 2, 3]), 1000)
10000 loops, best of 3: 122 us per loop
EDIT
The output of the code you gave in your question can also be achieved as following:
arr = myIter / (N + 1)
X = numpy.repeat(arr, N)
Y = numpy.tile(arr, N)
This way you can avoid looping the arrays (which is one of the great advantages of using numpy). The resulting code is simpler (if you know the functions of course, see the documentation for repeat and tile) and faster.
print numpy.array(range(1, 4) * 3)
print numpy.array(range(1, 5) * 4).astype(float) * 2 / 10
If you want to create lists of repeating values, you could use list/tuple multiplication...
>>> import numpy
>>> numpy.array((1, 2, 3) * 3)
array([1, 2, 3, 1, 2, 3, 1, 2, 3])
>>> numpy.array((0.2, 0.4, 0.6, 0.8) * 3).reshape((3, 4))
array([[ 0.2, 0.4, 0.6, 0.8],
[ 0.2, 0.4, 0.6, 0.8],
[ 0.2, 0.4, 0.6, 0.8]])
Thanks for updating your question -- it's much clearer now. Though I think joris's answer is the best one in this case (because it is more readable), I'll point out that the new code you posted could also be generalized like so:
>>> arr = numpy.arange(1, N + 1) / (N + 1.0)
>>> X = arr[numpy.indices((N, N))[0]].flatten()
>>> Y = arr[numpy.indices((N, N))[1]].flatten()
In many cases, when using numpy, one avoids while loops by using numpy's powerful indexing system. In general, when you use array I to index array A, the result is an array J of the same shape as I. For each index i in I, the value A[i] is assigned to the corresponding position in J. For example, say you have arr = numpy.arange(0, 9) / (9.0) and you want the values at indices 3, 5, and 8. All you have to do is use numpy.array([3, 5, 8]) as the index to arr:
>>> arr
array([ 0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
0.55555556, 0.66666667, 0.77777778, 0.88888889])
>>> arr[numpy.array([3, 5, 8])]
array([ 0.33333333, 0.55555556, 0.88888889])
What if you want a 2-d array? Just pass in a 2-d index:
>>> arr[numpy.array([[1,1,1],[2,2,2],[3,3,3]])]
array([[ 0.11111111, 0.11111111, 0.11111111],
[ 0.22222222, 0.22222222, 0.22222222],
[ 0.33333333, 0.33333333, 0.33333333]])
>>> arr[numpy.array([[1,2,3],[1,2,3],[1,2,3]])]
array([[ 0.11111111, 0.22222222, 0.33333333],
[ 0.11111111, 0.22222222, 0.33333333],
[ 0.11111111, 0.22222222, 0.33333333]])
Since you don't want to have to type indices like that out all the time, you can generate them automatically -- with numpy.indices:
>>> numpy.indices((3, 3))
array([[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]],
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]])
In a nutshell, that's how the above code works. (Also check out numpy.mgrid and numpy.ogrid -- which provide slightly more flexible index-generators.)
Since many numpy operations are vectorized (i.e. they are applied to each element in an array) you just have to find the right indices for the job -- no loops required.
import numpy as np
X = range(1,4)*3
X = list(np.arange(.2,.8,.2))*4
these will make your two lists, respectively. Hope thats what you were asking
I'm not exactly sure what you are trying to do, but as a guess: if you have a 1D array and you need to make it 2D you can use the array classes reshape method.
>>> import numpy
>>> a = numpy.array([1,2,3,1,2,3])
>>> a.reshape((2,3))
array([[1, 2, 3],
[1, 2, 3]])