Jumping Multi-element slices in Numpy Arrays - python

So say i have an array:
arr = np.arange(12)
And at the end I want this array:
arr2 = [0,1,2,6,7,8]
So I want a jumping mulitple slice, something like:
arr2 = arr[(0:2):-1:6]
where the second array is a slice of three that jumps 6 everytime.
Is this possible in numpy?
My actual example is a more complex example where part of the math is applied for the slice (0:2) that jumps 6 and the other math is applied to the slice (3:5) with a goal to write in one line i.e. without a for-loop.
Sorry if this question has been asked before. I'm having trouble finding documentation on this and I think I might just be googling the wrong thing. Thanks!

You can't do this with slice notation, at least not directly.
But with some reshaping:
In [74]: arr = np.arange(12)
In [75]: arr.reshape(-1,3)
Out[75]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
In [76]: arr.reshape(-1,3)[::2,:]
Out[76]:
array([[0, 1, 2],
[6, 7, 8]])
In [77]: _.reshape(-1)
Out[77]: array([0, 1, 2, 6, 7, 8])
Individually slicing and reshaping make views, but at some point in this transition, it has to make a copy. So the timing relative to the advanced indexing that Divakar suggests is, at best, modest:
In [86]: timeit arr.reshape(-1,3)[::2,:].reshape(-1)
3.99 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [87]: timeit arr[(np.arange(len(arr))%6)<3]
8.91 µs ± 89.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Related

Creating a tumbling windows in python

Just wondering if there is a way to construct a tumbling window in python. So for example if I have list/ndarray , listA = [3,2,5,9,4,6,3,8,7,9]. Then how could I find the maximum of the first 3 items (3,2,5) -> 5, and then the next 3 items (9,4,6) -> 9 and so on... Sort of like breaking it up to sections and finding the max. So the final result would be list [5,9,8,9]
Approach #1: One-liner for windowed-max using np.maximum.reduceat -
In [118]: np.maximum.reduceat(listA,np.arange(0,len(listA),3))
Out[118]: array([5, 9, 8, 9])
Becomes more compact with np.r_ -
np.maximum.reduceat(listA,np.r_[:len(listA):3])
Approach #2: Generic ufunc way
Here's a function for generic ufuncs and that window length as a parameter -
def windowed_ufunc(a, ufunc, W):
a = np.asarray(a)
n = len(a)
L = W*(n//W)
out = ufunc(a[:L].reshape(-1,W),axis=1)
if n>L:
out = np.hstack((out, ufunc(a[L:])))
return out
Sample run -
In [81]: a = [3,2,5,9,4,6,3,8,7,9]
In [82]: windowed_ufunc(a, ufunc=np.max, W=3)
Out[82]: array([5, 9, 8, 9])
On other ufuncs -
In [83]: windowed_ufunc(a, ufunc=np.min, W=3)
Out[83]: array([2, 4, 3, 9])
In [84]: windowed_ufunc(a, ufunc=np.sum, W=3)
Out[84]: array([10, 19, 18, 9])
In [85]: windowed_ufunc(a, ufunc=np.mean, W=3)
Out[85]: array([3.33333333, 6.33333333, 6. , 9. ])
Benchmarking
Timings on NumPy solutions on array data with sample data scaled up by 10000x -
In [159]: a = [3,2,5,9,4,6,3,8,7,9]
In [160]: a = np.tile(a, 10000)
# #yatu's soln
In [162]: %timeit moving_maxima(a, w=3)
435 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# From this post - app#1
In [167]: %timeit np.maximum.reduceat(a,np.arange(0,len(a),3))
353 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# From this post - app#2
In [165]: %timeit windowed_ufunc(a, ufunc=np.max, W=3)
379 µs ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you want a one-liner, you can use list comprehension:
listA = [3,2,5,9,4,6,3,8,7,9]
listB=[max(listA[i:i+3]) for i in range(0,len(listA),3)]
print (listB)
it returns:
[5, 9, 8, 9]
Of course the codes can be written more dynamically: if you want a different window size, just change 3 to any integer.
Using numpy, you can extend the list with zeroes so its length is divisible by the window size, and reshape and compute the maxalong the second axis:
def moving_maxima(a, w):
mod = len(a)%w
d = w if mod else mod
x = np.r_[a, [0]*(d-mod)]
return x.reshape(-1,w).max(1)
Some examples:
moving_maxima(listA,2)
# array([3., 9., 6., 8., 9.])
moving_maxima(listA,3)
#array([5, 9, 8, 9])
moving_maxima(listA,4)
#array([9, 8, 9])

Filter numpy array of strings

I have a very large data set gotten from twitter. I am trying to figure out how to do the equivalent of python filtering like the below in numpy. The environment is the python interpreter
>>tweets = [['buhari si good'], ['atiku is great'], ['buhari nfd sdfa atiku'],
['is nice man that buhari']]
>>>filter(lambda x: 'buhari' in x[0].lower(), tweets)
[['buhari si good'], ['buhari nfd sdfa atiku'], ['is nice man that buhari']]
I tried boolean indexing like the below, but the array turned up empty
>>>tweet_arr = np.array([['buhari si good'], ['atiku is great'], ['buhari nfd sdfa atiku'], ['is nice man that buhari']])
>>>flat_tweets = tweet_arr[:, 0]
>>>flat_tweets
array(['buhari si good', 'atiku is great', 'buhari nfd sdfa atiku',
'is nice man that buhari'], dtype='|S23')
>>>flat_tweets['buhari' in flat_tweets]
array([], shape=(0, 4), dtype='|S23')
I would like to know how to filter strings in a numpy array, the way I was easily able to filter even numbers here
>>> arr = np.arange(15).reshape((15,1))
>>>arr
array([[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10],
[11],
[12],
[13],
[14]])
>>>arr[:][arr % 2 == 0]
array([ 0, 2, 4, 6, 8, 10, 12, 14])
Thanks
If you want to stick to a solution based entirely on NumPy, you could do
from numpy.core.defchararray import find, lower
tweet_arr[find(lower(tweet_arr), 'buhari') != -1]
You mention in a comment that what you're looking for here is performance, so it should be noted that this appears to be a good deal slower than the solution you came up with yourself:
In [33]: large_arr = np.repeat(tweet_arr, 10000)
In [36]: %timeit large_arr[find(lower(large_arr), 'buhari') != -1]
54.6 ms ± 765 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [43]: %timeit list(filter(lambda x: 'buhari' in x.lower(), large_arr))
21.2 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In fact, an ordinary list comprehension beats both approaches:
In [44]: %timeit [x for x in large_arr if 'buhari' in x.lower()]
18.5 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Fastest way to construct tuple from elements of list (Python)

I have 3 NumPy arrays, and I want to create tuples of the i-th element of each list. These tuples represent keys for a dictionary I had previously defined.
Ex:
List 1: [1, 2, 3, 4, 5]
List 2: [6, 7, 8, 9, 10]
List 3: [11, 12, 13, 14, 15]
Desired output: [mydict[(1,6,11)],mydict[(2,7,12)],mydict[(3,8,13)],mydict[(4,9,14)],mydict[(5,10,15)]]
These tuples represent keys of a dictionary I have previously defined (essentially, as input variables to a previously calculated function). I had read that this is the best way to store function values for lookup.
My current method of doing this is as follows:
[dict[x] for x in zip(l1, l2, l3)]
This works, but is obviously slow. Is there a way to vectorize this operation, or make it faster in any way? I'm open to changing the way I've stored the function values as well, if that is necessary.
EDIT: My apologies for the question being unclear. I do in fact, have NumPy arrays. My mistake for referring to them as lists and displaying them as such. They are of the same length.
Your question is a bit confusing, since you're calling these NumPy arrays, and asking for a way to vectorize things, but then showing lists, and labeling them as lists in your example, and using list in the title. I'm going to assume you do have arrays.
>>> l1 = np.array([1, 2, 3, 4, 5])
>>> l2 = np.array([6, 7, 8, 9, 10])
>>> l3 = np.array([11, 12, 13, 14, 15])
If so, you can stack these up in a 2D array:
>>> ll = np.stack((l1, l2, l3))
And then you can just transpose that:
>>> lt = ll.T
This is better than vectorized; it's constant-time. NumPy is just creating another view of the same data, with different striding so it reads in column order instead of row order.
>>> lt
array([[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14],
[ 5, 10, 15]])
As miradulo points out, you can do both of these in one step with column_stack:
>>> lt = np.column_stack((l1, l2, l3))
But I suspect you're actually going to want ll as a value in its own right. (Although I admit I'm just guessing here at what you're trying to do…)
And of course if you want to loop over these rows as 1D arrays instead of doing further vectorized work, you can:
>>> for row in lt:
...: print(row)
[ 1 6 11]
[ 2 7 12]
[ 3 8 13]
[ 4 9 14]
[ 5 10 15]
Of course, you can convert them from 1D arrays to tuples just by calling tuple on each row. Or… whatever that mydict is supposed to be (it doesn't look like a dictionary—there's no key-value pairs, just values), you can do that.
>>> mydict = collections.namedtuple('mydict', list('abc'))
>>> tups = [mydict(*row) for row in lt]
>>> tups
[mydict(a=1, b=6, c=11),
mydict(a=2, b=7, c=12),
mydict(a=3, b=8, c=13),
mydict(a=4, b=9, c=14),
mydict(a=5, b=10, c=15)]
If you're worried about the time to look up a tuple of keys in a dict, itemgetter in the operator module has a C-accelerated version. If keys is a np.array, or a tuple, or whatever, you can do this:
for row in lt:
myvals = operator.itemgetter(*row)(mydict)
# do stuff with myvals
Meanwhile, I decided to slap together a C extension that should be as fast as possible (with no error handling, because I'm lazy it should be a tiny bit faster that way—this code will probably segfault if you give it anything but a dict and a tuple or list):
static PyObject *
itemget_itemget(PyObject *self, PyObject *args) {
PyObject *d;
PyObject *keys;
PyArg_ParseTuple(args, "OO", &d, &keys);
PyObject *seq = PySequence_Fast(keys, "keys must be an iterable");
PyObject **arr = PySequence_Fast_ITEMS(seq);
int seqlen = PySequence_Fast_GET_SIZE(seq);
PyObject *result = PyTuple_New(seqlen);
PyObject **resarr = PySequence_Fast_ITEMS(result);
for (int i=0; i!=seqlen; ++i) {
resarr[i] = PyDict_GetItem(d, arr[i]);
Py_INCREF(resarr[i]);
}
return result;
}
Times for looking up 100 random keys out of a 10000-key dictionary on my laptop with python.org CPython 3.7 on macOS:
itemget.itemget: 1.6µs
operator.itemgetter: 1.8µs
comprehension: 3.4µs
pure-Python operator.itemgetter: 6.7µs
So, I'm pretty sure anything you do is going to be fast enough—that's only 34ns/key that we're trying to optimize. But if that really is too slow, operator.itemgetter does a good enough job moving the loop to C and cuts it roughly in half, which is pretty close to the best possibly result you could expect. (It's hard to imagine looping up a bunch of boxed-value keys in a hash table in much less than 16ns/key, after all.)
Define your 3 lists. You mention 3 arrays, but show lists (and call them that as well):
In [112]: list1,list2,list3 = list(range(1,6)),list(range(6,11)),list(range(11,16))
Now create a dictionary with tuple keys:
In [114]: dd = {x:i for i,x in enumerate(zip(list1,list2,list3))}
In [115]: dd
Out[115]: {(1, 6, 11): 0, (2, 7, 12): 1, (3, 8, 13): 2, (4, 9, 14): 3, (5, 10, 15): 4}
Accessing elements from that dictionary with your code:
In [116]: [dd[x] for x in zip(list1,list2,list3)]
Out[116]: [0, 1, 2, 3, 4]
In [117]: timeit [dd[x] for x in zip(list1,list2,list3)]
1.62 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Now for an array equivalent - turn the lists into a 2d array:
In [118]: arr = np.array((list1,list2,list3))
In [119]: arr
Out[119]:
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15]])
Access the same dictionary elements. If I used column_stack I could have omitted the .T, but that's slower. (array transpose is fast)
In [120]: [dd[tuple(x)] for x in arr.T]
Out[120]: [0, 1, 2, 3, 4]
In [121]: timeit [dd[tuple(x)] for x in arr.T]
15.7 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Notice that this is substantially slower. Iteration over an array is slower than iteration over a list. You can't access elements of a dictionary in any sort of numpy 'vectorized' fashion - you have to use a Python iteration.
I can improve on the array iteration by first turning it into a list:
In [124]: arr.T.tolist()
Out[124]: [[1, 6, 11], [2, 7, 12], [3, 8, 13], [4, 9, 14], [5, 10, 15]]
In [125]: timeit [dd[tuple(x)] for x in arr.T.tolist()]
3.21 µs ± 9.67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Array construction times:
In [122]: timeit arr = np.array((list1,list2,list3))
3.54 µs ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [123]: timeit arr = np.column_stack((list1,list2,list3))
18.5 µs ± 11.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With the pure Python itemgetter (from v3.6.3) there's no savings:
In [149]: timeit operator.itemgetter(*[tuple(x) for x in arr.T.tolist()])(dd)
3.51 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and if I move the getter definition out of the time loop:
In [151]: %%timeit idx = operator.itemgetter(*[tuple(x) for x in arr.T.tolist()]
...: )
...: idx(dd)
...:
482 ns ± 1.85 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Deleting diagonal elements of a numpy array

Given input
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Need output :
array([[2, 3],
[4, 6],
[7, 8]])
It is easy to use iteration or loop to do this, but there should be a neat way to do this without using loops. Thanks
Approach #1
One approach with masking -
A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1)
Sample run -
In [395]: A
Out[395]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [396]: A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1)
Out[396]:
array([[2, 3],
[4, 6],
[7, 8]])
Approach #2
Using the regular pattern of non-diagonal elements that could be traced with broadcasted additions with range arrays -
m = A.shape[0]
idx = (np.arange(1,m+1) + (m+1)*np.arange(m-1)[:,None]).reshape(m,-1)
out = A.ravel()[idx]
Approach #3 (Strides Strikes!)
Abusing the regular pattern of non-diagonal elements from previous approach, we can introduce np.lib.stride_tricks.as_strided and some slicing help, like so -
m = A.shape[0]
strided = np.lib.stride_tricks.as_strided
s0,s1 = A.strides
out = strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1)).reshape(m,-1)
Runtime test
Approaches as funcs :
def skip_diag_masking(A):
return A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1)
def skip_diag_broadcasting(A):
m = A.shape[0]
idx = (np.arange(1,m+1) + (m+1)*np.arange(m-1)[:,None]).reshape(m,-1)
return A.ravel()[idx]
def skip_diag_strided(A):
m = A.shape[0]
strided = np.lib.stride_tricks.as_strided
s0,s1 = A.strides
return strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1)).reshape(m,-1)
Timings -
In [528]: A = np.random.randint(11,99,(5000,5000))
In [529]: %timeit skip_diag_masking(A)
...: %timeit skip_diag_broadcasting(A)
...: %timeit skip_diag_strided(A)
...:
10 loops, best of 3: 56.1 ms per loop
10 loops, best of 3: 82.1 ms per loop
10 loops, best of 3: 32.6 ms per loop
I know I'm late to this party, but I have what I believe is a simper solution. So you want to remove the diagonal? Okay cool:
replace it with NaN
filter all but NaN (this converts to one dimensional as it can't assume the result will be square)
reset the dimensionality
`
arr = np.array([[1,2,3],[4,5,6],[7,8,9]]).astype(np.float)
np.fill_diagonal(arr, np.nan)
arr[~np.isnan(arr)].reshape(arr.shape[0], arr.shape[1] - 1)
Solution steps:
Flatten your array
Delete the location of the diagonal elements which is at the location range(0, len(x_no_diag), len(x) + 1)
Reshape your array to (num_rows, num_columns - 1)
The function:
import numpy as np
def remove_diag(x):
x_no_diag = np.ndarray.flatten(x)
x_no_diag = np.delete(x_no_diag, range(0, len(x_no_diag), len(x) + 1), 0)
x_no_diag = x_no_diag.reshape(len(x), len(x) - 1)
return x_no_diag
Example:
>>> x = np.random.randint(5, size=(3,3))
array([[0, 2, 3],
[3, 4, 1],
[2, 4, 0]])
>>> remove_diag(x)
array([[2, 3],
[3, 1],
[2, 4]])
Just with numpy, assuming a square matrix:
new_A = numpy.delete(A,range(0,A.shape[0]**2,(A.shape[0]+1))).reshape(A.shape[0],(A.shape[1]-1))
If you do not mind creating a new array, then you can use list comprehension.
A = np.array([A[i][A[i] != A[i][i]] for i in range(len(A))])
Rerunning the same methods as #Divakar,
A = np.random.randint(11,99,(5000,5000))
skip_diag_masking
85.7 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
skip_diag_broadcasting
163 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
skip_diag_strided
52.5 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
skip_diag_list_comp
101 ms ± 347 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Perhaps the cleanest way, based on Divakar's first solution but using len(array) instead of array.shape[0], is:
array_without_diagonal = array[~np.eye(len(array), dtype=bool)].reshape(len(array), -1)
I love all the answers here, but would like to add one in case your numpy object has more than 2-dimensions. In that case, you can use the following adjustment of Divakar's approach #1:
def remove_diag(A):
removed = A[~np.eye(A.shape[0], dtype=bool)].reshape(A.shape[0], int(A.shape[0])-1, -1)
return np.squeeze(removed)
The other approach is to use numpy.delete(). assuming square matrix, you can use:
numpy.delete(A,range(0,A.shape[0]**2,A.shape[0])).reshape(A.shape[0],A.shape[1]-1)

How to create a numpy array of N numbers of the same value?

This is surely an easy question:
How does one create a numpy array of N values, all the same value?
For instance, numpy.arange(10) creates 10 values of integers from 0 to 9.
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
I would like to create a numpy array of 10 values of the same integer,
array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
Use numpy.full():
import numpy as np
np.full(
shape=10,
fill_value=3,
dtype=np.int
)
> array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
Very easy
1)we use arange function :-
arr3=np.arange(0,10)
output=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
2)we use ones function whose provide 1 and after we will do multiply 3 :-
arr4=np.ones(10)*3
output=array([3., 3., 3., 3., 3., 3., 3., 3., 3., 3.])
An alternative (faster) way to do this would be with np.empty() and np.fill():
import numpy as np
shape = 10
value = 3
myarray = np.empty(shape, dtype=np.int)
myarray.fill(value)
Time comparison
The above approach on my machine executes for:
951 ns ± 14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
vs using np.full(shape=shape, fill_value=value, dtype=np.int) executes for:
1.66 µs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
vs using np.repeat(value, shape) executes for:
2.77 µs ± 41.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
vs using np.ones(shape) * value executes for:
2.71 µs ± 56.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I find it to be consistently a little bit quicker.
I realize it's not using numpy, but this is very easy with base python.
data = [3]*10
print(data)
> [3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
Try this. It is worked well in my python version (Python 3.9.6).
import numpy as np
np.array([3]*10)
output:

Categories

Resources