numpy bincount sequential slices of array

numpy bincount sequential slices of array - python

Given numpy row containing numbers from range(n),
I want to apply the following transformation:
[1 0 1 2] --> [[0 1 0] [1 1 0] [1 2 0] [1 2 1]]
We just go through the input list and bincount all elements to the left of the current (including).
import numpy as np
n = 3
a = np.array([1, 0, 1, 2])
out = []
for i in range(a.shape[0]):
out.append(np.bincount(a[:i+1], minlength=n))
out = np.array(out)
Is there any way to speed this up? I'm wondering if it's possible to get rid of that loop completely and use matrix magic only.
EDIT:
Thanks, lbragile, for mentioning list comprehensions. It's not what I meant. (I'm not sure if it's even significant asymptotically). I was thinking about some more complex things such as rewriting this based on how bincount operation works under the hood.

You can use cumsum like so:
idx = [1,0,1,2]
np.identity(np.max(idx)+1,int)[idx].cumsum(0)
# array([[0, 1, 0],
# [1, 1, 0],
# [1, 2, 0],
# [1, 2, 1]])

Using list comprehension:
fast_out = [np.bincount(a[:i+1], minlength=n) for i in range(a.shape[0])]
print(fast_out)
Output:
[array([0, 1, 0]), array([1, 1, 0]), array([1, 2, 0]), array([1, 2, 1])]
To time the code use the following:
import timeit
def timer(code_to_test):
elapsed_time = timeit.timeit(code_to_test, number=100)/100
print(elapsed_time)
your_code = """
import numpy as np
n = 3
a = np.array([1, 0, 1, 2])
out = []
for i in range(a.shape[0]):
out.append(np.bincount(a[:i+1], minlength=n))
out = np.array(out)
"""
list_comp_code = """
import numpy as np
n = 3
a = np.array([1, 0, 1, 2])
fast_out = [np.bincount(a[:i+1], minlength=n) for i in range(a.shape[0])]
"""
timer(your_code) # 0.001330663086846471
timer(list_comp_code) # 1.4601880684494972e-05
So the list comprehension method is over 91 times faster when averaged over 100 trials.

Related

Get Numpy ndarray value from list of nd points

How can I obtain value of a ndarray from a list that contains coordinates of a n-D point as efficient as possible.
Here an implementation for 3D :
1 arr = np.array([[[0, 1]]])
2 points = [[0, 0, 1], [0, 0, 0]]
3 values = []
4 for point in points:
5 x, y, z = point
6 value.append(arr[x, y, z])
7 # values -> [1, 0]
If this is not possible, is there a way to generalize lines 5-6 to nD?

I am sure there is way to achieve this using fancy indexing. Here is a way to do without the for-loop:
arr = np.array([[[0, 1]]])
points = np.array([[0, 0, 1], [0, 0, 0]])
x,y,z = np.split(points, 3, axis=1)
arr[x,y,z]
output (values):
array([[1],
[0]])
Alternatively, you could use tuple unpacking as suggested by the comment:
arr[(*points.T,)]
output:
array([1, 0])

Based on the Numpy documentation for indexing, you can easily do that, as long as you use tuples instead of lists:
arr = np.array([[[0, 1]]])
points = [(0, 0, 1), (0, 0, 0)]
values = []
for point in points:
value.append(arr[point])
# values -> [1, 0]
This works independent of dimensionality of the Numpy array involved.
Bonus: In addition to appending to a list, you can also use the Python slice function to extract ranges directly:
arr = np.array([[[0, 1]]])
points = (0, 0, slice(2) )
vals = arr[points]
# --> [0 1] (a Numpy array!)

How to modify a list if a matrix contain a value?

I have a list with values [5, 5, 5, 5, 5] and I have a matrix too filled with with 1 and 0.
I want to have a new list that have to be like this:
if there's a 1 into the matrix then sum a '2' into the v's value if it's the first row and sum a '3' it's the second row.
example:
list:
v = [5,5,5,5,5]
matrix:
m = [[0, 1, 1, 0, 0], [0, 0, 1, 1, 0]]
final result:
v1 = [5,7,10,8,5]

Create a function that adds array lines, you can have the parameters be 1D numeric arrays. Loops through the arrays and returns a result array that is the addition of each element.
If your task requires it, add a check if the lines are of equal length and abort the function with an error if so.
Run this function on all of the matrix lines and then run it for the result of that and the input array.
Hope I managed to be comprehensive enough

You can use NumPy package for efficient code.
import numpy as np
v = [5,5,5,5,5]
matrix = [[0, 1, 1, 0, 0],
[0, 0, 1, 1, 0]]
weights = np.array([2,3])
w_matrix = np.multiply(matrix, weights[:, np.newaxis]).sum(axis=0)
v1 = v + w_matrix

classical python:
You can use a loop comprehension:
to_add = [sum((A*B) for A,B in zip(factors,x)) for x in zip(*m)]
[a+b for a,b in zip(v, to_add)]
output: [5, 7, 10, 8, 5]
numpy:
That said, this is a perfect use case for numpy that is more efficient and less verbose:
import numpy as np
v = [5,5,5,5,5]
m = [[0, 1, 1, 0, 0], [0, 0, 1, 1, 0]]
factors = [2,3]
V = np.array(v)
M = np.array(m)
F = np.array(factors)
V+(M*F[:,None]).sum(0)
output: array([ 5, 7, 10, 8, 5])

how to implement multiple ifelse in numpy

I have an array like this and need to replace every 1 with 2, every 3 with 4, every 4 with 1. Is there a way to do this just with np and not loops?
import numpy as np
np.random.seed(2)
arr=np.random.randint(1,5,(3,3),int)
arr
array([[1, 4, 2],
[1, 3, 4],
[3, 4, 1]])
If I use array mask sequentially, it doesn't give the expected outcome:
array([[2, 1, 2],
[2, 4, 1],
[4, 1, 2]])
It is based on a conditional logic and not maths formula

If the array values don't necessarely range between 1 and 4 you can use np.select:
import numpy as np
a = np.random.randint(1,5, (3,3))
condlist = [np.logical_or(a==1, a==2), a==3, a==4]
choicelist= [2, 4, 1]
b = np.select(condlist, choicelist)
which does not care about the order of the conditions

Here's one with np.searchsorted for performance efficiency -
def map_values(arr, old_val, new_val):
sidx = old_val.argsort()
idx = np.searchsorted(old_val,arr,sorter=sidx)
return np.where(old_val[idx]==arr, new_val[sidx[idx]], arr)
Sample run -
In [40]: arr
Out[40]:
array([[1, 4, 2],
[1, 3, 4],
[3, 4, 1]])
In [41]: old_val = np.array([1,3,4])
...: new_val = np.array([2,4,1])
In [42]: map_values(arr, old_val, new_val)
Out[42]:
array([[2, 1, 2],
[2, 4, 1],
[4, 1, 2]])

Could do this with a lambda function and np.vectorize():
import numpy as np
np.random.seed(2)
arr=np.random.randint(1,5,(3,3),int)
f = lambda x: x%4 + 1 if x in [1,3,4] else x
vfunc = np.vectorize(f)
Usage:
>>> vfunc(arr)
array([[2, 1, 2],
[2, 4, 1],
[4, 1, 2]])

You have to be careful about the order of assignments. For example, if you do
arr[arr == 4] = 1
arr[arr == 1] = 2
Now all of the elements that were originally 4 will be 2, not 1 as you intend.
One solution is to carefully craft the order of assignments:
arr[arr == 1] = 2
arr[arr == 4] = 1
However, this is very brittle and will fall apart as you introduce more of them. It would be better to create the masks up front from the original array:
ones = arr == 1
fours = arr == 4
arr[ones] = 2
arr[fours] = 1
Now the order of the assignments won't matter because the masks are determined before modifying the array.

You want arr % 4 + 1, except in the case of 2, which stays the same. So use np.where to find all the 2s. Then do arr % 4 + 1, then reset all the 2s.
import numpy as np
np.random.seed(2)
arr=np.random.randint(1,5,(3,3),int)
twos = np.where(arr == 2)
arr = arr % 4 + 1
arr[twos] = 2
print(arr)

Calculate the sum of every 5 elements in a python array

I have a python array in which I want to calculate the sum of every 5 elements. In my case I have the array c with ten elements. (In reality it has a lot more elements.)
c = [1, 0, 0, 0, 0, 2, 0, 0, 0, 0]
So finally I would like to have a new array (c_new) which should show the sum of the first 5 elements, and the second 5 elements
So the result should be that one
1+0+0+0+0 = 1
2+0+0+0+0 = 2
c_new = [1, 2]
Thank you for your help
Markus

You can use np.add.reduceat by passing indices where you want to split and sum:
import numpy as np
c = [1, 0, 0, 0, 0, 2, 0, 0, 0, 0]
np.add.reduceat(c, np.arange(0, len(c), 5))
# array([1, 2])

Heres one way of doing it -
c = [1, 0, 0, 0, 0, 2, 0, 0, 0, 0]
print [sum(c[i:i+5]) for i in range(0, len(c), 5)]
Result -
[1, 2]

If five divides the length of your vector and it is contiguous then
np.reshape(c, (-1, 5)).sum(axis=-1)
It also works if it is non contiguous, but then it is typically less efficient.
Benchmark:
def aredat():
return np.add.reduceat(c, np.arange(0, len(c), 5))
def reshp():
np.reshape(c, (-1, 5)).sum(axis=-1)
c = np.random.random(10_000_000)
timeit(aredat, number=100)
3.8516048429883085
timeit(reshp, number=100)
3.09542763303034
So where possible, reshapeing seems a bit faster; reduceat has the advantage of gracefully handling non-multiple-of-five vectors.

why don't you use this ?
np.array([np.sum(i, axis = 0) for i in c.reshape(c.shape[0]//5,5,c.shape[1])])

There are various ways to achieve that. Will leave, below, two options using numpy built-in methods.
Option 1
numpy.sum and numpy.ndarray.reshape as follows
c_sum = np.sum(np.array(c).reshape(-1, 5), axis=1)
[Out]: array([1, 2])
Option 2
Using numpy.vectorize, a custom lambda function, and numpy.arange as follows
c_sum = np.vectorize(lambda x: sum(c[x:x+5]))(np.arange(0, len(c), 5))
[Out]: array([1, 2])

Sum over rows in scipy.sparse.csr_matrix

I have a big csr_matrix and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. (Context: The matrix is a document-term matrix obtained from sklearn CountVectorizer and I want to be able to quickly combine documents according to codes associated with these documents)
For a minimal example, this is my matrix:
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack
row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()
[[1 0 0 0 0]
[0 0 3 0 0]
[0 5 0 0 0]
[4 0 0 0 0]
[0 0 2 0 0]]
No let's say I want a new matrix B in which rows (1, 4) and (2, 3, 5) are combined by summing them, which would look something like this:
[[5 0 0 0 0]
[0 5 5 0 0]]
And should be again in sparse format (because the real data I'm working with is large). I tried to sum over slices of the matrix and then stack it:
idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))
But this gives me the summed up values just for the non-zero columns in the slice, so I can't combine it with the other slices because the number of columns in the summed slices are different.
I feel like there must be an easy way to do this. But I couldn't find any discussion of this online or in the documentation. What am I missing?
Thank you for your help

Note that you can do this by carefully constructing another matrix. Here's how it would work for a dense matrix:
>>> S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])
>>> np.dot(S, A.toarray())
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
>>>
The sparse version is only a little more complicated. The information about which rows should be summed together is encoded in row:
col = range(5)
row = [0, 1, 1, 0, 1]
dat = [1, 1, 1, 1, 1]
S = csr_matrix((dat, (row, col)), shape=(2, 5))
result = S * A
# check that the result is another sparse matrix
print type(result)
# check that the values are the ones we want
print result.toarray()
Output:
<class 'scipy.sparse.csr.csr_matrix'>
[[5 0 0 0 0]
[0 5 5 0 0]]
You can handle more rows in your output by including higher values in row and extending the shape of S accordingly.

The indexing should be:
idx1 = [0, 3] # rows 1 and 4
idx2 = [1, 2, 4] # rows 2,3 and 5
Then you need to keep A_sub1 and A_sub2 in sparse format and use axis=0:
A_sub1 = csr_matrix(A[idx1, :].sum(axis=0))
A_sub2 = csr_matrix(A[idx2, :].sum(axis=0))
B = vstack((A_sub1, A_sub2))
B.toarray()
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
Note, I think the A[idx, :].sum(axis=0) operations involve conversion from sparse matrices - so #Mr_E's answer is probably better.
Alternatively, it works when you use axis=0 and np.vstack (as opposed to scipy.sparse.vstack):
A_sub1 = A[idx1, :].sum(axis=0)
A_sub2 = A[idx2, :].sum(axis=0)
np.vstack((A_sub1, A_sub2))
Giving:
matrix([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy bincount sequential slices of array - python

You can use cumsum like so: idx = [1,0,1,2] np.identity(np.max(idx)+1,int)[idx].cumsum(0) # array([[0, 1, 0], # [1, 1, 0], # [1, 2, 0], # [1, 2, 1]])

Related

Get Numpy ndarray value from list of nd points

How to modify a list if a matrix contain a value?

how to implement multiple ifelse in numpy

Calculate the sum of every 5 elements in a python array

Sum over rows in scipy.sparse.csr_matrix

Categories

Resources