Python: Mapping between two arrays with an index array

Python: Mapping between two arrays with an index array - python

I have a numpy array
src = np.random.rand(320,240)
and another numpy array idx of size (2 x (320*240)). Each column of idx indexes an entry in a result array dst, e.g., idx[:,20] = [3,10] references row 3, column 10 in dst and the assumption is that 20 corresponds to the flattened index of src, i.e., idx establishes a mapping between the entries of src and dst. Assuming dst is initialized with all zeros, how can I copy the entries in src to their destination in dst without a loop?

Here is the canonical way of doing it:
>>> import numpy as np
>>>
>>> src = np.random.rand(4, 3)
>>> src
array([[0.0309325 , 0.72261479, 0.98373595],
[0.06357406, 0.44763809, 0.45116039],
[0.63992938, 0.6445605 , 0.01267776],
[0.76084312, 0.61888759, 0.2138713 ]])
>>>
>>> idx = np.indices(src.shape).reshape(2, -1)
>>> np.random.shuffle(idx.T)
>>> idx
array([[3, 3, 0, 1, 0, 3, 1, 1, 2, 2, 2, 0],
[1, 2, 2, 0, 1, 0, 1, 2, 2, 1, 0, 0]])
>>>
>>> dst = np.empty_like(src)
>>> dst[tuple(idx)] = src.ravel()
>>> dst
array([[0.2138713 , 0.44763809, 0.98373595],
[0.06357406, 0.63992938, 0.6445605 ],
[0.61888759, 0.76084312, 0.01267776],
[0.45116039, 0.0309325 , 0.72261479]])
If you can't be sure that idx is a proper shuffle it's a bit safer to use np.full with a fill value that does not appear in src instead of np.empty.
>>> dst = np.full_like(src, np.nan)
>>> dst[tuple(idx)] = src.ravel()
>>>
>>> dst
array([[0.27020869, 0.71216066, nan],
[0.63812283, 0.69151451, 0.65843901],
[ nan, 0.02406174, 0.47543061],
[0.05650845, nan, nan]])
If you spot the fill value in dst, something is wrong with idx.

You can try:
dst[idx[0, :], idx[1, :]] = src.flat
In [33]: src = np.random.randn(2, 3)
In [34]: src
Out[34]:
array([[ 0.68636938, 0.60275041, 1.26078727],
[ 1.17937849, -1.0369404 , 0.42847611]])
In [35]: dst = np.zeros_like(src)
In [37]: idx = np.array([[0, 1, 0, 1, 0, 0], [1, 2, 0, 1, 2, 0]])
In [38]: dst[idx[0, :], idx[1, :]] = src.flat
In [39]: dst
Out[39]:
array([[ 0.42847611, 0.68636938, -1.0369404 ],
[ 0. , 1.17937849, 0.60275041]])
dst[0, 1] is src[0, 0], etc.

Related

Python Numpy - Slicing assignment not assigning correctly

I have a 2d numpy array called arm_resets that has positive integers. The first column has all positive integers < 360. For all columns other than the first, I need to replace all values over 360 with the value that is in the same row in the 1st column. I thought this would be a relatively easy thing to do, here's what I have:
i = 300
over_360 = arm_resets[:, [i]] >= 360
print(arm_resets[:, [i]][over_360])
print(arm_resets[:, [0]][over_360])
arm_resets[:, [i]][over_360] = arm_resets[:, [0]][over_360]
print(arm_resets[:, [i]][over_360])
And here's what prints:
[3600 3609 3608 ... 3600 3611 3605]
[ 0 9 8 ... 0 11 5]
[3600 3609 3608 ... 3600 3611 3605]
Since all numbers that are being shown in the first print (first 3 and last 3) are above 360, they should be getting replaced by the 2nd print in the 3rd print. Why is this not working?
edit: reproducible example:
df = pd.DataFrame({"start":[1,2,5,6],"freq":[1,5,6,9]})
periods = 6
arm_resets = df[["start"]].values
freq = df[["freq"]].values
arm_resets = np.pad(arm_resets,((0,0),(0,periods-1)))
for i in range(1,periods):
arm_resets[:,[i]] = arm_resets[:,[i-1]] + freq
#over_360 = arm_resets[:,[i]] >= periods
#arm_resets[:,[i]][over_360] = arm_resets[:,[0]][over_360]
arm_resets
Given commented out code here's what prints:
array([[ 1, 2, 3, 4, 5, 6],
[ 2, 7, 12, 17, 22, 27],
[ 3, 9, 15, 21, 27, 33],
[ 4, 13, 22, 31, 40, 49]])
What I would expect:
array([[ 1, 2, 3, 4, 5, 1],
[ 2, 2, 2, 2, 2, 2],
[ 3, 3, 3, 3, 3, 3],
[ 4, 4, 4, 4, 4, 4]])
Now if it helps, the final 2d array I'm actually trying to create is a 1/0 array that indicates which are filled in, so in this example I'd want this:
array([[ 0, 1, 1, 1, 1, 1],
[ 0, 0, 1, 0, 0, 0],
[ 0, 0, 0, 1, 0, 0],
[ 0, 0, 0, 0, 1, 0]])
The code I use to achieve this from the above arm_resets is this:
fin = np.zeros((len(arm_resets),periods),dtype=int)
for i in range(len(arm_resets)):
fin[i,a[i]] = 1

The slice arm_resets[:, [i]] is a fancy index, and therefore makes a copy of the ith column of the data. arm_resets[:, [i]][over_360] = ... therefore calls __setitem__ on a temporary array that is discarded as soon as the statement executes. If you want to assign to the mask, call __setitem__ on the sliced object directly:
arm_resets[over_360, [i]] = ...
You also don't need to make the index into a list. It's generally better to use simple indices, especially when doing assignments, since they create views rather than copies:
arm_resets[over_360, i] = ...
With slicing, even the following should work, since it calls __setitem__ on a view:
arm_resets[:, i][over_360] = ...
This index does not help you process each row of the data, since i is a column. In fact, you can process the entire matrix in one step, without looping, if you use indices rather than a boolean mask. The reason that indices are useful is that you can match the item from the correct row in the first column:
rows, cols = np.nonzero(arm_resets[:, 1:] >= 360)
arm_resets[rows, cols] = arm_resets[rows, 1]

You can use np.where()
first_col = arm_resets[:,0] # first col
first_col = first_col.reshape(first_col.size,1) #Transfor in 2d array
arm_resets = np.where(arm_resets >= 360,first_col,arm_resets)
You can see in detail how np.where work here, but basically it compare arm_resets >= 360, if true it put first_col value in place (there another detail here with broadcasting) if false it put arm_resets value.
Edit: As suggested by Mad Physicist. You can use arm_resets[:,0,None] directly instead of creating first_col variable.
arm_resets = np.where(arm_resets >= 360,arm_resets[:,0,None],arm_resets)

Calculate sum of vectors in numpy array based on dictionary values

I have an array like the following, but much larger:
array = np.random.randint(6, size=(5, 4))
array([[4, 3, 0, 2],
[1, 4, 3, 1],
[0, 3, 5, 2],
[1, 0, 5, 3],
[0, 5, 4, 4]])
I also have a dictionary which gives me the vector representation of each value in this array:
dict_ = {2:np.array([3.4, 2.6, -1.2]), 0:np.array([0, 0, 0]), 1:np.array([3.9, 2.6, -1.2]), 3:np.array([3.8, 6.6, -1.9]), 4:np.array([5.4, 2.6, -1.2]),5:np.array([6.4, 2.6, -1.2])}
I want to calculate the average of the vector representations for each row in the array, but when the value is 0, ignore it when calculating average (dictionary shows it as a 0 vector).
For example, for the first row, it should average [5.4, 2.6, -1.2], [3.8, 6.6, -1.9], and [3.4, 2.6, -1.2], and give [4.2, 3.93, -1.43] as the first row of the output.
I want an output which keeps the same row structure, and has 3 columns (each vector in the dictionary has 3 values).
How can this be done in an efficient way? My actual dictionary has over 100000 entries and array is 100000 by 5000.

For efficiency I would transform the dict to an array and then use advanced indexing for lookup:
>>> import numpy as np
>>>
# create problem
>>> v = np.random.random((100_000, 3))
>>> dict_ = dict(enumerate(v))
>>> arr = np.random.randint(0, 100_000, (100_000, 100))
>>>
# solve
>>> from operator import itemgetter
>>> lookup = np.array(itemgetter(*range(100_000))(dict_))
>>> lookup[0] = np.nan
>>> result = np.nanmean(lookup[arr], axis=1)
Or applied to OP's example:
>>> arr = np.array([[4, 3, 0, 2],
... [1, 4, 3, 1],
... [0, 3, 5, 2],
... [1, 0, 5, 3],
... [0, 5, 4, 4]])
>>> dict_ = {2:np.array([3.4, 2.6, -1.2]), 0:np.array([0, 0, 0]), 1:np.array([3.9, 2.6, -1.2]), 3:np.array([3.8, 6.6, -1.9]), 4:np.array([5.4, 2.6, -1.2]),5:np.array([6.4, 2.6, -1.2])}
>>>
>>> lookup = np.array(itemgetter(*range(6))(dict_))
>>> lookup[0] = np.nan
>>> result = np.nanmean(lookup[arr], axis=1)
>>> result
array([[ 4.2 , 3.93333333, -1.43333333],
[ 4.25 , 3.6 , -1.375 ],
[ 4.53333333, 3.93333333, -1.43333333],
[ 4.7 , 3.93333333, -1.43333333],
[ 5.73333333, 2.6 , -1.2 ]])
Timings against #jpp's method:
pp: 0.8046 seconds
jpp: 10.3449 seconds
results equal: True
Code to produce timings:
import numpy as np
# create problem
v = np.random.random((100_000, 3))
dict_ = dict(enumerate(v))
arr = np.random.randint(0, 100_000, (100_000, 100))
# solve
from operator import itemgetter
def f_pp(arr, dict_):
lookup = np.array(itemgetter(*range(100_000))(dict_))
lookup[0] = np.nan
return np.nanmean(lookup[arr], axis=1)
def f_jpp(arr, dict_):
def averager(x):
lst = [dict_[i] for i in x if i]
return np.mean(lst, axis=0) if lst else np.array([0, 0, 0])
return np.apply_along_axis(averager, -1, arr)
from time import perf_counter
t = perf_counter()
r_pp = f_pp(arr, dict_)
s = perf_counter()
print(f'pp: {s-t:8.4f} seconds')
t = perf_counter()
r_jpp = f_jpp(arr, dict_)
s = perf_counter()
print(f'jpp: {s-t:8.4f} seconds')
print('results equal:', np.allclose(r_pp, r_jpp))

This is one solution using numpy.apply_along_axis.
You should test and benchmark to see if performance is adequate for your use case.
A = np.random.randint(6, size=(5, 4))
print(A)
[[3 5 2 4]
[2 4 5 2]
[0 3 1 1]
[3 4 4 5]
[2 5 0 2]]
zeros = {k for k, v in dict_.items() if (v==0).all()}
def averager(x):
lst = [dict_[i] for i in x if i not in zeros]
return np.mean(lst, axis=0) if lst else np.array([0, 0, 0])
res = np.apply_along_axis(averager, -1, A)
array([[ 4.75 , 3.6 , -1.375 ],
[ 4.65 , 2.6 , -1.2 ],
[ 3.86666667, 3.93333333, -1.43333333],
[ 5.25 , 3.6 , -1.375 ],
[ 4.4 , 2.6 , -1.2 ]])

Convert a 2D array to 3D array by multiplying vector in numpy

Suppose I have an 2D numpy array a=[[1,-2,1,0], [1,0,0,-1]], but I want to convert it to an 3D numpy array by element-wise multiply a vector t=[[x0,x0,x0,x0],[x1,x1,x1,x1]] where xi is a 1D numpy array with 3072 size. So the result would be a*t=[[x0,-2x0,x0,0],[x1,0,0,-x1]] with the size (2,4,3072). So how should I do that in Python numpy?

Code:
import numpy as np
# Example data taken from bendl's answer !!!
a = np.array([[1,-2,1,0], [1,0,0,-1]])
xi = np.array([1, 2, 3])
b = np.outer(a, xi).reshape(a.shape[0], -1, len(xi))
print('a:')
print(a)
print('b:')
print(b)
Output:
a:
[[ 1 -2 1 0]
[ 1 0 0 -1]]
b:
[[[ 1 2 3]
[-2 -4 -6]
[ 1 2 3]
[ 0 0 0]]
[[ 1 2 3]
[ 0 0 0]
[ 0 0 0]
[-1 -2 -3]]]
As i said: it looks like an outer-product and splitting/reshaping this one dimension is easy.

You can use numpy broadcasting for this:
a = numpy.array([[1, -2, 1, 0], [1, 0, 0, -1]])
t = numpy.arange(3072 * 2).reshape(2, 3072)
# array([[ 0, 1, 2, ..., 3069, 3070, 3071], # = x0
# [3072, 3073, 3074, ..., 6141, 6142, 6143]]) # = x1
a.shape
# (2, 4)
t.shape
# (2, 3072)
c = (a.T[None, :, :] * t.T[:, None, :]).T
# array([[[ 0, 1, 2, ..., 3069, 3070, 3071], # = 1 * x0
# [ 0, -2, -4, ..., -6138, -6140, -6142], # = -2 * x0
# [ 0, 1, 2, ..., 3069, 3070, 3071], # = 1 * x0
# [ 0, 0, 0, ..., 0, 0, 0]], # = 0 * x0
#
# [[ 3072, 3073, 3074, ..., 6141, 6142, 6143], # = 1 * x1
# [ 0, 0, 0, ..., 0, 0, 0], # = 0 * x1
# [ 0, 0, 0, ..., 0, 0, 0], # = 0 * x1
# [-3072, -3073, -3074, ..., -6141, -6142, -6143]]]) # = -1 * x1
c.shape
# (2, 4, 3072)

Does this do what you need?
import numpy as np
a = np.array([[1,-2,1,0], [1,0,0,-1]])
xi = np.array([1, 2, 3])
a = np.dstack([a * i for i in xi])
The docs for this are here:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.dstack.html

How I select/format only the values from a dictionary into a list or numpy array?

How do I get it to print just a list of the averages?
I just need it to be the exact same format as my np
arrays so I can compare them to see if they are the same or not.
Code:
import numpy as np
from pprint import pprint
centroids = np.array([[3,44],[4,15],[5,15]])
dataPoints = np.array([[2,4],[17,4],[45,2],[45,7],[16,32],[32,14],[20,56],[68,33]])
def size(vector):
return np.sqrt(sum(x**2 for x in vector))
def distance(vector1, vector2):
return size(vector1 - vector2)
def distances(array1, array2):
lists = [[distance(vector1, vector2) for vector2 in array2] for vector1 in array1]
#print lists.index(min, zip(*lists))
smallest = [min(zip(l,range(len(l)))) for l in zip(*lists)]
clusters = {}
for j, (_, i) in enumerate(smallest):
clusters.setdefault(i,[]).append(dataPoints[j])
pprint (clusters)
print'\nAverage of Each Point'
avgDict = {}
for k,v in clusters.iteritems():
avgDict[k] = sum(v)/ (len(v))
avgList = np.asarray(avgDict)
pprint (avgList)
distances(centroids,dataPoints)
Current Output:
{0: [array([16, 32]), array([20, 56])],
1: [array([2, 4])],
2: [array([17, 4]),
array([45, 2]),
array([45, 7]),
array([32, 14]),
array([68, 33])]}
Average of Each Point
array({0: array([18, 44]), 1: array([2, 4]), 2: array([41, 12])}, dtype=object)
Desired Output:
[[18,44],[2,4],[41,12]]
Or whatever the best format to compare my arrays/lists. I am aware I should have just stuck with one data type.

Do you try to cluster the dataPoints by the index of the nearest centroids, and find out the average position of the clustered points? If it is, I advise to use some broadcast rules of numpy to get the output you need.
Consider this,
np.linalg.norm(centroids[None, :, :] - dataPoints[:, None, :], axis=-1)
It creates a matrix showing all distances between dataPoints and centroids,
array([[ 40.01249805, 11.18033989, 11.40175425],
[ 42.3792402 , 17.02938637, 16.2788206 ],
[ 59.39696962, 43.01162634, 42.05948169],
[ 55.97320788, 41.77319715, 40.79215611],
[ 17.69180601, 20.80865205, 20.24845673],
[ 41.72529209, 28.01785145, 27.01851217],
[ 20.80865205, 44.01136217, 43.65775991],
[ 65.9241989 , 66.48308055, 65.520989 ]])
And you can compute the indices of the nearest centroids by this trick (they are split into 3 lines for readability),
In: t0 = centroids[None, :, :] - dataPoints[:, None, :]
In: t1 = np.linalg.norm(t0, axis=-1)
In: t2 = np.argmin(t1, axis=-1)
Now t2 has the indices,
array([1, 2, 2, 2, 0, 2, 0, 2])
To find the #1 cluster, use the boolean mask t2 == 0,
In: dataPoints[t2 == 0]
Out: array([[16, 32],
[20, 56]])
In: dataPoints[t2 == 1]
Out: array([[2, 4]])
In: dataPoints[t2 == 2]
Out: array([[17, 4],
[45, 2],
[45, 7],
[32, 14],
[68, 33]])
Or just calculate the average in your case,
In: np.mean(dataPoints[t2 == 0], axis=0)
Out: array([ 18., 44.])
In: np.mean(dataPoints[t2 == 1], axis=0)
Out: array([ 2., 4.])
In: np.mean(dataPoints[t2 == 2], axis=0)
Out: array([ 41.4, 12. ])
Of course, the latter blocks can be rewritten in for-loop if you want.
It might be a good practice to formulate the solution by numpy's conventions in my opinion.

How to make numpy.cumsum start after the first value

I have:
import numpy as np
position = np.array([4, 4.34, 4.69, 5.02, 5.3, 5.7, ..., 4])
x = (B/position**2)*dt
A = np.cumsum(x)
assert A[0] == 0 # I want this to be true.
Where B and dt are scalar constants. This is for a numerical integration problem with initial condition of A[0] = 0. Is there a way to set A[0] = 0 and then do a cumsum for everything else?

I don't understand what exactly your problem is, but here are some things you can do to have A[0] = 0.
You can create A to be longer by one index to have the zero as the first entry:
# initialize example data
import numpy as np
B = 1
dt = 1
position = np.array([4, 4.34, 4.69, 5.02, 5.3, 5.7])
# do calculation
A = np.zeros(len(position) + 1)
A[1:] = np.cumsum((B/position**2)*dt)
Result:
A = [ 0. 0.0625 0.11559096 0.16105356 0.20073547 0.23633533 0.26711403]
len(A) == len(position) + 1
Alternatively, you can manipulate the calculation to substract the first entry of the result:
# initialize example data
import numpy as np
B = 1
dt = 1
position = np.array([4, 4.34, 4.69, 5.02, 5.3, 5.7])
# do calculation
A = np.cumsum((B/position**2)*dt)
A = A - A[0]
Result:
[ 0. 0.05309096 0.09855356 0.13823547 0.17383533 0.20461403]
len(A) == len(position)
As you see, the results have different lengths. Is one of them what you expect?

1D cumsum
A wrapper around np.cumsum that sets first element to 0:
def cumsum(pmf):
cdf = np.empty(len(pmf) + 1, dtype=pmf.dtype)
cdf[0] = 0
np.cumsum(pmf, out=cdf[1:])
return cdf
Example usage:
>>> np.arange(1, 11)
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> cumsum(np.arange(1, 11))
array([ 0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55])
N-D cumsum
A wrapper around np.cumsum that sets first element to 0, and works with N-D arrays:
def cumsum(pmf, axis=None, dtype=None):
if axis is None:
pmf = pmf.reshape(-1)
axis = 0
if dtype is None:
dtype = pmf.dtype
idx = [slice(None)] * pmf.ndim
# Create array with extra element along cumsummed axis.
shape = list(pmf.shape)
shape[axis] += 1
cdf = np.empty(shape, dtype)
# Set first element to 0.
idx[axis] = 0
cdf[tuple(idx)] = 0
# Perform cumsum on remaining elements.
idx[axis] = slice(1, None)
np.cumsum(pmf, axis=axis, dtype=dtype, out=cdf[tuple(idx)])
return cdf
Example usage:
>>> np.arange(1, 11).reshape(2, 5)
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10]])
>>> cumsum(np.arange(1, 11).reshape(2, 5), axis=-1)
array([[ 0, 1, 3, 6, 10, 15],
[ 0, 6, 13, 21, 30, 40]])

I totally understand your pain, I wonder why Numpy doesn't allow this with np.cumsum. Anyway, though I'm really late and there's already another good answer, I prefer this one a bit more:
np.cumsum(np.pad(array, (1, 0), "constant"))
where array in your case is (B/position**2)*dt. You can change the order of np.pad and np.cumsum as well. I'm just adding a zero to the start of the array and calling np.cumsum.

You can use roll (shift right by 1) and then set the first entry to zero.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Mapping between two arrays with an index array - python

Related

Python Numpy - Slicing assignment not assigning correctly

Calculate sum of vectors in numpy array based on dictionary values

Convert a 2D array to 3D array by multiplying vector in numpy

How I select/format only the values from a dictionary into a list or numpy array?

How to make numpy.cumsum start after the first value

Categories

Resources