Related
I have a multi-dimensional array for scores, and for which, I need to get sum of each columns at 3rd level in Python. I am using Numpy to achieve this.
import numpy as np
Data is something like:
score_list = [
[[1,1,3], [1,2,5]],
[[2,7,5], [4,1,3]]
]
This should return:
[[3 8 8] [5 3 8]]
Which is happening correctly using this:
sum_array = np_array.sum(axis=0)
print(sum_array)
However, if I have irregular shape like this:
score_list = [
[[1,1], [1,2,5]],
[[2,7], [4,1,3]]
]
I expect it to return:
[[3 8] [5 3 8]]
However, it comes up with warning and the return value is:
[list([1, 1, 2, 7]) list([1, 2, 5, 4, 1, 3])]
How can I get expected result?
numpy will try to cast it into an nd array which will fail, instead consider passing each sublist individually using zip.
score_list = [
[[1,1], [1,2,5]],
[[2,7], [4,1,3]]
]
import numpy as np
res = [np.sum(x,axis=0) for x in zip(*score_list)]
print(res)
[array([3, 8]), array([5, 3, 8])]
Here is one solution for doing this, keep in mind that it doesn't use numpy and will be very inefficient for larger matrices (but for smaller matrices runs just fine).
# Create matrix
score_list = [
[[1,1,3], [1,2,5]],
[[2,7,5], [4,1,3]]
]
# Get each row
for i in range(1, len(score_list)):
# Get each list within the row
for j in range(len(score_list[i])):
# Get each value in each list
for k in range(len(score_list[i][j])):
# Add current value to the same index
# on the first row
score_list[0][j][k] += score_list[i][j][k]
print(score_list[0])
There is bound to be a better solution but this is a temporary fix for you :)
Edit. Made more efficient
A possible solution:
a = np.vstack([np.array(score_list[x], dtype='object')
for x in range(len(score_list))])
[np.add(*[x for x in a[:, i]]) for i in range(a.shape[1])]
Another possible solution:
a = sum(score_list, [])
b = [a[x] for x in range(0,len(a),2)]
c = [a[x] for x in range(1,len(a),2)]
[np.add(x[0], x[1]) for x in [b, c]]
Output:
[array([3, 8]), array([5, 3, 8])]
I have an array like the following, but much larger:
array = np.random.randint(6, size=(5, 4))
array([[4, 3, 0, 2],
[1, 4, 3, 1],
[0, 3, 5, 2],
[1, 0, 5, 3],
[0, 5, 4, 4]])
I also have a dictionary which gives me the vector representation of each value in this array:
dict_ = {2:np.array([3.4, 2.6, -1.2]), 0:np.array([0, 0, 0]), 1:np.array([3.9, 2.6, -1.2]), 3:np.array([3.8, 6.6, -1.9]), 4:np.array([5.4, 2.6, -1.2]),5:np.array([6.4, 2.6, -1.2])}
I want to calculate the average of the vector representations for each row in the array, but when the value is 0, ignore it when calculating average (dictionary shows it as a 0 vector).
For example, for the first row, it should average [5.4, 2.6, -1.2], [3.8, 6.6, -1.9], and [3.4, 2.6, -1.2], and give [4.2, 3.93, -1.43] as the first row of the output.
I want an output which keeps the same row structure, and has 3 columns (each vector in the dictionary has 3 values).
How can this be done in an efficient way? My actual dictionary has over 100000 entries and array is 100000 by 5000.
For efficiency I would transform the dict to an array and then use advanced indexing for lookup:
>>> import numpy as np
>>>
# create problem
>>> v = np.random.random((100_000, 3))
>>> dict_ = dict(enumerate(v))
>>> arr = np.random.randint(0, 100_000, (100_000, 100))
>>>
# solve
>>> from operator import itemgetter
>>> lookup = np.array(itemgetter(*range(100_000))(dict_))
>>> lookup[0] = np.nan
>>> result = np.nanmean(lookup[arr], axis=1)
Or applied to OP's example:
>>> arr = np.array([[4, 3, 0, 2],
... [1, 4, 3, 1],
... [0, 3, 5, 2],
... [1, 0, 5, 3],
... [0, 5, 4, 4]])
>>> dict_ = {2:np.array([3.4, 2.6, -1.2]), 0:np.array([0, 0, 0]), 1:np.array([3.9, 2.6, -1.2]), 3:np.array([3.8, 6.6, -1.9]), 4:np.array([5.4, 2.6, -1.2]),5:np.array([6.4, 2.6, -1.2])}
>>>
>>> lookup = np.array(itemgetter(*range(6))(dict_))
>>> lookup[0] = np.nan
>>> result = np.nanmean(lookup[arr], axis=1)
>>> result
array([[ 4.2 , 3.93333333, -1.43333333],
[ 4.25 , 3.6 , -1.375 ],
[ 4.53333333, 3.93333333, -1.43333333],
[ 4.7 , 3.93333333, -1.43333333],
[ 5.73333333, 2.6 , -1.2 ]])
Timings against #jpp's method:
pp: 0.8046 seconds
jpp: 10.3449 seconds
results equal: True
Code to produce timings:
import numpy as np
# create problem
v = np.random.random((100_000, 3))
dict_ = dict(enumerate(v))
arr = np.random.randint(0, 100_000, (100_000, 100))
# solve
from operator import itemgetter
def f_pp(arr, dict_):
lookup = np.array(itemgetter(*range(100_000))(dict_))
lookup[0] = np.nan
return np.nanmean(lookup[arr], axis=1)
def f_jpp(arr, dict_):
def averager(x):
lst = [dict_[i] for i in x if i]
return np.mean(lst, axis=0) if lst else np.array([0, 0, 0])
return np.apply_along_axis(averager, -1, arr)
from time import perf_counter
t = perf_counter()
r_pp = f_pp(arr, dict_)
s = perf_counter()
print(f'pp: {s-t:8.4f} seconds')
t = perf_counter()
r_jpp = f_jpp(arr, dict_)
s = perf_counter()
print(f'jpp: {s-t:8.4f} seconds')
print('results equal:', np.allclose(r_pp, r_jpp))
This is one solution using numpy.apply_along_axis.
You should test and benchmark to see if performance is adequate for your use case.
A = np.random.randint(6, size=(5, 4))
print(A)
[[3 5 2 4]
[2 4 5 2]
[0 3 1 1]
[3 4 4 5]
[2 5 0 2]]
zeros = {k for k, v in dict_.items() if (v==0).all()}
def averager(x):
lst = [dict_[i] for i in x if i not in zeros]
return np.mean(lst, axis=0) if lst else np.array([0, 0, 0])
res = np.apply_along_axis(averager, -1, A)
array([[ 4.75 , 3.6 , -1.375 ],
[ 4.65 , 2.6 , -1.2 ],
[ 3.86666667, 3.93333333, -1.43333333],
[ 5.25 , 3.6 , -1.375 ],
[ 4.4 , 2.6 , -1.2 ]])
I am trying to normalize each row vector of numpy array x, but I'm facing 2 problems.
I'm unable to update the row vectors of x (source code in image)
Is it possible to avoid the for loop (line 6) with any numpy functions?
import numpy as np
x = np.array([[0, 3, 4] , [1, 6, 4]])
c = x ** 2
for i in range(0, len(x)):
print(x[i]/np.sqrt(c[i].sum())) #prints [0. 0.6 0.8]
x[i] = x[i]/np.sqrt(c[i].sum())
print(x[i]) #prints [0 0 0]
print(x) #prints [[0 0 0] [0 0 0]] and wasn't updated
I've just recently started out with numpy, so any assistance would be greatly appreciated!
I'm unable to update the row vectors of x (source code in image)
Your np.array has no dtype argument, so it uses <type 'numpy.int32'>. If you wish to store floats in the array, add a float dtype:
x = np.array([
[0,3,4],
[1,6,4]
], dtype = np.float)
To see this, compare
x = np.array([
[0,3,4],
[1,6,4]
], dtype = np.float)
print type(x[0][0]) # output = <type 'numpy.float64'>
to
x = np.array([
[0,3,4],
[1,6,4]
])
print type(x[0][0]) # output = <type 'numpy.int32'>
is it possible to avoid the for loop (line 6) with any numpy functions?
This is how I would do it:
norm1, norm2 = np.linalg.norm(x[0]), np.linalg.norm(x[1])
print x[0] / norm1
print x[1] / norm2
You can use:
x/np.sqrt((x*x).sum(axis=1))[:, None]
Example:
In [9]: x = np.array([[0, 3, 4] , [1, 6, 4]])
In [10]: x/np.sqrt((x*x).sum(axis=1))[:, None]
Out[10]:
array([[0. , 0.6 , 0.8 ],
[0.13736056, 0.82416338, 0.54944226]])
For the first question:
x = np.array([[0,3,4],[1,6,4]],dtype=np.float32)
For the second question:
x/np.sqrt(np.sum(x**2,axis=1).reshape((len(x),1)))
Given 2-dimensional array
x = np.array([[0, 3, 4] , [1, 6, 4]])
Row-wise L2 norm of that array can be calculated with:
norm = np.linalg.norm(x, axis = 1)
print(norm)
[5. 7.28010989]
You can not divide array x of shape (2, 3) by norm of shape (2,), the following trick enables that by adding extra dimension to norm
# Divide by adding extra dimension
x = x / norm[:, None]
print(x)
[[0. 0.6 0.8 ]
[0.13736056 0.82416338 0.54944226]]
This solves both your questions
Say you have two numpy arrays one, call it A = [x1,x2,x3,x4,x5] which has all the x coordinates, then I have another array, call it B = [y1,y2,y3,y4,y5].. How would one "extract" a set of coordinates e.g (x1,y1) so that i could actually do something with it? Could I use a forloop or something similar? I can't seem to find any good examples, so if you could direct me or show me some I would be grateful.
Not sure if that's what you're looking for. But you can use numpy.concatenate. You just have to add a fake dimension before with [:,None] :
import numpy as np
a = np.array([1,2,3,4,5])
b = np.array([6,7,8,9,10])
arr_2d = np.concatenate([a[:,None],b[:,None]], axis=1)
print arr_2d
# [[ 1 6] [ 2 7] [ 3 8] [ 4 9] [ 5 10]]
Once you have generated a 2D array you can just use arr_2d[i] to get the i-th set of coordinates.
import numpy as np
a = np.array([1, 2, 3, 4, 5])
b = np.array([6, 7, 8, 9, 10])
print(np.hstack([a[:, np.newaxis], b[:, np.newaxis]]))
[[ 1 6]
[ 2 7]
[ 3 8]
[ 4 9]
[ 5 10]]
As #user2314737 said in a comment, you could manually do it by simply grabbing the same element from each array like so:
a = np.array([1,2,3])
b = np.array([4,5,6])
index = 2 #completely arbitrary index choice
#as individual values
pointA = a[index]
pointB = b[index]
#or in tuple form
point = (a[index], b[index])
If you need all of them converted to coordinate form, then #Nuageux's answer is probably better
Let's say you have x = np.array([ 0.48, 0.51, -0.43, 2.46, -0.91]) and y = np.array([ 0.97, -1.07, 0.62, -0.92, -1.25])
Then you can use the zip function
zip(x,y)
This will create a generator. Turn this generator into a list and turn the result into a numpy array
np.array(list(zip(x,y)))
the result will look like this
array([[ 0.48, 0.97],
[ 0.51, -1.07],
[-0.43, 0.62],
[ 2.46, -0.92],
[-0.91, -1.25]])
I have:
import numpy as np
position = np.array([4, 4.34, 4.69, 5.02, 5.3, 5.7, ..., 4])
x = (B/position**2)*dt
A = np.cumsum(x)
assert A[0] == 0 # I want this to be true.
Where B and dt are scalar constants. This is for a numerical integration problem with initial condition of A[0] = 0. Is there a way to set A[0] = 0 and then do a cumsum for everything else?
I don't understand what exactly your problem is, but here are some things you can do to have A[0] = 0.
You can create A to be longer by one index to have the zero as the first entry:
# initialize example data
import numpy as np
B = 1
dt = 1
position = np.array([4, 4.34, 4.69, 5.02, 5.3, 5.7])
# do calculation
A = np.zeros(len(position) + 1)
A[1:] = np.cumsum((B/position**2)*dt)
Result:
A = [ 0. 0.0625 0.11559096 0.16105356 0.20073547 0.23633533 0.26711403]
len(A) == len(position) + 1
Alternatively, you can manipulate the calculation to substract the first entry of the result:
# initialize example data
import numpy as np
B = 1
dt = 1
position = np.array([4, 4.34, 4.69, 5.02, 5.3, 5.7])
# do calculation
A = np.cumsum((B/position**2)*dt)
A = A - A[0]
Result:
[ 0. 0.05309096 0.09855356 0.13823547 0.17383533 0.20461403]
len(A) == len(position)
As you see, the results have different lengths. Is one of them what you expect?
1D cumsum
A wrapper around np.cumsum that sets first element to 0:
def cumsum(pmf):
cdf = np.empty(len(pmf) + 1, dtype=pmf.dtype)
cdf[0] = 0
np.cumsum(pmf, out=cdf[1:])
return cdf
Example usage:
>>> np.arange(1, 11)
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> cumsum(np.arange(1, 11))
array([ 0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55])
N-D cumsum
A wrapper around np.cumsum that sets first element to 0, and works with N-D arrays:
def cumsum(pmf, axis=None, dtype=None):
if axis is None:
pmf = pmf.reshape(-1)
axis = 0
if dtype is None:
dtype = pmf.dtype
idx = [slice(None)] * pmf.ndim
# Create array with extra element along cumsummed axis.
shape = list(pmf.shape)
shape[axis] += 1
cdf = np.empty(shape, dtype)
# Set first element to 0.
idx[axis] = 0
cdf[tuple(idx)] = 0
# Perform cumsum on remaining elements.
idx[axis] = slice(1, None)
np.cumsum(pmf, axis=axis, dtype=dtype, out=cdf[tuple(idx)])
return cdf
Example usage:
>>> np.arange(1, 11).reshape(2, 5)
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10]])
>>> cumsum(np.arange(1, 11).reshape(2, 5), axis=-1)
array([[ 0, 1, 3, 6, 10, 15],
[ 0, 6, 13, 21, 30, 40]])
I totally understand your pain, I wonder why Numpy doesn't allow this with np.cumsum. Anyway, though I'm really late and there's already another good answer, I prefer this one a bit more:
np.cumsum(np.pad(array, (1, 0), "constant"))
where array in your case is (B/position**2)*dt. You can change the order of np.pad and np.cumsum as well. I'm just adding a zero to the start of the array and calling np.cumsum.
You can use roll (shift right by 1) and then set the first entry to zero.