I have the following array:
a = [[40.5, 23.4],
[175.9, 20.2],
[21.4, 24.0],
[130.3, 18.4],
[6.3, 25.7],
[73.4, 21.5],
[16.6, 25.7],
[125.9, 19.1],
[41.4, 24.7],
[180.6, 16.4],
[13.6, 24.4],
[103.2, 19.0],
[3.2, 24.7],
[55.9, 23.1],
[208.8, 20.4]]
I need to add rows with zeroes to have the final array as:
b = [[40.5, 23.4],
[175.9, 20.2],
[0., 0.],
[21.4, 24.0],
[130.3, 18.4],
[0., 0.],
[6.3, 25.7],
[73.4, 21.5],
[0., 0.],
[16.6, 25.7],
[125.9, 19.1],
[0., 0.],
[41.4, 24.7],
[0., 0.],
[0., 0.],
[180.6, 16.4],
[0., 0.],
[0., 0.],
[13.6, 24.4],
[103.2, 19.0],
[0., 0.],
[3.2, 24.7],
[55.9, 23.1],
[208.8, 20.4]]
In Summary, what I need is to add rows to specific indexes however the number of rows is not constant. In this example (please see image), I need to add the number of rows that will make each key match the maximum number of keys. I don't care about the keys in my code but I need to somehow "normalise" the array so I'll have the same number of rows for each key.
Array Details
Here's a sample of the list of indices: [[ 2 1] [ 4 1] [ 6 1] [ 8 2] [ 9 2] [10 1] [12 0]]
I've tried np.insert, np.concatenate, advanced indexing, etc but could not come up with a solution.
Any Ideas how to solve this?
import numpy as np
# as it seems to me indeces you provided don't conform to the data,
# here is just an example list of indeces. Substitute.
# First I transform this list to a form that can be fed to np.insert()
indeces = [[0, 2], [2, 3]]
tmp = [[_[0]] * _[1] for _ in indeces]
indeces_flat = []
for elem in tmp:
for item in elem:
indeces_flat.append(item)
print(indeces_flat)
# substitute with your array
a = np.array([[1, 2], [3, 4], [5, 6]])
# the main insertion
a_inserted = np.insert(a, indeces_flat, [0, 0], axis=0)
print(a_inserted)
prints:
[0, 0, 2, 2, 2]
[[0 0]
[0 0]
[1 2]
[3 4]
[0 0]
[0 0]
[0 0]
[5 6]]
Here's a NumPy based approach
def insert_n_zeros_at(a, i):
# which zeros have more than 1 row inserted?
m = i[:,1]>1
#empty nan array, filled on cols >1 with replicated values
ix = np.full((i.shape[0], i[:,1].max()),np.nan)
ix[:,:1] = i[:,:1]
ix.ravel()[np.stack((m,m)).ravel('F')] = np.repeat(i[m,0], i[m,1])
# columns' values cumulative sum (they are the real indices)
ix += np.arange(ix.shape[1])
# accounts for the index increasement when prior rows are added
cs = i[:,1].cumsum()
ix[1:] += cs[:-1,None]
# flattens to 1d of actual indices
ix = ix[~np.isnan(ix)]
# amount of zeros to insert. Used to define out
out = np.zeros((a.shape[0]+cs[-1], a.shape[1]))
r = np.arange(out.shape[0])
# assign a where we don't have indices of 0s
out[r[~np.isin(r, ix)]] = a
return out
For the shared example, we get:
i = np.array([[ 2, 1], [ 4, 1], [ 6, 1], [ 8, 2], [ 9, 2], [10, 1]])
insert_n_zeros_at(a, i)
array([[ 40.5, 23.4], # 0
[175.9, 20.2], # 1
[ 0. , 0. ], # <- 1 zero at 2
[ 21.4, 24. ], # 2
[130.3, 18.4], # 3
[ 0. , 0. ], # <- 1 zero at 4
[ 6.3, 25.7], # 4
[ 73.4, 21.5], # 5
[ 0. , 0. ], # <- 1 zero at 6
[ 16.6, 25.7], # 6
[125.9, 19.1], # 7
[ 0. , 0. ], # <- 2 zeros at 8
[ 0. , 0. ],
[ 41.4, 24.7], # 8
[ 0. , 0. ], # <- 2 zeros at 9
[ 0. , 0. ],
[180.6, 16.4], # 9
[ 0. , 0. ], # <- 1 zero at 10
[ 13.6, 24.4], # 10
[103.2, 19. ], # 11
[ 3.2, 24.7], # 12
[ 55.9, 23.1], # 13
[208.8, 20.4]])
Related
There is list of list of tuples:
[[(0, 0.5), (1, 0.6)], [(4, 0.01), (5, 0.005), (6, 0.002)], [(1,0.7)]]
I need to get matrix X x Y:
x = num of sublists
y = max among second eleme throught all pairs
elem[x,y] = second elem for x sublist if first elem==Y
0
1
2
3
4
5
6
0.5
0.6
0
0
0
0
0
0
0
0
0
0.01
0.005
0.002
0
0.7
0
0
0
0
0
You can figure out the array's dimensions the following way. The Y dimension is the number of sublists
>>> data = [[(0, 0.5), (1, 0.6)], [(4, 0.01), (5, 0.005), (6, 0.002)], [(1,0.7)]]
>>> dim_y = len(data)
>>> dim_y
3
The X dimension is the largest [0] index of all of the tuples, plus 1.
>>> dim_x = max(max(i for i,j in sub) for sub in data) + 1
>>> dim_x
7
So then initialize an array of all zeros with this size
>>> import numpy as np
>>> arr = np.zeros((dim_x, dim_y))
>>> arr
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
Now to fill it, enumerate over your sublists to keep track of the y index. Then for each sublist use the [0] for the x index and the [1] for the value itself
for y, sub in enumerate(data):
for x, value in sub:
arr[x,y] = value
Then the resulting array should be populated (might want to transpose to look like your desired dimensions).
>>> arr.T
array([[0.5 , 0.6 , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.01 , 0.005, 0.002],
[0. , 0.7 , 0. , 0. , 0. , 0. , 0. ]])
As I commented in the accepted answer, data is 'ragged' and can't be made into a array.
Now if the data had a more regular form, a no-loop solution is possible. But conversion to such a form requires the same double looping!
In [814]: [(i,j,v) for i,row in enumerate(data) for j,v in row]
Out[814]:
[(0, 0, 0.5),
(0, 1, 0.6),
(1, 4, 0.01),
(1, 5, 0.005),
(1, 6, 0.002),
(2, 1, 0.7)]
'transpose' and separate into 3 variables:
In [815]: I,J,V=zip(*_)
In [816]: I,J,V
Out[816]: ((0, 0, 1, 1, 1, 2), (0, 1, 4, 5, 6, 1), (0.5, 0.6, 0.01, 0.005, 0.002, 0.7))
I stuck with the list transpose here so as to not convert the integer indices to floats. It may also be faster, since making an array from a list isn't a time-trivial task.
Now we can assign values via numpy magic:
In [819]: arr = np.zeros((3,7))
In [820]: arr[I,J]=V
In [821]: arr
Out[821]:
array([[0.5 , 0.6 , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.01 , 0.005, 0.002],
[0. , 0.7 , 0. , 0. , 0. , 0. , 0. ]])
I,J,V could also be used as input to a scipy.sparse.coo_matrix call, making a sparse matrix.
Speaking of a sparse matrix, here's what a sparse version of arr looks like:
In list-of-lists format:
In [822]: from scipy import sparse
In [823]: M = sparse.lil_matrix(arr)
In [824]: M
Out[824]:
<3x7 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in List of Lists format>
In [825]: M.A
Out[825]:
array([[0.5 , 0.6 , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.01 , 0.005, 0.002],
[0. , 0.7 , 0. , 0. , 0. , 0. , 0. ]])
In [826]: M.rows
Out[826]: array([list([0, 1]), list([4, 5, 6]), list([1])], dtype=object)
In [827]: M.data
Out[827]:
array([list([0.5, 0.6]), list([0.01, 0.005, 0.002]), list([0.7])],
dtype=object)
and the more common coo format:
In [828]: Mc=M.tocoo()
In [829]: Mc.row
Out[829]: array([0, 0, 1, 1, 1, 2], dtype=int32)
In [830]: Mc.col
Out[830]: array([0, 1, 4, 5, 6, 1], dtype=int32)
In [831]: Mc.data
Out[831]: array([0.5 , 0.6 , 0.01 , 0.005, 0.002, 0.7 ])
and the csr used for most calculations:
In [832]: Mr=M.tocsr()
In [833]: Mr.data
Out[833]: array([0.5 , 0.6 , 0.01 , 0.005, 0.002, 0.7 ])
In [834]: Mr.indices
Out[834]: array([0, 1, 4, 5, 6, 1], dtype=int32)
In [835]: Mr.indptr
Out[835]: array([0, 2, 5, 6], dtype=int32)
I have a binomial tree stored as an upper triangular matrix:
array([[400., 500., 625.],
[ 0., 320., 400.],
[ 0., 0., 256.]])
and I am trying to convert it to a matrix with all possible paths, like:
array([[400., 500., 625.],
[400., 500., 400.],
[400., 320., 400.],
[400., 320., 256.]])
I've written a snippet that does the job when there are only 2 steps:
def unstack_tree(tree):
output_map = []
for i in range(tree.shape[0] - 1):
for j in range(tree.shape[1] - 1):
output_map.append([tree[0,0], tree[i, 1], tree[i+j, 2]])
return np.array(output_map)
But I am struggling with how to generilize it to N steps to handle, say 3 step tree:
array([[400. , 500. , 625. , 781.25],
[ 0. , 320. , 400. , 500. ],
[ 0. , 0. , 256. , 320. ],
[ 0. , 0. , 0. , 204.8 ]])
I think I need more loops but cannot formulate it
Each path can be represented by binary code: first (0, 0), second (0, 1), third
(1, 0) ... . But actual index of array will be represented by cumsum of binary
representation.
import numpy as np
from itertools import product
n = 2
b = np.array([[400., 500., 625.],
[ 0., 320., 400.],
[ 0., 0., 256.]])
a = np.array(list(product((0, 1), repeat=n)))
a = np.c_[[0] * 2 ** n, a]
print(a)
# [[0 0 0]
# [0 0 1]
# [0 1 0]
# [0 1 1]]
a = a.cumsum(axis=1)
print(a)
# [[0 0 0]
# [0 0 1]
# [0 1 1]
# [0 1 2]]
print(np.choose(a, b))
# [[400. 500. 625.]
# [400. 500. 400.]
# [400. 320. 400.]
# [400. 320. 256.]]
In case we have
indice=[0 0 1 1 0 1];
and
X=[0 0 0;0 0 0;5 8 9;10 11 12; 0 0 0; 20 3 4],
i would like to use indice to mask 0 value in X and get Xx=[5 8 9;10 11 12; 20 3 4], and then from Xx, we back to initial dimension newX=[0 0 0;0 0 0;5 8 9;10 11 12; 0 0 0; 20 3 4]
for i in range(3):
a=X[:,i];
Xx[:,i]=a[indice];
--and back to initial dimension:
for ii in range(3)
aa=Xx[:,ii]
bb[indice]=aa
newX[:,ii]=bb
could you help me please to solve that with python?
Using numpy.where the life is much easier.
X=np.array([[0 ,0 ,0],[0, 0, 0],[5, 8, 9],[10, 11, 12],[ 0, 0 ,0],[ 20, 3, 4]])
index = np.where(X.any(axis=1))[0] # find rows with all 0s
print(X[index])
#array([[ 5, 8, 9],
# [10, 11, 12],
# [20, 3, 4]])
EDIT:
If you really want to reconstruct it, and based on the fact that you know that you have removed lines with all 0s, then:
Create a new matrix with all 0s:
X_new = np.zeros(X.shape)
and insert the values where they should be:
X_new[index] = X[index]
Now check the X_new:
X_new
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 5., 8., 9.],
[10., 11., 12.],
[ 0., 0., 0.],
[20., 3., 4.]])
I have a program that created a numpy array and the array is
array([[0.0543275 , 0.51249827, 0.43317423],
[0.07144389, 0.51152126, 0.41703486],
[0.0776112 , 0.48593384, 0.43645496]])
I used the following code for finding the maximum in a row but it is not working for float values
for row in a:
maxi = np.argmax(np.max(row, axis=0))
float(maxi)
print(maxi)
I want something like this
array([[0 , 1 , 0],
[0 , 1 , 0],
[0 , 1 , 0]])
Upd: it was originally wrong, now this is just the essence of the the previous correct answer:
a = np.array([[0.0543275 , 0.51249827, 0.43317423],
[0.07144389, 0.51152126, 0.41703486],
[0.0776112 , 0.48593384, 0.43645496]])
b = np.zeros_like(a)
b[np.arange(a.shape[0]), np.argmax(a, axis=1)] = 1
Since np.argmax() gives us indices of the max elements, we just use them for indexing directly. Now b contains desired output:
array([[0., 1., 0.],
[0., 1., 0.],
[0., 1., 0.]])
you can also do: b.astype(int) to turn to integers.
Here is an option that works
for e, i in enumerate(a):
for f, j in enumerate(i):
if j == max(i):
a[e][f] = 1
else:
a[e][f] = 0
This will convert the array that you use to the desired form:
<class 'numpy.ndarray'>
[[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]]
In [41]: arr = np.array([[0.0543275 , 0.51249827, 0.43317423], [0.07144389, 0.51
...: 152126, 0.41703486], [0.0776112 , 0.48593384, 0.43645496]])
In [42]: arr
Out[42]:
array([[0.0543275 , 0.51249827, 0.43317423],
[0.07144389, 0.51152126, 0.41703486],
[0.0776112 , 0.48593384, 0.43645496]])
The maximum in each row is:
In [47]: np.max(arr, axis=1)
Out[47]: array([0.51249827, 0.51152126, 0.48593384])
Its row index is:
In [48]: np.argmax(arr, axis=1)
Out[48]: array([1, 1, 1])
We can map that argmax array onto a array with the same shape with:
In [52]: x = np.zeros(arr.shape, int)
In [53]: x[np.arange(3),_48] = 1
In [54]: x
Out[54]:
array([[0, 1, 0],
[0, 1, 0],
[0, 1, 0]])
I have a DataFrame with columns for the x, y, z coordinates and the value at this position and I want to convert this to a 3-dimensional ndarray.
To make things more complicated, not all values exist in the DataFrame (these can just be replaced by NaN in the ndarray).
Just a simple example:
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
Should result in the ndarray:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
For two dimensions, this is easy:
array = df.pivot_table(index="y", columns="x", values="value").as_matrix()
However, this method can not be applied to three or more dimensions.
Could you give me some suggestions?
Bonus points if this also works for more than three dimensions, handles multiple defined values (by taking the average) and ensures that all x, y, z coordinates are consecutive (by inserting row/columns of NaN when a coordinate is missing).
EDIT: Some more explanations:
I read data from a CSV file which has the columns for x, y, z coordinates, optionally the frequency and the measurement value at this point and frequency. Then I round the coordinates to a specified precision (e.g. 0.1m) and want to get an ndarray which contains the averaged measurement values at each (rounded) coordinates. The indizes of the values do not need to coincide with the location. However they need to be in the correct order.
EDIT: I just ran a quick performance test:
The solution of jakevdp takes 1.598s, Divikars solution takes 7.405s, JohnE's solution takes 7.867s and Wens solution takes 6.286s to complete.
You can use a groupby followed by the approach from Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array:
grouped = df.groupby(['z', 'y', 'x'])['value'].mean()
# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[grouped.index.labels] = grouped.values.flat
print(arr)
# [[[ 1. 2. nan]
# [ 3. nan 4.]]
#
# [[ 5. 6. 7.]
# [ 8. 9. nan]]]
Here's one NumPy approach -
def dataframe_to_array_averaged(df):
arr = df[['z','y','x']].values
arr -= arr.min(0)
out_shp = arr.max(0)+1
L = np.prod(out_shp)
val = df['value'].values
ids = np.ravel_multi_index(arr.T, out_shp)
avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
return avgs.reshape(out_shp)
Note that that this shows a warning because for places with no x,y,z triplets would have zero counts and hence the average values would be 0/0 = NaN, but since that's the expected output for those places, you can ignore the warning there. To avoid this warning, we can employ indexing, as discussed in the second method (Alternative method).
Sample run -
In [106]: df
Out[106]:
value x y z
0 1 1 1 1 # <=== this is repeated
1 2 2 1 1
2 3 1 2 1
3 4 3 2 1
4 5 1 1 2
5 6 2 1 2
6 7 3 1 2
7 8 1 2 2
8 9 2 2 2
9 4 1 1 1 # <=== this is repeated
In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]:
array([[[ 2.5, 2. , nan],
[ 3. , nan, 4. ]],
[[ 5. , 6. , 7. ],
[ 8. , 9. , nan]]])
Alternative method
To avoid warning, an alternative way would be like so -
out = np.full(out_shp, np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count
Another solution is to use the xarray package:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df = pd.pivot_table(df, values='value', index=['x', 'y', 'z'])
xrTensor = xr.DataArray(df).unstack("dim_0")
array = xrTensor.values[0].T
print(array)
Output:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
Note that the xrTensor object is very handy since xarray's DataArrays contain the labels so you may just go on with that object rather pulling out the ndarray:
print(xrTensor)
Output:
<xarray.DataArray (dim_1: 1, x: 3, y: 2, z: 2)>
array([[[[ 1., 5.],
[ 3., 8.]],
[[ 2., 6.],
[nan, 9.]],
[[nan, 7.],
[ 4., nan]]]])
Coordinates:
* dim_1 (dim_1) object 'value'
* x (x) int64 1 2 3
* y (y) int64 1 2
* z (z) int64 1 2
We can using stack
np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))
Out[451]:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])