Related
I have the following array:
a = [[40.5, 23.4],
[175.9, 20.2],
[21.4, 24.0],
[130.3, 18.4],
[6.3, 25.7],
[73.4, 21.5],
[16.6, 25.7],
[125.9, 19.1],
[41.4, 24.7],
[180.6, 16.4],
[13.6, 24.4],
[103.2, 19.0],
[3.2, 24.7],
[55.9, 23.1],
[208.8, 20.4]]
I need to add rows with zeroes to have the final array as:
b = [[40.5, 23.4],
[175.9, 20.2],
[0., 0.],
[21.4, 24.0],
[130.3, 18.4],
[0., 0.],
[6.3, 25.7],
[73.4, 21.5],
[0., 0.],
[16.6, 25.7],
[125.9, 19.1],
[0., 0.],
[41.4, 24.7],
[0., 0.],
[0., 0.],
[180.6, 16.4],
[0., 0.],
[0., 0.],
[13.6, 24.4],
[103.2, 19.0],
[0., 0.],
[3.2, 24.7],
[55.9, 23.1],
[208.8, 20.4]]
In Summary, what I need is to add rows to specific indexes however the number of rows is not constant. In this example (please see image), I need to add the number of rows that will make each key match the maximum number of keys. I don't care about the keys in my code but I need to somehow "normalise" the array so I'll have the same number of rows for each key.
Array Details
Here's a sample of the list of indices: [[ 2 1] [ 4 1] [ 6 1] [ 8 2] [ 9 2] [10 1] [12 0]]
I've tried np.insert, np.concatenate, advanced indexing, etc but could not come up with a solution.
Any Ideas how to solve this?
import numpy as np
# as it seems to me indeces you provided don't conform to the data,
# here is just an example list of indeces. Substitute.
# First I transform this list to a form that can be fed to np.insert()
indeces = [[0, 2], [2, 3]]
tmp = [[_[0]] * _[1] for _ in indeces]
indeces_flat = []
for elem in tmp:
for item in elem:
indeces_flat.append(item)
print(indeces_flat)
# substitute with your array
a = np.array([[1, 2], [3, 4], [5, 6]])
# the main insertion
a_inserted = np.insert(a, indeces_flat, [0, 0], axis=0)
print(a_inserted)
prints:
[0, 0, 2, 2, 2]
[[0 0]
[0 0]
[1 2]
[3 4]
[0 0]
[0 0]
[0 0]
[5 6]]
Here's a NumPy based approach
def insert_n_zeros_at(a, i):
# which zeros have more than 1 row inserted?
m = i[:,1]>1
#empty nan array, filled on cols >1 with replicated values
ix = np.full((i.shape[0], i[:,1].max()),np.nan)
ix[:,:1] = i[:,:1]
ix.ravel()[np.stack((m,m)).ravel('F')] = np.repeat(i[m,0], i[m,1])
# columns' values cumulative sum (they are the real indices)
ix += np.arange(ix.shape[1])
# accounts for the index increasement when prior rows are added
cs = i[:,1].cumsum()
ix[1:] += cs[:-1,None]
# flattens to 1d of actual indices
ix = ix[~np.isnan(ix)]
# amount of zeros to insert. Used to define out
out = np.zeros((a.shape[0]+cs[-1], a.shape[1]))
r = np.arange(out.shape[0])
# assign a where we don't have indices of 0s
out[r[~np.isin(r, ix)]] = a
return out
For the shared example, we get:
i = np.array([[ 2, 1], [ 4, 1], [ 6, 1], [ 8, 2], [ 9, 2], [10, 1]])
insert_n_zeros_at(a, i)
array([[ 40.5, 23.4], # 0
[175.9, 20.2], # 1
[ 0. , 0. ], # <- 1 zero at 2
[ 21.4, 24. ], # 2
[130.3, 18.4], # 3
[ 0. , 0. ], # <- 1 zero at 4
[ 6.3, 25.7], # 4
[ 73.4, 21.5], # 5
[ 0. , 0. ], # <- 1 zero at 6
[ 16.6, 25.7], # 6
[125.9, 19.1], # 7
[ 0. , 0. ], # <- 2 zeros at 8
[ 0. , 0. ],
[ 41.4, 24.7], # 8
[ 0. , 0. ], # <- 2 zeros at 9
[ 0. , 0. ],
[180.6, 16.4], # 9
[ 0. , 0. ], # <- 1 zero at 10
[ 13.6, 24.4], # 10
[103.2, 19. ], # 11
[ 3.2, 24.7], # 12
[ 55.9, 23.1], # 13
[208.8, 20.4]])
In case we have
indice=[0 0 1 1 0 1];
and
X=[0 0 0;0 0 0;5 8 9;10 11 12; 0 0 0; 20 3 4],
i would like to use indice to mask 0 value in X and get Xx=[5 8 9;10 11 12; 20 3 4], and then from Xx, we back to initial dimension newX=[0 0 0;0 0 0;5 8 9;10 11 12; 0 0 0; 20 3 4]
for i in range(3):
a=X[:,i];
Xx[:,i]=a[indice];
--and back to initial dimension:
for ii in range(3)
aa=Xx[:,ii]
bb[indice]=aa
newX[:,ii]=bb
could you help me please to solve that with python?
Using numpy.where the life is much easier.
X=np.array([[0 ,0 ,0],[0, 0, 0],[5, 8, 9],[10, 11, 12],[ 0, 0 ,0],[ 20, 3, 4]])
index = np.where(X.any(axis=1))[0] # find rows with all 0s
print(X[index])
#array([[ 5, 8, 9],
# [10, 11, 12],
# [20, 3, 4]])
EDIT:
If you really want to reconstruct it, and based on the fact that you know that you have removed lines with all 0s, then:
Create a new matrix with all 0s:
X_new = np.zeros(X.shape)
and insert the values where they should be:
X_new[index] = X[index]
Now check the X_new:
X_new
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 5., 8., 9.],
[10., 11., 12.],
[ 0., 0., 0.],
[20., 3., 4.]])
Assume that I have two arrays
>>> import numpy as np
>>> a = np.random.randint(0, 10, size=(5, 4))
>>> a
array([[1, 6, 7, 4],
[2, 7, 4, 2],
[9, 3, 6, 4],
[9, 6, 8, 2],
[7, 2, 9, 5]])
>>> b = np.random.randint(0, 10, size=(5, 4))
>>> b
array([[ 5., 8., 6., 5.],
[ 1., 8., 4., 8.],
[ 1., 4., 6., 3.],
[ 4., 8., 6., 4.],
[ 8., 7., 7., 5.]], dtype=float32)
Now I have a situation where I need to compare elements of each arrays and replace with known values. For example my conditions are
if a == 0 then replace with 0 (or) if b == 0 then replace with 0
if a > 4 and < 11 then replace with 1 (or) if b > 1 and < 3 then replace with 1
if a > 10 and < 18 then replace with 2 (or) if b > 2 and < 5 then replace with 2
.
.
.
and finally
if a > 40 replace with 9 (or) if b > 9 then replace with 9.
Those replaced values can be stored in a new arrary which I need to use it for other function.
The simplest form of element wise comparison like a[ a > 2 ] = 1 works. But I am not aware of multiple comparison (multiple times) with same method.
I am sure that there is a easy way exist in numpy which I am unable to find. Any help is appreciated.
if
np.digitize should do what you want. The first arguments are the values you want to replace and the second are the thresholds.
a_replace = np.digitize(a, [0, 4, 10, ..., 40], right=True)
b_replace = np.digitize(b, [0, 1, 2, ..., 9], right=True)
I have a DataFrame with columns for the x, y, z coordinates and the value at this position and I want to convert this to a 3-dimensional ndarray.
To make things more complicated, not all values exist in the DataFrame (these can just be replaced by NaN in the ndarray).
Just a simple example:
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
Should result in the ndarray:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
For two dimensions, this is easy:
array = df.pivot_table(index="y", columns="x", values="value").as_matrix()
However, this method can not be applied to three or more dimensions.
Could you give me some suggestions?
Bonus points if this also works for more than three dimensions, handles multiple defined values (by taking the average) and ensures that all x, y, z coordinates are consecutive (by inserting row/columns of NaN when a coordinate is missing).
EDIT: Some more explanations:
I read data from a CSV file which has the columns for x, y, z coordinates, optionally the frequency and the measurement value at this point and frequency. Then I round the coordinates to a specified precision (e.g. 0.1m) and want to get an ndarray which contains the averaged measurement values at each (rounded) coordinates. The indizes of the values do not need to coincide with the location. However they need to be in the correct order.
EDIT: I just ran a quick performance test:
The solution of jakevdp takes 1.598s, Divikars solution takes 7.405s, JohnE's solution takes 7.867s and Wens solution takes 6.286s to complete.
You can use a groupby followed by the approach from Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array:
grouped = df.groupby(['z', 'y', 'x'])['value'].mean()
# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[grouped.index.labels] = grouped.values.flat
print(arr)
# [[[ 1. 2. nan]
# [ 3. nan 4.]]
#
# [[ 5. 6. 7.]
# [ 8. 9. nan]]]
Here's one NumPy approach -
def dataframe_to_array_averaged(df):
arr = df[['z','y','x']].values
arr -= arr.min(0)
out_shp = arr.max(0)+1
L = np.prod(out_shp)
val = df['value'].values
ids = np.ravel_multi_index(arr.T, out_shp)
avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
return avgs.reshape(out_shp)
Note that that this shows a warning because for places with no x,y,z triplets would have zero counts and hence the average values would be 0/0 = NaN, but since that's the expected output for those places, you can ignore the warning there. To avoid this warning, we can employ indexing, as discussed in the second method (Alternative method).
Sample run -
In [106]: df
Out[106]:
value x y z
0 1 1 1 1 # <=== this is repeated
1 2 2 1 1
2 3 1 2 1
3 4 3 2 1
4 5 1 1 2
5 6 2 1 2
6 7 3 1 2
7 8 1 2 2
8 9 2 2 2
9 4 1 1 1 # <=== this is repeated
In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]:
array([[[ 2.5, 2. , nan],
[ 3. , nan, 4. ]],
[[ 5. , 6. , 7. ],
[ 8. , 9. , nan]]])
Alternative method
To avoid warning, an alternative way would be like so -
out = np.full(out_shp, np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count
Another solution is to use the xarray package:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df = pd.pivot_table(df, values='value', index=['x', 'y', 'z'])
xrTensor = xr.DataArray(df).unstack("dim_0")
array = xrTensor.values[0].T
print(array)
Output:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
Note that the xrTensor object is very handy since xarray's DataArrays contain the labels so you may just go on with that object rather pulling out the ndarray:
print(xrTensor)
Output:
<xarray.DataArray (dim_1: 1, x: 3, y: 2, z: 2)>
array([[[[ 1., 5.],
[ 3., 8.]],
[[ 2., 6.],
[nan, 9.]],
[[nan, 7.],
[ 4., nan]]]])
Coordinates:
* dim_1 (dim_1) object 'value'
* x (x) int64 1 2 3
* y (y) int64 1 2
* z (z) int64 1 2
We can using stack
np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))
Out[451]:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
I am trying to use scipy.stats.binned_statistic_dd and I can't for the life of me figure out the outputs. Does anyone have any advice here?
Look at this simple sample program:
import scipy
scipy.__version__
# '0.14.0'
import numpy as np
print scipy.stats.binned_statistic_dd([np.ones(10), np.ones(10)], np.arange(10), 'count', bins=3)
#(array([[ 0., 0., 0.],
# [ 0., 10., 0.],
# [ 0., 0., 0.]]),
# [array([ 0.5 , 0.83333333, 1.16666667, 1.5 ]),
# array([ 0.5 , 0.83333333, 1.16666667, 1.5 ])],
# array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12]))
So the documentation claims the outputs are:
statistic : ndarray, shape(nx1, nx2, nx3,...) The values of the
selected statistic in each two-dimensional bin
edges : list of
ndarrays A list of D arrays describing the (nxi + 1) bin edges for
each dimension
binnumber : 1-D ndarray of ints This assigns to each
observation an integer that represents the bin in which this
observation falls. Array has the same length as values.
In the example the statistic makes good sence, I asked for the 'count' and got 10, there are 10 elements all in that same bin. Edges makes good sense too, the data to be over was a dimension 2 and I wanted 3 bins so I gotout 4 edges that are reasonable.
Then the question the binnumber makes no sense to me at all, array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12]), there are indeed 10 numbers the same length and the data inputted, np.arange(10), but number 12 makes no sense at all. What am I missing. 12 is not an unravel index over the bins turned into a multi D array, since there are 3 bins in each dimension I could see numbers up to 9. What is 12 telling me?
The values in binnumbers are an unraveled index of bins that include an extra
set of "out of range" bins.
In this example,
In [40]: hst, edges, bincounts = binned_statistic_dd([np.ones(10), np.ones(10)], None, 'count', bins=3)
In [41]: hst
Out[41]:
array([[ 0., 0., 0.],
[ 0., 10., 0.],
[ 0., 0., 0.]])
the bins are numbered as follows:
0 | 1 | 2 | 3 | 4
-----+-----+-----+-----+-----
5 | 6 | 7 | 8 | 9
-----+-----+-----+-----+-----
10 | 11 | 12 | 13 | 14
-----+-----+-----+-----+-----
15 | 16 | 17 | 18 | 19
-----+-----+-----+-----+-----
20 | 21 | 22 | 23 | 24
The "out of range" bins are not included in hst; the data in hst corresponds to bin numbers
6, 7, 8, 11, 12, 13, 16, 17 and 18. That's why all the values in bincounts are 12:
In [42]: bincounts
Out[42]: array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12])
You can use the range argument to force the counts into the outer bins. For example,
by setting the ranges of the coordinates to be [2, 3] and [0, 0.5], so all the values in the
first coordinate are left of their range and all the values in the second coordinate are
to the right of their range, all the points end up in the upper right outer bin, which is
bin index 4:
In [51]: binned_statistic_dd([np.ones(10), np.ones(10)], None, 'count', bins=3, range=[[2,3],[0,0.5]])
Out[51]:
(array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]]),
[array([ 2. , 2.33333333, 2.66666667, 3. ]),
array([ 0. , 0.16666667, 0.33333333, 0.5 ])],
array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4]))