Mask zero values in matrix and reconstruct original matrix using indices - python

In case we have
indice=[0 0 1 1 0 1];
and
X=[0 0 0;0 0 0;5 8 9;10 11 12; 0 0 0; 20 3 4],
i would like to use indice to mask 0 value in X and get Xx=[5 8 9;10 11 12; 20 3 4], and then from Xx, we back to initial dimension newX=[0 0 0;0 0 0;5 8 9;10 11 12; 0 0 0; 20 3 4]
for i in range(3):
a=X[:,i];
Xx[:,i]=a[indice];
--and back to initial dimension:
for ii in range(3)
aa=Xx[:,ii]
bb[indice]=aa
newX[:,ii]=bb
could you help me please to solve that with python?

Using numpy.where the life is much easier.
X=np.array([[0 ,0 ,0],[0, 0, 0],[5, 8, 9],[10, 11, 12],[ 0, 0 ,0],[ 20, 3, 4]])
index = np.where(X.any(axis=1))[0] # find rows with all 0s
print(X[index])
#array([[ 5, 8, 9],
# [10, 11, 12],
# [20, 3, 4]])
EDIT:
If you really want to reconstruct it, and based on the fact that you know that you have removed lines with all 0s, then:
Create a new matrix with all 0s:
X_new = np.zeros(X.shape)
and insert the values where they should be:
X_new[index] = X[index]
Now check the X_new:
X_new
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 5., 8., 9.],
[10., 11., 12.],
[ 0., 0., 0.],
[20., 3., 4.]])

Related

How to efficiently filter maximum elements of a matrix per row

Given a 2D array, I'm looking for a pythonic way to get an array of same shape, with only the maximum element per each row.
See max_row_filter function below
def max_row_filter(mat2d):
m = np.zeros(mat2d.shape)
for r in range(mat2d.shape[0]):
c = np.argmax(mat2d[r])
m[r,c]=mat2d[r,c]
return m
p = np.array([[1,2,3],[5,4,3,],[9,10,3]])
max_row_filter(p)
Out: array([[ 0., 0., 3.],
[ 5., 0., 0.],
[ 0., 10., 0.]])
I'm looking for an efficient way to do this, suitable to be done on big arrays.
Alternative answer (this will keep duplicates):
p * (p==p.max(axis=1, keepdims=True))
If there are no duplicates, you could use numpy.argmax:
import numpy as np
p = np.array([[1, 2, 3],
[5, 4, 3, ],
[9, 10, 3]])
result = np.zeros_like(p)
rows, cols = zip(*enumerate(np.argmax(p, axis=1)))
result[rows, cols] = p[rows, cols]
print(result)
Output
[[ 0 0 3]
[ 5 0 0]
[ 0 10 0]]
Note that, for multiple occurrences argmax return the first occurence.

Python : Changing MxN array into NxM [duplicate]

This question already has answers here:
Matrix Transpose in Python [duplicate]
(19 answers)
Closed 4 years ago.
matrix = []
n = int(input("n: "))
m = int(input("m: "))
for i in range(m):
data = input()
data_list = data.split()
data_list = [int(i) for i in data_list]
matrix.append(data_list)
I made a python code for put an integer in MxN array.
I want to change it into Nx(M+1) array
change array[m][n] into array[n][m]
and put 0 int the array[][m+1]
for example:
n : 4
m : 3
Input integers:
1 2 3 4
5 6 7 8
9 10 11 12
Turns into:
1 5 9 0
2 6 10 0
3 7 11 0
4 8 12 0
how can I make that code to do this thing?
I tried
for i in range(m):
for j in range(n):
matrix[i][j] = matrix[j][i]
but this is wrong way to do it.
matrix = [
[ 1, 2, 3, 4 ],
[ 5, 6, 7, 8 ],
[ 9, 10, 11, 12 ]
]
def change(matrix):
m = len(matrix)
n = len(matrix[0])
result = [[] for i in range(n)]
for i in range(m+1):
for j in range(n):
if i == m:
result[j].append(0)
else:
result[j].append(matrix[i][j])
return result
changed = change(matrix)
print(changed)
To solve your problem, get acquainted with NumPy.
import numpy as np
t1 = np.arange(1, 13).reshape(3, 4)
creates your source table:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
Then you should transpose it:
t2 = t1.T
which gives:
array([[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11],
[ 4, 8, 12]])
And finally:
np.c_[ t2, np.zeros(4) ]
adds a column of 4 zeroes, giving the final result:
array([[ 1., 5., 9., 0.],
[ 2., 6., 10., 0.],
[ 3., 7., 11., 0.],
[ 4., 8., 12., 0.]])

Replace values based on multiple conditions of two array?

Assume that I have two arrays
>>> import numpy as np
>>> a = np.random.randint(0, 10, size=(5, 4))
>>> a
array([[1, 6, 7, 4],
[2, 7, 4, 2],
[9, 3, 6, 4],
[9, 6, 8, 2],
[7, 2, 9, 5]])
>>> b = np.random.randint(0, 10, size=(5, 4))
>>> b
array([[ 5., 8., 6., 5.],
[ 1., 8., 4., 8.],
[ 1., 4., 6., 3.],
[ 4., 8., 6., 4.],
[ 8., 7., 7., 5.]], dtype=float32)
Now I have a situation where I need to compare elements of each arrays and replace with known values. For example my conditions are
if a == 0 then replace with 0 (or) if b == 0 then replace with 0
if a > 4 and < 11 then replace with 1 (or) if b > 1 and < 3 then replace with 1
if a > 10 and < 18 then replace with 2 (or) if b > 2 and < 5 then replace with 2
.
.
.
and finally
if a > 40 replace with 9 (or) if b > 9 then replace with 9.
Those replaced values can be stored in a new arrary which I need to use it for other function.
The simplest form of element wise comparison like a[ a > 2 ] = 1 works. But I am not aware of multiple comparison (multiple times) with same method.
I am sure that there is a easy way exist in numpy which I am unable to find. Any help is appreciated.
if
np.digitize should do what you want. The first arguments are the values you want to replace and the second are the thresholds.
a_replace = np.digitize(a, [0, 4, 10, ..., 40], right=True)
b_replace = np.digitize(b, [0, 1, 2, ..., 9], right=True)

Convert a Pandas DataFrame to a multidimensional ndarray

I have a DataFrame with columns for the x, y, z coordinates and the value at this position and I want to convert this to a 3-dimensional ndarray.
To make things more complicated, not all values exist in the DataFrame (these can just be replaced by NaN in the ndarray).
Just a simple example:
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
Should result in the ndarray:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
For two dimensions, this is easy:
array = df.pivot_table(index="y", columns="x", values="value").as_matrix()
However, this method can not be applied to three or more dimensions.
Could you give me some suggestions?
Bonus points if this also works for more than three dimensions, handles multiple defined values (by taking the average) and ensures that all x, y, z coordinates are consecutive (by inserting row/columns of NaN when a coordinate is missing).
EDIT: Some more explanations:
I read data from a CSV file which has the columns for x, y, z coordinates, optionally the frequency and the measurement value at this point and frequency. Then I round the coordinates to a specified precision (e.g. 0.1m) and want to get an ndarray which contains the averaged measurement values at each (rounded) coordinates. The indizes of the values do not need to coincide with the location. However they need to be in the correct order.
EDIT: I just ran a quick performance test:
The solution of jakevdp takes 1.598s, Divikars solution takes 7.405s, JohnE's solution takes 7.867s and Wens solution takes 6.286s to complete.
You can use a groupby followed by the approach from Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array:
grouped = df.groupby(['z', 'y', 'x'])['value'].mean()
# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[grouped.index.labels] = grouped.values.flat
print(arr)
# [[[ 1. 2. nan]
# [ 3. nan 4.]]
#
# [[ 5. 6. 7.]
# [ 8. 9. nan]]]
Here's one NumPy approach -
def dataframe_to_array_averaged(df):
arr = df[['z','y','x']].values
arr -= arr.min(0)
out_shp = arr.max(0)+1
L = np.prod(out_shp)
val = df['value'].values
ids = np.ravel_multi_index(arr.T, out_shp)
avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
return avgs.reshape(out_shp)
Note that that this shows a warning because for places with no x,y,z triplets would have zero counts and hence the average values would be 0/0 = NaN, but since that's the expected output for those places, you can ignore the warning there. To avoid this warning, we can employ indexing, as discussed in the second method (Alternative method).
Sample run -
In [106]: df
Out[106]:
value x y z
0 1 1 1 1 # <=== this is repeated
1 2 2 1 1
2 3 1 2 1
3 4 3 2 1
4 5 1 1 2
5 6 2 1 2
6 7 3 1 2
7 8 1 2 2
8 9 2 2 2
9 4 1 1 1 # <=== this is repeated
In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]:
array([[[ 2.5, 2. , nan],
[ 3. , nan, 4. ]],
[[ 5. , 6. , 7. ],
[ 8. , 9. , nan]]])
Alternative method
To avoid warning, an alternative way would be like so -
out = np.full(out_shp, np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count
Another solution is to use the xarray package:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df = pd.pivot_table(df, values='value', index=['x', 'y', 'z'])
xrTensor = xr.DataArray(df).unstack("dim_0")
array = xrTensor.values[0].T
print(array)
Output:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
Note that the xrTensor object is very handy since xarray's DataArrays contain the labels so you may just go on with that object rather pulling out the ndarray:
print(xrTensor)
Output:
<xarray.DataArray (dim_1: 1, x: 3, y: 2, z: 2)>
array([[[[ 1., 5.],
[ 3., 8.]],
[[ 2., 6.],
[nan, 9.]],
[[nan, 7.],
[ 4., nan]]]])
Coordinates:
* dim_1 (dim_1) object 'value'
* x (x) int64 1 2 3
* y (y) int64 1 2
* z (z) int64 1 2
We can using stack
np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))
Out[451]:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])

Output in scipy.stats.binned_statistic_dd()

I am trying to use scipy.stats.binned_statistic_dd and I can't for the life of me figure out the outputs. Does anyone have any advice here?
Look at this simple sample program:
import scipy
scipy.__version__
# '0.14.0'
import numpy as np
print scipy.stats.binned_statistic_dd([np.ones(10), np.ones(10)], np.arange(10), 'count', bins=3)
#(array([[ 0., 0., 0.],
# [ 0., 10., 0.],
# [ 0., 0., 0.]]),
# [array([ 0.5 , 0.83333333, 1.16666667, 1.5 ]),
# array([ 0.5 , 0.83333333, 1.16666667, 1.5 ])],
# array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12]))
So the documentation claims the outputs are:
statistic : ndarray, shape(nx1, nx2, nx3,...) The values of the
selected statistic in each two-dimensional bin
edges : list of
ndarrays A list of D arrays describing the (nxi + 1) bin edges for
each dimension
binnumber : 1-D ndarray of ints This assigns to each
observation an integer that represents the bin in which this
observation falls. Array has the same length as values.
In the example the statistic makes good sence, I asked for the 'count' and got 10, there are 10 elements all in that same bin. Edges makes good sense too, the data to be over was a dimension 2 and I wanted 3 bins so I gotout 4 edges that are reasonable.
Then the question the binnumber makes no sense to me at all, array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12]), there are indeed 10 numbers the same length and the data inputted, np.arange(10), but number 12 makes no sense at all. What am I missing. 12 is not an unravel index over the bins turned into a multi D array, since there are 3 bins in each dimension I could see numbers up to 9. What is 12 telling me?
The values in binnumbers are an unraveled index of bins that include an extra
set of "out of range" bins.
In this example,
In [40]: hst, edges, bincounts = binned_statistic_dd([np.ones(10), np.ones(10)], None, 'count', bins=3)
In [41]: hst
Out[41]:
array([[ 0., 0., 0.],
[ 0., 10., 0.],
[ 0., 0., 0.]])
the bins are numbered as follows:
0 | 1 | 2 | 3 | 4
-----+-----+-----+-----+-----
5 | 6 | 7 | 8 | 9
-----+-----+-----+-----+-----
10 | 11 | 12 | 13 | 14
-----+-----+-----+-----+-----
15 | 16 | 17 | 18 | 19
-----+-----+-----+-----+-----
20 | 21 | 22 | 23 | 24
The "out of range" bins are not included in hst; the data in hst corresponds to bin numbers
6, 7, 8, 11, 12, 13, 16, 17 and 18. That's why all the values in bincounts are 12:
In [42]: bincounts
Out[42]: array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12])
You can use the range argument to force the counts into the outer bins. For example,
by setting the ranges of the coordinates to be [2, 3] and [0, 0.5], so all the values in the
first coordinate are left of their range and all the values in the second coordinate are
to the right of their range, all the points end up in the upper right outer bin, which is
bin index 4:
In [51]: binned_statistic_dd([np.ones(10), np.ones(10)], None, 'count', bins=3, range=[[2,3],[0,0.5]])
Out[51]:
(array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]]),
[array([ 2. , 2.33333333, 2.66666667, 3. ]),
array([ 0. , 0.16666667, 0.33333333, 0.5 ])],
array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4]))

Categories

Resources