Output in scipy.stats.binned_statistic_dd()

Output in scipy.stats.binned_statistic_dd() - python

I am trying to use scipy.stats.binned_statistic_dd and I can't for the life of me figure out the outputs. Does anyone have any advice here?
Look at this simple sample program:
import scipy
scipy.__version__
# '0.14.0'
import numpy as np
print scipy.stats.binned_statistic_dd([np.ones(10), np.ones(10)], np.arange(10), 'count', bins=3)
#(array([[ 0., 0., 0.],
# [ 0., 10., 0.],
# [ 0., 0., 0.]]),
# [array([ 0.5 , 0.83333333, 1.16666667, 1.5 ]),
# array([ 0.5 , 0.83333333, 1.16666667, 1.5 ])],
# array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12]))
So the documentation claims the outputs are:
statistic : ndarray, shape(nx1, nx2, nx3,...) The values of the
selected statistic in each two-dimensional bin
edges : list of
ndarrays A list of D arrays describing the (nxi + 1) bin edges for
each dimension
binnumber : 1-D ndarray of ints This assigns to each
observation an integer that represents the bin in which this
observation falls. Array has the same length as values.
In the example the statistic makes good sence, I asked for the 'count' and got 10, there are 10 elements all in that same bin. Edges makes good sense too, the data to be over was a dimension 2 and I wanted 3 bins so I gotout 4 edges that are reasonable.
Then the question the binnumber makes no sense to me at all, array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12]), there are indeed 10 numbers the same length and the data inputted, np.arange(10), but number 12 makes no sense at all. What am I missing. 12 is not an unravel index over the bins turned into a multi D array, since there are 3 bins in each dimension I could see numbers up to 9. What is 12 telling me?

The values in binnumbers are an unraveled index of bins that include an extra
set of "out of range" bins.
In this example,
In [40]: hst, edges, bincounts = binned_statistic_dd([np.ones(10), np.ones(10)], None, 'count', bins=3)
In [41]: hst
Out[41]:
array([[ 0., 0., 0.],
[ 0., 10., 0.],
[ 0., 0., 0.]])
the bins are numbered as follows:
0 | 1 | 2 | 3 | 4
-----+-----+-----+-----+-----
5 | 6 | 7 | 8 | 9
-----+-----+-----+-----+-----
10 | 11 | 12 | 13 | 14
-----+-----+-----+-----+-----
15 | 16 | 17 | 18 | 19
-----+-----+-----+-----+-----
20 | 21 | 22 | 23 | 24
The "out of range" bins are not included in hst; the data in hst corresponds to bin numbers
6, 7, 8, 11, 12, 13, 16, 17 and 18. That's why all the values in bincounts are 12:
In [42]: bincounts
Out[42]: array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12])
You can use the range argument to force the counts into the outer bins. For example,
by setting the ranges of the coordinates to be [2, 3] and [0, 0.5], so all the values in the
first coordinate are left of their range and all the values in the second coordinate are
to the right of their range, all the points end up in the upper right outer bin, which is
bin index 4:
In [51]: binned_statistic_dd([np.ones(10), np.ones(10)], None, 'count', bins=3, range=[[2,3],[0,0.5]])
Out[51]:
(array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]]),
[array([ 2. , 2.33333333, 2.66666667, 3. ]),
array([ 0. , 0.16666667, 0.33333333, 0.5 ])],
array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4]))

Related

Mask zero values in matrix and reconstruct original matrix using indices

In case we have
indice=[0 0 1 1 0 1];
and
X=[0 0 0;0 0 0;5 8 9;10 11 12; 0 0 0; 20 3 4],
i would like to use indice to mask 0 value in X and get Xx=[5 8 9;10 11 12; 20 3 4], and then from Xx, we back to initial dimension newX=[0 0 0;0 0 0;5 8 9;10 11 12; 0 0 0; 20 3 4]
for i in range(3):
a=X[:,i];
Xx[:,i]=a[indice];
--and back to initial dimension:
for ii in range(3)
aa=Xx[:,ii]
bb[indice]=aa
newX[:,ii]=bb
could you help me please to solve that with python?

Using numpy.where the life is much easier.
X=np.array([[0 ,0 ,0],[0, 0, 0],[5, 8, 9],[10, 11, 12],[ 0, 0 ,0],[ 20, 3, 4]])
index = np.where(X.any(axis=1))[0] # find rows with all 0s
print(X[index])
#array([[ 5, 8, 9],
# [10, 11, 12],
# [20, 3, 4]])
EDIT:
If you really want to reconstruct it, and based on the fact that you know that you have removed lines with all 0s, then:
Create a new matrix with all 0s:
X_new = np.zeros(X.shape)
and insert the values where they should be:
X_new[index] = X[index]
Now check the X_new:
X_new
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 5., 8., 9.],
[10., 11., 12.],
[ 0., 0., 0.],
[20., 3., 4.]])

Merge multidimensional NumPy arrays based on first row

I have to work with sensor data (from ros, specifically, but it should not be relevant). To this end, I have several 2-D numpy arrays with one row storing the timestamps and the following others the corresponding sensors data. Problem is, such arrays do not have the same dimensions (different sampling times). I need to merge all of these arrays into a single big one. How can I do so based on the timestamp and, say, replace the missing numbers with 0 or NaN?
Example of my situation:
import numpy as np
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
print(a)
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b=np.array((time2,data2))
print(b)
Which returns output
[[ 1 2 3 4 5 6 7 8 9]
[ 51 9 117 174 164 60 95 197 30]]
[[ 1 3 5 7 9]
[ 35 188 114 153 36]]
What I am looking for is
[[ 1 2 3 4 5 6 7 8 9]
[ 51 9 117 174 164 60 95 197 30]
[ 35 0 188 0 114 0 153 0 36]]
Is there any way to achieve this in an efficient way? This is an example but I am working with thousands of samples. Thanks!

For simple case of one b-matrix
With first row of a storing all possible timestamps and both of those first rows in a and b being sorted, we can use np.searchsorted -
idx = np.searchsorted(a[0],b[0])
out_dtype = np.result_type((a.dtype,b.dtype))
b0 = np.zeros(a.shape[1],dtype=out_dtype)
b0[idx] = b[1]
out = np.vstack((a,b0))
For several b-matrices
Approach #1
To extend to multiple b-matrices, we can follow a similar method with np.searchsorted within a loop, like so -
def merge_arrays(a, B):
# a : Array with first row holding all possible timestamps
# B : list or tuple of all b-matrices
lens = np.array([len(i) for i in B])
L = (lens-1).sum() + len(a)
out_dtype = np.result_type(*[i.dtype for i in B])
out = np.zeros((L, a.shape[1]), dtype=out_dtype)
out[:len(a)] = a
s = len(a)
for b_i in B:
idx = np.searchsorted(a[0],b_i[0])
out[s:s+len(b_i)-1,idx] = b_i[1:]
s += len(b_i)-1
return out
Sample run -
In [175]: a
Out[175]:
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
In [176]: b0
Out[176]:
array([[16, 22, 34, 56, 67, 91],
[20, 80, 69, 79, 47, 64],
[82, 88, 49, 29, 19, 19]])
In [177]: b1
Out[177]:
array([[ 4, 16, 34, 99],
[28, 34, 0, 0],
[36, 53, 5, 38],
[17, 79, 4, 42]])
In [178]: merge_arrays(a, [b0,b1])
Out[178]:
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 0, 0, 20, 80, 69, 79, 47, 0, 64, 0],
[ 0, 0, 82, 88, 49, 29, 19, 0, 19, 0],
[28, 0, 34, 0, 0, 0, 0, 0, 0, 0],
[36, 0, 53, 0, 5, 0, 0, 0, 0, 38],
[17, 0, 79, 0, 4, 0, 0, 0, 0, 42]])
Approach #2
If looping with np.searchsorted seems to be the bottleneck, we can vectorize that part -
def merge_arrays_v2(a, B):
# a : Array with first row holding all possible timestamps
# B : list or tuple of all b-matrices
lens = np.array([len(i) for i in B])
L = (lens-1).sum() + len(a)
out_dtype = np.result_type(*[i.dtype for i in B])
out = np.zeros((L, a.shape[1]), dtype=out_dtype)
out[:len(a)] = a
s = len(a)
r0 = [i[0] for i in B]
r0s = np.concatenate((r0))
idxs = np.searchsorted(a[0],r0s)
cols = np.array([i.shape[1] for i in B])
sp = np.r_[0,cols.cumsum()]
start,stop = sp[:-1],sp[1:]
for (b_i,s0,s1) in zip(B,start,stop):
idx = idxs[s0:s1]
out[s:s+len(b_i)-1,idx] = b_i[1:]
s += len(b_i)-1
return out

Here's an approach using np.searchsorted:
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
# array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
# [118, 105, 86, 94, 69, 17, 142, 46, 54]])
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b=np.array((time2,data2))
# array([[ 1, 3, 5, 7, 9],
# [70, 15, 4, 97, 57]])
out = np.vstack([a, np.zeros(a.shape[1])])
out[out.shape[0]-1, np.searchsorted(a[0], b[0])] = b[1]
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[118., 105., 86., 94., 69., 17., 142., 46., 54.],
[ 70., 0., 15., 0., 4., 0., 97., 0., 57.]])
Update - Merging many matrices
Here's a almost fully vectorised approach for a scenario with multiple b matrices. This approach does not require a priori knowledge of which is the largest list:
def merge_timestamps(*x):
# infer which is the list with maximum length
# as well as individual lengths
concat = np.concatenate(*x, axis=1)[0]
lens = np.r_[np.flatnonzero(np.diff(concat) < 0), len(concat)]
max_len_list = np.r_[lens[0], np.diff(lens)].argmax()
# define the output matrix
A = x[0][max_len_list]
out = np.vstack([A[1], np.zeros((len(*x)-1, len(A[0])))])
others = np.flatnonzero(~np.in1d(np.arange(len(*x)), max_len_list))
# Update the output matrix with the values of the smaller
# arrays according to their index. This is of course assuming
# all values are contained in the largest
for ix, i in enumerate(others):
out[-(ix+1), x[0][i][0]-A[0].min()] = x[0][i][1]
return out
Lets check with the following example:
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
# array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
# [107, 13, 123, 119, 137, 135, 65, 157, 83]])
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b = np.array((time2,data2))
# array([[ 1, 3, 5, 7, 9],
# [ 81, 49, 83, 32, 179]])
time3=np.arange(1,4,2)
data3=np.random.randint(200, size=time3.shape)
c=np.array((time3,data3))
# array([[ 1, 3],
# [185, 117]])
merge_timestamps([a,b,c])
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[107., 13., 123., 119., 137., 135., 65., 157., 83.],
[185., 0., 117., 0., 0., 0., 0., 0., 0.],
[ 81., 0., 49., 0., 83., 0., 32., 0., 179.]])
As mentioned this approach does not require a priori knowledge of which is the largest list, i.e. it would also work with:
merge_timestamps([b, c, a])
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[107., 13., 123., 119., 137., 135., 65., 157., 83.],
[185., 0., 117., 0., 0., 0., 0., 0., 0.],
[ 81., 0., 49., 0., 83., 0., 32., 0., 179.]])

Applicable only if sensor is capturing data at fixed interval.
First we will need to create a dataframe with fixed interval (15 min interval in this case), then use concat function to this dataframe with sensor's data.
Code to generate dataframe with 15 min interval (copied)
l = (pd.DataFrame(columns=['NULL'],
index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
freq='15T'))
.between_time('07:00','21:00')
.index.strftime('%Y-%m-%dT%H:%M:%SZ')
.tolist()
)
l = pd.DataFrame(l)
Assuming below data comes from sensor
m = (pd.DataFrame(columns=['NULL'],
index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
freq='30T'))
.between_time('07:00','21:00')
.index.strftime('%Y-%m-%dT%H:%M:%SZ')
.tolist()
)
m = pd.DataFrame(m)
m['SensorData'] = np.arange(8)
merge above two dataframes
df = l.merge(m, left_on = 0, right_on= 0,how='left')
df.loc[df['SensorData'].isna() == True,'SensorData'] = 0
Output
0 SensorData
0 2016-09-02T17:30:00Z 0.0
1 2016-09-02T17:45:00Z 0.0
2 2016-09-02T18:00:00Z 1.0
3 2016-09-02T18:15:00Z 0.0
4 2016-09-02T18:30:00Z 2.0
5 2016-09-02T18:45:00Z 0.0
6 2016-09-02T19:00:00Z 3.0
7 2016-09-02T19:15:00Z 0.0
8 2016-09-02T19:30:00Z 4.0
9 2016-09-02T19:45:00Z 0.0
10 2016-09-02T20:00:00Z 5.0
11 2016-09-02T20:15:00Z 0.0
12 2016-09-02T20:30:00Z 6.0
13 2016-09-02T20:45:00Z 0.0
14 2016-09-02T21:00:00Z 7.0

Python : Changing MxN array into NxM [duplicate]

This question already has answers here:
Matrix Transpose in Python [duplicate]
(19 answers)
Closed 4 years ago.
matrix = []
n = int(input("n: "))
m = int(input("m: "))
for i in range(m):
data = input()
data_list = data.split()
data_list = [int(i) for i in data_list]
matrix.append(data_list)
I made a python code for put an integer in MxN array.
I want to change it into Nx(M+1) array
change array[m][n] into array[n][m]
and put 0 int the array[][m+1]
for example:
n : 4
m : 3
Input integers:
1 2 3 4
5 6 7 8
9 10 11 12
Turns into:
1 5 9 0
2 6 10 0
3 7 11 0
4 8 12 0
how can I make that code to do this thing?
I tried
for i in range(m):
for j in range(n):
matrix[i][j] = matrix[j][i]
but this is wrong way to do it.

matrix = [
[ 1, 2, 3, 4 ],
[ 5, 6, 7, 8 ],
[ 9, 10, 11, 12 ]
]
def change(matrix):
m = len(matrix)
n = len(matrix[0])
result = [[] for i in range(n)]
for i in range(m+1):
for j in range(n):
if i == m:
result[j].append(0)
else:
result[j].append(matrix[i][j])
return result
changed = change(matrix)
print(changed)

To solve your problem, get acquainted with NumPy.
import numpy as np
t1 = np.arange(1, 13).reshape(3, 4)
creates your source table:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
Then you should transpose it:
t2 = t1.T
which gives:
array([[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11],
[ 4, 8, 12]])
And finally:
np.c_[ t2, np.zeros(4) ]
adds a column of 4 zeroes, giving the final result:
array([[ 1., 5., 9., 0.],
[ 2., 6., 10., 0.],
[ 3., 7., 11., 0.],
[ 4., 8., 12., 0.]])

Replace values based on multiple conditions of two array?

Assume that I have two arrays
>>> import numpy as np
>>> a = np.random.randint(0, 10, size=(5, 4))
>>> a
array([[1, 6, 7, 4],
[2, 7, 4, 2],
[9, 3, 6, 4],
[9, 6, 8, 2],
[7, 2, 9, 5]])
>>> b = np.random.randint(0, 10, size=(5, 4))
>>> b
array([[ 5., 8., 6., 5.],
[ 1., 8., 4., 8.],
[ 1., 4., 6., 3.],
[ 4., 8., 6., 4.],
[ 8., 7., 7., 5.]], dtype=float32)
Now I have a situation where I need to compare elements of each arrays and replace with known values. For example my conditions are
if a == 0 then replace with 0 (or) if b == 0 then replace with 0
if a > 4 and < 11 then replace with 1 (or) if b > 1 and < 3 then replace with 1
if a > 10 and < 18 then replace with 2 (or) if b > 2 and < 5 then replace with 2
.
.
.
and finally
if a > 40 replace with 9 (or) if b > 9 then replace with 9.
Those replaced values can be stored in a new arrary which I need to use it for other function.
The simplest form of element wise comparison like a[ a > 2 ] = 1 works. But I am not aware of multiple comparison (multiple times) with same method.
I am sure that there is a easy way exist in numpy which I am unable to find. Any help is appreciated.
if

np.digitize should do what you want. The first arguments are the values you want to replace and the second are the thresholds.
a_replace = np.digitize(a, [0, 4, 10, ..., 40], right=True)
b_replace = np.digitize(b, [0, 1, 2, ..., 9], right=True)

Convert a Pandas DataFrame to a multidimensional ndarray

I have a DataFrame with columns for the x, y, z coordinates and the value at this position and I want to convert this to a 3-dimensional ndarray.
To make things more complicated, not all values exist in the DataFrame (these can just be replaced by NaN in the ndarray).
Just a simple example:
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
Should result in the ndarray:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
For two dimensions, this is easy:
array = df.pivot_table(index="y", columns="x", values="value").as_matrix()
However, this method can not be applied to three or more dimensions.
Could you give me some suggestions?
Bonus points if this also works for more than three dimensions, handles multiple defined values (by taking the average) and ensures that all x, y, z coordinates are consecutive (by inserting row/columns of NaN when a coordinate is missing).
EDIT: Some more explanations:
I read data from a CSV file which has the columns for x, y, z coordinates, optionally the frequency and the measurement value at this point and frequency. Then I round the coordinates to a specified precision (e.g. 0.1m) and want to get an ndarray which contains the averaged measurement values at each (rounded) coordinates. The indizes of the values do not need to coincide with the location. However they need to be in the correct order.
EDIT: I just ran a quick performance test:
The solution of jakevdp takes 1.598s, Divikars solution takes 7.405s, JohnE's solution takes 7.867s and Wens solution takes 6.286s to complete.

You can use a groupby followed by the approach from Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array:
grouped = df.groupby(['z', 'y', 'x'])['value'].mean()
# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[grouped.index.labels] = grouped.values.flat
print(arr)
# [[[ 1. 2. nan]
# [ 3. nan 4.]]
#
# [[ 5. 6. 7.]
# [ 8. 9. nan]]]

Here's one NumPy approach -
def dataframe_to_array_averaged(df):
arr = df[['z','y','x']].values
arr -= arr.min(0)
out_shp = arr.max(0)+1
L = np.prod(out_shp)
val = df['value'].values
ids = np.ravel_multi_index(arr.T, out_shp)
avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
return avgs.reshape(out_shp)
Note that that this shows a warning because for places with no x,y,z triplets would have zero counts and hence the average values would be 0/0 = NaN, but since that's the expected output for those places, you can ignore the warning there. To avoid this warning, we can employ indexing, as discussed in the second method (Alternative method).
Sample run -
In [106]: df
Out[106]:
value x y z
0 1 1 1 1 # <=== this is repeated
1 2 2 1 1
2 3 1 2 1
3 4 3 2 1
4 5 1 1 2
5 6 2 1 2
6 7 3 1 2
7 8 1 2 2
8 9 2 2 2
9 4 1 1 1 # <=== this is repeated
In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]:
array([[[ 2.5, 2. , nan],
[ 3. , nan, 4. ]],
[[ 5. , 6. , 7. ],
[ 8. , 9. , nan]]])
Alternative method
To avoid warning, an alternative way would be like so -
out = np.full(out_shp, np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count

Another solution is to use the xarray package:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df = pd.pivot_table(df, values='value', index=['x', 'y', 'z'])
xrTensor = xr.DataArray(df).unstack("dim_0")
array = xrTensor.values[0].T
print(array)
Output:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
Note that the xrTensor object is very handy since xarray's DataArrays contain the labels so you may just go on with that object rather pulling out the ndarray:
print(xrTensor)
Output:
<xarray.DataArray (dim_1: 1, x: 3, y: 2, z: 2)>
array([[[[ 1., 5.],
[ 3., 8.]],
[[ 2., 6.],
[nan, 9.]],
[[nan, 7.],
[ 4., nan]]]])
Coordinates:
* dim_1 (dim_1) object 'value'
* x (x) int64 1 2 3
* y (y) int64 1 2
* z (z) int64 1 2

We can using stack
np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))
Out[451]:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Output in scipy.stats.binned_statistic_dd() - python

Related

Mask zero values in matrix and reconstruct original matrix using indices

Merge multidimensional NumPy arrays based on first row

Python : Changing MxN array into NxM [duplicate]

Replace values based on multiple conditions of two array?

Convert a Pandas DataFrame to a multidimensional ndarray

Categories

Resources