Convert a Pandas DataFrame to a multidimensional ndarray

Convert a Pandas DataFrame to a multidimensional ndarray - python

I have a DataFrame with columns for the x, y, z coordinates and the value at this position and I want to convert this to a 3-dimensional ndarray.
To make things more complicated, not all values exist in the DataFrame (these can just be replaced by NaN in the ndarray).
Just a simple example:
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
Should result in the ndarray:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
For two dimensions, this is easy:
array = df.pivot_table(index="y", columns="x", values="value").as_matrix()
However, this method can not be applied to three or more dimensions.
Could you give me some suggestions?
Bonus points if this also works for more than three dimensions, handles multiple defined values (by taking the average) and ensures that all x, y, z coordinates are consecutive (by inserting row/columns of NaN when a coordinate is missing).
EDIT: Some more explanations:
I read data from a CSV file which has the columns for x, y, z coordinates, optionally the frequency and the measurement value at this point and frequency. Then I round the coordinates to a specified precision (e.g. 0.1m) and want to get an ndarray which contains the averaged measurement values at each (rounded) coordinates. The indizes of the values do not need to coincide with the location. However they need to be in the correct order.
EDIT: I just ran a quick performance test:
The solution of jakevdp takes 1.598s, Divikars solution takes 7.405s, JohnE's solution takes 7.867s and Wens solution takes 6.286s to complete.

You can use a groupby followed by the approach from Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array:
grouped = df.groupby(['z', 'y', 'x'])['value'].mean()
# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[grouped.index.labels] = grouped.values.flat
print(arr)
# [[[ 1. 2. nan]
# [ 3. nan 4.]]
#
# [[ 5. 6. 7.]
# [ 8. 9. nan]]]

Here's one NumPy approach -
def dataframe_to_array_averaged(df):
arr = df[['z','y','x']].values
arr -= arr.min(0)
out_shp = arr.max(0)+1
L = np.prod(out_shp)
val = df['value'].values
ids = np.ravel_multi_index(arr.T, out_shp)
avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
return avgs.reshape(out_shp)
Note that that this shows a warning because for places with no x,y,z triplets would have zero counts and hence the average values would be 0/0 = NaN, but since that's the expected output for those places, you can ignore the warning there. To avoid this warning, we can employ indexing, as discussed in the second method (Alternative method).
Sample run -
In [106]: df
Out[106]:
value x y z
0 1 1 1 1 # <=== this is repeated
1 2 2 1 1
2 3 1 2 1
3 4 3 2 1
4 5 1 1 2
5 6 2 1 2
6 7 3 1 2
7 8 1 2 2
8 9 2 2 2
9 4 1 1 1 # <=== this is repeated
In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]:
array([[[ 2.5, 2. , nan],
[ 3. , nan, 4. ]],
[[ 5. , 6. , 7. ],
[ 8. , 9. , nan]]])
Alternative method
To avoid warning, an alternative way would be like so -
out = np.full(out_shp, np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count

Another solution is to use the xarray package:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df = pd.pivot_table(df, values='value', index=['x', 'y', 'z'])
xrTensor = xr.DataArray(df).unstack("dim_0")
array = xrTensor.values[0].T
print(array)
Output:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
Note that the xrTensor object is very handy since xarray's DataArrays contain the labels so you may just go on with that object rather pulling out the ndarray:
print(xrTensor)
Output:
<xarray.DataArray (dim_1: 1, x: 3, y: 2, z: 2)>
array([[[[ 1., 5.],
[ 3., 8.]],
[[ 2., 6.],
[nan, 9.]],
[[nan, 7.],
[ 4., nan]]]])
Coordinates:
* dim_1 (dim_1) object 'value'
* x (x) int64 1 2 3
* y (y) int64 1 2
* z (z) int64 1 2

We can using stack
np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))
Out[451]:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])

Related

How to replace specific indices of a matrix using numpy in python

Beginner level at python. I have a large matrix (MxN) that I want to process and a Mx1 matrix that contains some indices. What I would want is to replace each row of the MxN matrix with a NaN given that the column of that row is less than that of the listed with respect to the Mx1 indices matrix.
Say for example I have:
A = [1 2 3 4]
[5 6 7 8]
[9 10 11 12]
and
B = [0]
[2]
[1]
the resultant matrix should be
C = [1 2 3 4]
[NaN NaN 7 8]
[NaN 10 11 12]
I am trying to avoid using for loops because the matrix I'm dealing with is large and the this function will be repetitive. Is there an elegant pythonic way to implement this?

Check out this code :
here logic over which first method work is that create condition-matrix for np.where and which is done following ways
import numpy as np
A = np.array([[1, 2, 3, 4],[5, 6, 7, 8],[9, 10, 11, 12]], dtype=np.float)
B = np.array([[0], [2], [1]])
B = np.array(list(map(lambda i: [False]*i[0]+[True]*(4-i[0]), B)))
A = np.where(B, A, np.nan)
print(A)
Method-2: using basic pythonic code
import numpy as np
A = np.array([[1, 2, 3, 4],[5, 6, 7, 8],[9, 10, 11, 12]], dtype=np.float)
B = np.array([[0], [2], [1]])
for i,j in enumerate(A):
j[:B[i][0]] = np.nan
print(A)

Your arrays - note that A is float, so it can hold np.nan:
In [348]: A = np.arange(1,13).reshape(3,4).astype(float); B = np.array([[0],[2],[1]])
In [349]: A
Out[349]:
array([[ 1., 2., 3., 4.],
[ 5., 6., 7., 8.],
[ 9., 10., 11., 12.]])
In [350]: B
Out[350]:
array([[0],
[2],
[1]])
A boolean mask were we want to change values:
In [351]: np.arange(4)<B
Out[351]:
array([[False, False, False, False],
[ True, True, False, False],
[ True, False, False, False]])
apply it:
In [352]: A[np.arange(4)<B] = np.nan
In [353]: A
Out[353]:
array([[ 1., 2., 3., 4.],
[nan, nan, 7., 8.],
[nan, 10., 11., 12.]])

Replace values based on multiple conditions of two array?

Assume that I have two arrays
>>> import numpy as np
>>> a = np.random.randint(0, 10, size=(5, 4))
>>> a
array([[1, 6, 7, 4],
[2, 7, 4, 2],
[9, 3, 6, 4],
[9, 6, 8, 2],
[7, 2, 9, 5]])
>>> b = np.random.randint(0, 10, size=(5, 4))
>>> b
array([[ 5., 8., 6., 5.],
[ 1., 8., 4., 8.],
[ 1., 4., 6., 3.],
[ 4., 8., 6., 4.],
[ 8., 7., 7., 5.]], dtype=float32)
Now I have a situation where I need to compare elements of each arrays and replace with known values. For example my conditions are
if a == 0 then replace with 0 (or) if b == 0 then replace with 0
if a > 4 and < 11 then replace with 1 (or) if b > 1 and < 3 then replace with 1
if a > 10 and < 18 then replace with 2 (or) if b > 2 and < 5 then replace with 2
.
.
.
and finally
if a > 40 replace with 9 (or) if b > 9 then replace with 9.
Those replaced values can be stored in a new arrary which I need to use it for other function.
The simplest form of element wise comparison like a[ a > 2 ] = 1 works. But I am not aware of multiple comparison (multiple times) with same method.
I am sure that there is a easy way exist in numpy which I am unable to find. Any help is appreciated.
if

np.digitize should do what you want. The first arguments are the values you want to replace and the second are the thresholds.
a_replace = np.digitize(a, [0, 4, 10, ..., 40], right=True)
b_replace = np.digitize(b, [0, 1, 2, ..., 9], right=True)

How to remove na and count values NxK arrays in numpy in a vectorized way

My situation: i have a pandas dataframe so that, for each row, I have to compute the following.
1) Get the first valute na excluded (df.apply(lambda x: x.dropna().iloc[0]))
2) Get the last valute na excluded (df.apply(lambda x: x.dropna().iloc[-1]))
3) Count the non na values (df.apply(lambda x: len(x.dropna()))
Sample case and expected output :
x = np.array([[1,2,np.nan], [4,5,6], [np.nan, 8,9]])
1) [1, 4, 8]
2) [2, 6, 9]
3) [2, 3, 2]
And i need to keep it optimized. So i turned to numpy and looked for a way to apply y = x[~numpy.isnan(x)] on a NxK array as a first step. Then,i would use what was shown here (Vectorized way of accessing row specific elements in a numpy array) for 1) and 2) but i am still empty handed for 3)

Here's one way -
In [756]: x
Out[756]:
array([[ 1., 2., nan],
[ 4., 5., 6.],
[ nan, 8., 9.]])
In [768]: m = ~np.isnan(x)
In [769]: first_idx = m.argmax(1)
In [770]: last_idx = m.shape[1] - m[:,::-1].argmax(1) - 1
In [771]: x[np.arange(len(first_idx)), first_idx]
Out[771]: array([ 1., 4., 8.])
In [772]: x[np.arange(len(last_idx)), last_idx]
Out[772]: array([ 2., 6., 9.])
In [773]: m.sum(1)
Out[773]: array([2, 3, 2])
Alternatively, we could make use of cumulative-summation to get those indices, like so -
In [787]: c = m.cumsum(1)
In [788]: first_idx = (c==1).argmax(1)
In [789]: last_idx = c.argmax(1)

Removing NaNs in numpy arrays

I have two numpy arrays that contains NaNs:
A = np.array([np.nan, 2, np.nan, 3, 4])
B = np.array([ 1 , 2, 3 , 4, np.nan])
are there any smart way using numpy to remove the NaNs in both arrays, and also remove whats on the corresponding index in the other list?
Making it look like this:
A = array([ 2, 3, ])
B = array([ 2, 4, ])

What you could do is add the 2 arrays together this will overwrite with NaN values where they are none, then use this to generate a boolean mask index and then use the index to index into your original numpy arrays:
In [193]:
A = np.array([np.nan, 2, np.nan, 3, 4])
B = np.array([ 1 , 2, 3 , 4, np.nan])
idx = np.where(~np.isnan(A+B))
idx
print(A[idx])
print(B[idx])
[ 2. 3.]
[ 2. 4.]
output from A+B:
In [194]:
A+B
Out[194]:
array([ nan, 4., nan, 7., nan])
EDIT
As #Oliver W. has correctly pointed out, the np.where is unnecessary as np.isnan will produce a boolean index that you can use to index into the arrays:
In [199]:
A = np.array([np.nan, 2, np.nan, 3, 4])
B = np.array([ 1 , 2, 3 , 4, np.nan])
idx = (~np.isnan(A+B))
print(A[idx])
print(B[idx])
[ 2. 3.]
[ 2. 4.]

A[~(np.isnan(A) | np.isnan(B))]
B[~(np.isnan(A) | np.isnan(B))]

Does KNeighborsClassifier compare lists with different sizes?

I have to use Scikit Lean's KNeighborsClassifier to compare time series using an user defined function in Python.
knn = KNeighborsClassifier(n_neighbors=1,weights='distance',metric='pyfunc',func=dtw_dist)
The problem is that KNeighborsClassifier doens't seem to support my training data. They are time series, so they are lists with different sizes. KNeighborsClassifier gives me this error message when I try to use fit method (knn.fit(X,Y)):
ValueError: data type not understood
It seems KNeighborsClassifier only supports same size training sets (only time series with same lenght would be accepted, but that is not my case), but my teacher told me to use KNeighborsClassifier. So I don't know what to do...
Any ideas?

Two (or one...) options as far as I can tell:
Precompute the distances (not directly supported by KNeighborsClassifier it seems, other clustering algorithms do, e.g., Spectral Clustering).
Convert your data to be square using NaNs, and handling these accordingly in your custom distance function.
'Square' your data using NaNs
So, option 2 it is.
Say we have the following data, where every row represents a time series:
import numpy as np
series = [
[1,2,3,4],
[1,2,3],
[1],
[1,2,3,4,5,6,7,8]
]
We simply make the data square by adding nans:
def make_square(jagged):
# Careful: this mutates the series list of list
max_cols = max(map(len, jagged))
for row in jagged:
row.extend([None] * (max_cols - len(row)))
return np.array(jagged, dtype=np.float)
make_square(series)
array([[ 1., 2., 3., 4., nan, nan, nan, nan],
[ 1., 2., 3., nan, nan, nan, nan, nan],
[ 1., nan, nan, nan, nan, nan, nan, nan],
[ 1., 2., 3., 4., 5., 6., 7., 8.]])
Now the data 'fits' into the algorithm. You just have to adapt your distance function to account for the NaNs.
Precompute and use a cache function
Oh we can probably do option 1 too (assuming you have N time series):
Precompute the distances into a (N, N) distance matrix D
Create a (N, 1) matrix that is just a range between [0, N) (i.e., the index of the series in the distance matrix)
Create a distance function wrapper
Use this wrapper as the distance function.
wrapper function:
def wrapper(row1, row2):
# might have to fiddle a bit here, but i think this retrieves the indices.
i1, i2 = row1[0], row2[0]
return D[i1, i2]
Ok I hope its clear.
Complete example
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
from mlpy import dtw_std # I dont know if you are using this one: it doesnt matter.
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
# Example data
series = [
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3],
[1],
[1, 2, 3, 4, 5, 6, 7, 8],
[1, 2, 5, 6, 7, 8],
[1, 2, 4, 5, 6, 7, 8],
]
# I dont know.. these seemed to make sense to me!
y = np.array([
0,
0,
0,
0,
1,
2,
2,
2
])
# Compute the distance matrix
N = len(series)
D = np.zeros((N, N))
for i in range(N):
for j in range(i+1, N):
D[i, j] = dtw_std(series[i], series[j])
D[j, i] = D[i, j]
print D
# Create the fake data matrix: just the indices of the timeseries
X = np.arange(N).reshape((N, 1))
# Create the wrapper function that returns the correct distance
def wrapper(row1, row2):
# cast to int to prevent warnings: sklearn converts our integer indices to floats.
i1, i2 = int(row1[0]), int(row2[0])
return D[i1, i2]
# Only the ball_tree algorith seems to accept a custom function
knn = KNeighborsClassifier(weights='distance', algorithm='ball_tree', metric='pyfunc', func=wrapper)
knn.fit(X, y)
print knn.kneighbors(X[0])
# (array([[ 0., 0., 0., 1., 6.]]), array([[1, 2, 0, 3, 4]]))
print knn.kneighbors(X[0])
# (array([[ 0., 0., 0., 1., 6.]]), array([[1, 2, 0, 3, 4]]))
print knn.predict(X)
# [0 0 0 0 1 2 2 2]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert a Pandas DataFrame to a multidimensional ndarray - python

We can using stack np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3)) Out[451]: array([[[ 1., 2., nan], [ 3., nan, 4.]], [[ 5., 6., 7.], [ 8., 9., nan]]])

Related

How to replace specific indices of a matrix using numpy in python

Replace values based on multiple conditions of two array?

How to remove na and count values NxK arrays in numpy in a vectorized way

Removing NaNs in numpy arrays

Does KNeighborsClassifier compare lists with different sizes?

Categories

Resources