Deducting the median from each column - python

I have a dataframe, df with numbers, like so:
1 1 1
2 1 1
2 1 3
I'd like to deduct the median from each column so that the median of each becomes 0.
-1 0 0
0 0 0
0 0 2
How do I do this in a pythandic way? I'm guessing it is possible without iterating over the values, computing the median and then deducting. I'd like to do it tersely, approximately like so:
from numpy import median
df -= median(df) #does not work, deducts median for whole dataframe

Just like this
df -= df.median(axis=0)
median of numpy computes median of overall data.
To accomplish using numpy, try this code instead.
df -= median(df, axis=0)
for more detail, see the document: http://docs.scipy.org/doc/numpy/reference/generated/numpy.median.html

Some testing in ipython showed:
In [23]: A = numpy.arange(9)
In [24]: B = A.reshape((3,3))
In [25]: C = numpy.median(B,axis=0)
In [26]: D = B - C[None,:]
In [27]: B
Out[27]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [28]: D
Out[28]:
array([[-3., -3., -3.],
[ 0., 0., 0.],
[ 3., 3., 3.]])
In [29]: C
Out[29]: array([ 3., 4., 5.])
So the next line gets the median along the columns
C = numpy.median(B,axis=0)
And the next line subtracts it from the matrix, column by column
D = B - C[None,:]

Related

How to efficiently filter maximum elements of a matrix per row

Given a 2D array, I'm looking for a pythonic way to get an array of same shape, with only the maximum element per each row.
See max_row_filter function below
def max_row_filter(mat2d):
m = np.zeros(mat2d.shape)
for r in range(mat2d.shape[0]):
c = np.argmax(mat2d[r])
m[r,c]=mat2d[r,c]
return m
p = np.array([[1,2,3],[5,4,3,],[9,10,3]])
max_row_filter(p)
Out: array([[ 0., 0., 3.],
[ 5., 0., 0.],
[ 0., 10., 0.]])
I'm looking for an efficient way to do this, suitable to be done on big arrays.
Alternative answer (this will keep duplicates):
p * (p==p.max(axis=1, keepdims=True))
If there are no duplicates, you could use numpy.argmax:
import numpy as np
p = np.array([[1, 2, 3],
[5, 4, 3, ],
[9, 10, 3]])
result = np.zeros_like(p)
rows, cols = zip(*enumerate(np.argmax(p, axis=1)))
result[rows, cols] = p[rows, cols]
print(result)
Output
[[ 0 0 3]
[ 5 0 0]
[ 0 10 0]]
Note that, for multiple occurrences argmax return the first occurence.

Convert a Pandas DataFrame to a multidimensional ndarray

I have a DataFrame with columns for the x, y, z coordinates and the value at this position and I want to convert this to a 3-dimensional ndarray.
To make things more complicated, not all values exist in the DataFrame (these can just be replaced by NaN in the ndarray).
Just a simple example:
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
Should result in the ndarray:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
For two dimensions, this is easy:
array = df.pivot_table(index="y", columns="x", values="value").as_matrix()
However, this method can not be applied to three or more dimensions.
Could you give me some suggestions?
Bonus points if this also works for more than three dimensions, handles multiple defined values (by taking the average) and ensures that all x, y, z coordinates are consecutive (by inserting row/columns of NaN when a coordinate is missing).
EDIT: Some more explanations:
I read data from a CSV file which has the columns for x, y, z coordinates, optionally the frequency and the measurement value at this point and frequency. Then I round the coordinates to a specified precision (e.g. 0.1m) and want to get an ndarray which contains the averaged measurement values at each (rounded) coordinates. The indizes of the values do not need to coincide with the location. However they need to be in the correct order.
EDIT: I just ran a quick performance test:
The solution of jakevdp takes 1.598s, Divikars solution takes 7.405s, JohnE's solution takes 7.867s and Wens solution takes 6.286s to complete.
You can use a groupby followed by the approach from Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array:
grouped = df.groupby(['z', 'y', 'x'])['value'].mean()
# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[grouped.index.labels] = grouped.values.flat
print(arr)
# [[[ 1. 2. nan]
# [ 3. nan 4.]]
#
# [[ 5. 6. 7.]
# [ 8. 9. nan]]]
Here's one NumPy approach -
def dataframe_to_array_averaged(df):
arr = df[['z','y','x']].values
arr -= arr.min(0)
out_shp = arr.max(0)+1
L = np.prod(out_shp)
val = df['value'].values
ids = np.ravel_multi_index(arr.T, out_shp)
avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
return avgs.reshape(out_shp)
Note that that this shows a warning because for places with no x,y,z triplets would have zero counts and hence the average values would be 0/0 = NaN, but since that's the expected output for those places, you can ignore the warning there. To avoid this warning, we can employ indexing, as discussed in the second method (Alternative method).
Sample run -
In [106]: df
Out[106]:
value x y z
0 1 1 1 1 # <=== this is repeated
1 2 2 1 1
2 3 1 2 1
3 4 3 2 1
4 5 1 1 2
5 6 2 1 2
6 7 3 1 2
7 8 1 2 2
8 9 2 2 2
9 4 1 1 1 # <=== this is repeated
In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]:
array([[[ 2.5, 2. , nan],
[ 3. , nan, 4. ]],
[[ 5. , 6. , 7. ],
[ 8. , 9. , nan]]])
Alternative method
To avoid warning, an alternative way would be like so -
out = np.full(out_shp, np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count
Another solution is to use the xarray package:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df = pd.pivot_table(df, values='value', index=['x', 'y', 'z'])
xrTensor = xr.DataArray(df).unstack("dim_0")
array = xrTensor.values[0].T
print(array)
Output:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
Note that the xrTensor object is very handy since xarray's DataArrays contain the labels so you may just go on with that object rather pulling out the ndarray:
print(xrTensor)
Output:
<xarray.DataArray (dim_1: 1, x: 3, y: 2, z: 2)>
array([[[[ 1., 5.],
[ 3., 8.]],
[[ 2., 6.],
[nan, 9.]],
[[nan, 7.],
[ 4., nan]]]])
Coordinates:
* dim_1 (dim_1) object 'value'
* x (x) int64 1 2 3
* y (y) int64 1 2
* z (z) int64 1 2
We can using stack
np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))
Out[451]:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])

How to remove na and count values NxK arrays in numpy in a vectorized way

My situation: i have a pandas dataframe so that, for each row, I have to compute the following.
1) Get the first valute na excluded (df.apply(lambda x: x.dropna().iloc[0]))
2) Get the last valute na excluded (df.apply(lambda x: x.dropna().iloc[-1]))
3) Count the non na values (df.apply(lambda x: len(x.dropna()))
Sample case and expected output :
x = np.array([[1,2,np.nan], [4,5,6], [np.nan, 8,9]])
1) [1, 4, 8]
2) [2, 6, 9]
3) [2, 3, 2]
And i need to keep it optimized. So i turned to numpy and looked for a way to apply y = x[~numpy.isnan(x)] on a NxK array as a first step. Then,i would use what was shown here (Vectorized way of accessing row specific elements in a numpy array) for 1) and 2) but i am still empty handed for 3)
Here's one way -
In [756]: x
Out[756]:
array([[ 1., 2., nan],
[ 4., 5., 6.],
[ nan, 8., 9.]])
In [768]: m = ~np.isnan(x)
In [769]: first_idx = m.argmax(1)
In [770]: last_idx = m.shape[1] - m[:,::-1].argmax(1) - 1
In [771]: x[np.arange(len(first_idx)), first_idx]
Out[771]: array([ 1., 4., 8.])
In [772]: x[np.arange(len(last_idx)), last_idx]
Out[772]: array([ 2., 6., 9.])
In [773]: m.sum(1)
Out[773]: array([2, 3, 2])
Alternatively, we could make use of cumulative-summation to get those indices, like so -
In [787]: c = m.cumsum(1)
In [788]: first_idx = (c==1).argmax(1)
In [789]: last_idx = c.argmax(1)

First occurrence in numpy logics

Let's say I have a numpy.ndarray:
a = np.array([0,4,10,0,11,10])
I compared this with 10.
a >= 10
# array([False, False, True, False, True, True], dtype=bool)
I would like to have a single True, i.e. True only at the first occurrence.
I would like to apply this to a given axis in n-D numpy.ndarray.(say, 1000*1000*10)
a_2d = np.array([[0,4,10],[0,11,10]])
#if axis == 1: array([[False, False, True], [False, True, False]])
What I have done:
As for a 1-D array, I managed to do it by using this.
b=np.zeros(a.size)
b[np.argmax(a>=10)]=True
#b=array([ 0., 0., 1., 0., 0., 0.])
However, I have no idea how to apply this to a large n-D array.
This one should work with no for loops, for 1D or 2D:
def firstByRow(a, f = lambda x: x >= 10):
b = (np.cumsum(f(a), axis = -1) == 1).T
b[1:] = b[1:] * np.equal(b[1:], np.diff((f(a)).astype(int), axis = -1).T)
return b.T
Not sure if it would be faster than a slightly loopier code though, as it does both cumsum and diff
EDIT:
You can also do this, which is probably faster (leveraging that np.unique(return_index = True) picks the first occurrence):
def firstByAxis(a, f = lambda x: x >= 10, axis = 0):
c = np.where(f(a))
i = np.unique(c[axis], return_index = True)[1]
b = np.zeros_like(a)
b[tuple(np.take(c, i, axis = -1))] = 1
return b
You can try the following:
>>> import numpy as np
>>> a_2d = np.array([[0,4,10],[0,11,10]])
>>> r, c = np.where( a_2d >= 10 )
>>> mask = r+c == (r+c).min()
>>> highMask = np.zeros(np.shape(a_2d))
>>> highMask[r[mask], c[mask]] = 1
>>> highMask
array([[ 0., 0., 1.],
[ 0., 1., 0.]])
There is no such thing as the 'first' one in a 2D array. In a 2D array, the minimum indices will form a line on the 2D axis, the both of which will have minimum indices values. For a 3D matrix, this will be a surface, etc ..
Example of such a line would be:
0 0 0 0 0 1
0 0 0 0 1 0
0 0 0 1 0 0
0 0 1 0 0 0
0 1 0 0 0 0
1 0 0 0 0 0
All of which are equidistant from the [0,0] location ...
If you enumerate over the argmax, you can update your zeros array.
Code:
a = np.array([[0, 4, 10], [0, 11, 10]])
print(a)
b = np.zeros(a.shape)
for i, j in enumerate(np.argmax(a >= 10, axis=1)):
b[i, j] = 1
print(b)
Results:
[[ 0 4 10]
[ 0 11 10]]
[[ 0. 0. 1.]
[ 0. 1. 0.]]
Using advanced indexing:
c = np.zeros(a.shape)
c[list(range(a.shape[0])), np.argmax(a >= 10, axis=1)] = 1

Save one-hot-encoded features into Pandas DataFrame the fastest way

I have a Pandas DataFrame with all my features and labels. One of my feature is categorical and needs to be one-hot-encoded.
The feature is an integer and can only have values from 0 to 4
To save those arrays back in my DataFrame I use the following code
# enc is my OneHotEncoder object
df['mycol'] = df['mycol'].map(lambda x: enc.transform(x).toarray())
My DataFrame has more than 1 million rows so the above code takes a while.Is there a faster way to assign the arrays to the DataFrame cells? Because I have just 5 categories i dont need to call the transform() function 1 million times.
I already tried something like
num_categories = 5
i = 0
while (i<num_categories):
df.loc[df['mycol'] == i, 'mycol'] = enc.transform(i).toarray()
i += 1
Which yields this error
ValueError: Must have equal len keys and value when setting with an ndarray
You can use pd.get_dummies:
>>> s
0 a
1 b
2 c
3 a
dtype: object
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Alternatively:
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> a = np.array([1, 1, 3, 2, 2]).reshape(-1, 1)
>>> a
array([[1],
[1],
[3],
[2],
[2]]
>>> one_hot = enc.fit_transform(a)
>>> one_hot.toarray()
array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 1., 0.]])

Categories

Resources