Pandas groupby throws an error when using sum()

Pandas groupby throws an error when using sum() - python

I am trying to calculate the between cluster scatter matrix. In order to do that, for each cluster (named "group" in the example below), I need to perform an operation which results in a matrix and subsequently perform an element-wise addition of the matrices from each cluster.
To do this I try the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'group': [1, 2, 1, 0, 0, 0, 1, 2],
'A': [1.5, 0.5, 2.5, 0.5, 1.5, 0.5, 1.5, 0.5],
'B': [3.5, 2.5, 3.5, 2.5, 3.5, 2.5, 3.5, 2.5]})
features = list(df.columns)
features.remove('group')
def g(x, mu):
y = np.array([np.mean(x) - mu])
print((y.T # y)*len(x))
print("")
return (y.T # y)*len(x)
m = len(df.index)
mu = df.groupby('group')[features].apply(lambda x: (np.multiply(x.count(), np.mean(x)))/m).sum()
print("mu:")
print(mu)
Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()
This example throws a TypeError: Series.name must be a hashable type error on the last line. The print statement in the g function shows the result as expected, see below, so I believe the error is due to the .sum() operation.
[[0.2553 0.1458]
[0.1458 0.0833]]
[[1.5052 1.0625]
[1.0625 0.75 ]]
[[0.7912 0.625 ]
[0.625 0.5 ]]
The result I was expecting by adding the .sum() operation was the element-wise addition of the three matrices above.
The expected output is:
[[2.5416 1.8333]
[1.8333 1.3333]]
Any ideas why this is giving me an error and what I can do to correct it?
Update 1:
Using:
Sb = df.groupby('group').apply(g, mu=(mu)).sum()
instead of
Sb = df.groupby('group')[features].apply(g, mu=(mu)).sum()
gives the correct matrix, padded with nans. Why does features cause an error?

Have you tried this ?
sb=df.groupby('group').apply(g, mu=(mu)).sum()
it gives the following result:
[[2.54166667 1.83333333 nan]
[1.83333333 1.33333333 nan]
[ nan nan nan]]
Is it what you want ?
You still have to deal with the nans though
Edit to answer your comments:
To answer you problem in the comments you could change your function as below:
def g(x, mu):
x=x[["A","B"]] #or x=x[features]
y = np.array([np.mean(x) - mu])
print((y.T # y)*(len(x)))
print("")
return (y.T # y)*(len(x))
and then:
sb=df.groupby(['group']).apply(g, mu=(mu)).sum()
print(sb)
which gives:
[[2.54166667 1.83333333]
[1.83333333 1.33333333]]

Related

Numpy/Torch insert smallest value in case of collision

I have an empty numpy array, a list of indices, and list of values associated with the indices. The issue is that there may be duplicates in the indices. In all these "collision" cases, I'd like the smallest value to be picked. Just wondering what is the best way to go about it.
Eg:
array = [0,0,0,0,0,0,0]
indices = [0, 0, 2, 3, 2, 4]
values = [1.0, 3.0, 3.5, 1.5, 2.5, 8.0]
Result:
out = [1.0, 0, 2.5, 1.5, 8.0, 0.0, 0.0]

You can always implement something manually like:
import numpy as np
def index_reduce(arr, indices, out, reducer=min):
touched = np.zeros_like(out, dtype=np.bool_)
for i, x in enumerate(indices):
if not touched[x]:
out[x] = arr[i]
touched[x] = True
else:
out[x] = reducer(out[x], arr[i])
return out
which essentially loops through the indices and assign the values of arr to out if not already touched (keeping track of this with the touched array) and reducing the output with the specified reducer.
NOTE: The reducer function needs to be such that the final result can only depend on the current and previous value.
The usage of this would be:
indices = [0, 0, 2, 3, 2, 4]
values = [1.0, 3.0, 3.5, 1.5, 2.5, 8.0]
array = np.zeros(7)
index_reduce(values, indices, array)
# array([1. , 0. , 2.5, 1.5, 8. , 0. , 0. ])
If performances are of concern, you can also accelerate the above code with Numba with a simple decoration provided that also the values and indices inputs are NumPy arrays:
import numba as nb
index_reduce_nb = nb.njit(index_reduce)
indices = np.array([0, 0, 2, 3, 2, 4])
values = np.array([1.0, 3.0, 3.5, 1.5, 2.5, 8.0])
array = np.zeros(7)
index_reduce_nb(values, indices, array)
# array([1. , 0. , 2.5, 1.5, 8. , 0. , 0. ])
Benchmarks
The above solutions can be compared to a Torch-based solution (reworked from #Shai's answer):
import torch
def index_reduce_torch(arr, indices, out, reduce_="amin"):
arr = torch.from_numpy(arr)
indices = torch.from_numpy(indices)
out = torch.from_numpy(out)
return out.index_reduce_(dim=0, index=indices, source=arr, reduce=reduce_, include_self=False).numpy()
or, with additional skipping of Torch gradients:
index_reduce_torch_ng = torch.no_grad()(index_reduce_torch)
index_reduce_torch_ng.__name__ = "index_reduce_torch_ng"
and a Pandas-based solution (reworked from #bpfrd's answer):
import pandas as pd
def index_reduce_pd(arr, indices, out, reducer=min):
df = pd.DataFrame(data=zip(indices, arr))
df1 = df.groupby(0, as_index=False).agg(reducer)
out[df1[0]] = df1[1]
return out
using the following code:
funcs = index_reduce, index_reduce_nb, index_reduce_pd, index_reduce_torch, index_reduce_torch_ng
timings = {}
for i in range(4, 18):
n = 2 ** i
print(f"n = {n}, i = {i}")
extrema = 0, 2 * n
indices = np.random.randint(*extrema, n)
values = np.random.random(n)
out = np.zeros(extrema[1] + 1)
timings[n] = []
base = funcs[0](values, indices, out)
for func in funcs:
res = func(values, indices, out)
is_good = np.allclose(base, res)
timed = %timeit -r 16 -n 16 -q -o func(values, indices, out)
timing = timed.best * 1e6
timings[n].append(timing if is_good else None)
print(f"{func.__name__:>24} {is_good} {timing:10.3f} µs")
to produce with the additional lines:
import matplotlib.pyplot as plt
df = pd.DataFrame(data=timings, index=[func.__name__ for func in funcs]).transpose()
df.plot(marker='o', xlabel='Input size / #', ylabel='Best timing / µs', figsize=(6, 4))
df.plot(marker='o', xlabel='Input size / #', ylabel='Best timing / µs', ylim=[0, 500], figsize=(6, 4))
fig = plt.gcf()
fig.patch.set_facecolor('white')
these plots:
(the second is a zoomed-in version of the first).
These indicate that the Numba accelerated solution could be the fastest, closely followed by the Torch-based solution while the Pandas approach could be the slowest, even slower than the explicit solution without acceleration.

You are looking for index_reduce_, which was introduced in PyTorch 1.12.
import torch
array = torch.zeros(7)
indices = torch.tensor([0, 0, 2, 3, 2, 4])
values = torch.tensor([1.0, 3.0, 3.5, 1.5, 2.5, 8.0])
out = array.index_reduce_(dim=0, index=indices, source=values, reduce='amin', include_self=False)
You'll get your desired output:
tensor([1.0000, 0.0000, 2.5000, 1.5000, 8.0000, 0.0000, 0.0000])
Note that this method is in "beta" and its API may change in future PyTorch versions.

You can use pandas groupby agg as the following:
indices = [0, 0, 2, 3, 2, 4]
values = [1.0, 3.0, 3.5, 1.5, 2.5, 8.0]
array = [0,0,0,0,0,0,0]
df = pd.DataFrame(zip(indices, values), columns=['indices','values'])
df1 = df.groupby('indices', as_index=False).agg(values=('values', min))
for i,j in zip(df1['indices'].tolist(), df1['values'].tolist()):
array[i] = j
output:
array
>[1.0, 0, 2.5, 1.5, 8.0, 0, 0]

How to compare two numpy arrays with multiple condition

I have 2 NumPy arrays like the below
array_1 = np.array([1.2, 2.3, -1.0, -0.5])
array_2 = np.array([-0.5, 1.3, 2.5, -0.9])
We can do the element-wise simple arithmetic calculation (addition, subtraction, division etc) easily using different np functions
array_sum = np.add(array_1, array_2)
print(array_sum) # [ 0.7 3.6 3.5 -0.4]
array_sign = np.sign(array_1 * array_2)
print(array_sign) # [-1. 1. 1. -1.]
However, I need to check element-wise multiple conditions for 2 arrays and want to save them in 2 new arrays (say X and Y).
For example, if both elements contain different sign (e.g.: 1st and 3rd element pairs of the given example)) then, X will contain 0 and Y will be the sum of the poitive element and abs(negative element)
X = [0]
Y = [1.7]
When both elements are positive (e.g.: 2nd element pair of the given example) then, X will contain the lower value and Y will contain the greater value
X = [1.3]
Y = [2.3]
If both elements are negative, then, X will be 0 and Y will be the sum of the abs(negative element) and abs(negative element)
So, the final X and Y will be something like
X = [0, 1.3, 0, 0]
Y = [1.7, 2.3, 3.5, 1.4]
I have gone through some posts (this, and this) that described, the comparison procedures between 2 arrays, but not getting idea for multiple conditions. Here, 2 arrays are very small but, my real arrays are very large (e.g.: contains 2097152 element per array).
Any ideas are highly appreciated.

Try with numpy.select:
conditions = [(array_1>0)&(array_2>0), (array_1<0)&(array_2<0)]
choiceX = [np.minimum(array_1, array_2), np.zeros(len(array_1))]
choiceY = [np.maximum(array_1, array_2), -np.add(array_1,array_2)]
X = np.select(conditions, choiceX)
Y = np.select(conditions, choiceY, np.add(np.abs(array_1), np.abs(array_2)))
>>> X
array([0. , 1.3, 0. , 0. ])
>>> Y
array([1.7, 2.3, 3.5, 1.4])

This will do it. It does require vertically stacking the two arrays. I'm sure someone will pipe up if there is a more efficient solution.
import numpy as np
array_1 = np.array([1.2, 2.3, -1.0, -0.5])
array_2 = np.array([-0.5, 1.3, 2.5, -0.9])
def pick(t):
if t[0] < 0 or t[1] < 0:
return (0,abs(t[0])+abs(t[1]))
return (t.min(), t.max())
print( np.apply_along_axis( pick, 0, np.vstack((array_1,array_2))))
Output:
[[0. 1.3 0. 0. ]
[1.7 2.3 3.5 1.4]]
The second line of the function can also be written:
return (0,np.abs(t).sum())
But since these will only be two-element arrays, I doubt that saves anything at all.

Apply logical and/or operations along an axis in numpy python [duplicate]

For machine learning, I'm appliying Parzen Window algorithm.
I have an array (m,n). I would like to check on each row if any of the values is > 0.5 and if each of them is, then I would return 0, otherwise 1.
I would like to know if there is a way to do this without a loop thanks to numpy.

You can use np.all with axis=1 on a boolean array.
import numpy as np
arr = np.array([[0.8, 0.9], [0.1, 0.6], [0.2, 0.3]])
print(np.all(arr>0.5, axis=1))
>> [True False False]

import numpy as np
# Value Initialization
a = np.array([0.75, 0.25, 0.50])
y_predict = np.zeros((1, a.shape[0]))
#If the value is greater than 0.5, the value is 1; otherwise 0
y_predict = (a > 0.5).astype(float)

I have an array (m,n). I would like to check on each row if any of the values is > 0.5
That will be stored in b:
import numpy as np
a = # some np.array of shape (m,n)
b = np.any(a > 0.5, axis=1)
and if each of them is, then I would return 0, otherwise 1.
I'm assuming you mean 'and if this is the case for all rows'. In this case:
c = 1 - 1 * np.all(b)
c contains your return value, either 0 or 1.

numpy, pass individual arguments to np.vectorized function

if i have an array of x values and want to multiply each x value with a different coefficients and sum the. although I want this operation to happen by passing a function that handles the summation and weighting. for example if i have x, coeffs and a function, custom_weight(x, a, b, c)
x = numpy.array([1, 2, 3, 4, 5, 6])
coeffs = numpy.array([[0.1, 0.2, 3.2], [4.5, 4.0, 0.005]]
def custom_weight(x, a, b, c):
return a*x**2 + (x+b)**3 + x*c
I want x to be broadcast for each inner array of coeffs. in this case the final result
should be an array with the shape (6, 2). for the first iteration of the custom_weight function
should look like this custom_weight(x[0], *(coeffs[0])) == custom_weight(1, 0.1, 0.2, 3.2). the same happens for all the other x's 2-6. then this happens again with the x's but now using the second set of coefficients.
I do realize that I could do this manually or numpy.vectorize in a certain way... but I specifically want to use a function in that form. what I want is some function that would look like so:
numpy.the_function(x, coeffs, axis=0, custom_weight)
# the_function should take each x value and pass it to custom_weight as the first arg.
# then pass the column of coeffs (because axis=0)
# to custom_weight but it should do this by unpacking the column into the args a, b, and c

The problem is more because your custome_weight function is not designed to be vectorized. You are looking for something like this:
def custom_weight(x, coeffs):
return coeffs # x**np.array([[2,3,1]]).T
Output:
array([[ 3.5 , 8.4 , 15.9 , 27.2 , 43.5 , 66. ],
[ 8.505, 50.01 , 148.515, 328.02 , 612.525, 1026.03 ]])

So after messing around, one solution I found was by vectorizing. transposing the coefficients when passing the arguments to custom_weight and then unpacking the coefficients and it will broadcasting and np.vectorize takes care of the rest.
import numpy as np
def custom_weight(x, a, b):
return a*x**2 + b
x = np.linspace(-1, 1, 100)
coeffs = np.array([[0.2, 0.6],
[1.2, 0.1]])
vec_custom_weight = np.vectorize(custom_weight)
results = vec_custom_weight(xs[:, np.newaxis], *coeffs.T).T

Calculate moving average in numpy array with NaNs

I am trying to calculate the moving average in a large numpy array that contains NaNs. Currently I am using:
import numpy as np
def moving_average(a,n=5):
ret = np.cumsum(a,dtype=float)
ret[n:] = ret[n:]-ret[:-n]
return ret[-1:]/n
When calculating with a masked array:
x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
mx = np.ma.masked_array(x,np.isnan(x))
y = moving_average(mx).filled(np.nan)
print y
>>> array([3.8,3.8,3.6,nan,nan,nan,2,2.4,nan,nan,nan,2.8,2.6])
The result I am looking for (below) should ideally have NaNs only in the place where the original array, x, had NaNs and the averaging should be done over the number of non-NaN elements in the grouping (I need some way to change the size of n in the function.)
y = array([4.75,4.75,nan,4.4,3.75,2.33,3.33,4,nan,nan,3,3.5,nan,3.25,4,4.5,3])
I could loop over the entire array and check index by index but the array I am using is very large and that would take a long time. Is there a numpythonic way to do this?

Pandas has a lot of really nice functionality with this. For example:
x = np.array([np.nan, np.nan, 3, 3, 3, np.nan, 5, 7, 7])
# requires three valid values in a row or the resulting value is null
print(pd.Series(x).rolling(3).mean())
#output
nan,nan,nan, nan, 3, nan, nan, nan, 6.333
# only requires 2 valid values out of three for size=3 window
print(pd.Series(x).rolling(3, min_periods=2).mean())
#output
nan, nan, nan, 3, 3, 3, 4, 6, 6.3333
You can play around with the windows/min_periods and consider filling-in nulls all in one chained line of code.

I'll just add to the great answers before that you could still use cumsum to achieve this:
import numpy as np
def moving_average(a, n=5):
ret = np.cumsum(a.filled(0))
ret[n:] = ret[n:] - ret[:-n]
counts = np.cumsum(~a.mask)
counts[n:] = counts[n:] - counts[:-n]
ret[~a.mask] /= counts[~a.mask]
ret[a.mask] = np.nan
return ret
x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
mx = np.ma.masked_array(x,np.isnan(x))
y = moving_average(mx)

You could create a temporary array and use np.nanmean() (new in version 1.8 if I'm not mistaken):
import numpy as np
temp = np.vstack([x[i:-(5-i)] for i in range(5)]) # stacks vertically the strided arrays
means = np.nanmean(temp, axis=0)
and put original nan back in place with means[np.isnan(x[:-5])] = np.nan
However this look redundant both in terms of memory (stacking the same array strided 5 times) and computation.

If I understand correctly, you want to create a moving average and then populate the resulting elements as nan if their index in the original array was nan.
import numpy as np
>>> inc = 5 #the moving avg increment
>>> x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
>>> mov_avg = np.array([np.nanmean(x[idx:idx+inc]) for idx in range(len(x))])
# Determine indices in x that are nans
>>> nan_idxs = np.where(np.isnan(x))[0]
# Populate output array with nans
>>> mov_avg[nan_idxs] = np.nan
>>> mov_avg
array([ 4.75, 4.75, nan, 4.4, 3.75, 2.33333333, 3.33333333, 4., nan, nan, 3., 3.5, nan, 3.25, 4., 4.5, 3.])

Here's an approach using strides -
w = 5 # Window size
n = x.strides[0]
avgs = np.nanmean(np.lib.stride_tricks.as_strided(x, \
shape=(x.size-w+1,w), strides=(n,n)),1)
x_rem = np.append(x[-w+1:],np.full(w-1,np.nan))
avgs_rem = np.nanmean(np.lib.stride_tricks.as_strided(x_rem, \
shape=(w-1,w), strides=(n,n)),1)
avgs = np.append(avgs,avgs_rem)
avgs[np.isnan(x)] = np.nan

Currently bottleneck package should do the trick quite reliably and quickly. Here is slightly adjusted example from https://kwgoodman.github.io/bottleneck-doc/reference.html#bottleneck.move_mean:
>>> import bottleneck as bn
>>> a = np.array([1.0, 2.0, 3.0, np.nan, 5.0])
>>> bn.move_mean(a, window=2)
array([ nan, 1.5, 2.5, nan, nan])
>>> bn.move_mean(a, window=2, min_count=1)
array([ 1. , 1.5, 2.5, 3. , 5. ])
Note that the resulting means correspond to the last index of the window.
The package is available from Ubuntu repos, pip etc. It can operate over arbitrary axis of numpy-array etc. Besides that, it is claimed to be faster than plain-numpy implementation in many cases.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby throws an error when using sum() - python

Related

Numpy/Torch insert smallest value in case of collision

How to compare two numpy arrays with multiple condition

Apply logical and/or operations along an axis in numpy python [duplicate]

numpy, pass individual arguments to np.vectorized function

Calculate moving average in numpy array with NaNs

Categories

Resources