Pandas: aggregate when column contains numpy arrays - python

I'm using a pandas DataFrame in which one column contains numpy arrays. When trying to sum that column via aggregation I get an error stating 'Must produce aggregated value'.
e.g.
import pandas as pd
import numpy as np
DF = pd.DataFrame([[1,np.array([10,20,30])],
[1,np.array([40,50,60])],
[2,np.array([20,30,40])],], columns=['category','arraydata'])
This works the way I would expect it to:
DF.groupby('category').agg(sum)
output:
arraydata
category 1 [50 70 90]
2 [20 30 40]
However, since my real data frame has multiple numeric columns, arraydata is not chosen as the default column to aggregate on, and I have to select it manually. Here is one approach I tried:
g=DF.groupby('category')
g.agg({'arraydata':sum})
Here is another:
g=DF.groupby('category')
g['arraydata'].agg(sum)
Both give the same output:
Exception: must produce aggregated value
However if I have a column that uses numeric rather than array data, it works fine. I can work around this, but it's confusing and I'm wondering if this is a bug, or if I'm doing something wrong. I feel like the use of arrays here might be a bit of an edge case and indeed wasn't sure if they were supported. Ideas?
Thanks

One, perhaps more clunky way to do it would be to iterate over the GroupBy object (it generates (grouping_value, df_subgroup) tuples. For example, to achieve what you want here, you could do:
grouped = DF.groupby("category")
aggregate = list((k, v["arraydata"].sum()) for k, v in grouped)
new_df = pd.DataFrame(aggregate, columns=["category", "arraydata"]).set_index("category")
This is very similar to what pandas is doing under the hood anyways [groupby, then do some aggregation, then merge back in], so you aren't really losing out on much.
Diving into the Internals
The problem here is that pandas is checking explicitly that the output not be an ndarray because it wants to intelligently reshape your array, as you can see in this snippet from _aggregate_named where the error occurs.
def _aggregate_named(self, func, *args, **kwargs):
result = {}
for name, group in self:
group.name = name
output = func(group, *args, **kwargs)
if isinstance(output, np.ndarray):
raise Exception('Must produce aggregated value')
result[name] = self._try_cast(output, group)
return result
My guess is that this happens because groupby is explicitly set up to try to intelligently put back together a DataFrame with the same indexes and everything aligned nicely. Since it's rare to have nested arrays in a DataFrame like that, it checks for ndarrays to make sure that you are actually using an aggregate function. In my gut, this feels like a job for Panel, but I'm not sure how to transform it perfectly. As an aside, you can sidestep this problem by converting your output to a list, like this:
DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})
Pandas doesn't complain, because now you have an array of Python objects. [but this is really just cheating around the typecheck]. And if you want to convert back to array, just apply np.array to it.
result = DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})
result["arraydata"] = result["arraydata"].apply(np.array)
How you want to resolve this issue really depends on why you have columns of ndarray and whether you want to aggregate anything else at the same time. That said, you can always iterate over GroupBy like I've shown above.

Pandas works much more efficiently if you don't do this (e.g using numeric data, as you suggest). Another alternative is to use a Panel object for this kind of multidimensional data.
Saying that, this looks like a bug, the Exception is being raised purely because the result is an array:
Exception: Must produce aggregated value
In [11]: %debug
> /Users/234BroadWalk/pandas/pandas/core/groupby.py(1511)_aggregate_named()
1510 if isinstance(output, np.ndarray):
-> 1511 raise Exception('Must produce aggregated value')
1512 result[name] = self._try_cast(output, group)
ipdb> output
array([50, 70, 90])
If you were to recklessly remove these two lines from the source code it works as expected:
In [99]: g.agg(sum)
Out[99]:
arraydata
category
1 [50, 70, 90]
2 [20, 30, 40]
Note: They're almost certainly in there for a reason...

Since the sum function only iterate over rows, or sum function only calculates the sum along the first axis.
You can define an aggregation function:
def mySum(dataframe):
return np.sum(np.sum(dataframe))
And then pass this function into the agg():
DF.groupby('category').agg(mySum)

Related

When I use apply function to a single column returns error

I'm learning Python and want to use the "apply" function. Reading around the manual I found that if a I have a simple dataframe like this:
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
A B
0 4 9
1 4 9
2 4 9
and then I use something like this:
df.apply(lambda x:x.sum(),axis=0)
output works because according to theory x receives every column and apply the sum to each so the the result is correctly this:
A 12
B 27
dtype: int64
When instead I issue something like:
df['A'].apply(lambda x:x.sum())
result is: 'int' object has no attribute 'sum'
question is: why is that working on a dataframe by column, it's ok and working on a single column is not ? In the end the logic should be the same. x should receive in input one column instead of two.
I know that for this simple example I should use other functions like df.agg or even df['A'].sum() but the question is to understand the logic of apply.
if you look at a specific column of a pandas.DataFrame object, you working with a pandas.Series with (in your case) integers as values. Well and integers don't have a sum() method.
(Run type(df['A']) to see that you are working with a series and not a data frame anymore when slicing a single column).
The irritating part is that if you work with an actual pandas.DataFrame object, every column is a pandas.Series object and they have a sum() method.
So there are two ways to fix your problem
Work with a pandas.DataFrame and not with a pandas.Series: df[['A']]. The additional brackets force pandas to return a pandas.DataFrame object. (Verify by type(df[['A']])) and use the lambda function just as you did before
use a function rather than a method when using lambda: df['A'].apply(lambda x: np.sum(x)) (assuming that you have imported numpy as np)
I would recommend to go with the second option as it seems to me the more generic and clearer way
However, this is only relevant if you want to apply a certain function to ever element in a pandas.Series or pandas.DataFrame. In your specific case, there is no need to take the detour that your are currently using. Just use df.sum(axis=0).
The approach with apply is over complicating things. The reason why this works is that every element of a pandas.DataFrame is a pandas.Series, which as a sum method. But so does a pandas.DataFrame has, so you can use this right away.
The only way, where you actually need to take the way with apply is if you had arrays in every cell of the pandas.DataFrame

Best method for non-regular index-based interpolation on grouped dataframes

Problem statement
I had the following problem:
I have samples that ran independent tests. In my dataframe, tests of sample with the same "test name" are also independent. So the couple (test,sample) is independent and unique.
data are collected at non regular sampling rates, so we're speaking about unequaly spaced indices. This "time series" index is called unreg_idx in the example. For the sake of simplicity, it is a float between 0 & 1.
I want to figure out what the value at a specific index, e.g. for unreg_idx=0.5. If the value is missing, I just want a linear interpolation that depends on the index. If extrapolating because the value is at an extremum in the sorted unreg_idx of the group (test,sample), it can leave NaN.
Note the following from pandas documentation:
Please note that only method='linear' is supported for
DataFrame/Series with a MultiIndex.
’linear’: Ignore the index and treat the values as equally spaced.
This is the only method supported on MultiIndexes.
The only solution I found is long, complex and slow. I am wondering if I am missing out on something, or on the contrary something is missing from the pandas library. I believe this is a typical issue in scientific and engineering fields to have independent tests on various samples with non regular indices.
What I tried
sample data set preparation
This part is just for making an example
import pandas as pd
import numpy as np
tests = (f'T{i}' for i in range(20))
samples = (chr(i) for i in range(97,120))
idx = pd.MultiIndex.from_product((tests,samples),names=('tests','samples'))
idx
dfs=list()
for ids in idx:
group_idx = pd.MultiIndex.from_product(((ids[0],),(ids[1],),tuple(np.random.random_sample(size=(90,))))).sort_values()
dfs.append(pd.DataFrame(1000*np.random.random_sample(size=(90,)),index=group_idx))
df = pd.concat(dfs)
df = df.rename_axis(index=('test','sample','nonreg_idx')).rename({0:'value'},axis=1)
The (bad) solution
add_missing = df.index.droplevel('nonreg_idx').unique().to_frame().reset_index(drop=True)
add_missing['nonreg_idx'] = .5
add_missing = pd.MultiIndex.from_frame(add_missing)
added_missing = df.reindex(add_missing)
df_full = pd.concat([added_missing.loc[~added_missing.index.isin(df.index)], df])
df_full.sort_index(inplace=True)
def interp_fnc(group):
try:
return group.reset_index(['test','sample']).interpolate(method='slinear').set_index(['test','sample'], append=True).reorder_levels(['test','sample','value']).sort_index()
except:
return group
grouped = df_full.groupby(level=['test','sample'])
df_filled = grouped.apply(
interp_fnc
)
Here, the wanted values are in df_filled. So I can do df_filled.loc[(slice(None), slice(None), .5),'value'] to get what I need for each sample/test.
I would have expected to be able to do the same within 1 or maximum 2 lines of code. I have 14 here. apply is quite a slow method. I can't even use numba.
Question
Can someone propose a better solution?
If you think there is no better alternative, please comment and I'll open an issue...

Quickly search through a numpy array and sum the corresponding values

I have an array with around 160k entries which I get from a CSV-file and it looks like this:
data_arr = np.array(['ID0524', 1.0]
['ID0965', 2.5]
.
.
['ID0524', 6.7]
['ID0324', 3.0])
I now get around 3k unique ID's from some database and what I have to do is look up each of these IDs in the array and sum the corresponding numbers.
So if I would need to look up "ID0524", the sum would be 7.7.
My current working code looks something like this (I'm sorry that it's pretty ugly, I'm very new to numpy):
def sumValues(self, id)
sub_arr = data_arr[data_arr[0:data_arr.size, 0] == id]
sum_arr = sub_arr[0:sub_arr.size, 1]
return sum_arr.sum()
And it takes around ~18s to do this for all 3k IDs.
I wondered if there is probably any faster way to this as the current runtime seems a bit too long for me. I would appreciate any guidance and hints on this. Thank you!
You could try the using builtin numpy methods.
numpy.intersect1d to find the unique IDs
numpy.sum to sum them up
A convenient tool to do your task is Pandas, with its grouping mechanism.
Start from the necessary import:
import pandas as pd
Then convert data_arr to a pandasonic DataFrame:
df = pd.DataFrame({'Id': data_arr[:, 0], 'Amount': data_arr[:, 1].astype(float)})
The reason for some complication in the above code is that:
elements of your input array are of a single type (in this case
object),
so there is necessary to convert the second column to float.
Then you can get the expected result in a single instruction:
result = df.groupby('Id').sum()
The result, for your data sample, is:
Amount
Id
ID0324 3.0
ID0524 7.7
ID0965 2.5
Another approach is that you could read your CSV file directly
into a DataFrame (see read_csv method), so there is no need to use
any Numpy array.
The advantage is that read_csv is clever enough to recognize the data
type of each column separately, at least it is able to tell apart numbers
from strings.

Adding new dask column based on a vectorized function

I have what I thought would be a straightforward thing to do in python using dask. I have a dataframe with some records in it, and I want to add a new column based on calling a function with values from two other columns as parameters.
Here is what I mean (pretend ge exists and takes two parameters):
def gc(x, y):
return ge(x, y)
def gdf(df):
func1 = np.vectorize(gc)
gh = da.from_array(func1(df.x, df.y))
df['gh'] = gh
However, I seem to get one issue or another no matter what I try to do. Currently, in the above state, I get
Number of partitions do not match (2 != 33)
It feels like I'm either going about this all wrong (like maybe I need map_blocks or map_partitions or even gufunc), or I'm missing something easy where I can set the number of partitions on my array to match that of my dataframe.
Any help would be appreciated.
It should be possible to do this with assign or map_partitions:
func1 = np.vectorize(gc)
df = df.assign(gh=lambda df: func1(df.x, df.y))
# or try this
def myfunc(df):
df['gh'] = func1(df.x, df.y)
return df
df = df.map_partitions(myfunc)

Applying function whose parameters depend on values of a column

I have a dataframe containing a column type of categorical data, and I have a table (dictionary) of parameter values for each possible type, each entry of which looks like
type1: [x1,x2,x3]
I have working code looking like this:
def foo(df):
[x1,x2,x3] = parameters[df.type]
return (* formula depending on x1,x2,x3,df.A,df.B *)
df['new_variable'] = df.apply(lambda x: foo(x), axis = 1)
Iterating through the rows like this (.apply(..., axis=1)) is of course very slow, and I'd like an efficient solution, but I don't know how to do the table-lookup in a neat manner. For instance, I can't just do
df['new_variable'] = (* formula depending on parameters[df.type][0:3],df.A,df.B *)
as that throws a TypeError: 'Series' objects are mutable, thus they cannot be hashed (I'm naively trying to use a Series as a key, which doesn't work).
I suppose I could make new columns for the parameter values, but that seems inelegant somehow, and I'm sure there is a better way. What's the best way to do this?
EDIT: I just realised I can get a column with the lists of parameters via
df.type.map(parameters)
but I can't access the entries of those lists, as the usual index-conventions don't seem to work. E.g. df.type.map(parameters).loc[:,2] gives an IndexingError: Too many indexers; basically pandas gets confused when having too many dimensions without sticking it all in a MultiIndex. Is there a way to get around this?
EDIT2: a minimal example:
df = pd.DataFrame([['dog',4],['dog',6],['cat',1],['cat',4]],columns = ['type','A'])
parameters = {'dog': [1,2], 'cat': [3,-1]}
def foo(x):
[a,b]=parameters[x.type]
return a * x.A + b
df['new'] = df.apply(foo,axis=1)
produces the desired output
type A new
0 dog 4 6
1 dog 6 8
2 cat 1 2
3 cat 4 11
For a vectorised solution you should split your series of lists, which is what df['type'].map(parameters) gives, into separate columns. You can then leverage efficient NumPy operations:
params = pd.DataFrame(df['type'].map(parameters).values.tolist(),
columns=['a', 'b'])
df['new'] = params['a'] * df['A'] + params['b']
As you note, pd.DataFrame.apply is a thinly veiled, and generally inefficient, loop. It should be avoided wherever possible.

Categories

Resources