I am using an aggregation function that I have used in my work for a long time now. The idea is that if the Series passed to the function is of length 1 (i.e. the group only has one observation) then that observations is returned. If the length of the Series passed is greater than one, then the observations are returned in a list.
This may seem odd to some, but this is not an X,Y problem, I have good reason for wanting to do this that is not relevant to this question.
This is the function that I have been using:
def MakeList(x):
""" This function is used to aggregate data that needs to be kept distinc within multi day
observations for later use and transformation. It makes a list of the data and if the list is of length 1
then there is only one line/day observation in that group so the single element of the list is returned.
If the list is longer than one then there are multiple line/day observations and the list itself is
returned."""
L = x.tolist()
if len(L) > 1:
return L
else:
return L[0]
Now for some reason, with the current data set I am working on I get a ValueError stating that the function does not reduce. Here is some test data and the remaining steps I am using:
import pandas as pd
DF = pd.DataFrame({'date': ['2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02'],
'line_code': ['401101',
'401101',
'401102',
'401103',
'401104',
'401105',
'401105',
'401106',
'401106',
'401107'],
's.m.v.': [ 7.760,
25.564,
25.564,
9.550,
4.870,
7.760,
25.564,
5.282,
25.564,
5.282]})
DFGrouped = DF.groupby(['date', 'line_code'], as_index = False)
DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})
In trying to debug this, I put a print statement to the effect of print L and print x.index and
the output was as follows:
[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')
[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')
For some reason it appears that agg is passing the Series twice to the function. This as far as I know is not normal at all, and is presumably the reason why my function is not reducing.
For example if I write a function like this:
def test_func(x):
print x.index
return x.iloc[0]
This runs without problem and the print statements are:
DF_Agg = DFGrouped.agg({'s.m.v.' : test_func})
Int64Index([0, 1], dtype='int64')
Int64Index([2], dtype='int64')
Int64Index([3], dtype='int64')
Int64Index([4], dtype='int64')
Int64Index([5, 6], dtype='int64')
Int64Index([7, 8], dtype='int64')
Int64Index([9], dtype='int64')
Which indicates that each group is only being passed once as a Series to the function.
Can anyone help me understand why this is failing? I have used this function with success in many many data sets I work with....
Thanks
I can't really explain you why, but from my experience list in pandas.DataFrame don't work all that well.
I usually use tuple instead.
That will work:
def MakeList(x):
T = tuple(x)
if len(T) > 1:
return T
else:
return T[0]
DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})
date line_code s.m.v.
0 2013-04-02 401101 (7.76, 25.564)
1 2013-04-02 401102 25.564
2 2013-04-02 401103 9.55
3 2013-04-02 401104 4.87
4 2013-04-02 401105 (7.76, 25.564)
5 2013-04-02 401106 (5.282, 25.564)
6 2013-04-02 401107 5.282
This is a misfeature in DataFrame. If the aggregator returns a list for the first group, it will fail with the error you mention; if it returns a non-list (non-Series) for the first group, it will work fine. The broken code is in groupby.py:
def _aggregate_series_pure_python(self, obj, func):
group_index, _, ngroups = self.group_info
counts = np.zeros(ngroups, dtype=int)
result = None
splitter = get_splitter(obj, group_index, ngroups, axis=self.axis)
for label, group in splitter:
res = func(group)
if result is None:
if (isinstance(res, (Series, Index, np.ndarray)) or
isinstance(res, list)):
raise ValueError('Function does not reduce')
result = np.empty(ngroups, dtype='O')
counts[label] = group.shape[0]
result[label] = res
Notice that if result is None and isinstance(res, list.
Your options are:
Fake out groupby().agg(), so it doesn't see a list for the first group, or
Do the aggregation yourself, using code like that above but without the erroneous test.
Related
An example dataset I'm working with
df = pd.DataFrame({"competitorname": ["3 Musketeers", "Almond Joy"], "winpercent": [67.602936, 50.347546] }, index = [1, 2])
I am trying to see whether 3 Musketeers or Almond Joy has a higher winpercent. The code I wrote is:
more_popular = '3 Musketeers' if df.loc[df["competitorname"] == '3 Musketeers', 'winpercent'].values[0] > df.loc[df["competitorname"] == 'Almond Joy', 'winpercent'].values[0] else 'Almond Joy'
My question is
Can I select the values I am interested in without python returning a Series? Is there a way to just do
df[df["competitorname"] == 'Almond Joy', 'winpercent']
and then it would return a simple
50.347546
?
I know this doesn't make my code significantly shorter but I feel like I am missing something about getting values from pandas that would help me avoid constantly adding
.values[0]
The underlying issue is that there could be multiple matches, so we will always need to extract the match(es) at some point in the pipeline:
Use Series.idxmax on the boolean mask
Since False is 0 and True is 1, using Series.idxmax on the boolean mask will give you the index of the first True:
df.loc[df['competitorname'].eq('Almond Joy').idxmax(), 'winpercent']
# 50.347546
This assumes there is at least 1 True match, otherwise it will return the first False.
Or use Series.item on the result
This is basically just an alias for Series.values[0]:
df.loc[df['competitorname'].eq('Almond Joy'), 'winpercent'].item()
# 50.347546
This assumes there is exactly 1 True match, otherwise it will throw a ValueError.
How about simply sorting the dataframe by "winpercent" and then taking the top row?
df.sort_values(by="winpercent", ascending=False, inplace=True)
then to see the winner's row
df.head(1)
or to get the values
df.iloc[0]["winpercent"]
If you're sure that the returned Series has a single element, you can simply use .item() to get it:
import pandas as pd
df = pd.DataFrame({
"competitorname": ["3 Musketeers", "Almond Joy"],
"winpercent": [67.602936, 50.347546]
}, index = [1, 2])
s = df.loc[df["competitorname"] == 'Almond Joy', 'winpercent'] # a pandas Series
print(s)
# output
# 2 50.347546
# Name: winpercent, dtype: float64
v = df.loc[df["competitorname"] == 'Almond Joy', 'winpercent'].item() # a scalar value
print(v)
# output
# 50.347546
I'm trying to use DataFrame.map_partitions() from Dask to apply a function on each partition. The function takes in input a list of values and have to return the rows of the dataframe partition that contains these values, on a specific column (using loc() and isin()).
The issue is that I get this error:
"index = partition_info['number'] - 1
TypeError: 'NoneType' object is not subscriptable"
When I print partition_info, it prints None hundreds of times (but I only have 60 elements in the loop so we expect only 60 prints), is it normal to print None because it's a child process or am I missing something with partition_info? I cannot find useful information on that.
def apply_f(df, barcodes_per_core: List[List[str]], partition_info=None):
print(partition_info)
index = partition_info['number'] - 1
indexes = barcodes_per_core[index]
return df.loc[df['barcode'].isin(indexes)]
df = from_pandas(df, npartitions=nb_cores)
dfs_per_core = df.map_partitions(apply_f, barcodes_per_core, meta=df)
dfs_per_core = dfs_per_core.compute(scheduler='processes')
=> Doc of partition_info at the end of this page.
It's not clear why things are not working on your end, one potential thing is that you are re-using df multiple times. Here's a MWE that works:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(range(10), columns=["a"])
ddf = dd.from_pandas(df, npartitions=3)
def my_func(d, x, partition_info=None):
print(x, partition_info)
ddf.map_partitions(my_func, 3, meta=df.head()).compute(scheduler='processes')
I'm seeing a strange behaviour on the map function when applied to a DatetimeIndex, where the first element that is mapped is the whole index, and then each element is processed individually (as expected).
Here's a way to reproduce the issue
(Have tried it on pandas 0.22.0, 0.23.0 and 0.24.0):
df = pd.DataFrame(data = np.random.randn(3,1),
index = pd.DatetimeIndex(
start='2018-05-03',
periods = 3,
freq ='D'))
df.index.map(lambda x: print(x))
yields:
DatetimeIndex(['2018-05-03', '2018-05-04', '2018-05-05'], dtype='datetime64[ns]', freq='D')
2018-05-03 00:00:00
2018-05-04 00:00:00
2018-05-05 00:00:00
Index([None, None, None], dtype='object')
EDIT: The very first line that the print is producing is what I find odd. If I use a RangeIndex this doesn't happen.
Surprising print behaviour
This unusual behaviour only affects a DatetimeIndex and not a Series. So to fix the bug, wrap your index in pd.Series() before mapping the lambda function:
pd.Series(df.index).map(lambda x: print(x))
Alternatively you can use the .to_series() method:
df.index.to_series().map(lambda x: print(x))
Note the return values of the pd.Series() version will be numerically indexed, while the return values of the .to_series() version will be datetime indexed.
Is this a bug?
Index.map(), like Series.map(), returns a Series containing the return values of your lambda function.
In this case, print() just returns None, so you are correctly getting an Index Series of None values. The print behaviour is inconsistent with other types of pandas Indexes and Series, but this is an unusual application.
import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.random.randn(3,1),
index = pd.DatetimeIndex(
start='2018-05-03',
periods = 3,
freq ='D'))
example = df.index.map(lambda x: print(x))
# DatetimeIndex(['2018-05-03', '2018-05-04', '2018-05-05'], dtype='datetime64[ns]', freq='D')
# 2018-05-03 00:00:00
# 2018-05-04 00:00:00
# 2018-05-05 00:00:00
print(example)
# Index([None, None, None], dtype='object')
As you can see, there's nothing wrong with the return value. Or for a clearer example, where we add one day to each item:
example2 = df.index.map(lambda x: x + 1)
print(example2)
# DatetimeIndex(['2018-05-04', '2018-05-05', '2018-05-06'], dtype='datetime64[ns]', freq='D')
So the print behaviour is inconsistent with similar classes in pandas, but the return values are correct.
I have a function that takes in some complex parameters and is expected to return a filter to be used on a pandas dataframe.
filters = build_filters(df, ...)
filtered_df = df[filters]
For example, if the dataframe has series Gender and Age, build_filters could return (df.Gender == 'M') & (df.Age == 100)
If, however, build_filters determines that there should be no filters applied, is there anything that I can return (i.e. the "identity filter") that will result in df not being filtered?
I've tried the obvious things like None, True, and even a generator that returns True for every call to next()
The closest I've come is
operator.ne(df.ix[:,0], nan)
which I think is silly, and likely going to cause bugs I can't yet forsee.
You can return slice(None). Here's a trivial demonstration:
df = pd.DataFrame([[1, 2, 3]])
df2 = df[slice(None)] # equivalent to df2 = df[:]
df2[0] = -1
assert df.equals(df2)
Alternatively, use pd.DataFrame.pipe and return df if no filters need to be applied:
def apply_filters(df):
# some logic
if not filter_flag:
return df
else:
# mask = ....
return df[mask]
filtered_df = df.pipe(apply_filters)
I want to pass a function to resample() on a pandas dataframe with certain parameters specified when it is passed (as opposed to defining several separate functions).
This is the function
import itertools
def spell(X, kind='wet', how='mean', threshold=0.5):
if kind=='wet':
condition = X>threshold
else:
condition = X<=threshold
length = [sum(1 if x==True else nan for x in group) for key,group in itertools.groupby(condition)]
if not length:
res = 0
elif how=='mean':
res = np.mean(length)
else:
res = np.max(length)
return res
here is a dataframe
idx = pd.DatetimeIndex(start='1960-01-01', periods=100, freq='d')
values = np.random.random(100)
df = pd.DataFrame(values, index=idx)
And heres sort of what I want to do with it
df.resample('M', how=spell(kind='dry',how='max',threshold=0.7))
But I get the error TypeError: spell() takes at least 1 argument (3 given). I want to be able to pass this function with these parameters specified except for the input array. Is there a way to do this?
EDIT:
X is the input array that is passed to the function when calling the resample method on a dataframe object like so df.resample('M', how=my_func) for a monthly resampling interval.
If I try df.resample('M', how=spell) I get:
0
1960-01-31 1.875000
1960-02-29 1.500000
1960-03-31 1.888889
1960-04-30 3.000000
which is exactly what I want for the default parameters but I want to be able to specify the input parameters to the function before passing it. This might include storing the definition in another variable but I'm not sure how to do this with the default parameters changed.
I think this may be what you're looking for, though it's a little hard to tell.. Let me know if this helps. First, the example dataframe:
idx = pd.DatetimeIndex(start='1960-01-01', periods=100, freq='d')
values = np.random.random(100)
df = pd.DataFrame(values, index=idx)
EDIT- had a greater than instead of less than or equal to originally...
Next, the function:
def spell(df, column='', kind='wet', rule='M', how='mean', threshold=0.5):
if kind=='wet':
df = df[df[column] > threshold]
else:
df = df[df[column] <= threshold]
df = df.resample(rule=rule, how=how)
return df
So, you would call it by:
spell(df, 0)
To get:
0
1960-01-31 0.721519
1960-02-29 0.754054
1960-03-31 0.746341
1960-04-30 0.654872
You can change around the parameters as well:
spell(df, 0, kind='something else', rule='W', how='max', threshold=0.7)
0
1960-01-03 0.570638
1960-01-10 0.529357
1960-01-17 0.565959
1960-01-24 0.682973
1960-01-31 0.676349
1960-02-07 0.379397
1960-02-14 0.680303
1960-02-21 0.654014
1960-02-28 0.546587
1960-03-06 0.699459
1960-03-13 0.626460
1960-03-20 0.611464
1960-03-27 0.685950
1960-04-03 0.688385
1960-04-10 0.697602