Pandas ValueError: Function does not reduce - python

I have been trying to use pandas groupby to analyze data, then I encountered an issue after update pandas from version 0.15.0 to 0.18.1 that did not exist before.
I want to calculate the number of consercutive periods where the value of 'equality' is 1 (it can only take values of 0 or 1). I defined the followin lambda function, and used groupby command as follows:
import pandas as pd
E = lambda x: np.sum(x.diff()==1) + x.head(1)
grouped = df.groupby(['run_'])
agg_data = grouped[['equality','avg_payoff']].mean()
agg_data['E'] = grouped.equality.agg(E) # number of "equality" epochs
but received the error message for the last line of code:
ValueError: Function does not reduce
It is weird that this code ran perfectly before update. This is not the first time that I encounter an issue after update of scientific computing packages, which makes me a bit frustrated.Could anyone help solve the issue? Or I have to roll back to the old versions...

x.head(1) returns series (with one row but series).
You can make a silly workaround like this
E = lambda x: np.sum(x.diff()==1) + np.sum(x.head(1))
or a little bit smarter
E = lambda x: np.sum(x.diff()==1) + x.iloc[0]

Related

How to using the .apply(lambda x: function) over all the columns of a dataframe

I'm trying to pass every column of a dataframe through a custom function by using the apply(lamdba x: function in python.
The custom function I have created works individually but when put it into the apply(lamdba x: structure only returns NaN values into the selected dataframe.
first is the custom function -
def snr_pd(wavenumber_arr):
intensity_arr = Zhangfit_output
signal_low = 1650
signal_high = 1750
noise_low = 1750
noise_high = 1850
signal_mask = np.logical_and((wavenumber_arr >= signal_low), (wavenumber_arr <
signal_high))
noise_mask = np.logical_and((wavenumber_arr >= noise_low), (wavenumber_arr < noise_high))
signal = np.max(intensity_arr[signal_mask])
noise = np.std(intensity_arr[noise_mask])
return signal / noise
And this is the setup of the lambda function -
sd['s/n'] = df.apply(lambda x: snr_pd(x), axis =0,)
Currently I believe this is taking the columns form df, passing them to the snr_pd() and appending them to sd under the column ['s/n'], but the only answer produced is NaN.
I have also tried a couple structure changes like using applymap() instead of apply().
sd['s/n'] = fd.applymap(lambda x: snr_pd(x), na_action = 'ignore')
However this return this error instead :
ValueError: zero-size array to reduction operation maximum which has no identity
Which I have even less understanding of.
Any help would be much apricated.
It looks as though your function snr_pd() expects an entire array as an argument.
Without seeing your data it's hard to say, but you should be able to apply the function directly to the DataFrame using np.apply_along_axis():
np.apply_along_axis(snr_pd, axis=0, arr=df)
Note that this assumes that every column in df is numeric. If not, then simply select the columns of the df on which you'd like to apply the function.
Note also that np.apply_along_axis() will return a numpy array.

Compute the rolling mean over the last n days in Dask

I am trying to compute the rolling mean over the last n days(with n = 30) on a large dataset.
In Pandas, I'd use the following command:
temp = chunk.groupby('id_code').apply(lambda x: x.set_index('entry_time_flat').resample('1D').first())
dd = temp.groupby(level=0)['duration'
].apply(lambda x: x.shift().rolling(min_periods = 1,window = n_days).mean()
).reset_index(name = "avg_delay_"+ str(n_days) + "_days")
chunk = pd.merge(chunk, dd, on=['entry_time_flat', 'id_code'], how='left'
).dropna(subset = ["avg_delay_"+ str(n_days) + "_days"])
Basically, the function groups by "id code" and, for the last n-days over "entry_time_flat" (a datetime object), computes the mean value of feature "duration".
However, in order to keep the code efficient, it would be great to reproduce this function on a Dask dataframe, without transforming it into a Pandas DF.
If I run the aforementioned code on a Dask DF, it raises the following error:
TypeError: __init__() got an unexpected keyword argument 'level'
Ultimately, how could I compute the mean of the "duration" column, over the last n-days on a Dask dataframe?
Ultimately, how could I compute the mean of the "duration" column, over the last n-days on a Dask dataframe?
The rolling API should give you this functionality
https://docs.dask.org/en/latest/dataframe-api.html#rolling

Why can't I use the groupby function to calculate the average of another column here?

I am trying to find the average CTR for a set of emails which I would like to categorize by the time that they were sent in order to determine if the CTR is affected by the time they were sent. But for some reason, pandas just doesn't want to let me find the mean of the CTR values.
As you'll see below, I have tried using the mean function to find the mean of the CTR for each of the times, but I continually get the error:
DataError: No numeric types to aggregate
This to me would imply that my CTR figures are not integers or floats, but are instead strings. However, though they came in as strings, I have already converted them to floats. I know this too because if I use the sum() function in lieu of the average function, it works just fine.
The line of code is very simple:
df.groupby("TIME SENT", as_index=False)['CTR'].mean()
I can't imagine why the sum function would work and the mean function would fail, especially if the error is the one described above. Anyone got any ideas?
EDIT: Code I used to turn CTR column from string percentage (85.8%) to float:
i = 0
for index, row in df.iterrows():
df.loc[i, "CTR"] = float(row['CTR'].strip('%'))/100
i += 1
Link to df.head() : https://ethercalc.org/zw6xmf2c7auw
df['CTR']= (df['CTR'].str.strip('%').astype('float'))/100
The above code strips the % from the CTR column, then changes its type to float.You can then do your groupby.

Rolling Standard Deviation in Pandas Returning Zeroes for One Column

Has anyone had issues with rolling standard deviations not working on only one column in a pandas dataframe?
I have a dataframe with a datetime index and associated financial data. When I run df.rolling().std() (psuedo code, see actual below), I get correct data for all columns except one. That column returns 0's where there should be standard deviation values. I also get the same error when using .rolling_std() and I get an error when trying to run df.rolling().skew(), all the other columns work and this column gives NaN.
What's throwing me off about this error is that the other columns work correctly and for this column, df.rolling().mean() works. In addition, the column has dtype float64, which shouldn't be a problem. I also checked and don't see missing data. I'm using a rolling window of 30 days and if I try to get the last standard deviation value using series[-30:].std() I get a correct result. So it seems like something specifically about the rolling portion isn't working. I played around with the parameters of .rolling() but couldn't get anything to change.
# combine the return, volume and slope data
raw_factor_data = pd.concat([fut_rets, vol_factors, slope_factors], axis=1)
# create new dataframe for each factor type (mean,
# std dev, skew) and combine
mean_vals = raw_factor_data.rolling(window=past, min_periods=past).mean()
mean_vals.columns = [column + '_mean' for column in list(mean_vals)]
std_vals = raw_factor_data.rolling(window=past, min_periods=past).std()
std_vals.columns = [column + '_std' for column in list(std_vals)]
skew_vals = raw_factor_data.rolling(window=past, min_periods=past).skew()
skew_vals.columns = [column + '_skew' for column in list(skew_vals)]
fact_data = pd.concat([mean_vals, std_vals, skew_vals], axis=1)
The first line combines three dataframes together. Then I create separate dataframes with rolling mean, std and skew (past = 30), and then combine those into a single dataframe.
The name of the column I'm having trouble with is 'TY1_slope'. So I've run some code as follows to see where there is an error.
print raw_factor_data['TY1_slope'][-30:].std()
print raw_factor_data['TY1_slope'][-30:].mean()
print raw_factor_data['TY1_slope'].rolling(window=30, min_periods=30).std()
print raw_factor_data['TY1_slope'].rolling(window=30, min_periods=30).mean()
The first two lines of code output a correct standard deviation and mean (.08 and .14). However, the third line of code produces zeroes but the fourth line produces accurate mean values (the final values in those series are 0.0 and .14).
If anyone can help with how to look at the .rolling source code that would be helpful too. I'm new to doing that and tried the following, but just got a few lines that didn't seem very helpful.
import inspect
import pandas as pd
print inspect.getsourcelines(pd.rolling_std)
Quoting JohnE's comment since it worked (although still not sure the root cause of the issue). JohnE, feel free to change to an answer and I'll upvote.
shot in the dark, but you could try rolling(30).apply( lambda x: np.std(x,ddof=1) ) in case it's some weird syntax bug with rolling + std – JohnE

Use numpy.average with weights for resampling a pandas array

I need to resample some data with numpys weighted-average-function - and it just doesn't work... .
This is my test-case:
import numpy as np
import pandas as pd
time_vec = [datetime.datetime(2007,1,1,0,0)
,datetime.datetime(2007,1,1,0,1)
,datetime.datetime(2007,1,1,0,5)
,datetime.datetime(2007,1,1,0,8)
,datetime.datetime(2007,1,1,0,10)
]
df = pd.DataFrame([2,3,1,7,4],index = time_vec)
A normal resampling without weights works fine (using the lambda function as a parameter to how is suggested here: Pandas resampling using numpy percentile? Thanks!):
df.resample('5min',how = lambda x: np.average(x[0]))
But if i try to use some weights, it always returns a TypeError: Axis must be specified when shapes of a and weights differ:
df.resample('5min',how = lambda x: np.average(x[0],weights = [1,2,3,4,5]))
I tried this with many different numbers of weights, but it did not get better:
for i in xrange(20):
try:
print range(i)
print df.resample('5min',how = lambda x:np.average(x[0],weights = range(i)))
print i
break
except TypeError:
print i,'typeError'
I'd be glad about any suggestions.
The short answer here is that the weights in your lambda need to be created dynamically based on the length of the series that is being averaged. In addition, you need to be careful about the types of objects that you're manipulating.
The code that I got to compute what I think you're trying to do is as follows:
df.resample('5min', how=lambda x: np.average(x, weights=1+np.arange(len(x))))
There are two differences compared with the line that was giving you problems:
x[0] is now just x. The x object in the lambda is a pd.Series, and so x[0] gives just the first value in the series. This was working without raising an exception in the first example (without the weights) because np.average(c) just returns c when c is a scalar. But I think it was actually computing incorrect averages even in that case, because each of the sampled subsets was just returning its first value as the "average".
The weights are created dynamically based on the length of data in the Series being resampled. You need to do this because the x in your lambda might be a Series of different length for each time interval being computed.
The way I figured this out was through some simple type debugging, by replacing the lambda with a proper function definition:
def avg(x):
print(type(x), x.shape, type(x[0]))
return np.average(x, weights=np.arange(1, 1+len(x)))
df.resample('5Min', how=avg)
This let me have a look at what was happening with the x variable. Hope that helps!

Categories

Resources