Ive currently got a set of data as you can see here;
I am trying to use the .std() and .mean() functions within Panda to find the deviation and mean to reject outliers. Unfortunately I keep getting the error shown below the piece of code. I have no idea why, might be because of the headers not being numerical? I am not sure.
def reject_outliers(new1, m=3):
return new1[abs(new1 - np.mean(new1)) < m * np.std(new1)]
new2 = reject_outliers(new1, m=3)
new2.to_csv('final.csv')
ValueError: can only convert an array of size 1 to a Python scalar
Isolate the numeric columns and only apply the transformation to them
# get list of numeric columns
numcols = list(new1.select_dtypes(include=['number']).columns.values
# run function only on numeric columns
new1[numcols] = reject_outliers(new1[numcols], m=3)
Related
I am trying to modify a pandas dataframe column this way:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE["Var"]["Jan"] = 2678400*SLICE["Var"]["Jan"]
However, this does not work. The resulting column SLICE["Var"]["Jan"] is still the same as before the multiplication.
If I multiply with 2 orders of magnitude less, the multiplication works. Also a subsequent multiplication with 100 to receive the same value that was intended in the first place, works.
SLICE["Var"]["Jan"] = 26784*SLICE["Var"]["Jan"]
SLICE["Var"]["Jan"] = 100*SLICE["Var"]["Jan"]
I seems like the scalar is too large for the multiplication. Is this a python thing or a pandas thing? How can I make sure that the multiplication with the 7-digit number works directly?
I am using Python 3.8, the precision of numbers in the dataframe is float32, they are in a range between 5.0xE-5 and -5.0xE-5 with some numbers having a smaller absolute value than 1xE-11.
EDIT: It might have to do with the 2-level column indexing. When I delete the first level, the calculation works:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE=SLICE.droplevel(0, axis=1)
SLICE["Jan"] = 2678400*SLICE["Jan"]
Your first method might give SettingWithCopyWarning which basically means the changes are not made to the actual dataframe. You can use .loc instead:
SLICE.loc[:,('Var', 'Jan')] = SLICE.loc[:,('Var', 'Jan')]*2678400
This should be very basic, but there seem to be no post about it here (well, I didn't find any).
I tried to apply box-cox transformation to a column in Pandas, but got this error:
ValueError: Length of values does not match length of index
This is what I've done:
from scipy import stats
df['boxcox_col_1'] = stats.boxcox(df['col_1'])
Shouldn't this work?
It's just a regular pandas column with numeric variables ranging from 0.005 to 39 and no missing values.
Try this instead:
a, b = stats.boxcox(df['col_1'])
df['boxcox_col_1'] = a
Read the documentation here: BoxCox
The code be should as :
df['boxcox_col_1'] = stats.boxcox(df['col_1'])[0]
as it returns , one argument more .That results in your error. Refer
I am trying to find the average CTR for a set of emails which I would like to categorize by the time that they were sent in order to determine if the CTR is affected by the time they were sent. But for some reason, pandas just doesn't want to let me find the mean of the CTR values.
As you'll see below, I have tried using the mean function to find the mean of the CTR for each of the times, but I continually get the error:
DataError: No numeric types to aggregate
This to me would imply that my CTR figures are not integers or floats, but are instead strings. However, though they came in as strings, I have already converted them to floats. I know this too because if I use the sum() function in lieu of the average function, it works just fine.
The line of code is very simple:
df.groupby("TIME SENT", as_index=False)['CTR'].mean()
I can't imagine why the sum function would work and the mean function would fail, especially if the error is the one described above. Anyone got any ideas?
EDIT: Code I used to turn CTR column from string percentage (85.8%) to float:
i = 0
for index, row in df.iterrows():
df.loc[i, "CTR"] = float(row['CTR'].strip('%'))/100
i += 1
Link to df.head() : https://ethercalc.org/zw6xmf2c7auw
df['CTR']= (df['CTR'].str.strip('%').astype('float'))/100
The above code strips the % from the CTR column, then changes its type to float.You can then do your groupby.
Has anyone had issues with rolling standard deviations not working on only one column in a pandas dataframe?
I have a dataframe with a datetime index and associated financial data. When I run df.rolling().std() (psuedo code, see actual below), I get correct data for all columns except one. That column returns 0's where there should be standard deviation values. I also get the same error when using .rolling_std() and I get an error when trying to run df.rolling().skew(), all the other columns work and this column gives NaN.
What's throwing me off about this error is that the other columns work correctly and for this column, df.rolling().mean() works. In addition, the column has dtype float64, which shouldn't be a problem. I also checked and don't see missing data. I'm using a rolling window of 30 days and if I try to get the last standard deviation value using series[-30:].std() I get a correct result. So it seems like something specifically about the rolling portion isn't working. I played around with the parameters of .rolling() but couldn't get anything to change.
# combine the return, volume and slope data
raw_factor_data = pd.concat([fut_rets, vol_factors, slope_factors], axis=1)
# create new dataframe for each factor type (mean,
# std dev, skew) and combine
mean_vals = raw_factor_data.rolling(window=past, min_periods=past).mean()
mean_vals.columns = [column + '_mean' for column in list(mean_vals)]
std_vals = raw_factor_data.rolling(window=past, min_periods=past).std()
std_vals.columns = [column + '_std' for column in list(std_vals)]
skew_vals = raw_factor_data.rolling(window=past, min_periods=past).skew()
skew_vals.columns = [column + '_skew' for column in list(skew_vals)]
fact_data = pd.concat([mean_vals, std_vals, skew_vals], axis=1)
The first line combines three dataframes together. Then I create separate dataframes with rolling mean, std and skew (past = 30), and then combine those into a single dataframe.
The name of the column I'm having trouble with is 'TY1_slope'. So I've run some code as follows to see where there is an error.
print raw_factor_data['TY1_slope'][-30:].std()
print raw_factor_data['TY1_slope'][-30:].mean()
print raw_factor_data['TY1_slope'].rolling(window=30, min_periods=30).std()
print raw_factor_data['TY1_slope'].rolling(window=30, min_periods=30).mean()
The first two lines of code output a correct standard deviation and mean (.08 and .14). However, the third line of code produces zeroes but the fourth line produces accurate mean values (the final values in those series are 0.0 and .14).
If anyone can help with how to look at the .rolling source code that would be helpful too. I'm new to doing that and tried the following, but just got a few lines that didn't seem very helpful.
import inspect
import pandas as pd
print inspect.getsourcelines(pd.rolling_std)
Quoting JohnE's comment since it worked (although still not sure the root cause of the issue). JohnE, feel free to change to an answer and I'll upvote.
shot in the dark, but you could try rolling(30).apply( lambda x: np.std(x,ddof=1) ) in case it's some weird syntax bug with rolling + std – JohnE
I wanted to take a pandas DataFrame that has two columns and calculate rolling covariance between two of the columns. The catch is that sometimes I want to assume the mean is zero and sometimes I will want the proper sample covariance that is demeaned.
To do this, I have the following function that I would like to use a rolling apply with - all this does is calculate covariance assuming zero mean if not centered and calculate the usual covariance when it is centered.
def real_cov(x,y, centered=True):
return (((x-(x.mean() if centered else 0))*(y-(y.mean() if centered else 0))).sum()/(len(x)-1))
# Make this a binary function
real_cov_uncentered = lambda x,y: real_cov(x,y,False)
now assume that df is a DataFrame with 2 columns and 100 rows of numbers.
I would like to use a pandas rolling_apply function to calculate the uncentered rolling covariance using my custom function, real_cov_uncentered.
i.e. I want this code to work:
rolling_cov=pd.rolling_apply(df, window=20, func=real_cov_uncentered)
This does not work because I obviously cannot convince pandas to treat the two columns in df as the two arguments to real_cov_uncentered.
Any suggestions? Let me know if I am not being clear and I will edit.
On Edit 1:
I should add that my attempts at some hackish cleverness (that would have been unsatisfying even if it worked) also failed:
def zipped_cov(two_tuple, centered=True):
x=np.array([two_tuple[i][0] for i in np.arange(len(two_tuple))])
y=np.array([two_tuple[i][1] for i in np.arange(len(two_tuple))])
return real_cov(x,y,centered)
zipped_df=pd.Series(data=zip(df['col1'], df['col2']), index=df.index)
rolling_cov=pd.rolling_apply(zipped_df, window=20, func=lambda x: zipped_cov(x, False))
Here, what I was doing was forcing my covariance call to be a unary function by zipping my two columns into one column, and then calling the zipped_cov function, which would unzip and call the original covariance function. It seems that under the hood, pandas would rather vomit than do this for me:
Error Msg:
C:\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\stats\moments.pyc in _process_data_structure(arg, kill_inf)
330
331 if not issubclass(values.dtype.type, float):
--> 332 values = values.astype(float)
333
334 if kill_inf:
ValueError: setting an array element with a sequence.