I have the following data (see attached - easier this way). I am trying to find the first occurrence of the value 0 for each customer ID. Then, I plan to use code similar to below to create a Kaplan-Meier curve:
from lifelines import KaplanMeierFitter
## Example Data
durations = [5,6,6,2.5,4,4]
event_observed = [1, 0, 0, 1, 1, 1]
## create a kmf object
kmf = KaplanMeierFitter()
## Fit the data into the model
kmf.fit(durations, event_observed,label='Kaplan Meier Estimate')
## Create an estimate
kmf.plot(ci_show=False) ## ci_show is meant for Confidence interval, since our data set is too tiny, thus i am not showing it.
(this code is from here).
What' the simplest way to do this? Note that I want to ignore the NAs: I have plenty of them and there's no getting around that.
Thanks!
I'm gonna assume that all rows contain at least one non-NaN value.
One thing we'd have to do first is just ensure that we operate on a dataframe where there is indeed a zero; we can accomplish this with min.
This will give us a series, and we just have to select on the rows that contain zero:
df.loc[min_series == 0]
Then, we can use idxmin:
df.idxmin(1, skipna=True)
This should spit out the period on which the first 0 is encountered (we've guaranteed that all rows contain a 0).
Then, this should give you what you're looking for!
I am trying to normalize price at a certain point in time with respect to price 10 seconds later using this formula: ((price(t+10seconds) – price(t)) / price(t) ) / spread(t)
Both price and spread are columns in my dataframe. And I have indexed my dataframe by timestamp (pd.datetime object) because I figured that would make calculating price(t+10sec) easier.
What I've tried so far:
pos['timestamp'] = pd.to_datetime(pos['timestamp'])
pos.set_index('timestamp')
def normalize_data(pos):
t0 = pd.to_datetime('2021-10-27 09:30:13.201')
x = pos['mid_price']
y = ((x[t0 + pd.Timedelta('10 sec')] - x)/x) / (spread)
return y
pos['norm_price'] = normalize_data(pos)
this gives me an error because I'm indexing x[t0+pd.Timedelta('10sec')] but not the other x's in the equation. I also don't think I'm using pd.Timedelta or the x[t0+pd.Time...] correctly and unsure of how to fix all this/define a better function.
Any input would be much appreciated
dataframe
Your problem is here:
pos.set_index('timestamp')
This line of code will return a new dataframe, and leave your original dataframe unchanged. So, your function normalize_data is working on the original version of pos, which does not have the index you want, and neither will x. Change your code to this:
pos = pos.set_index('timestamp')
And that should get things working.
I wrote some code to perform interpolation based on two criteria, the amount of insurance and the deductible amount %. I was struggling to do the interpolation all at once, so had split the filtering.The table hf contains the known data which I am using to base my interpolation results on.Table df contains the new data which needs the developed factors interpolated based on hf.
Right now my work around is first filtering each table based on the ded_amount percentage and then performing the interpolation into an empty data frame and appending after each loop.
I feel like this is inefficient, and there is a better way to perform this, looking to hear some feedback on some improvements I can make. Thanks
Test data provided below.
import pandas as pd
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
deduct_fact=pd.DataFrame()
for deduct in hf['Ded_amount'].unique():
deduct_table=hf[hf['Ded_amount']==deduct]
aoi_table=df[df['Ded_amount']==deduct]
x=deduct_table['AOI']
y=deduct_table['factor']
f=interpolate.interp1d(x,y,fill_value="extrapolate")
xnew=aoi_table[['AOI']]
ynew=f(xnew)
append_frame=aoi_table
append_frame['Factor']=ynew
deduct_fact=deduct_fact.append(append_frame)
Yep, there is a way to do this more efficiently, without having to make a bunch of intermediate dataframes and appending them. have a look at this code:
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
# Create this column now
df['Factor'] = None
# I like specifying this explicitly; easier to debug
deduction_amounts = list(hf.Ded_amount.unique())
for deduction_amount in deduction_amounts:
# You can index a dataframe and call a column in one line
x, y = hf[hf['Ded_amount']==deduction_amount]['AOI'], hf[hf['Ded_amount']==deduction_amount]['factor']
f = interpolate.interp1d(x, y, fill_value="extrapolate")
# This is the most important bit. Lambda function on the dataframe
df['Factor'] = df.apply(lambda x: f(x['AOI']) if x['Ded_amount']==deduction_amount else x['Factor'], axis=1)
The way the lambda function works is:
It goes row by row through the column 'Factor' and gives it a value based on conditions on the other columns.
It returns the interpolation of the AOI column of df (this is what you called xnew) if the deduction amount matches, otherwise it just returns the same thing back.
Has anyone had issues with rolling standard deviations not working on only one column in a pandas dataframe?
I have a dataframe with a datetime index and associated financial data. When I run df.rolling().std() (psuedo code, see actual below), I get correct data for all columns except one. That column returns 0's where there should be standard deviation values. I also get the same error when using .rolling_std() and I get an error when trying to run df.rolling().skew(), all the other columns work and this column gives NaN.
What's throwing me off about this error is that the other columns work correctly and for this column, df.rolling().mean() works. In addition, the column has dtype float64, which shouldn't be a problem. I also checked and don't see missing data. I'm using a rolling window of 30 days and if I try to get the last standard deviation value using series[-30:].std() I get a correct result. So it seems like something specifically about the rolling portion isn't working. I played around with the parameters of .rolling() but couldn't get anything to change.
# combine the return, volume and slope data
raw_factor_data = pd.concat([fut_rets, vol_factors, slope_factors], axis=1)
# create new dataframe for each factor type (mean,
# std dev, skew) and combine
mean_vals = raw_factor_data.rolling(window=past, min_periods=past).mean()
mean_vals.columns = [column + '_mean' for column in list(mean_vals)]
std_vals = raw_factor_data.rolling(window=past, min_periods=past).std()
std_vals.columns = [column + '_std' for column in list(std_vals)]
skew_vals = raw_factor_data.rolling(window=past, min_periods=past).skew()
skew_vals.columns = [column + '_skew' for column in list(skew_vals)]
fact_data = pd.concat([mean_vals, std_vals, skew_vals], axis=1)
The first line combines three dataframes together. Then I create separate dataframes with rolling mean, std and skew (past = 30), and then combine those into a single dataframe.
The name of the column I'm having trouble with is 'TY1_slope'. So I've run some code as follows to see where there is an error.
print raw_factor_data['TY1_slope'][-30:].std()
print raw_factor_data['TY1_slope'][-30:].mean()
print raw_factor_data['TY1_slope'].rolling(window=30, min_periods=30).std()
print raw_factor_data['TY1_slope'].rolling(window=30, min_periods=30).mean()
The first two lines of code output a correct standard deviation and mean (.08 and .14). However, the third line of code produces zeroes but the fourth line produces accurate mean values (the final values in those series are 0.0 and .14).
If anyone can help with how to look at the .rolling source code that would be helpful too. I'm new to doing that and tried the following, but just got a few lines that didn't seem very helpful.
import inspect
import pandas as pd
print inspect.getsourcelines(pd.rolling_std)
Quoting JohnE's comment since it worked (although still not sure the root cause of the issue). JohnE, feel free to change to an answer and I'll upvote.
shot in the dark, but you could try rolling(30).apply( lambda x: np.std(x,ddof=1) ) in case it's some weird syntax bug with rolling + std – JohnE
I have a dataset from which I want a few averages of multiple variables I created.
I started off with:
data2['socialIdeology2'].mean()
data2['econIdeology'].mean()
^ that works perfectly, and gives me the averages I'm looking for.
Now, I'm trying to do a conditional mean, so the mean only for a select group within the data set. (I want the ideologies broken down by whom voted for in the 2016 election) In Stata, the code would be similar to: mean(variable) if voteChoice == 'Clinton'
Looking into it, I came to the conclusion a conditional mean just isn't a thing (although hopefully I am wrong?), so I was writing my own function for it.
This is me just starting out with a 'mean' function, to create a foundation for a conditional mean function:
def mean():
sum = 0.0
count = 0
for index in range(0, len(data2['socialIdeology2'])):
sum = sum + (data2['socialIdeology2'][index])
print(data2['socialIdeology2'][index])
count = count + 1
return sum / count
print(mean())
Yet I keep getting 'nan' as the result. Printing data2['socialIdeology2'][index] within the loop prints nan over and over again.
So my question is: if the data stored within the socialIdeology2 variable really is a nan (which I don't understand how it could be), why is it that the .mean() function works with it?
And how can I get generate means by category?
Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby():
means = data2.groupby('voteChoice').mean()
or maybe, in your case, the following would be more efficient:
means = data2.groupby('voteChoice')['socialIdeology2'].mean()
to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice is the name of the column you want to condition on.
If you're only interested in the mean for a single group (e.g. Clinton voters) then you could create a boolean series that is True for members of that group, then use this to index into the rows of the DataFrame before taking the mean:
voted_for_clinton = data2['voteChoice'] == 'Clinton'
mean_for_clinton_voters = data2.loc[voted_for_clinton, 'socialIdeology2'].mean()
If you want to get the means for multiple groups simultaneously then you can use groupby, as in Brad's answer. However, I would do it like this:
means_by_vote_choice = data2.groupby('voteChoice')['socialIdeology2'].mean()
Placing the ['socialIdeology2'] index before the .mean() means that you only compute the mean over the column you're interested in, whereas if you place the indexing expression after the .mean() (i.e. data2.groupby('voteChoice').mean()['socialIdeology2']) this computes the means over all columns and then selects only the 'socialIdeology2' column from the result, which is less efficient.
See here for more info on indexing DataFrames using .loc and here for more info on groupby.