in-group time-to event counter - python

I'm trying to work through the methodology for churn prediction I found here:
Let's say today is 1/6/2017. I have a pandas dataframe, df, that I want to add two columns to.
df = pd.DataFrame([
['a', '2017-01-01', 0],
['a', '2017-01-02', 0],
['a', '2017-01-03', 0],
['a', '2017-01-04', 1],
['a', '2017-01-05', 1],
['b', '2017-01-01', 0],
['b', '2017-01-02', 1],
['b', '2017-01-03', 0],
['b', '2017-01-04', 0],
['b', '2017-01-05', 0]
]
,columns=['id','date','is_event']
)
df['date'] = pd.to_datetime(df['date'])
One is time_to_next_event and the other is is_censored. time_to_next_event will, within each id, decrease towards zero as an event gets closer in time. If no event exists before today, time_to_next_event will decrease in value until the end of the group.
is_censored is a binary flag related to this phenomenon and will indicate, within each id, the rows which have occurred between the most recent event and today. For id a, the most recent row contains the event so is_censored is zero for the whole group. For id b, there are three rows between the most recent event and today so each of their is_censored values are 1.
desired = pd.DataFrame([
['a', '2017-01-01', 0, 3, 0],
['a', '2017-01-02', 0, 2, 0],
['a', '2017-01-03', 0, 1, 0],
['a', '2017-01-04', 1, 0, 0],
['a', '2017-01-05', 1, 0, 0],
['b', '2017-01-01', 0, 1, 0],
['b', '2017-01-02', 1, 0, 0],
['b', '2017-01-03', 0, 3, 1],
['b', '2017-01-04', 0, 2, 1],
['b', '2017-01-05', 0, 1, 1]
]
,columns=['id','date','is_event','time_to_next_event', 'is_censored']
)
desired['date'] = pd.to_datetime(desired['date'])
For time_to_next_event, I found this SO question but had trouble getting it to fit my use case.
For is_censored, I'm stumped so far. I'm posting this question in the hopes that some benevolent Stack Overflower will take pity on me while I sleep (working in EU) and I'll take another stab at this tomorrow. Will update with anything I find. Many thanks in advance!

To get the days until the next event, we can add a column that backfills the date of the next event:
df['next_event'] = df['date'][df['is_event'] == 1]
df['next_event'] = df.groupby('id')['next_event'].transform(lambda x: x.fillna(method='bfill'))
We can then just subtract to get the days between the next event and each day:
df['next_event'] = df['next_event'].fillna(df['date'].iloc[-1] + pd.Timedelta(days=1))
df['time_to_next_event'] = (df['next_event']-df['date']).dt.days
To get the is_censored value for each day and each id, we can group by id, and then we can forward-fill based on the 'is_event' column for each group. Now, we just need the forward-filled values, since according to the definition above, the value of 'is_censored' should be 0 on the day of the event itself. So, we can compare the 'is_event' column to the forward-filled version of that column and set 'is_censored' to 1 each time we have a forward-filled value that wasn't in the original.
df['is_censored'] = (df.groupby('id')['is_event'].transform(lambda x: x.replace(0, method='ffill')) != df['is_event']).astype(int)
df = df.drop('next_event', axis=1)
In [343]: df
Out[343]:
id date is_event time_to_next_event is_censored
0 a 2017-01-01 0 3 0
1 a 2017-01-02 0 2 0
2 a 2017-01-03 0 1 0
3 a 2017-01-04 1 0 0
4 a 2017-01-05 1 0 0
5 b 2017-01-01 0 1 0
6 b 2017-01-02 1 0 0
7 b 2017-01-03 0 3 1
8 b 2017-01-04 0 2 1
9 b 2017-01-05 0 1 1

To generalize the method for is_censored to include cases where an event happens more than once within each id, I wrote this:
df['is_censored2'] = 1
max_dates = df[df['is_event'] == 1].groupby('id',as_index=False)['date'].max()
max_dates.columns = ['id','max_date']
df = pd.merge(df,max_dates,on=['id'],how='left')
df['is_censored2'][df['date'] <= df['max_date']] = 0
It initializes the column at 1 then grabs the max date associated with an event within each id and populates a 0 in is_censored2 if there are any dates in id that are less than or equal to it.

Related

Returna value in Pandas by index row number and column name?

I have a DF where the index is equal strings.
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=['a', 'a', 'a'], columns=['A', 'B', 'C'])
>>> df
A B C
a 0 2 3
a 0 4 1
a 10 20 30
Let's say I am trying to access the value in col 'B' at the first row. I am using something like this:
>>> df.iloc[0]['B']
2
Reading the post here it seems .at is recommended to be used for efficiency. Is there any better way in my example to return the value by the index row number and column name?
Try with iat with get_indexer
df.iat[0,df.columns.get_indexer(['B'])[0]]
Out[124]: 2

Count occurrences of feature pairs for a given set of mappings

I'm trying to aggregate a DataFrame such that for each from, and each to given in the mappings table (e.g. .iloc[0] where a maps to b), we take the corresponding f# (feature) columns from the labels table, and find the number of times that that feature mapping occurred.
The expected output is given in the output table.
Example: in the output table we can see there are 4 times when a from element mapped to a to element (i.e. where the from had an f1 feature and the to had an f2 feature). We can deduce these as being a->b, a->c, d->e, and d->g.
Mappings
from to
0 a b
1 a c
2 d e
3 d f
4 d g
Labels
name f1 f2 f3
0 a 1 0 0
1 b 0 1 0
2 c 0 1 0
3 d 1 1 0
4 e 0 1 0
5 f 0 0 1
6 g 1 1 0
Output
f1 f2 f3
f1 1 4 1
f2 1 2 1
f3 0 0 0
Table construction code
# dataframe 1 - the mappings
mappings = pd.DataFrame({
'from': ['a', 'a', 'd', 'd', 'd'],
'to': ['b', 'c', 'e', 'f', 'g']
})
# dataframe 2 - the labels
labels = pd.DataFrame({
'name': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'f1': [1, 0, 0, 1, 0, 0, 1],
'f2': [0, 1, 1, 1, 1, 0, 1],
'f3': [0, 0, 0, 0, 0, 1, 0],
})
# dataframe 3 - the expected output
output = pd.DataFrame(
index = ['f1', 'f2', 'f3'],
data = {
'f1': [1, 1, 0],
'f2': [4, 2, 0],
'f3': [1, 1, 0],
})
First we melt your labels dataframe from columns to rows, so we can easily match on them. Then we merge these values to our mapping and finally use crosstab to get your final result:
labels = labels.set_index('name').where(lambda x: x > 0).melt(ignore_index=False).dropna()
df = (
mappings.merge(labels.add_suffix('_from'), left_on='from', right_on='name')
.merge(labels.add_suffix('_to'), left_on='to', right_on='name')
)
final = pd.crosstab(index=df['variable_from'], columns=df['variable_to'])
final = (
final.reindex(index=final.columns, fill_value=0)
.rename_axis(index=None, columns=None)
).convert_dtypes()
Output
f1 f2 f3
f1 1 4 1
f2 1 2 1
f3 0 0 0
Note:
melt(ignore_index=False) requires pandas >= 1.1.0
convert_dtypes requires pandas >= 1.0.0
For pandas < 1.1.0 we can use stack instead of melt:
(
labels.set_index('name')
.where(lambda x: x > 0)
.stack()
.reset_index(level=1)
.rename(columns={'level_1': 'variable', 0: 'value'})
)

Sampling from within Pandas groups with defined probabilities

Consider the following Pandas dataframe,
df = pd.DataFrame(
[
['X', 0, 0.5],
['X', 1, 0.5],
['Y', 0, 0.25],
['Y', 1, 0.3],
['Y', 2, 0.45],
['Z', 0, 0.6],
['Z', 1, 0.1],
['Z', 2, 0.3]
], columns=['NAME', 'POSITION', 'PROB'])
Notice that df defines a discrete probability distribution for each unique NAME value i.e.
assert ((df.groupby('NAME')['PROB'].sum() - 1)**2 < 1e-10).all()
What I would like to do is sample from these probability distributions.
We can think of POSITION as being the values corresponding to the probabilities. So when considering X the sample will be 0 with probability 0.5 and 1 with probability 0.5.
I would like to create a new dataframe with columns ['NAME', 'POSITION', 'PROB', 'SAMPLE'] representing these samples. Each unique SAMPLE value represents a new sample. The PROB column is now always 0 or 1, representing whether the given row was selected in the given sample. For example, if I were to select 3 samples an example outcome is below,
df_samples = pd.DataFrame(
[
['X', 0, 1, 0],
['X', 1, 0, 0],
['X', 0, 0, 1],
['X', 1, 1, 1],
['X', 0, 1, 2],
['X', 1, 0, 2],
['Y', 0, 1, 0],
['Y', 1, 0, 0],
['Y', 2, 0, 0],
['Y', 0, 0, 1],
['Y', 1, 0, 1],
['Y', 2, 1, 1],
['Y', 0, 1, 2],
['Y', 1, 0, 2],
['Y', 2, 0, 2],
['Z', 0, 0, 0],
['Z', 1, 0, 0],
['Z', 2, 1, 0],
['Z', 0, 0, 1],
['Z', 1, 0, 1],
['Z', 2, 1, 1],
['Z', 0, 1, 2],
['Z', 1, 0, 2],
['Z', 2, 0, 2],
], columns=['NAME', 'POSITION', 'PROB', 'SAMPLE'])
Of course due to the randomness involved, this is just one of a number of possible outcomes.
A unittest for the program would be that as the samples increases, by the law of large numbers, the mean number of our samples for each (NAME, POSITION) pair, should tend to the actual probability. One could calculate a confidence region based on the total samples used and then make sure the true probability lies within it. For example using a normal approximation to binomial outcomes (requires total samples n_samples to be 'large') a (-4 sd, 4 sd) region test would be:
z = 4
p_est = df_samples.groupby(['NAME', 'POSITION'])['PROB'].mean()
p_true = df.set_index(['NAME', 'POSITION'])['PROB']
CI_lower = p_est - z*np.sqrt(p_est*(1-p_est)/n_samples)
CI_upper = p_est + z*np.sqrt(p_est*(1-p_est)/n_samples)
assert p_true < CI_upper
assert p_true > CI_lower
What is the most efficient way to do this in Pandas? I feel like I want to apply some sample function to the df.groupby('NAME') object.
P.S.
To be even more explicit, here is a very long winded way of doing this using Numpy.
n_samples = 3
df_list = []
for name in ['X', 'Y', 'Z']:
idx = df['NAME'] == name
position_samples = np.random.choice(df.loc[idx, 'POSITION'],
n_samples,
p=df.loc[idx, 'PROB'])
prob = np.zeros([idx.sum(), n_samples])
prob[position_samples, np.arange(n_samples)] = 1
position = np.tile(np.arange(idx.sum())[:, None], n_samples)
sample = np.tile(np.arange(n_samples)[:,None], idx.sum()).T
df_list.append(pd.DataFrame(
[[name, prob.ravel()[i], position.ravel()[i],
sample.ravel()[i]]
for i in range(n_samples*idx.sum())],
columns=['NAME', 'PROB', 'POSITION', 'SAMPLE']))
df_samples = pd.concat(df_list)
If I understand correctly, you're looking for groupby + sample and then some indexing stuff
First sample by the probabilites:
n_samples = 3
df_samples = df.groupby('NAME').apply(lambda x: x[['NAME', 'POSITION']] \
.sample(n_samples, replace=True,
weights=x.PROB)) \
.reset_index(drop=True)
Now add the extra columns:
df_samples['SAMPLE'] = df_samples.groupby('NAME').cumcount()
df_samples['PROB'] = 1
print(df_samples)
NAME POSITION SAMPLE PROB
0 X 1 0 1
1 X 0 1 1
2 X 1 2 1
3 Y 1 0 1
4 Y 1 1 1
5 Y 1 2 1
6 Z 2 0 1
7 Z 0 1 1
8 Z 0 2 1
Note that this doesn't include the 0 probability positions for each sample as requested in the initial question but it is a more concise way of storing the information.
If we want to also include the 0 probability positions we can merge in the other positions as follows:
domain = df[['NAME', 'POSITION']].drop_duplicates()
df_samples.drop('PROB', axis=1, inplace=True)
df_samples = pd.merge(df_samples, domain, on='NAME',
suffixes=['_sample', ''])
df_samples['PROB'] = (df_samples['POSITION'] ==
df_samples['POSITION_sample']).astype(int)
df_samples.drop('POSITION_sample', axis=1, inplace=True)

How to get back the index after groupby in pandas

I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
print df
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df
cost
item_id
d 5
I need to keep track of desired_row and delete this row from df and repeat the process.
What is the best way to find and delete the desired_row?
I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last). In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.
gb = df.groupby('item_id', as_index=False)
>>> gb.groups # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}
# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
cost item_id
0 1 a
5 1 c
2 1 b
6 5 d
# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Try this ?
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]
Out[186]:
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Or using not in
Consider this df with few more rows
pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'],
'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})
df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
7 1 d
8 7 d
Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.
df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
grouped=df_temp.groupby(['item_id'])['cost'].max()
result_df=pd.DataFrame(columns=['item_id','cost'])
for key, value in enumerate(grouped):
index=grouped.index[key]
result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)
print(result_df.head(5))

Correlation between a pandas Series and a whole DataFrame

I have a series of values and I'm looking to compute the pearson correlation with every row of a given table.
How do I do I do that?
Example:
import pandas as pd
v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]
s = pd.Series(v)
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
# Here I expect ot do df.corrwith(s) - but won't work
Using Series.corr() to calculate, the expected output is
-0.1666666666666666 # correlation with the first row
0.83914639167827343 # correlation with the second row
-0.35355339059327379 # correlation with the third row
You need same index of Series as columns of DataFrame for align Series by DataFrame and add axis=1 in corrwith for row-wise correlation:
s1 = pd.Series(s.values, index=df.columns)
print (s1)
a -1
b 5
c 0
d 0
e 10
f 0
g -7
dtype: int64
print (df.corrwith(s1, axis=1))
0 -0.166667
1 0.839146
2 -0.353553
dtype: float64
print (df.corrwith(pd.Series(v, index=df.columns), axis=1))
0 -0.166667
1 0.839146
2 -0.353553
dtype: float64
EDIT:
You can specify columns and use subset:
cols = ['a','b','e']
print (df[cols])
a b e
0 1 0 0
1 0 1 1
2 1 1 0
print (df[cols].corrwith(pd.Series(v, index=df.columns), axis=1))
0 -0.891042
1 0.891042
2 -0.838628
dtype: float64
This might be useful to those concerned with performance.
I have found this runs in half the time compared to pandas corrwith.
Your data:
import pandas as pd
v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
The solution (note that v is not transformed into a series):
from scipy.stats.stats import pearsonr
s_corrs = df.apply(lambda x: pearsonr(x.values, v)[0], axis=1)

Categories

Resources