Sampling from within Pandas groups with defined probabilities - python

Consider the following Pandas dataframe,
df = pd.DataFrame(
[
['X', 0, 0.5],
['X', 1, 0.5],
['Y', 0, 0.25],
['Y', 1, 0.3],
['Y', 2, 0.45],
['Z', 0, 0.6],
['Z', 1, 0.1],
['Z', 2, 0.3]
], columns=['NAME', 'POSITION', 'PROB'])
Notice that df defines a discrete probability distribution for each unique NAME value i.e.
assert ((df.groupby('NAME')['PROB'].sum() - 1)**2 < 1e-10).all()
What I would like to do is sample from these probability distributions.
We can think of POSITION as being the values corresponding to the probabilities. So when considering X the sample will be 0 with probability 0.5 and 1 with probability 0.5.
I would like to create a new dataframe with columns ['NAME', 'POSITION', 'PROB', 'SAMPLE'] representing these samples. Each unique SAMPLE value represents a new sample. The PROB column is now always 0 or 1, representing whether the given row was selected in the given sample. For example, if I were to select 3 samples an example outcome is below,
df_samples = pd.DataFrame(
[
['X', 0, 1, 0],
['X', 1, 0, 0],
['X', 0, 0, 1],
['X', 1, 1, 1],
['X', 0, 1, 2],
['X', 1, 0, 2],
['Y', 0, 1, 0],
['Y', 1, 0, 0],
['Y', 2, 0, 0],
['Y', 0, 0, 1],
['Y', 1, 0, 1],
['Y', 2, 1, 1],
['Y', 0, 1, 2],
['Y', 1, 0, 2],
['Y', 2, 0, 2],
['Z', 0, 0, 0],
['Z', 1, 0, 0],
['Z', 2, 1, 0],
['Z', 0, 0, 1],
['Z', 1, 0, 1],
['Z', 2, 1, 1],
['Z', 0, 1, 2],
['Z', 1, 0, 2],
['Z', 2, 0, 2],
], columns=['NAME', 'POSITION', 'PROB', 'SAMPLE'])
Of course due to the randomness involved, this is just one of a number of possible outcomes.
A unittest for the program would be that as the samples increases, by the law of large numbers, the mean number of our samples for each (NAME, POSITION) pair, should tend to the actual probability. One could calculate a confidence region based on the total samples used and then make sure the true probability lies within it. For example using a normal approximation to binomial outcomes (requires total samples n_samples to be 'large') a (-4 sd, 4 sd) region test would be:
z = 4
p_est = df_samples.groupby(['NAME', 'POSITION'])['PROB'].mean()
p_true = df.set_index(['NAME', 'POSITION'])['PROB']
CI_lower = p_est - z*np.sqrt(p_est*(1-p_est)/n_samples)
CI_upper = p_est + z*np.sqrt(p_est*(1-p_est)/n_samples)
assert p_true < CI_upper
assert p_true > CI_lower
What is the most efficient way to do this in Pandas? I feel like I want to apply some sample function to the df.groupby('NAME') object.
P.S.
To be even more explicit, here is a very long winded way of doing this using Numpy.
n_samples = 3
df_list = []
for name in ['X', 'Y', 'Z']:
idx = df['NAME'] == name
position_samples = np.random.choice(df.loc[idx, 'POSITION'],
n_samples,
p=df.loc[idx, 'PROB'])
prob = np.zeros([idx.sum(), n_samples])
prob[position_samples, np.arange(n_samples)] = 1
position = np.tile(np.arange(idx.sum())[:, None], n_samples)
sample = np.tile(np.arange(n_samples)[:,None], idx.sum()).T
df_list.append(pd.DataFrame(
[[name, prob.ravel()[i], position.ravel()[i],
sample.ravel()[i]]
for i in range(n_samples*idx.sum())],
columns=['NAME', 'PROB', 'POSITION', 'SAMPLE']))
df_samples = pd.concat(df_list)

If I understand correctly, you're looking for groupby + sample and then some indexing stuff
First sample by the probabilites:
n_samples = 3
df_samples = df.groupby('NAME').apply(lambda x: x[['NAME', 'POSITION']] \
.sample(n_samples, replace=True,
weights=x.PROB)) \
.reset_index(drop=True)
Now add the extra columns:
df_samples['SAMPLE'] = df_samples.groupby('NAME').cumcount()
df_samples['PROB'] = 1
print(df_samples)
NAME POSITION SAMPLE PROB
0 X 1 0 1
1 X 0 1 1
2 X 1 2 1
3 Y 1 0 1
4 Y 1 1 1
5 Y 1 2 1
6 Z 2 0 1
7 Z 0 1 1
8 Z 0 2 1
Note that this doesn't include the 0 probability positions for each sample as requested in the initial question but it is a more concise way of storing the information.
If we want to also include the 0 probability positions we can merge in the other positions as follows:
domain = df[['NAME', 'POSITION']].drop_duplicates()
df_samples.drop('PROB', axis=1, inplace=True)
df_samples = pd.merge(df_samples, domain, on='NAME',
suffixes=['_sample', ''])
df_samples['PROB'] = (df_samples['POSITION'] ==
df_samples['POSITION_sample']).astype(int)
df_samples.drop('POSITION_sample', axis=1, inplace=True)

Related

Count occurrences of feature pairs for a given set of mappings

I'm trying to aggregate a DataFrame such that for each from, and each to given in the mappings table (e.g. .iloc[0] where a maps to b), we take the corresponding f# (feature) columns from the labels table, and find the number of times that that feature mapping occurred.
The expected output is given in the output table.
Example: in the output table we can see there are 4 times when a from element mapped to a to element (i.e. where the from had an f1 feature and the to had an f2 feature). We can deduce these as being a->b, a->c, d->e, and d->g.
Mappings
from to
0 a b
1 a c
2 d e
3 d f
4 d g
Labels
name f1 f2 f3
0 a 1 0 0
1 b 0 1 0
2 c 0 1 0
3 d 1 1 0
4 e 0 1 0
5 f 0 0 1
6 g 1 1 0
Output
f1 f2 f3
f1 1 4 1
f2 1 2 1
f3 0 0 0
Table construction code
# dataframe 1 - the mappings
mappings = pd.DataFrame({
'from': ['a', 'a', 'd', 'd', 'd'],
'to': ['b', 'c', 'e', 'f', 'g']
})
# dataframe 2 - the labels
labels = pd.DataFrame({
'name': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'f1': [1, 0, 0, 1, 0, 0, 1],
'f2': [0, 1, 1, 1, 1, 0, 1],
'f3': [0, 0, 0, 0, 0, 1, 0],
})
# dataframe 3 - the expected output
output = pd.DataFrame(
index = ['f1', 'f2', 'f3'],
data = {
'f1': [1, 1, 0],
'f2': [4, 2, 0],
'f3': [1, 1, 0],
})
First we melt your labels dataframe from columns to rows, so we can easily match on them. Then we merge these values to our mapping and finally use crosstab to get your final result:
labels = labels.set_index('name').where(lambda x: x > 0).melt(ignore_index=False).dropna()
df = (
mappings.merge(labels.add_suffix('_from'), left_on='from', right_on='name')
.merge(labels.add_suffix('_to'), left_on='to', right_on='name')
)
final = pd.crosstab(index=df['variable_from'], columns=df['variable_to'])
final = (
final.reindex(index=final.columns, fill_value=0)
.rename_axis(index=None, columns=None)
).convert_dtypes()
Output
f1 f2 f3
f1 1 4 1
f2 1 2 1
f3 0 0 0
Note:
melt(ignore_index=False) requires pandas >= 1.1.0
convert_dtypes requires pandas >= 1.0.0
For pandas < 1.1.0 we can use stack instead of melt:
(
labels.set_index('name')
.where(lambda x: x > 0)
.stack()
.reset_index(level=1)
.rename(columns={'level_1': 'variable', 0: 'value'})
)

Summarize data from one dataframe to another

I would like to help you for what follows.
In my work I have two DataFrames. The first, called df_card_features, has card features and the card_id column has the unique ID of each card. The second, called df_cart_historic, has card data from the first dataframe; in this second dataframe, the card_id column has no unique values, but is the same as the card_id column of the first dataframe.
As a solution I thought about creating a dictionary and then include the columns in the dataframe, but this proposal seems to me very costly in terms of performance, because the csv file of history has about 5 GB.
# card features:
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e']
date_activation = ['2019-02-01', '2019-05-02', '2018-01-20', '2015-07-23', '2013-07-23']
feature_1_1 = [0, 1, 1, 1, 0]
feature_1_2 = [1, 0, 0, 0, 1]
df_card_features = pd.DataFrame()
df_card_features['card_id'] = card_id
df_card_features['date_activation'] = date_activation
df_card_features['feature_1_1'] = feature_1_1
df_card_features['feature_1_2'] = feature_1_2;
df_card_features.head()
# card historic
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e']
denied_purchase = ['N', 'Y', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y']
purchase_date = ['2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-10', '2019-02-11', '2019-02-21', '2019-03-01', '2019-03-01', '2019-03-01', '2019-03-31', '2018-04-01', '2016-02-01', '2013-12-01']
installments = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 8, 4, 0 ]
month_lag = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5]
df_cart_historic = pd.DataFrame()
df_cart_historic['card_id'] = card_id
df_cart_historic['denied_purchase'] = denied_purchase
df_cart_historic['purchase_date'] = purchase_date
df_cart_historic['installments'] = installments
df_cart_historic['month_lag'] = month_lag
What I need is to create the following columns in the df_card_features dataframe:
column 'denied_purchase?' Whose value is 1 if there is at least one Y value occurrence in the denied_purchase column of the df_cart_historic dataframe or zero if there is no Y occurrence for the card_id
'oldest_Date' column, whose value is the oldest date in the purchase_date column of df_cart_historic
'max_installments', which is the maximum value of the installments column of df_cart_historic
'max_month_lag', which is the maximum value of the month_lag column of df_cart_historic.
Yoy need to use groupby on 'card_id' column in df_cart_historic in order to build the new columns using only the rows where 'card_id' has the same value.
By calling groupby('card_id').apply(func) you can use a custom function func which does the job.
Here a working example:
import pandas as pd
# card features:
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e']
date_activation = ['2019-02-01', '2019-05-02', '2018-01-20', '2015-07-23', '2013-07-23']
feature_1_1 = [0, 1, 1, 1, 0]
feature_1_2 = [1, 0, 0, 0, 1]
df_card_features = pd.DataFrame()
df_card_features['card_id'] = card_id
df_card_features['date_activation'] = pd.to_datetime(date_activation) #converting to datetime
df_card_features['feature_1_1'] = feature_1_1
df_card_features['feature_1_2'] = feature_1_2;
df_card_features.head()
# card historic
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e']
denied_purchase = ['N', 'Y', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y']
purchase_date = ['2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-10', '2019-02-11', '2019-02-21', '2019-03-01', '2019-03-01', '2019-03-01', '2019-03-31', '2018-04-01', '2016-02-01', '2013-12-01']
installments = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 8, 4, 0 ]
month_lag = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5]
df_cart_historic = pd.DataFrame()
df_cart_historic['card_id'] = card_id
df_cart_historic['denied_purchase'] = denied_purchase
df_cart_historic['purchase_date'] = pd.to_datetime(purchase_date) #converting to datetime
df_cart_historic['installments'] = installments
df_cart_historic['month_lag'] = month_lag
df_card_features.set_index('card_id', inplace=True) #using card_id column as index
def getnewcols(x):
res = pd.DataFrame()
res['denied_purchase?'] = pd.Series(['Y' if 'Y' in x['denied_purchase'].unique() else 'N'])
res['oldest_Date'] = x['purchase_date'].min()
res['max_installments'] = x['installments'].max()
res['max_month_lag'] = x['month_lag'].max()
return res
newcols = df_cart_historic.groupby('card_id').apply(getnewcols)
newcols = newcols.reset_index().drop('level_1', axis=1).set_index('card_id')
df_card_features_final = pd.concat([df_card_features, newcols], axis=1)
Notice that the column with dates is parsed with pandas.to_datetime in order to have datetime objects instead of simple strings (very handful to work with dates).
newcols is the dataframe holding the new columns, df_card_features_final is the final dataframe with all the columns:
date_activation feature_1_1 feature_1_2 denied_purchase? oldest_Date max_installments max_month_lag
card_id
card_a 2019-02-01 0 1 N 2019-02-01 0 0
card_b 2019-05-02 1 0 Y 2019-02-01 0 0
card_c 2018-01-20 1 0 N 2018-04-01 8 0
card_d 2015-07-23 1 0 Y 2016-02-01 4 0
card_e 2013-07-23 0 1 Y 2013-12-01 5 5

in-group time-to event counter

I'm trying to work through the methodology for churn prediction I found here:
Let's say today is 1/6/2017. I have a pandas dataframe, df, that I want to add two columns to.
df = pd.DataFrame([
['a', '2017-01-01', 0],
['a', '2017-01-02', 0],
['a', '2017-01-03', 0],
['a', '2017-01-04', 1],
['a', '2017-01-05', 1],
['b', '2017-01-01', 0],
['b', '2017-01-02', 1],
['b', '2017-01-03', 0],
['b', '2017-01-04', 0],
['b', '2017-01-05', 0]
]
,columns=['id','date','is_event']
)
df['date'] = pd.to_datetime(df['date'])
One is time_to_next_event and the other is is_censored. time_to_next_event will, within each id, decrease towards zero as an event gets closer in time. If no event exists before today, time_to_next_event will decrease in value until the end of the group.
is_censored is a binary flag related to this phenomenon and will indicate, within each id, the rows which have occurred between the most recent event and today. For id a, the most recent row contains the event so is_censored is zero for the whole group. For id b, there are three rows between the most recent event and today so each of their is_censored values are 1.
desired = pd.DataFrame([
['a', '2017-01-01', 0, 3, 0],
['a', '2017-01-02', 0, 2, 0],
['a', '2017-01-03', 0, 1, 0],
['a', '2017-01-04', 1, 0, 0],
['a', '2017-01-05', 1, 0, 0],
['b', '2017-01-01', 0, 1, 0],
['b', '2017-01-02', 1, 0, 0],
['b', '2017-01-03', 0, 3, 1],
['b', '2017-01-04', 0, 2, 1],
['b', '2017-01-05', 0, 1, 1]
]
,columns=['id','date','is_event','time_to_next_event', 'is_censored']
)
desired['date'] = pd.to_datetime(desired['date'])
For time_to_next_event, I found this SO question but had trouble getting it to fit my use case.
For is_censored, I'm stumped so far. I'm posting this question in the hopes that some benevolent Stack Overflower will take pity on me while I sleep (working in EU) and I'll take another stab at this tomorrow. Will update with anything I find. Many thanks in advance!
To get the days until the next event, we can add a column that backfills the date of the next event:
df['next_event'] = df['date'][df['is_event'] == 1]
df['next_event'] = df.groupby('id')['next_event'].transform(lambda x: x.fillna(method='bfill'))
We can then just subtract to get the days between the next event and each day:
df['next_event'] = df['next_event'].fillna(df['date'].iloc[-1] + pd.Timedelta(days=1))
df['time_to_next_event'] = (df['next_event']-df['date']).dt.days
To get the is_censored value for each day and each id, we can group by id, and then we can forward-fill based on the 'is_event' column for each group. Now, we just need the forward-filled values, since according to the definition above, the value of 'is_censored' should be 0 on the day of the event itself. So, we can compare the 'is_event' column to the forward-filled version of that column and set 'is_censored' to 1 each time we have a forward-filled value that wasn't in the original.
df['is_censored'] = (df.groupby('id')['is_event'].transform(lambda x: x.replace(0, method='ffill')) != df['is_event']).astype(int)
df = df.drop('next_event', axis=1)
In [343]: df
Out[343]:
id date is_event time_to_next_event is_censored
0 a 2017-01-01 0 3 0
1 a 2017-01-02 0 2 0
2 a 2017-01-03 0 1 0
3 a 2017-01-04 1 0 0
4 a 2017-01-05 1 0 0
5 b 2017-01-01 0 1 0
6 b 2017-01-02 1 0 0
7 b 2017-01-03 0 3 1
8 b 2017-01-04 0 2 1
9 b 2017-01-05 0 1 1
To generalize the method for is_censored to include cases where an event happens more than once within each id, I wrote this:
df['is_censored2'] = 1
max_dates = df[df['is_event'] == 1].groupby('id',as_index=False)['date'].max()
max_dates.columns = ['id','max_date']
df = pd.merge(df,max_dates,on=['id'],how='left')
df['is_censored2'][df['date'] <= df['max_date']] = 0
It initializes the column at 1 then grabs the max date associated with an event within each id and populates a 0 in is_censored2 if there are any dates in id that are less than or equal to it.

Numpy: how to convert observations to probabilities?

I have a feature matrix and a corresponding targets, which are ones or zeroes:
# raw observations
features = np.array([[1, 1, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1]])
targets = np.array([1, 0, 1, 1, 0, 0])
As you can see, each feature may correspond to both ones and zeros. I need to convert my raw observation matrix to probability matrix, where each feature will correspond to the probability of seeing one as a target:
[1 1 0] -> 0.5
[0 1 0] -> 0.67
[0 0 1] -> 0
I have constructed a quite straight-forward solution:
import numpy as np
# raw observations
features = np.array([[1, 1, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1]])
targets = np.array([1, 0, 1, 1, 0, 0])
from collections import Counter
def convert_obs_to_proba(features, targets):
features_ = []
targets_ = []
# compute unique rows (idx will point to some representative)
b = np.ascontiguousarray(features).view(np.dtype((np.void, features.dtype.itemsize * features.shape[1])))
_, idx = np.unique(b, return_index=True)
idx = idx[::-1]
zeros = Counter()
ones = Counter()
# collect row-wise number of one and zero targets
for i, row in enumerate(features[:]):
if targets[i] == 0:
zeros[tuple(row)] += 1
else:
ones[tuple(row)] += 1
# iterate over unique features and compute probabilities
for k in idx:
unique_row = features[k]
zero_count = zeros[tuple(unique_row)]
one_count = ones[tuple(unique_row)]
proba = float(one_count) / float(zero_count + one_count)
features_.append(unique_row)
targets_.append(proba)
return np.array(features_), np.array(targets_)
features_, targets_ = convert_obs_to_proba(features, targets)
print(features_)
print(targets_)
which:
extracts unique features;
counts number of zero and one observations targets for each unique feature;
computes probability and constructs the result.
Could it be solved in a prettier way using some advanced numpy magic?
Update. Previous code was pretty inefficient O(n^2). Converted it to more performance-friendly. Old code:
import numpy as np
# raw observations
features = np.array([[1, 1, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1]])
targets = np.array([1, 0, 1, 1, 0, 0])
def convert_obs_to_proba(features, targets):
features_ = []
targets_ = []
# compute unique rows (idx will point to some representative)
b = np.ascontiguousarray(features).view(np.dtype((np.void, features.dtype.itemsize * features.shape[1])))
_, idx = np.unique(b, return_index=True)
idx = idx[::-1]
# calculate ZERO class occurences and ONE class occurences
for k in idx:
unique_row = features[k]
zeros = 0
ones = 0
for i, row in enumerate(features[:]):
if np.array_equal(row, unique_row):
if targets[i] == 0:
zeros += 1
else:
ones += 1
proba = float(ones) / float(zeros + ones)
features_.append(unique_row)
targets_.append(proba)
return np.array(features_), np.array(targets_)
features_, targets_ = convert_obs_to_proba(features, targets)
print(features_)
print(targets_)
It's easy using Pandas:
df = pd.DataFrame(features)
df['targets'] = targets
Now you have:
0 1 2 targets
0 1 1 0 1
1 1 1 0 0
2 0 1 0 1
3 0 1 0 1
4 0 1 0 0
5 0 0 1 0
Now, the fancy part:
df.groupby([0,1,2]).targets.mean()
Gives you:
0 1 2
0 0 1 0.000000
1 0 0.666667
1 1 0 0.500000
Name: targets, dtype: float64
Pandas doesn't print the 0 at the leftmost part of the 0.666 row, but if you inspect the value there, it is indeed 0.
np.sum(np.reshape([targets[f] if tuple(features[f])==tuple(i) else 0 for i in np.vstack(set(map(tuple,features))) for f in range(features.shape[0])],features.shape[::-1]),axis=1)/np.sum(np.reshape([1 if tuple(features[f])==tuple(i) else 0 for i in np.vstack(set(map(tuple,features))) for f in range(features.shape[0])],features.shape[::-1]),axis=1)
Here you go, numpy magic! Although unnecceserily so, this could probably be cleaned up using some boring variables ;)
(And this is probably far from optimal)

Correlation between a pandas Series and a whole DataFrame

I have a series of values and I'm looking to compute the pearson correlation with every row of a given table.
How do I do I do that?
Example:
import pandas as pd
v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]
s = pd.Series(v)
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
# Here I expect ot do df.corrwith(s) - but won't work
Using Series.corr() to calculate, the expected output is
-0.1666666666666666 # correlation with the first row
0.83914639167827343 # correlation with the second row
-0.35355339059327379 # correlation with the third row
You need same index of Series as columns of DataFrame for align Series by DataFrame and add axis=1 in corrwith for row-wise correlation:
s1 = pd.Series(s.values, index=df.columns)
print (s1)
a -1
b 5
c 0
d 0
e 10
f 0
g -7
dtype: int64
print (df.corrwith(s1, axis=1))
0 -0.166667
1 0.839146
2 -0.353553
dtype: float64
print (df.corrwith(pd.Series(v, index=df.columns), axis=1))
0 -0.166667
1 0.839146
2 -0.353553
dtype: float64
EDIT:
You can specify columns and use subset:
cols = ['a','b','e']
print (df[cols])
a b e
0 1 0 0
1 0 1 1
2 1 1 0
print (df[cols].corrwith(pd.Series(v, index=df.columns), axis=1))
0 -0.891042
1 0.891042
2 -0.838628
dtype: float64
This might be useful to those concerned with performance.
I have found this runs in half the time compared to pandas corrwith.
Your data:
import pandas as pd
v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
The solution (note that v is not transformed into a series):
from scipy.stats.stats import pearsonr
s_corrs = df.apply(lambda x: pearsonr(x.values, v)[0], axis=1)

Categories

Resources