Summarize data from one dataframe to another - python

I would like to help you for what follows.
In my work I have two DataFrames. The first, called df_card_features, has card features and the card_id column has the unique ID of each card. The second, called df_cart_historic, has card data from the first dataframe; in this second dataframe, the card_id column has no unique values, but is the same as the card_id column of the first dataframe.
As a solution I thought about creating a dictionary and then include the columns in the dataframe, but this proposal seems to me very costly in terms of performance, because the csv file of history has about 5 GB.
# card features:
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e']
date_activation = ['2019-02-01', '2019-05-02', '2018-01-20', '2015-07-23', '2013-07-23']
feature_1_1 = [0, 1, 1, 1, 0]
feature_1_2 = [1, 0, 0, 0, 1]
df_card_features = pd.DataFrame()
df_card_features['card_id'] = card_id
df_card_features['date_activation'] = date_activation
df_card_features['feature_1_1'] = feature_1_1
df_card_features['feature_1_2'] = feature_1_2;
df_card_features.head()
# card historic
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e']
denied_purchase = ['N', 'Y', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y']
purchase_date = ['2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-10', '2019-02-11', '2019-02-21', '2019-03-01', '2019-03-01', '2019-03-01', '2019-03-31', '2018-04-01', '2016-02-01', '2013-12-01']
installments = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 8, 4, 0 ]
month_lag = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5]
df_cart_historic = pd.DataFrame()
df_cart_historic['card_id'] = card_id
df_cart_historic['denied_purchase'] = denied_purchase
df_cart_historic['purchase_date'] = purchase_date
df_cart_historic['installments'] = installments
df_cart_historic['month_lag'] = month_lag
What I need is to create the following columns in the df_card_features dataframe:
column 'denied_purchase?' Whose value is 1 if there is at least one Y value occurrence in the denied_purchase column of the df_cart_historic dataframe or zero if there is no Y occurrence for the card_id
'oldest_Date' column, whose value is the oldest date in the purchase_date column of df_cart_historic
'max_installments', which is the maximum value of the installments column of df_cart_historic
'max_month_lag', which is the maximum value of the month_lag column of df_cart_historic.

Yoy need to use groupby on 'card_id' column in df_cart_historic in order to build the new columns using only the rows where 'card_id' has the same value.
By calling groupby('card_id').apply(func) you can use a custom function func which does the job.
Here a working example:
import pandas as pd
# card features:
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e']
date_activation = ['2019-02-01', '2019-05-02', '2018-01-20', '2015-07-23', '2013-07-23']
feature_1_1 = [0, 1, 1, 1, 0]
feature_1_2 = [1, 0, 0, 0, 1]
df_card_features = pd.DataFrame()
df_card_features['card_id'] = card_id
df_card_features['date_activation'] = pd.to_datetime(date_activation) #converting to datetime
df_card_features['feature_1_1'] = feature_1_1
df_card_features['feature_1_2'] = feature_1_2;
df_card_features.head()
# card historic
card_id = ['card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e', 'card_a', 'card_b', 'card_c', 'card_d', 'card_e']
denied_purchase = ['N', 'Y', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y']
purchase_date = ['2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-01', '2019-02-10', '2019-02-11', '2019-02-21', '2019-03-01', '2019-03-01', '2019-03-01', '2019-03-31', '2018-04-01', '2016-02-01', '2013-12-01']
installments = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 8, 4, 0 ]
month_lag = [0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5]
df_cart_historic = pd.DataFrame()
df_cart_historic['card_id'] = card_id
df_cart_historic['denied_purchase'] = denied_purchase
df_cart_historic['purchase_date'] = pd.to_datetime(purchase_date) #converting to datetime
df_cart_historic['installments'] = installments
df_cart_historic['month_lag'] = month_lag
df_card_features.set_index('card_id', inplace=True) #using card_id column as index
def getnewcols(x):
res = pd.DataFrame()
res['denied_purchase?'] = pd.Series(['Y' if 'Y' in x['denied_purchase'].unique() else 'N'])
res['oldest_Date'] = x['purchase_date'].min()
res['max_installments'] = x['installments'].max()
res['max_month_lag'] = x['month_lag'].max()
return res
newcols = df_cart_historic.groupby('card_id').apply(getnewcols)
newcols = newcols.reset_index().drop('level_1', axis=1).set_index('card_id')
df_card_features_final = pd.concat([df_card_features, newcols], axis=1)
Notice that the column with dates is parsed with pandas.to_datetime in order to have datetime objects instead of simple strings (very handful to work with dates).
newcols is the dataframe holding the new columns, df_card_features_final is the final dataframe with all the columns:
date_activation feature_1_1 feature_1_2 denied_purchase? oldest_Date max_installments max_month_lag
card_id
card_a 2019-02-01 0 1 N 2019-02-01 0 0
card_b 2019-05-02 1 0 Y 2019-02-01 0 0
card_c 2018-01-20 1 0 N 2018-04-01 8 0
card_d 2015-07-23 1 0 Y 2016-02-01 4 0
card_e 2013-07-23 0 1 Y 2013-12-01 5 5

Related

Pandas how to reorder my table or dataframe

I have a dataframe in pandas where each column has different value range. For example:
My desired output is:
First, it makes 2-level index, then unstack based on multiindex and, finally, rename columns.
df = pd.DataFrame({'axis_x': [0, 1, 2, 0, 1, 2, 0, 1, 2], 'axis_y': [0, 0, 0, 1, 1, 1, 2, 2, 2],
'data': ['diode', 'switch', 'coil', '$2.2', '$4.5', '$3.2', 'colombia', 'china', 'brazil']})
df = df.set_index(['axis_x', 'axis_y']).unstack().rename(columns={0: 'product', 1: 'price', 2: 'country'})
print(df)
Prints:
data
axis_y product price country
axis_x
0 diode $2.2 colombia
1 switch $4.5 china
2 coil $3.2 brazil

How to merge pandas.core.series.Series and numpy.int32 to pandas.core.frame.DataFrame?

I have a df, On this data, I build some clustering model, and found the labels, the labels I get as an array, now I need the merge the data and labels
data = [['M', 10, 'red','apple'],
['F', 15, 'blue','orange'],
['M', 14, 'blue','apple'],
['M', 14, 'blue','apple'],
['F', 14, 'blue','apple'],
['M', 14, 'red',''],
['M', 14, 'blue','banana'],
['', 14, 'blue','apple']]
df = pd.DataFrame(data, columns = ['Gender', 'Age', 'Color','Fruit'])
df is encoded as numbers, then, the labels is get as
df_encode = OneHotEncoder(df)
kmeans = KMeans(n_clusters= 2)
kmeans.fit(df_encode.values)
labels = kmeans.labels_
type(labels)
Out[120]: numpy.ndarray
labels
Out[122]: array([1, 0, 1, 0, 1, 1, 0, 0])
i view both of them as follows
for i in range(len(df_encode)):
print("coordinate:",df_encode.iloc[i], "label:", labels[i])
This gives output like
coordinate:
Gender 1.0
Age 10.0
Color 0.0
Fruit 1.0
label: 0
Here how should I merge label as a column in df_encode dataframe?
Turn it to a list and attach it to your dataframe:
kmf2labels = labels.tolist()
df_encode['labels'] = kmf2labels
Output:
df_encode['labels']
Out[39]:
0 1
1 0
2 0
3 0
4 0
5 0
6 0
7 0
Name: labels, dtype: int64

Sampling from within Pandas groups with defined probabilities

Consider the following Pandas dataframe,
df = pd.DataFrame(
[
['X', 0, 0.5],
['X', 1, 0.5],
['Y', 0, 0.25],
['Y', 1, 0.3],
['Y', 2, 0.45],
['Z', 0, 0.6],
['Z', 1, 0.1],
['Z', 2, 0.3]
], columns=['NAME', 'POSITION', 'PROB'])
Notice that df defines a discrete probability distribution for each unique NAME value i.e.
assert ((df.groupby('NAME')['PROB'].sum() - 1)**2 < 1e-10).all()
What I would like to do is sample from these probability distributions.
We can think of POSITION as being the values corresponding to the probabilities. So when considering X the sample will be 0 with probability 0.5 and 1 with probability 0.5.
I would like to create a new dataframe with columns ['NAME', 'POSITION', 'PROB', 'SAMPLE'] representing these samples. Each unique SAMPLE value represents a new sample. The PROB column is now always 0 or 1, representing whether the given row was selected in the given sample. For example, if I were to select 3 samples an example outcome is below,
df_samples = pd.DataFrame(
[
['X', 0, 1, 0],
['X', 1, 0, 0],
['X', 0, 0, 1],
['X', 1, 1, 1],
['X', 0, 1, 2],
['X', 1, 0, 2],
['Y', 0, 1, 0],
['Y', 1, 0, 0],
['Y', 2, 0, 0],
['Y', 0, 0, 1],
['Y', 1, 0, 1],
['Y', 2, 1, 1],
['Y', 0, 1, 2],
['Y', 1, 0, 2],
['Y', 2, 0, 2],
['Z', 0, 0, 0],
['Z', 1, 0, 0],
['Z', 2, 1, 0],
['Z', 0, 0, 1],
['Z', 1, 0, 1],
['Z', 2, 1, 1],
['Z', 0, 1, 2],
['Z', 1, 0, 2],
['Z', 2, 0, 2],
], columns=['NAME', 'POSITION', 'PROB', 'SAMPLE'])
Of course due to the randomness involved, this is just one of a number of possible outcomes.
A unittest for the program would be that as the samples increases, by the law of large numbers, the mean number of our samples for each (NAME, POSITION) pair, should tend to the actual probability. One could calculate a confidence region based on the total samples used and then make sure the true probability lies within it. For example using a normal approximation to binomial outcomes (requires total samples n_samples to be 'large') a (-4 sd, 4 sd) region test would be:
z = 4
p_est = df_samples.groupby(['NAME', 'POSITION'])['PROB'].mean()
p_true = df.set_index(['NAME', 'POSITION'])['PROB']
CI_lower = p_est - z*np.sqrt(p_est*(1-p_est)/n_samples)
CI_upper = p_est + z*np.sqrt(p_est*(1-p_est)/n_samples)
assert p_true < CI_upper
assert p_true > CI_lower
What is the most efficient way to do this in Pandas? I feel like I want to apply some sample function to the df.groupby('NAME') object.
P.S.
To be even more explicit, here is a very long winded way of doing this using Numpy.
n_samples = 3
df_list = []
for name in ['X', 'Y', 'Z']:
idx = df['NAME'] == name
position_samples = np.random.choice(df.loc[idx, 'POSITION'],
n_samples,
p=df.loc[idx, 'PROB'])
prob = np.zeros([idx.sum(), n_samples])
prob[position_samples, np.arange(n_samples)] = 1
position = np.tile(np.arange(idx.sum())[:, None], n_samples)
sample = np.tile(np.arange(n_samples)[:,None], idx.sum()).T
df_list.append(pd.DataFrame(
[[name, prob.ravel()[i], position.ravel()[i],
sample.ravel()[i]]
for i in range(n_samples*idx.sum())],
columns=['NAME', 'PROB', 'POSITION', 'SAMPLE']))
df_samples = pd.concat(df_list)
If I understand correctly, you're looking for groupby + sample and then some indexing stuff
First sample by the probabilites:
n_samples = 3
df_samples = df.groupby('NAME').apply(lambda x: x[['NAME', 'POSITION']] \
.sample(n_samples, replace=True,
weights=x.PROB)) \
.reset_index(drop=True)
Now add the extra columns:
df_samples['SAMPLE'] = df_samples.groupby('NAME').cumcount()
df_samples['PROB'] = 1
print(df_samples)
NAME POSITION SAMPLE PROB
0 X 1 0 1
1 X 0 1 1
2 X 1 2 1
3 Y 1 0 1
4 Y 1 1 1
5 Y 1 2 1
6 Z 2 0 1
7 Z 0 1 1
8 Z 0 2 1
Note that this doesn't include the 0 probability positions for each sample as requested in the initial question but it is a more concise way of storing the information.
If we want to also include the 0 probability positions we can merge in the other positions as follows:
domain = df[['NAME', 'POSITION']].drop_duplicates()
df_samples.drop('PROB', axis=1, inplace=True)
df_samples = pd.merge(df_samples, domain, on='NAME',
suffixes=['_sample', ''])
df_samples['PROB'] = (df_samples['POSITION'] ==
df_samples['POSITION_sample']).astype(int)
df_samples.drop('POSITION_sample', axis=1, inplace=True)

in-group time-to event counter

I'm trying to work through the methodology for churn prediction I found here:
Let's say today is 1/6/2017. I have a pandas dataframe, df, that I want to add two columns to.
df = pd.DataFrame([
['a', '2017-01-01', 0],
['a', '2017-01-02', 0],
['a', '2017-01-03', 0],
['a', '2017-01-04', 1],
['a', '2017-01-05', 1],
['b', '2017-01-01', 0],
['b', '2017-01-02', 1],
['b', '2017-01-03', 0],
['b', '2017-01-04', 0],
['b', '2017-01-05', 0]
]
,columns=['id','date','is_event']
)
df['date'] = pd.to_datetime(df['date'])
One is time_to_next_event and the other is is_censored. time_to_next_event will, within each id, decrease towards zero as an event gets closer in time. If no event exists before today, time_to_next_event will decrease in value until the end of the group.
is_censored is a binary flag related to this phenomenon and will indicate, within each id, the rows which have occurred between the most recent event and today. For id a, the most recent row contains the event so is_censored is zero for the whole group. For id b, there are three rows between the most recent event and today so each of their is_censored values are 1.
desired = pd.DataFrame([
['a', '2017-01-01', 0, 3, 0],
['a', '2017-01-02', 0, 2, 0],
['a', '2017-01-03', 0, 1, 0],
['a', '2017-01-04', 1, 0, 0],
['a', '2017-01-05', 1, 0, 0],
['b', '2017-01-01', 0, 1, 0],
['b', '2017-01-02', 1, 0, 0],
['b', '2017-01-03', 0, 3, 1],
['b', '2017-01-04', 0, 2, 1],
['b', '2017-01-05', 0, 1, 1]
]
,columns=['id','date','is_event','time_to_next_event', 'is_censored']
)
desired['date'] = pd.to_datetime(desired['date'])
For time_to_next_event, I found this SO question but had trouble getting it to fit my use case.
For is_censored, I'm stumped so far. I'm posting this question in the hopes that some benevolent Stack Overflower will take pity on me while I sleep (working in EU) and I'll take another stab at this tomorrow. Will update with anything I find. Many thanks in advance!
To get the days until the next event, we can add a column that backfills the date of the next event:
df['next_event'] = df['date'][df['is_event'] == 1]
df['next_event'] = df.groupby('id')['next_event'].transform(lambda x: x.fillna(method='bfill'))
We can then just subtract to get the days between the next event and each day:
df['next_event'] = df['next_event'].fillna(df['date'].iloc[-1] + pd.Timedelta(days=1))
df['time_to_next_event'] = (df['next_event']-df['date']).dt.days
To get the is_censored value for each day and each id, we can group by id, and then we can forward-fill based on the 'is_event' column for each group. Now, we just need the forward-filled values, since according to the definition above, the value of 'is_censored' should be 0 on the day of the event itself. So, we can compare the 'is_event' column to the forward-filled version of that column and set 'is_censored' to 1 each time we have a forward-filled value that wasn't in the original.
df['is_censored'] = (df.groupby('id')['is_event'].transform(lambda x: x.replace(0, method='ffill')) != df['is_event']).astype(int)
df = df.drop('next_event', axis=1)
In [343]: df
Out[343]:
id date is_event time_to_next_event is_censored
0 a 2017-01-01 0 3 0
1 a 2017-01-02 0 2 0
2 a 2017-01-03 0 1 0
3 a 2017-01-04 1 0 0
4 a 2017-01-05 1 0 0
5 b 2017-01-01 0 1 0
6 b 2017-01-02 1 0 0
7 b 2017-01-03 0 3 1
8 b 2017-01-04 0 2 1
9 b 2017-01-05 0 1 1
To generalize the method for is_censored to include cases where an event happens more than once within each id, I wrote this:
df['is_censored2'] = 1
max_dates = df[df['is_event'] == 1].groupby('id',as_index=False)['date'].max()
max_dates.columns = ['id','max_date']
df = pd.merge(df,max_dates,on=['id'],how='left')
df['is_censored2'][df['date'] <= df['max_date']] = 0
It initializes the column at 1 then grabs the max date associated with an event within each id and populates a 0 in is_censored2 if there are any dates in id that are less than or equal to it.

Correlation between a pandas Series and a whole DataFrame

I have a series of values and I'm looking to compute the pearson correlation with every row of a given table.
How do I do I do that?
Example:
import pandas as pd
v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]
s = pd.Series(v)
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
# Here I expect ot do df.corrwith(s) - but won't work
Using Series.corr() to calculate, the expected output is
-0.1666666666666666 # correlation with the first row
0.83914639167827343 # correlation with the second row
-0.35355339059327379 # correlation with the third row
You need same index of Series as columns of DataFrame for align Series by DataFrame and add axis=1 in corrwith for row-wise correlation:
s1 = pd.Series(s.values, index=df.columns)
print (s1)
a -1
b 5
c 0
d 0
e 10
f 0
g -7
dtype: int64
print (df.corrwith(s1, axis=1))
0 -0.166667
1 0.839146
2 -0.353553
dtype: float64
print (df.corrwith(pd.Series(v, index=df.columns), axis=1))
0 -0.166667
1 0.839146
2 -0.353553
dtype: float64
EDIT:
You can specify columns and use subset:
cols = ['a','b','e']
print (df[cols])
a b e
0 1 0 0
1 0 1 1
2 1 1 0
print (df[cols].corrwith(pd.Series(v, index=df.columns), axis=1))
0 -0.891042
1 0.891042
2 -0.838628
dtype: float64
This might be useful to those concerned with performance.
I have found this runs in half the time compared to pandas corrwith.
Your data:
import pandas as pd
v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
The solution (note that v is not transformed into a series):
from scipy.stats.stats import pearsonr
s_corrs = df.apply(lambda x: pearsonr(x.values, v)[0], axis=1)

Categories

Resources