How to assign identical random IDs conditionally to "related" rows in pandas? - python
New to Python I'm struggling with the problem to assign some random IDs to "related" rows
where the relation is simply their proximity (within 14 days) in consecutive days grouped by user. In that example I chose uuidwithout any specific intention. It could be any other random IDs uniquely indentifying conceptually related rows.
import pandas as pd
import uuid
import numpy as np
Here is a dummy dataframe:
dummy_df = pd.DataFrame({"transactionid": [1, 2, 3, 4, 5, 6, 7, 8],
"user": ["michael",
"michael",
"michael",
"tom",
"tom",
"tom",
"tom",
"tom"],
"transactiontime": pd.to_datetime(["2022-01-01",
"2022-01-02",
"2022-01-03",
"2022-09-01",
"2022-09-13",
"2022-10-17",
"2022-10-20",
"2022-11-17"])})
dummy_df.head(10)
transactionid user transactiontime
0 1 michael 2022-01-01
1 2 michael 2022-01-02
2 3 michael 2022-01-03
3 4 tom 2022-09-01
4 5 tom 2022-09-13
5 6 tom 2022-10-17
6 7 tom 2022-10-20
7 8 tom 2022-11-17
Here I sort transactions and calculate their difference in days:
dummy_df = dummy_df.assign(
timediff = dummy_df
.sort_values('transactiontime')
.groupby(["user"])['transactiontime'].diff() / np.timedelta64(1, 'D')
).fillna(0)
dummy_df.head(10)
transactionid user transactiontime timediff
0 1 michael 2022-01-01 0.0
1 2 michael 2022-01-02 1.0
2 3 michael 2022-01-03 1.0
3 4 tom 2022-09-01 0.0
4 5 tom 2022-09-13 12.0
5 6 tom 2022-10-17 34.0
6 7 tom 2022-10-20 3.0
7 8 tom 2022-11-17 28.0
Here I create a new column with a random IDs for each related transaction - though it does not work as expected:
dummy_df.assign(related_transaction = np.where((dummy_df.timediff >= 0) & (dummy_df.timediff < 15), uuid.uuid4(), dummy_df.transactionid))
transactionid user transactiontime timediff related_transaction
0 1 michael 2022-01-01 0.0 fd630f07-6564-4773-aff9-44ecb1e4211d
1 2 michael 2022-01-02 1.0 fd630f07-6564-4773-aff9-44ecb1e4211d
2 3 michael 2022-01-03 1.0 fd630f07-6564-4773-aff9-44ecb1e4211d
3 4 tom 2022-09-01 0.0 fd630f07-6564-4773-aff9-44ecb1e4211d
4 5 tom 2022-09-13 12.0 fd630f07-6564-4773-aff9-44ecb1e4211d
5 6 tom 2022-10-17 34.0 6
6 7 tom 2022-10-20 3.0 fd630f07-6564-4773-aff9-44ecb1e4211d
7 8 tom 2022-11-17 28.0 8
What I would expect is something like given that the user group difference between transactions is within 14 days:
transactionid user transactiontime timediff related_transaction
0 1 michael 2022-01-01 0.0 ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
1 2 michael 2022-01-02 1.0 ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
2 3 michael 2022-01-03 1.0 ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
3 4 tom 2022-09-01 0.0 b1da2251-7770-4756-8863-c82f90657542
4 5 tom 2022-09-13 12.0 b1da2251-7770-4756-8863-c82f90657542
5 6 tom 2022-10-17 34.0 485a8d97-80d1-4184-8fc8-99523f471527
6 7 tom 2022-10-20 3.0 485a8d97-80d1-4184-8fc8-99523f471527
7 8 tom 2022-11-17 28.0 8
Taking the idea from Luise, we start with an empty column for related_transaction. Then, we iterate through each row. For each date, we check if it is already part of a transaction. If so, continue. Otherwise, assign a new transaction to that date and all other dates within 15 following days for the same user:
import datetime
df = dummy_df
df['related_transaction'] = None
for i, row in dummy_df.iterrows():
if df.loc[i].related_transaction is not None:
# We already assigned that row
continue
df.loc[ # Select where:
(df.transactiontime <= row.transactiontime + datetime.timedelta(days=15)) & # Current row + 15 days
(df.user == row.user) & # Same user
(pd.isna(df.related_transaction)), # Don't overwrite anything already assigned
'related_transaction' # Set this column to:
] = uuid.uuid4() # Assign new UUID
This gives the output:
transactionid user transactiontime related_transaction
0 1 michael 2022-01-01 82d28e10-149b-481e-ba41-f5833662ba99
1 2 michael 2022-01-02 82d28e10-149b-481e-ba41-f5833662ba99
2 3 michael 2022-01-03 82d28e10-149b-481e-ba41-f5833662ba99
3 4 tom 2022-09-01 fa253663-8615-419a-afda-7646906024f0
4 5 tom 2022-09-13 fa253663-8615-419a-afda-7646906024f0
5 6 tom 2022-10-17 d6152d4b-1560-40e0-8589-bd8e3da363db
6 7 tom 2022-10-20 d6152d4b-1560-40e0-8589-bd8e3da363db
7 8 tom 2022-11-17 2a93d78d-b6f6-4f0f-bb09-1bc18361aa21
In your example, the dates are already sorted, that's an important assumption I'm making here!
The mismatch between your code and your desired result is that uuid.uuid4() creates an ID a single time and assigns it to all the relevant rows defined by np.where(). Instead, you need to generate the IDs in a vectorized way.
Try the following approach:
df.loc[ROW_CONDITIONs, COLUMNS] = VECTORIZED_ID_GENERATOR
which for your example would be
dummy_df.loc[(dummy_df['timediff'] >= 0) & (dummy_df['timediff'] < 15), 'related_transaction'] = dummy_df.apply(lambda _: uuid.uuid4(), axis=1)
Take into account that this only solves your question of how to assign random IDs using uuid conditionally in Pandas. It looks to me that you also need to generate the same ID for the same user and for transactions every 15 days. My advice for that would be to generate a dataframe where every row is a combination of two transactions and add a condition saying that the users from both transactions need to be the same.
Related
Calculate the mean value using two columns in pandas
I have a deal dataframe with three columns and I have sorted by the type and date, It looks like: type date price A 2020-05-01 4 A 2020-06-04 6 A 2020-06-08 8 A 2020-07-03 5 B 2020-02-01 3 B 2020-04-02 4 There are many types (A, B, C,D,E…), I want to calculate the previous mean price of the same type of product. For example: the pre_mean_price value of third row A is (4+6)/2=5. I want to get a dataframe like this: type date price pre_mean_price A 2020-05-01 4 . A 2020-06-04 6 4 A 2020-06-08 8 5 A 2020-07-03 5 6 B 2020-02-01 3 . B 2020-04-02 4 3 How can I calculate the pre_mean_price? Thanks a lot!
You can use expanding().mean() after groupby for each group , then shift the values. df['pre_mean_price'] = df.groupby("type")['price'].apply(lambda x: x.expanding().mean().shift()) print(df) type date price pre_mean_price 0 A 2020-05-01 4 NaN 1 A 2020-06-04 6 4.0 2 A 2020-06-08 8 5.0 3 A 2020-07-03 5 6.0 4 B 2020-02-01 3 NaN 5 B 2020-04-02 4 3.0
Something like df['pre_mean_price'] = df.groupby('type').expanding().mean().groupby('type').shift(1)['price'].values which produces type date price pre_mean_price 0 A 2020-05-01 4 NaN 1 A 2020-06-04 6 4.0 2 A 2020-06-08 8 5.0 3 A 2020-07-03 5 6.0 4 B 2020-02-01 3 NaN 5 B 2020-04-02 4 3.0 Short explanation The idea is to First groupby "type" with .groupby(). This must be done since we want to calculate the (incremental) means within the group "type". Then, calculate the incremental mean with expanding().mean(). The output in this point is price type A 0 4.00 1 5.00 2 6.00 3 5.75 B 4 3.00 5 3.50 Then, groupby again by "type", and shift the elements inside the groups by one row with shift(1). Then, just extract the values of the price column (the incremental means) Note: This assumes your data is sorted by date. It it is not, call df.sort_values('date', inplace=True) before.
How to skip cells in a lambda shift rolling function in pandas based off multiple column criteria
I have the following dataframe which is a list of althete times: Name Time Excuse Injured Margin John 15 nan 0 1 John 18 nan 0 5 John 30 leg injury 1 11 John 16 nan 0 4 John 40 nan 0 18 John 15 nan 0 3 John 22 nan 0 6 I then am using a function to get the mean of the previous last 5 times shifted: df['last5'] = df.groupby(['Name']).Time.apply( lambda x: x.shift().rolling(5, min_periods=1).mean().fillna(.5)) This works but I am hoping to perform the same calculation but I want to ignore the Time if there is an Excuse, Injured = 1 or Margin >10. My Expected output would be: Name Time Excuse Injured Margin last5 John 15 0 1 .5 John 18 0 5 15 John 30 leg injury 1 11 16.5 John 16 0 4 16.5 John 40 0 18 16.33 John 15 0 3 16.33 John 22 0 6 16 Can I just add a condition onto the end of the orginal function? Thanks in advance!
You can filter the dataframe according to criteria before applying the rolling calculation Use bfill() to backwards fill the NaN values as required: df['last5'] = (df[(df['Excuse'].isnull()) & (df['Injured'] != 1) & (df['Margin'] <= 10)] .groupby(['Name']).Time.apply(lambda x: x.shift().rolling(5, min_periods=1) .mean().fillna(.5))) df['last5'] = df.groupby(['Name'])['last5'].bfill() df Out[1]: Name Time Excuse Injured Margin last5 0 John 15 NaN 0 1 0.500000 1 John 18 NaN 0 5 15.000000 2 John 30 leg injury 1 11 16.500000 3 John 16 NaN 0 4 16.500000 4 John 40 NaN 0 18 16.333333 5 John 15 NaN 0 3 16.333333 6 John 22 NaN 0 6 16.000000
How to calculate the average from previous rows in pandas? [duplicate]
I have a deal dataframe with three columns and I have sorted by the type and date, It looks like: type date price A 2020-05-01 4 A 2020-06-04 6 A 2020-06-08 8 A 2020-07-03 5 B 2020-02-01 3 B 2020-04-02 4 There are many types (A, B, C,D,E…), I want to calculate the previous mean price of the same type of product. For example: the pre_mean_price value of third row A is (4+6)/2=5. I want to get a dataframe like this: type date price pre_mean_price A 2020-05-01 4 . A 2020-06-04 6 4 A 2020-06-08 8 5 A 2020-07-03 5 6 B 2020-02-01 3 . B 2020-04-02 4 3 How can I calculate the pre_mean_price? Thanks a lot!
You can use expanding().mean() after groupby for each group , then shift the values. df['pre_mean_price'] = df.groupby("type")['price'].apply(lambda x: x.expanding().mean().shift()) print(df) type date price pre_mean_price 0 A 2020-05-01 4 NaN 1 A 2020-06-04 6 4.0 2 A 2020-06-08 8 5.0 3 A 2020-07-03 5 6.0 4 B 2020-02-01 3 NaN 5 B 2020-04-02 4 3.0
Something like df['pre_mean_price'] = df.groupby('type').expanding().mean().groupby('type').shift(1)['price'].values which produces type date price pre_mean_price 0 A 2020-05-01 4 NaN 1 A 2020-06-04 6 4.0 2 A 2020-06-08 8 5.0 3 A 2020-07-03 5 6.0 4 B 2020-02-01 3 NaN 5 B 2020-04-02 4 3.0 Short explanation The idea is to First groupby "type" with .groupby(). This must be done since we want to calculate the (incremental) means within the group "type". Then, calculate the incremental mean with expanding().mean(). The output in this point is price type A 0 4.00 1 5.00 2 6.00 3 5.75 B 4 3.00 5 3.50 Then, groupby again by "type", and shift the elements inside the groups by one row with shift(1). Then, just extract the values of the price column (the incremental means) Note: This assumes your data is sorted by date. It it is not, call df.sort_values('date', inplace=True) before.
Setting subset of a pandas DataFrame by a DataFrame
I feel like this question has been asked a millions times before, but I just can't seem to get it to work or find a SO-post answering my question. So I am selecting a subset of a pandas DataFrame and want to change these values individually. I am subselecting my DataFrame like this: df.loc[df[key].isnull(), [keys]] which works perfectly. If I try and set all values to the same value such as df.loc[df[key].isnull(), [keys]] = 5 it works as well. But if I try and set it to a DataFrame it does not, however no error is produced either. So for example I have a DataFrame: data = [['Alex',10,0,0,2],['Bob',12,0,0,1],['Clarke',13,0,0,4],['Dennis',64,2],['Jennifer',56,1],['Tom',95,5],['Ellen',42,2],['Heather',31,3]] df1 = pd.DataFrame(data,columns=['Name','Age','Amount_of_cars','cars_per_year','some_other_value']) Name Age Amount_of_cars cars_per_year some_other_value 0 Alex 10 0 0.0 2.0 1 Bob 12 0 0.0 1.0 2 Clarke 13 0 0.0 4.0 3 Dennis 64 2 NaN NaN 4 Jennifer 56 1 NaN NaN 5 Tom 95 5 NaN NaN 6 Ellen 42 2 NaN NaN 7 Heather 31 3 NaN NaN and a second DataFrame: data = [[2/64,5],[1/56,1],[5/95,7],[2/42,5],[3/31,7]] df2 = pd.DataFrame(data,columns=['cars_per_year','some_other_value']) cars_per_year some_other_value 0 0.031250 5 1 0.017857 1 2 0.052632 7 3 0.047619 5 4 0.096774 7 and I would like to replace those nans with the second DataFrame df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2 Unfortunately this does not work as the index does not match. So how do I ignore the index, when setting values? Any help would be appreciated. Sorry if this has been posted before.
It is possible only if number of mising values is same like number of rows in df2, then assign array for prevent index alignment: df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values print (df1) Name Age Amount_of_cars cars_per_year some_other_value 0 Alex 10 0 0.000000 2.0 1 Bob 12 0 0.000000 1.0 2 Clarke 13 0 0.000000 4.0 3 Dennis 64 2 0.031250 5.0 4 Jennifer 56 1 0.017857 1.0 5 Tom 95 5 0.052632 7.0 6 Ellen 42 2 0.047619 5.0 7 Heather 31 3 0.096774 7.0 If not, get errors like: #4 rows assigned to 5 rows data = [[2/64,5],[1/56,1],[5/95,7],[2/42,5]] df2 = pd.DataFrame(data,columns=['cars_per_year','some_other_value']) df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values ValueError: shape mismatch: value array of shape (4,) could not be broadcast to indexing result of shape (5,) Another idea is set index of df2 by index of filtered rows in df1: df2 = df2.set_index(df1.index[df1['cars_per_year'].isnull()]) df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2 print (df1) Name Age Amount_of_cars cars_per_year some_other_value 0 Alex 10 0 0.000000 2.0 1 Bob 12 0 0.000000 1.0 2 Clarke 13 0 0.000000 4.0 3 Dennis 64 2 0.031250 5.0 4 Jennifer 56 1 0.017857 1.0 5 Tom 95 5 0.052632 7.0 6 Ellen 42 2 0.047619 5.0 7 Heather 31 3 0.096774 7.0
Just add .values or .to_numpy() if using pandas v 0.24 + df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values Name Age Amount_of_cars cars_per_year some_other_value 0 Alex 10 0 0.000000 2.0 1 Bob 12 0 0.000000 1.0 2 Clarke 13 0 0.000000 4.0 3 Dennis 64 2 0.031250 5.0 4 Jennifer 56 1 0.017857 1.0 5 Tom 95 5 0.052632 7.0 6 Ellen 42 2 0.047619 5.0 7 Heather 31 3 0.096774 7.0
Iterating and averaging pandas data frame
I have a database with a lot of rows such as: timestamp name price profit bob 5 4 jim 3 2 jim 2 6 bob 6 7 jim 4 1 jim 6 3 bob 3 1 The data base is sorted by a timestamp. I would like to be able to add a new column where it would take the last 2 values in the price column before the current value and average them into a new column. So that the first three rows would look something like this with a new column: timestamp name price profit new column bob 5 4 4.5 jim 3 2 3 jim 2 6 5 (6+3)/2 = 4.5 (2+4)/2 = 3 (4+6)/2 = 5 This isn't for a school project or anything this is just something I'm working on on my own. I've tried asking a similar question to this but I don't think I was very clear. Thanks in advance!
def shift_n_roll(df): return df.shift(-1).rolling(2).mean().shift(-1) df['new column'] = df.groupby('name').price.apply(shift_n_roll) df
By looking at the result you want, I'm guess you want average of the two prices following the current one instead of "2 values in the price column before the current value". I made up timestamp values that you omitted to be clear. print df timestamp name price profit 0 2016-01-01 bob 5 4 1 2016-01-02 jim 3 2 2 2016-01-03 jim 2 6 3 2016-01-04 bob 6 7 4 2016-01-05 jim 4 1 5 2016-01-06 jim 6 3 6 2016-01-07 bob 3 1 #No need to sort if you already did. #df.sort_values(['name','timestamp'], inplace=True) df['new column'] = (df.groupby('name')['price'].shift(-1) + df.groupby('name')['price'].shift(-2)) / 2 print df.dropna() timestamp name price profit new column 0 2016-01-01 bob 5 4 4.5 1 2016-01-02 jim 3 2 3.0 2 2016-01-03 jim 2 6 5.0