I have a table with the following structure; the count column gets updated every time a user accesses the app again on that date.
user_id
date
count
1
1/1/2021
4
2
1/1/2021
7
1
1/2/2021
3
3
1/2/2021
10
2
1/3/2021
4
4
1/1/2021
12
I want to de-aggregate this data based on the count, so for example, user_id of 1 will have four records on 1/1/2021 without the count column. After that, I want to concatenate a random time to the date. My output would like this:
user_id
date_time
1
1/1/2021 16:00:21
1
1/1/2021 7:23:55
1
1/1/2021 12:01:45
1
1/1/2021 21:21:07
I'm using pandas for this. Randomizing the timestamps is straightforward I think, just de-aggregating the data based on a column is a little tricky for me.
You can duplicate the index and add a random time between 0 and 24 hours:
(df.loc[df.index.repeat(df['count'])]
.assign(date=lambda d: pd.to_datetime(d['date'])
+pd.to_timedelta(np.random.randint(0,24*3600, size=len(d)), unit='s'))
.rename({'date': 'date_time'})
.drop('count', axis=1)
)
output:
user_id date
0 1 2021-01-01 03:32:40
0 1 2021-01-01 03:54:18
0 1 2021-01-01 00:57:49
0 1 2021-01-01 13:04:08
1 2 2021-01-01 00:34:03
1 2 2021-01-01 00:14:17
1 2 2021-01-01 03:57:20
1 2 2021-01-01 22:01:11
1 2 2021-01-01 22:09:55
1 2 2021-01-01 13:15:36
1 2 2021-01-01 12:26:39
2 1 2021-01-02 22:51:17
2 1 2021-01-02 13:44:12
2 1 2021-01-02 01:39:14
3 3 2021-01-02 09:22:16
3 3 2021-01-02 03:34:15
3 3 2021-01-02 23:05:49
3 3 2021-01-02 02:21:35
3 3 2021-01-02 19:51:41
3 3 2021-01-02 16:02:20
3 3 2021-01-02 18:14:05
3 3 2021-01-02 09:07:14
3 3 2021-01-02 22:43:44
3 3 2021-01-02 20:48:15
4 2 2021-01-03 19:25:04
4 2 2021-01-03 14:08:03
4 2 2021-01-03 21:23:58
4 2 2021-01-03 17:24:58
5 4 2021-01-01 23:37:41
5 4 2021-01-01 06:06:17
5 4 2021-01-01 19:23:29
5 4 2021-01-01 02:12:50
5 4 2021-01-01 08:09:59
5 4 2021-01-01 03:49:30
5 4 2021-01-01 08:00:42
5 4 2021-01-01 08:03:34
5 4 2021-01-01 15:36:12
5 4 2021-01-01 14:50:43
5 4 2021-01-01 14:54:04
5 4 2021-01-01 14:58:08
Related
My problem is a very complex and confusing one, I haven't been able to find the answer anywhere.
I basically have 2 dataframes, one is price history of certain products and the other is invoice dataframe that contains transaction data.
Sample Data:
Price History:
product_id updated price
id
1 1 2022-01-01 5.0
2 2 2022-01-01 5.5
3 3 2022-01-01 5.7
4 1 2022-01-15 6.0
5 2 2022-01-15 6.5
6 3 2022-01-15 6.7
7 1 2022-02-01 7.0
8 2 2022-02-01 7.5
9 3 2022-02-01 7.7
Invoice:
transaction_date product_id quantity
id
1 2022-01-02 1 2
2 2022-01-02 2 3
3 2022-01-02 3 4
4 2022-01-14 1 1
5 2022-01-14 2 4
6 2022-01-14 3 2
7 2022-01-15 1 3
8 2022-01-15 2 6
9 2022-01-15 3 5
10 2022-01-16 1 3
11 2022-01-16 2 2
12 2022-01-16 3 3
13 2022-02-05 1 1
14 2022-02-05 2 4
15 2022-02-05 3 7
16 2022-05-10 1 4
17 2022-05-10 2 2
18 2022-05-10 3 1
What I am looking to achieve is to add the price column in the Invoice dataframe, based on:
The product id
Comparing the Updated and Transaction Date in a way that updated date <= transaction date for that particular record, basically finding the closest date after the price was updated. (The MAX date that is <= transaction date)
I managed to do this:
invoice['price'] = invoice['product_id'].map(price_history.set_index('id')['price'])
but need to incorporate the date condition now.
Expected result for sample data:
Expected Result
Any guidance in the correct direction is appreciated, thanks
merge_asof is what you are looking for:
pd.merge_asof(
invoice,
price_history,
left_on="transaction_date",
right_on="updated",
by="product_id",
)[["transaction_date", "product_id", "quantity", "price"]]
merge_asof with arg direction
merged = pd.merge_asof(
left=invoice,
right=price_history,
left_on="transaction_date",
right_on="updated",
by="product_id",
direction="backward",
suffixes=("", "_y")
).drop(columns=["id_y", "updated"]).reset_index(drop=True)
print(merged)
id transaction_date product_id quantity price
0 1 2022-01-02 1 2 5.0
1 2 2022-01-02 2 3 5.5
2 3 2022-01-02 3 4 5.7
3 4 2022-01-14 1 1 5.0
4 5 2022-01-14 2 4 5.5
5 6 2022-01-14 3 2 5.7
6 7 2022-01-15 1 3 6.0
7 8 2022-01-15 2 6 6.5
8 9 2022-01-15 3 5 6.7
9 10 2022-01-16 1 3 6.0
10 11 2022-01-16 2 2 6.5
11 12 2022-01-16 3 3 6.7
12 13 2022-02-05 1 1 7.0
13 14 2022-02-05 2 4 7.5
14 15 2022-02-05 3 7 7.7
15 16 2022-05-10 1 4 7.0
16 17 2022-05-10 2 2 7.5
17 18 2022-05-10 3 1 7.7
I created a dataframe with pandas:
looks like that:
HostName
Date
B
2021-01-01 12:30
A
2021-01-01 12:45
C
2021-01-01 12:46
A
2021-02-01 12:42
B
2021-02-01 12:43
A
2021-02-01 12:45
B
2021-02-25 12:46
C
2021-03-01 12:41
A
2021-03-01 12:42
A
2021-03-01 12:43
C
2021-03-01 12:45
For every day, it should count how many different HostName
there is form the beginning of the day (example: 2021-01-01 00:00) to the specific row
Example:
for example lets take the 2021-01-01
HostName
Date
B
2021-01-01 12:30
A
2021-01-01 12:45
C
2021-01-01 12:46
there is tree rows:
the first result would be 1 - because its was the first row in the day.(B)
the second result would be 2 - because form the beginning of
the day till this line there is two different Hostname (B,A)
the third result would be 3 - because form the beginning of the day till this
line there is tree different Hostname ( B,A,C)
the end result should look like this:
HostName
Date
Result
B
2021-01-01 12:30
1
A
2021-01-01 12:45
2
C
2021-01-01 12:46
3
A
2021-02-01 12:42
1
B
2021-02-01 12:43
2
A
2021-02-01 12:45
2
B
2021-02-25 12:46
1
C
2021-03-01 12:41
1
A
2021-03-01 12:42
2
A
2021-03-01 12:43
2
C
2021-03-01 12:45
2
what it try do to but failed:
df.groupby(['HostName','Date')['HostName'].cumcount() + 1
or
def f(x):
one = x['HostName'].to_numpy()
twe = x['Date'].to_numpy()
both = x[['HostName','Date']].shift(1).to_numpy()
x['Host_1D_CumCount_Conn'] = [np.sum((one == a) & (twe == b)) for a, b in both]
return x
df.groupby('HostName').apply(f)
Use lambda function in GroupBy.transform with lambda function with Series.duplicated and cumulative sum:
df['Result'] = (df.groupby(df['Date'].dt.date)['HostName']
.transform(lambda x: (~x.duplicated()).cumsum()))
print (df)
HostName Date Result
0 B 2021-01-01 12:30:00 1
1 A 2021-01-01 12:45:00 2
2 C 2021-01-01 12:46:00 3
3 A 2021-02-01 12:42:00 1
4 B 2021-02-01 12:43:00 2
5 A 2021-02-01 12:45:00 2
6 B 2021-02-25 12:46:00 1
7 C 2021-03-01 12:41:00 1
8 A 2021-03-01 12:42:00 2
9 A 2021-03-01 12:43:00 2
10 C 2021-03-01 12:45:00 2
Alternative solution, faster is create helper columns d for dates and duplicates per d with HostName and use GroupBy.cumsum:
df['Result'] = (df.assign(d = df['Date'].dt.date,
new = lambda x: ~x.duplicated(['d','HostName']))
.groupby('d')['new']
.cumsum())
print (df)
HostName Date Result
0 B 2021-01-01 12:30:00 1
1 A 2021-01-01 12:45:00 2
2 C 2021-01-01 12:46:00 3
3 A 2021-02-01 12:42:00 1
4 B 2021-02-01 12:43:00 2
5 A 2021-02-01 12:45:00 2
6 B 2021-02-25 12:46:00 1
7 C 2021-03-01 12:41:00 1
8 A 2021-03-01 12:42:00 2
9 A 2021-03-01 12:43:00 2
10 C 2021-03-01 12:45:00 2
You can groupby the Date and use expanding+nunique. The issue is that, currently, expanding only works with numerical values (I wish we could simply do expanding().nunique()).
Thus we have to cheat a bit and factorize the column to numbers before applying pd.Series.nunique.
df['Result'] = (df.groupby(pd.to_datetime(df['Date']).dt.date, group_keys=False)
['HostName']
.apply(lambda s: pd.Series(s.factorize()[0]).expanding().apply(pd.Series.nunique))
.astype(int)
.values
)
output:
HostName Date Result
0 B 2021-01-01 12:30 1
1 A 2021-01-01 12:45 2
2 C 2021-01-01 12:46 3
3 A 2021-02-01 12:42 1
4 B 2021-02-01 12:43 2
5 A 2021-02-01 12:45 2
6 B 2021-02-25 12:46 1
7 C 2021-03-01 12:41 1
8 A 2021-03-01 12:42 2
9 A 2021-03-01 12:43 2
10 C 2021-03-01 12:45 2
I have the following data frame, where time_stamp is already sorted in the ascending order:
time_stamp indicator
0 2021-01-01 00:00:00 1
1 2021-01-01 00:02:00 1
2 2021-01-01 00:03:00 NaN
3 2021-01-01 00:04:00 NaN
4 2021-01-01 00:09:00 NaN
5 2021-01-01 00:14:00 NaN
6 2021-01-01 00:19:00 NaN
7 2021-01-01 00:24:00 NaN
8 2021-01-01 00:27:00 1
9 2021-01-01 00:29:00 NaN
10 2021-01-01 00:32:00 2
11 2021-01-01 00:34:00 NaN
12 2021-01-01 00:37:00 2
13 2021-01-01 00:38:00 NaN
14 2021-01-01 00:39:00 NaN
I want to create a new column in the above data frame, that shows the time difference between each row's time_stamp value and the first time_stamp value above that row where indicator is not NaN (immediately above row, where indicator is not NaN).
Below is how the output should look like (time_diff is a timedelta value, but I'll just show subtraction by indices to better illustrate. For example, ( 2 - 1 ) = df['time_stamp'][2] - df['time_stamp'][1] ):
time_stamp indicator time_diff
0 2021-01-01 00:00:00 1 NaT # (or undefined)
1 2021-01-01 00:02:00 1 1 - 0
2 2021-01-01 00:03:00 NaN 2 - 1
3 2021-01-01 00:04:00 NaN 3 - 1
4 2021-01-01 00:09:00 NaN 4 - 1
5 2021-01-01 00:14:00 NaN 5 - 1
6 2021-01-01 00:19:00 NaN 6 - 1
7 2021-01-01 00:24:00 NaN 7 - 1
8 2021-01-01 00:27:00 1 8 - 1
9 2021-01-01 00:29:00 NaN 9 - 8
10 2021-01-01 00:32:00 1 10 - 8
11 2021-01-01 00:34:00 NaN 11 - 10
12 2021-01-01 00:37:00 1 12 - 10
13 2021-01-01 00:38:00 NaN 13 - 12
14 2021-01-01 00:39:00 NaN 14 - 12
We can use a for loop that keeps track of the last NaN entry, but I'm looking for a solution that does not use a for loop.
I've ended up doing this:
# create an intermediate column to track the last timestamp corresponding to the non-NaN `indicator` value
df['tracking'] = np.nan
df['tracking'][~df['indicator'].isna()] = df['time_stamp'][~df['indicator'].isna()]
df['tracking'] = df['tracking'].ffill()
# use that to subtract the value from the `time_stamp`
df['time_diff'] = df['time_stamp'] - df['tracking']
Below is the sample of dataframe (df):-
alpha
value
0
a
5
1
a
8
2
a
4
3
b
2
4
b
1
I know how to make the sequence (numbers) as per the group:
df["serial"] = df.groupby("alpha").cumcount()+1
alpha
value
serial
0
a
5
1
1
a
8
2
2
a
4
3
3
b
2
1
4
b
1
2
But instead of number I need date-time in sequence having 30 mins interval:
Expected result:
alpha
value
serial
0
a
5
2021-01-01 23:30:00
1
a
8
2021-01-02 00:00:00
2
a
4
2021-01-02 00:30:00
3
b
2
2021-01-01 23:30:00
4
b
1
2021-01-02 00:00:00
You can simply multiply your result with a pd.Timedelta:
print ((df.groupby("alpha").cumcount()+1)*pd.Timedelta(minutes=30)+pd.Timestamp("2021-01-01 23:00:00"))
0 2021-01-01 23:30:00
1 2021-01-02 00:00:00
2 2021-01-02 00:30:00
3 2021-01-01 23:30:00
4 2021-01-02 00:00:00
dtype: datetime64[ns]
Try with to_datetime and groupby with cumcount, and then multiplying by pd.Timedelta for 30 minutes:
>>> df['serial'] = pd.to_datetime('2021-01-01 23:30:00') + df.groupby('alpha').cumcount() * pd.Timedelta(minutes=30)
>>> df
alpha value serial
0 a 5 2021-01-01 23:30:00
1 a 8 2021-01-02 00:00:00
2 a 4 2021-01-02 00:30:00
3 b 2 2021-01-01 23:30:00
4 b 1 2021-01-02 00:00:00
>>>
companies_id transaction_month count
0 2020-10-01 3
1 2020-10-01 5
1 2020-11-01 5
1 2020-12-01 18
1 2021-01-01 8
I want the result to be like
companies_id transaction_month count first_month
0 2020-10-01 3
1 2020-10-01 5 2020-10-01
1 2020-11-01 5 2020-10-01
1 2020-12-01 18 2020-10-01
1 2021-01-01 8 2020-10-01
This is my data set I want to add a new column called "first month" that should contain the value from transaction month column where the corresponding count is >=5.
for example :
In case of companies_id 1:
first 5 or more transaction occurred on 2020-10-01 therefore "first month" column should contain 2020-10-01 throughout i.e to all rows with companies_id as 1.
Use Series.where for replace transaction_month to NaN if not >=5 count and then use GroupBy.transform with GroupBy.first for first non missing values per groups to new column:
df['transaction_month'] = pd.to_datetime(df['transaction_month'])
print (df['transaction_month'].where(df['count'] >= 5))
0 NaT
1 2020-10-01
2 2020-11-01
3 2020-12-01
4 2021-01-01
Name: transaction_month, dtype: datetime64[ns]
df['first_month'] = (df['transaction_month'].where(df['count'] >= 5)
.groupby(df['companies_id'])
.transform('first'))
print (df)
companies_id transaction_month count first_month
0 0 2020-10-01 3 NaT
1 1 2020-10-01 5 2020-10-01
2 1 2020-11-01 5 2020-10-01
3 1 2020-12-01 18 2020-10-01
4 1 2021-01-01 8 2020-10-01