De-aggregating a table and randomizing datetime

De-aggregating a table and randomizing datetime - python

I have a table with the following structure; the count column gets updated every time a user accesses the app again on that date.
user_id
date
count
1
1/1/2021
4
2
1/1/2021
7
1
1/2/2021
3
3
1/2/2021
10
2
1/3/2021
4
4
1/1/2021
12
I want to de-aggregate this data based on the count, so for example, user_id of 1 will have four records on 1/1/2021 without the count column. After that, I want to concatenate a random time to the date. My output would like this:
user_id
date_time
1
1/1/2021 16:00:21
1
1/1/2021 7:23:55
1
1/1/2021 12:01:45
1
1/1/2021 21:21:07
I'm using pandas for this. Randomizing the timestamps is straightforward I think, just de-aggregating the data based on a column is a little tricky for me.

You can duplicate the index and add a random time between 0 and 24 hours:
(df.loc[df.index.repeat(df['count'])]
.assign(date=lambda d: pd.to_datetime(d['date'])
+pd.to_timedelta(np.random.randint(0,24*3600, size=len(d)), unit='s'))
.rename({'date': 'date_time'})
.drop('count', axis=1)
)
output:
user_id date
0 1 2021-01-01 03:32:40
0 1 2021-01-01 03:54:18
0 1 2021-01-01 00:57:49
0 1 2021-01-01 13:04:08
1 2 2021-01-01 00:34:03
1 2 2021-01-01 00:14:17
1 2 2021-01-01 03:57:20
1 2 2021-01-01 22:01:11
1 2 2021-01-01 22:09:55
1 2 2021-01-01 13:15:36
1 2 2021-01-01 12:26:39
2 1 2021-01-02 22:51:17
2 1 2021-01-02 13:44:12
2 1 2021-01-02 01:39:14
3 3 2021-01-02 09:22:16
3 3 2021-01-02 03:34:15
3 3 2021-01-02 23:05:49
3 3 2021-01-02 02:21:35
3 3 2021-01-02 19:51:41
3 3 2021-01-02 16:02:20
3 3 2021-01-02 18:14:05
3 3 2021-01-02 09:07:14
3 3 2021-01-02 22:43:44
3 3 2021-01-02 20:48:15
4 2 2021-01-03 19:25:04
4 2 2021-01-03 14:08:03
4 2 2021-01-03 21:23:58
4 2 2021-01-03 17:24:58
5 4 2021-01-01 23:37:41
5 4 2021-01-01 06:06:17
5 4 2021-01-01 19:23:29
5 4 2021-01-01 02:12:50
5 4 2021-01-01 08:09:59
5 4 2021-01-01 03:49:30
5 4 2021-01-01 08:00:42
5 4 2021-01-01 08:03:34
5 4 2021-01-01 15:36:12
5 4 2021-01-01 14:50:43
5 4 2021-01-01 14:54:04
5 4 2021-01-01 14:58:08

Related

Add column in dataframe from another dataframe matching the id and based on condition in date columns pandas

My problem is a very complex and confusing one, I haven't been able to find the answer anywhere.
I basically have 2 dataframes, one is price history of certain products and the other is invoice dataframe that contains transaction data.
Sample Data:
Price History:
product_id updated price
id
1 1 2022-01-01 5.0
2 2 2022-01-01 5.5
3 3 2022-01-01 5.7
4 1 2022-01-15 6.0
5 2 2022-01-15 6.5
6 3 2022-01-15 6.7
7 1 2022-02-01 7.0
8 2 2022-02-01 7.5
9 3 2022-02-01 7.7
Invoice:
transaction_date product_id quantity
id
1 2022-01-02 1 2
2 2022-01-02 2 3
3 2022-01-02 3 4
4 2022-01-14 1 1
5 2022-01-14 2 4
6 2022-01-14 3 2
7 2022-01-15 1 3
8 2022-01-15 2 6
9 2022-01-15 3 5
10 2022-01-16 1 3
11 2022-01-16 2 2
12 2022-01-16 3 3
13 2022-02-05 1 1
14 2022-02-05 2 4
15 2022-02-05 3 7
16 2022-05-10 1 4
17 2022-05-10 2 2
18 2022-05-10 3 1
What I am looking to achieve is to add the price column in the Invoice dataframe, based on:
The product id
Comparing the Updated and Transaction Date in a way that updated date <= transaction date for that particular record, basically finding the closest date after the price was updated. (The MAX date that is <= transaction date)
I managed to do this:
invoice['price'] = invoice['product_id'].map(price_history.set_index('id')['price'])
but need to incorporate the date condition now.
Expected result for sample data:
Expected Result
Any guidance in the correct direction is appreciated, thanks

merge_asof is what you are looking for:
pd.merge_asof(
invoice,
price_history,
left_on="transaction_date",
right_on="updated",
by="product_id",
)[["transaction_date", "product_id", "quantity", "price"]]

merge_asof with arg direction
merged = pd.merge_asof(
left=invoice,
right=price_history,
left_on="transaction_date",
right_on="updated",
by="product_id",
direction="backward",
suffixes=("", "_y")
).drop(columns=["id_y", "updated"]).reset_index(drop=True)
print(merged)
id transaction_date product_id quantity price
0 1 2022-01-02 1 2 5.0
1 2 2022-01-02 2 3 5.5
2 3 2022-01-02 3 4 5.7
3 4 2022-01-14 1 1 5.0
4 5 2022-01-14 2 4 5.5
5 6 2022-01-14 3 2 5.7
6 7 2022-01-15 1 3 6.0
7 8 2022-01-15 2 6 6.5
8 9 2022-01-15 3 5 6.7
9 10 2022-01-16 1 3 6.0
10 11 2022-01-16 2 2 6.5
11 12 2022-01-16 3 3 6.7
12 13 2022-02-05 1 1 7.0
13 14 2022-02-05 2 4 7.5
14 15 2022-02-05 3 7 7.7
15 16 2022-05-10 1 4 7.0
16 17 2022-05-10 2 2 7.5
17 18 2022-05-10 3 1 7.7

Count How Many Different Users Have in specific day

I created a dataframe with pandas:
looks like that:
HostName
Date
B
2021-01-01 12:30
A
2021-01-01 12:45
C
2021-01-01 12:46
A
2021-02-01 12:42
B
2021-02-01 12:43
A
2021-02-01 12:45
B
2021-02-25 12:46
C
2021-03-01 12:41
A
2021-03-01 12:42
A
2021-03-01 12:43
C
2021-03-01 12:45
For every day, it should count how many different HostName
there is form the beginning of the day (example: 2021-01-01 00:00) to the specific row
Example:
for example lets take the 2021-01-01
HostName
Date
B
2021-01-01 12:30
A
2021-01-01 12:45
C
2021-01-01 12:46
there is tree rows:
the first result would be 1 - because its was the first row in the day.(B)
the second result would be 2 - because form the beginning of
the day till this line there is two different Hostname (B,A)
the third result would be 3 - because form the beginning of the day till this
line there is tree different Hostname ( B,A,C)
the end result should look like this:
HostName
Date
Result
B
2021-01-01 12:30
1
A
2021-01-01 12:45
2
C
2021-01-01 12:46
3
A
2021-02-01 12:42
1
B
2021-02-01 12:43
2
A
2021-02-01 12:45
2
B
2021-02-25 12:46
1
C
2021-03-01 12:41
1
A
2021-03-01 12:42
2
A
2021-03-01 12:43
2
C
2021-03-01 12:45
2
what it try do to but failed:
df.groupby(['HostName','Date')['HostName'].cumcount() + 1
or
def f(x):
one = x['HostName'].to_numpy()
twe = x['Date'].to_numpy()
both = x[['HostName','Date']].shift(1).to_numpy()
x['Host_1D_CumCount_Conn'] = [np.sum((one == a) & (twe == b)) for a, b in both]
return x
df.groupby('HostName').apply(f)

Use lambda function in GroupBy.transform with lambda function with Series.duplicated and cumulative sum:
df['Result'] = (df.groupby(df['Date'].dt.date)['HostName']
.transform(lambda x: (~x.duplicated()).cumsum()))
print (df)
HostName Date Result
0 B 2021-01-01 12:30:00 1
1 A 2021-01-01 12:45:00 2
2 C 2021-01-01 12:46:00 3
3 A 2021-02-01 12:42:00 1
4 B 2021-02-01 12:43:00 2
5 A 2021-02-01 12:45:00 2
6 B 2021-02-25 12:46:00 1
7 C 2021-03-01 12:41:00 1
8 A 2021-03-01 12:42:00 2
9 A 2021-03-01 12:43:00 2
10 C 2021-03-01 12:45:00 2
Alternative solution, faster is create helper columns d for dates and duplicates per d with HostName and use GroupBy.cumsum:
df['Result'] = (df.assign(d = df['Date'].dt.date,
new = lambda x: ~x.duplicated(['d','HostName']))
.groupby('d')['new']
.cumsum())
print (df)
HostName Date Result
0 B 2021-01-01 12:30:00 1
1 A 2021-01-01 12:45:00 2
2 C 2021-01-01 12:46:00 3
3 A 2021-02-01 12:42:00 1
4 B 2021-02-01 12:43:00 2
5 A 2021-02-01 12:45:00 2
6 B 2021-02-25 12:46:00 1
7 C 2021-03-01 12:41:00 1
8 A 2021-03-01 12:42:00 2
9 A 2021-03-01 12:43:00 2
10 C 2021-03-01 12:45:00 2

You can groupby the Date and use expanding+nunique. The issue is that, currently, expanding only works with numerical values (I wish we could simply do expanding().nunique()).
Thus we have to cheat a bit and factorize the column to numbers before applying pd.Series.nunique.
df['Result'] = (df.groupby(pd.to_datetime(df['Date']).dt.date, group_keys=False)
['HostName']
.apply(lambda s: pd.Series(s.factorize()[0]).expanding().apply(pd.Series.nunique))
.astype(int)
.values
)
output:
HostName Date Result
0 B 2021-01-01 12:30 1
1 A 2021-01-01 12:45 2
2 C 2021-01-01 12:46 3
3 A 2021-02-01 12:42 1
4 B 2021-02-01 12:43 2
5 A 2021-02-01 12:45 2
6 B 2021-02-25 12:46 1
7 C 2021-03-01 12:41 1
8 A 2021-03-01 12:42 2
9 A 2021-03-01 12:43 2
10 C 2021-03-01 12:45 2

Subtract one column by itself based on a condition set by another column

I have the following data frame, where time_stamp is already sorted in the ascending order:
time_stamp indicator
0 2021-01-01 00:00:00 1
1 2021-01-01 00:02:00 1
2 2021-01-01 00:03:00 NaN
3 2021-01-01 00:04:00 NaN
4 2021-01-01 00:09:00 NaN
5 2021-01-01 00:14:00 NaN
6 2021-01-01 00:19:00 NaN
7 2021-01-01 00:24:00 NaN
8 2021-01-01 00:27:00 1
9 2021-01-01 00:29:00 NaN
10 2021-01-01 00:32:00 2
11 2021-01-01 00:34:00 NaN
12 2021-01-01 00:37:00 2
13 2021-01-01 00:38:00 NaN
14 2021-01-01 00:39:00 NaN
I want to create a new column in the above data frame, that shows the time difference between each row's time_stamp value and the first time_stamp value above that row where indicator is not NaN (immediately above row, where indicator is not NaN).
Below is how the output should look like (time_diff is a timedelta value, but I'll just show subtraction by indices to better illustrate. For example, ( 2 - 1 ) = df['time_stamp'][2] - df['time_stamp'][1] ):
time_stamp indicator time_diff
0 2021-01-01 00:00:00 1 NaT # (or undefined)
1 2021-01-01 00:02:00 1 1 - 0
2 2021-01-01 00:03:00 NaN 2 - 1
3 2021-01-01 00:04:00 NaN 3 - 1
4 2021-01-01 00:09:00 NaN 4 - 1
5 2021-01-01 00:14:00 NaN 5 - 1
6 2021-01-01 00:19:00 NaN 6 - 1
7 2021-01-01 00:24:00 NaN 7 - 1
8 2021-01-01 00:27:00 1 8 - 1
9 2021-01-01 00:29:00 NaN 9 - 8
10 2021-01-01 00:32:00 1 10 - 8
11 2021-01-01 00:34:00 NaN 11 - 10
12 2021-01-01 00:37:00 1 12 - 10
13 2021-01-01 00:38:00 NaN 13 - 12
14 2021-01-01 00:39:00 NaN 14 - 12
We can use a for loop that keeps track of the last NaN entry, but I'm looking for a solution that does not use a for loop.

I've ended up doing this:
# create an intermediate column to track the last timestamp corresponding to the non-NaN `indicator` value
df['tracking'] = np.nan
df['tracking'][~df['indicator'].isna()] = df['time_stamp'][~df['indicator'].isna()]
df['tracking'] = df['tracking'].ffill()
# use that to subtract the value from the `time_stamp`
df['time_diff'] = df['time_stamp'] - df['tracking']

Time sequence in pandas dataframe

Below is the sample of dataframe (df):-
alpha
value
0
a
5
1
a
8
2
a
4
3
b
2
4
b
1
I know how to make the sequence (numbers) as per the group:
df["serial"] = df.groupby("alpha").cumcount()+1
alpha
value
serial
0
a
5
1
1
a
8
2
2
a
4
3
3
b
2
1
4
b
1
2
But instead of number I need date-time in sequence having 30 mins interval:
Expected result:
alpha
value
serial
0
a
5
2021-01-01 23:30:00
1
a
8
2021-01-02 00:00:00
2
a
4
2021-01-02 00:30:00
3
b
2
2021-01-01 23:30:00
4
b
1
2021-01-02 00:00:00

You can simply multiply your result with a pd.Timedelta:
print ((df.groupby("alpha").cumcount()+1)*pd.Timedelta(minutes=30)+pd.Timestamp("2021-01-01 23:00:00"))
0 2021-01-01 23:30:00
1 2021-01-02 00:00:00
2 2021-01-02 00:30:00
3 2021-01-01 23:30:00
4 2021-01-02 00:00:00
dtype: datetime64[ns]

Try with to_datetime and groupby with cumcount, and then multiplying by pd.Timedelta for 30 minutes:
>>> df['serial'] = pd.to_datetime('2021-01-01 23:30:00') + df.groupby('alpha').cumcount() * pd.Timedelta(minutes=30)
>>> df
alpha value serial
0 a 5 2021-01-01 23:30:00
1 a 8 2021-01-02 00:00:00
2 a 4 2021-01-02 00:30:00
3 b 2 2021-01-01 23:30:00
4 b 1 2021-01-02 00:00:00
>>>

Python conditional transform

companies_id transaction_month count
0 2020-10-01 3
1 2020-10-01 5
1 2020-11-01 5
1 2020-12-01 18
1 2021-01-01 8
I want the result to be like
companies_id transaction_month count first_month
0 2020-10-01 3
1 2020-10-01 5 2020-10-01
1 2020-11-01 5 2020-10-01
1 2020-12-01 18 2020-10-01
1 2021-01-01 8 2020-10-01
This is my data set I want to add a new column called "first month" that should contain the value from transaction month column where the corresponding count is >=5.
for example :
In case of companies_id 1:
first 5 or more transaction occurred on 2020-10-01 therefore "first month" column should contain 2020-10-01 throughout i.e to all rows with companies_id as 1.

Use Series.where for replace transaction_month to NaN if not >=5 count and then use GroupBy.transform with GroupBy.first for first non missing values per groups to new column:
df['transaction_month'] = pd.to_datetime(df['transaction_month'])
print (df['transaction_month'].where(df['count'] >= 5))
0 NaT
1 2020-10-01
2 2020-11-01
3 2020-12-01
4 2021-01-01
Name: transaction_month, dtype: datetime64[ns]
df['first_month'] = (df['transaction_month'].where(df['count'] >= 5)
.groupby(df['companies_id'])
.transform('first'))
print (df)
companies_id transaction_month count first_month
0 0 2020-10-01 3 NaT
1 1 2020-10-01 5 2020-10-01
2 1 2020-11-01 5 2020-10-01
3 1 2020-12-01 18 2020-10-01
4 1 2021-01-01 8 2020-10-01

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

De-aggregating a table and randomizing datetime - python

Related

Add column in dataframe from another dataframe matching the id and based on condition in date columns pandas

Count How Many Different Users Have in specific day

Subtract one column by itself based on a condition set by another column

Time sequence in pandas dataframe

Python conditional transform

Categories

Resources