groupby and apply multiple conditions - python

This is a bit complicated but I will try to explain as best as I can.
i have the following dataframe.
transaction_hash block_timestamp from_address to_address value data token_address
1 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0xe5d84152dd961e2eb0d6c202cf3396f579974983 0x1111111254eeb25477b68fb85ed929f73a960582 1.052e+20 trace ETH
2 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.052e+20 trace ETH
3 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xe5d84152dd961e2eb0d6c202cf3396f579974983 1.0652365814992255e+20 transfer stETH
4 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.0652365814992255e+20 transfer stETH
5 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x7f39c581f595b53c5cb19bd0b3f8da6c935e2ca0 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 6.391691160717606e+19 transfer stETH
6 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0xdc24316b9ae028f1497c275eb9192a3ea0f67022 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 4.260674654274649e+19 transfer stETH
7 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 1.401493579633375e+20 trace ETH
8 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.401493579633375e+20 trace ETH
9 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.419e+20 transfer stETH
10 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x7f39c581f595b53c5cb19bd0b3f8da6c935e2ca0 4.257e+19 transfer stETH
11 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0xdc24316b9ae028f1497c275eb9192a3ea0f67022 9.933e+19 transfer stETH
These 11 transfers represent two swap transactions: 2 unique hash) between ETH and stETH. It'd be very nice if there was two clean transactions with one ETH going from A to B, and the other one stETH going from B to A. But, in decentralized exchanges, things work via routers which send multiple transactions through various addresses to complete one swap transaction.
Here, there are two transactions (as you can see, within one transaction (hash) made up of several transfers). I want to verify which ETH amounts corresponds to stETH. ETH and stETH prices are almost 1:1 so they should be quite close in value.
So from the first transaction (1-6), there is one ETH value (1.052e+20) but three different stETH value (1.0652365814992255e+20, 6.391691160717606e+19, and 4.260674654274649e+19). Obviously, it is clear that the pair that corresponds to the 1.052e+20 ETH swap is 1.0652365814992255e+20 stETH as it is the closest in value.
So, in order to filter out the right pair, I want to groupby transaction_hash, if there are more than 1 unique value of stETH, then I want to pick out the one that is closest to the ETH value.
so, the desired output will be:
transaction_hash block_timestamp from_address to_address value data token_address
1 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0xe5d84152dd961e2eb0d6c202cf3396f579974983 0x1111111254eeb25477b68fb85ed929f73a960582 1.052e+20 trace ETH
2 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.052e+20 trace ETH
3 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xe5d84152dd961e2eb0d6c202cf3396f579974983 1.0652365814992255e+20 transfer stETH
4 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.0652365814992255e+20 transfer stETH
5 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 1.401493579633375e+20 trace ETH
6 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.401493579633375e+20 trace ETH
7 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.419e+20 transfer stETH
Thanks!
EDIT!
I applied the code suggested below. Something strange is happening. My original df has a transaction like this
transaction_hash block_timestamp from_address to_address value data token_address
60347 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x558247e365be655f9144e1a0140d793984372ef3 6917030000000000.0 trace ETH
61076 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0xb1720612d0131839dc489fcf20398ea925282fca 1220650000000000.0 trace ETH
399307 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x6d02e95909da8da09865a26b62055bd6a1d5f706 8.4846e+17 trace ETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
625132 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x4028daac072e492d34a3afdbef0ba7e35d8b55c4 8.800918863842962e+17 transfer stETH
when I apply the code, instead of getting what I want which is..
399307 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x6d02e95909da8da09865a26b62055bd6a1d5f706 8.4846e+17 trace ETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
625132 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x4028daac072e492d34a3afdbef0ba7e35d8b55c4 8.800918863842962e+17 transfer stETH
I get more rows! and doesn't even remove the rows
transaction_hash block_timestamp from_address to_address value data token_address
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
60347 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x558247e365be655f9144e1a0140d793984372ef3 6917030000000000.0 trace ETH
61076 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0xb1720612d0131839dc489fcf20398ea925282fca 1220650000000000.0 trace ETH
399307 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x6d02e95909da8da09865a26b62055bd6a1d5f706 8.4846e+17 trace ETH
what is going on?

If you can sort your data by value you can use .merge_asof() to find the "nearest neighbor".
The by argument allows you specify grouping.
df = df.sort_values("value")
left = (
df.loc[df["token_address"] == "ETH", ["transaction_hash", "value"]]
.rename(columns={"value": "value_x"})
.reset_index()
)
right = (
df.loc[df["token_address"] != "ETH", ["transaction_hash", "value"]]
.rename(columns={"value": "value_y"})
.reset_index()
)
# Near matches
nearest = pd.merge_asof(
left = left,
right = right,
by = "transaction_hash",
left_on = "value_x",
right_on = "value_y",
direction = "nearest"
)
# Closest matches only
nearest["abs"] = (nearest["value_x"] - nearest["value_y"]).abs()
idxmin = nearest.groupby("value_y")["abs"].idxmin()
# index_x / index_y contain the row indexes of closest matches
minrows = nearest.loc[idxmin, ["index_x", "index_y"]].to_numpy().ravel()
# Extract their values used to filter rows
pairs = df.loc[minrows, ["transaction_hash", "value", "data", "token_address"]]
print(df.merge(pairs))
We find the distance from each "near neighbor" (x - y).abs() and .idxmin() to choose the closest.

Related

Create a new DataFrame using pandas date_range

I have the following DataFrame:
date_start date_end
0 2023-01-01 16:00:00 2023-01-01 17:00:00
1 2023-01-02 16:00:00 2023-01-02 17:00:00
2 2023-01-03 16:00:00 2023-01-03 17:00:00
3 2023-01-04 17:00:00 2023-01-04 19:00:00
4 NaN NaN
and I want to create a new DataFrame which will contain values starting from the date_start and ending at the date_end of each row.
So for the first row by using the code below:
new_df = pd.Series(pd.date_range(start=df['date_start'][0], end=df['date_end'][0], freq= '15min'))
I get the following:
0 2023-01-01 16:00:00
1 2023-01-01 16:15:00
2 2023-01-01 16:30:00
3 2023-01-01 16:45:00
4 2023-01-01 17:00:00
How can I get the same result for all the rows of the df combined in a new df?
You can use a list comprehension and concat:
out = pd.concat([pd.DataFrame({'date': pd.date_range(start=start, end=end,
freq='15min')})
for start, end in zip(df['date_start'], df['date_end'])],
ignore_index=True))
Output:
date
0 2023-01-01 16:00:00
1 2023-01-01 16:15:00
2 2023-01-01 16:30:00
3 2023-01-01 16:45:00
4 2023-01-01 17:00:00
5 2023-01-02 16:00:00
6 2023-01-02 16:15:00
7 2023-01-02 16:30:00
8 2023-01-02 16:45:00
9 2023-01-02 17:00:00
10 2023-01-03 16:00:00
11 2023-01-03 16:15:00
12 2023-01-03 16:30:00
13 2023-01-03 16:45:00
14 2023-01-03 17:00:00
15 2023-01-04 17:00:00
16 2023-01-04 17:15:00
17 2023-01-04 17:30:00
18 2023-01-04 17:45:00
19 2023-01-04 18:00:00
20 2023-01-04 18:15:00
21 2023-01-04 18:30:00
22 2023-01-04 18:45:00
23 2023-01-04 19:00:00
handling NAs:
out = pd.concat([pd.DataFrame({'date': pd.date_range(start=start, end=end,
freq='15min')})
for start, end in zip(df['date_start'], df['date_end'])
if pd.notna(start) and pd.notna(end)
],
ignore_index=True)
Adding to the previous answer that date_range has a to_series() method and that you could proceed like this as well:
pd.concat(
[
pd.date_range(start=row['date_start'], end=row['date_end'], freq= '15min').to_series()
for _, row in df.iterrows()
], ignore_index=True
)

Finding datetimes since event with conditions in pandas

I have a dataframe in pandas which reflects work shifts for employees (the time they are actually working). A snippet of it is the following:
df = pd.DataFrame({'Worker' : ['Alice','Alice','Alice', 'Bob','Bob','Bob'],
'Shift_start' : ['2022-01-01 10:00:00', '2022-01-01 13:10:00', '2022-01-01 15:45:00', '2022-01-01 11:30:00', '2022-01-01 13:40:00', '2022-01-01 15:20:00'],
'Shift_end' : ['2022-01-01 12:30:00', '2022-01-01 15:30:00', '2022-01-01 17:30:00', '2022-01-01 13:30:00', '2022-01-01 15:10:00', '2022-01-01 18:10:00']})
Worker
Shift_start
Shift_end
Alice
2022-01-01 10:00:00
2022-01-01 12:30:00
Alice
2022-01-01 13:10:00
2022-01-01 15:30:00
Alice
2022-01-01 15:45:00
2022-01-01 17:30:00
Bob
2022-01-01 11:30:00
2022-01-01 13:30:00
Bob
2022-01-01 13:40:00
2022-01-01 15:10:00
Bob
2022-01-01 15:20:00
2022-01-01 18:10:00
Now, I need to compute in very row the time since the last partial break, defined as a pause >20 minutes, and computed with respect to the start time of each shift. This is, if there is a pause of 15 minutes it should be considered that the pause has not existed, and the time would be computed since the last >20 min pause. If no pause exists, the time should be taken as the time since the start of the day.
So I would need something like:
Worker
Shift_start
Shift_end
Hours_since_break
Alice
2022-01-01 10:00:00
2022-01-01 12:30:00
0
Alice
2022-01-01 13:10:00
2022-01-01 15:30:00
0
Alice
2022-01-01 15:45:00
2022-01-01 17:30:00
2.58
Bob
2022-01-01 11:30:00
2022-01-01 13:30:00
0
Bob
2022-01-01 13:40:00
2022-01-01 15:10:00
2.17
Bob
2022-01-01 15:20:00
2022-01-01 18:10:00
3.83
For Alice, the first row is 0, as there is no previous break, so it is taken as the value since the start of the day. As it is her first shift, 0 hours is the result. In the second row, she has just taken a 40-minute pause, so again, 0 hours since the break. In the third row, she has just taken 15 minutes, but as the minimum break is 20 minutes, it is as if she hadn't take any break. Therefore, the time since her last break is since 13:10:00, when her last break finished, so the result is 2 hours and 35 minutes, i.e., 2.58 hours.
In the case of Bob the same logic applies. The first row is 0 (is the first shift of the day). In the second row he has taken just a 10-minute break which doesn't count, so the time since the last break would be since the start of his day, i.e., 2h10m (2.17 hours). In the third row, he has taken a 10-minute break again, so the time would be again since the start of the day, so 3h50m (3.83 hours).
To compute the breaks with the 20-minute constraint I did the following:
shifted_end = df.groupby("Worker")["Shift_end"].shift()
df["Partial_break"] = (df["Shift_start"] - shifted_end)
df['Partial_break_hours'] = df["Partial_break"].dt.total_seconds() / 3600
df.loc[(df['Partial_break_hours']<0.33), 'Partial_break_hours'] = 0
But I can't think of a way to implement the search logic to give the desired output. Any help is much appreciated!
You can try (assuming the DataFrame is sorted):
def fn(x):
rv = []
last_zero = 0
for a, c in zip(
x["Shift_start"],
(x["Shift_start"] - x["Shift_end"].shift()) < "20 minutes",
):
if c:
rv.append(round((a - last_zero) / pd.to_timedelta(1, unit="hour"), 2))
else:
last_zero = a
rv.append(0)
return pd.Series(rv, index=x.index)
df["Hours_since_break"] = df.groupby("Worker").apply(fn).droplevel(0)
print(df)
Prints:
Worker Shift_start Shift_end Hours_since_break
0 Alice 2022-01-01 10:00:00 2022-01-01 12:30:00 0.00
1 Alice 2022-01-01 13:10:00 2022-01-01 15:30:00 0.00
2 Alice 2022-01-01 15:45:00 2022-01-01 17:30:00 2.58
3 Bob 2022-01-01 11:30:00 2022-01-01 13:30:00 0.00
4 Bob 2022-01-01 13:40:00 2022-01-01 15:10:00 2.17
5 Bob 2022-01-01 15:20:00 2022-01-01 18:10:00 3.83
You could calculate a "fullBreakAtStart" flag. And based on that, set a "lastShiftStart". If there is no "fullBreakAtStart", then just enter a np.nan, and then use the fillna(method="ffill") function. Here is the code:
df["Shift_end_prev"] = df.groupby("Worker")["Shift_end"].shift(1)
df["timeDiff"] = pd.to_datetime(df["Shift_start"]) - pd.to_datetime(df["Shift_end_prev"])
df["fullBreakAtStart"] = (df["timeDiff"]> "20 minutes") | (df["timeDiff"].isna())
df["lastShiftStart"] = np.where(df["fullBreakAtStart"], df["Shift_start"], np.nan)
df["lastShiftStart"] = df["lastShiftStart"].fillna(method="ffill")
df["Hours_since_break"] = pd.to_datetime(df["Shift_start"]) - pd.to_datetime(df["lastShiftStart"])
df["Hours_since_break"] = df["Hours_since_break"]/np.timedelta64(1, 'h')
df["Hours_since_break"] = np.where(df["fullBreakAtStart"],0,df["Hours_since_break"])

Simple Linear Regression Stock Price Prediction

This simple linear regression LR predicts the close price but it doesn't go further than the end of the dataframe, I mean, I have the last closing price and aside is the prediction, but I want to know the next 10 closing prices which of course I don't have yet because are still coming. How do I see in the LR column the next 10 predictions without having the closing price yet?
# get prices from the exchange
prices = SESSION_DATA.query_kline(
symbol = 'BTCUSDT',
interval = 60, # timeframe (1 hour)
limit = 200, # numbers of candles
from_time = (TIMESTAMP() - (200 * 60)*60)) # from now go back 200 candles, 1 hour each
# pull data to a dataframe
df = pd.DataFrame(prices['result'])
df = df[['open_time','open','high','low','close']].astype(float)
df['open_time'] = pd.to_datetime(df['open_time'], unit='s')
# df['open_time'] = pd.to_datetime(df['open_time':]).strftime("%Y%m%d %I:%M:%S")
df.rename(columns={'open_time': 'Date'}, inplace=True)
# using Ta-Lib
prediction = TAL.LINEARREG(df['close'], 10)
df['LR'] = prediction
print(df)
Date open high low close LR
0 2022-10-06 14:00:00 20099.0 20116.5 19871.5 20099.0 NaN
1 2022-10-06 15:00:00 20099.0 20115.5 19987.0 20002.5 NaN
2 2022-10-06 16:00:00 20002.5 20092.0 19932.5 20050.0 NaN
3 2022-10-06 17:00:00 20050.0 20270.0 20002.5 20105.5 NaN
4 2022-10-06 18:00:00 20105.5 20106.0 19979.0 20010.5 NaN
5 2022-10-06 19:00:00 20010.5 20063.0 19985.0 20004.5 NaN
6 2022-10-06 20:00:00 20004.5 20064.5 19995.5 20042.5 NaN
7 2022-10-06 21:00:00 20042.5 20043.0 19878.5 19905.0 NaN
8 2022-10-06 22:00:00 19905.0 19944.0 19836.5 19894.0 NaN
9 2022-10-06 23:00:00 19894.0 19965.0 19851.0 19954.5 19925.527273
10 2022-10-07 00:00:00 19954.5 20039.5 19937.5 19984.5 19936.263636
11 2022-10-07 01:00:00 19984.5 20010.0 19957.0 19988.5 19935.327273
. . . I want the df ends this way
188 2022-10-14 10:00:00 19639.0 19733.5 19621.0 19680.0 19623.827273
189 2022-10-14 11:00:00 19680.0 19729.0 19576.5 NaN 19592.990909
190 2022-10-14 12:00:00 19586.5 19835.0 19535.5 NaN 19638.054545
191 2022-10-14 13:00:00 19785.5 19799.0 19612.0 NaN 19637.463636
192 2022-10-14 14:00:00 19656.5 19656.5 19334.5 NaN 19574.572727
193 2022-10-14 15:00:00 19455.0 19507.5 19303.5 NaN 19493.990909
194 2022-10-14 16:00:00 19351.0 19390.0 19220.0 NaN 19416.154545
195 2022-10-14 17:00:00 19296.5 19369.5 19284.5 NaN 19356.072727
196 2022-10-14 18:00:00 19358.0 19358.0 19127.5 NaN 19253.918182
197 2022-10-14 19:00:00 19208.5 19264.5 19100.0 NaN 19164.745455
198 2022-10-14 20:00:00 19164.0 19211.0 19114.0 NaN 19112.445455
199 2022-10-14 21:00:00 19172.0 19201.0 19125.0 NaN 19067.772727
Since Linear regression is ax + b the 10 further predictions would repeat itself, because you don't have any more input to alter the predictions beside the close price, i think, you are trying to look for a Monte Carlo simulation, that would try to predict based on random walk hypothesis for stock market prices.

Pandas groupby, melt and drop in one go

I want to add column to the dataframe with values(comments) based on the Timestamp, grouped per each day.
I made it as per example below, but... is there any other more "pandonic" way? maybe one-liner or at least close to it?
Example data frame (the actual has much more dates and more different values):
import pandas as pd
data = {"Values": ["absd","abse", "dara", "absd","abse", "dara"],
"Date": ["2022-05-25","2022-05-25","2022-05-25", "2022-05-26","2022-05-26","2022-05-26"],
"Timestamp": ["2022-05-25 08:00:00", "2022-05-25 11:30:00", "2022-05-25 20:25:00",
"2022-05-26 09:00:00", "2022-05-26 13:40:00", "2022-05-26 19:15:00"]}
df = pd.DataFrame(data)
df.Timestamp = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M:%S')
df.Date = pd.to_datetime(df.Date, format='%Y-%m-%d')
df out:
Values Date Timestamp
0 absd 2022-05-25 2022-05-25 08:00:00
1 abse 2022-05-25 2022-05-25 11:30:00
2 dara 2022-05-25 2022-05-25 20:25:00
3 absd 2022-05-26 2022-05-26 09:00:00
4 abse 2022-05-26 2022-05-26 13:40:00
5 dara 2022-05-26 2022-05-26 19:15:00
the end result I want is:
Values Date Period Datetime
0 absd 2022-05-25 Start 2022-05-25 08:00:00
1 abse 2022-05-25 Start 2022-05-25 08:00:00
2 dara 2022-05-25 Start 2022-05-25 08:00:00
3 dara 2022-05-25 Mid 2022-05-25 11:30:00
4 abse 2022-05-25 Mid 2022-05-25 11:30:00
5 absd 2022-05-25 Mid 2022-05-25 11:30:00
6 dara 2022-05-25 End 2022-05-25 20:25:00
7 abse 2022-05-25 End 2022-05-25 20:25:00
8 absd 2022-05-25 End 2022-05-25 20:25:00
9 dara 2022-05-26 Start 2022-05-26 09:00:00
10 abse 2022-05-26 Start 2022-05-26 09:00:00
11 absd 2022-05-26 Start 2022-05-26 09:00:00
12 absd 2022-05-26 Mid 2022-05-26 13:40:00
13 abse 2022-05-26 Mid 2022-05-26 13:40:00
14 dara 2022-05-26 Mid 2022-05-26 13:40:00
15 absd 2022-05-26 End 2022-05-26 19:15:00
16 abse 2022-05-26 End 2022-05-26 19:15:00
17 dara 2022-05-26 End 2022-05-26 19:15:00
my working approach is below:
df["Start"] = df["Timestamp"].groupby(df["Date"]).transform("min")
df["End"] = df["Timestamp"].groupby(df["Date"]).transform("max")
df["Mid"] = df["Timestamp"].groupby(df["Date"]).transform("median")
df1 = df.melt(id_vars = ["Values","Date"],
var_name="Period",value_name="Datetime").sort_values("Datetime")
df1 = df1[df1.Period != "Timestamp"].reset_index(drop=True)
From the end result dataframe, it looks like you need a combination of all the columns (well, a combination of the Values column and the ('Date', Timestamp') columns).
One option is with complete from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.assign(Period = ['Start', 'Mid', 'End'] * 2)
.complete(('Date', 'Timestamp', 'Period'), 'Values')
)
Values Date Timestamp Period
0 absd 2022-05-25 2022-05-25 08:00:00 Start
1 abse 2022-05-25 2022-05-25 08:00:00 Start
2 dara 2022-05-25 2022-05-25 08:00:00 Start
3 absd 2022-05-25 2022-05-25 11:30:00 Mid
4 abse 2022-05-25 2022-05-25 11:30:00 Mid
5 dara 2022-05-25 2022-05-25 11:30:00 Mid
6 absd 2022-05-25 2022-05-25 20:25:00 End
7 abse 2022-05-25 2022-05-25 20:25:00 End
8 dara 2022-05-25 2022-05-25 20:25:00 End
9 absd 2022-05-26 2022-05-26 09:00:00 Start
10 abse 2022-05-26 2022-05-26 09:00:00 Start
11 dara 2022-05-26 2022-05-26 09:00:00 Start
12 absd 2022-05-26 2022-05-26 13:40:00 Mid
13 abse 2022-05-26 2022-05-26 13:40:00 Mid
14 dara 2022-05-26 2022-05-26 13:40:00 Mid
15 absd 2022-05-26 2022-05-26 19:15:00 End
16 abse 2022-05-26 2022-05-26 19:15:00 End
17 dara 2022-05-26 2022-05-26 19:15:00 End
Using only pandas:
(
df['Timestamp'].groupby(df['Date']).agg(['min','median','max']).merge(df, on='Date')
.melt(id_vars=['Values','Date'], var_name='Period', value_name='Datetime')
.query('Period!="Timestamp"')
.sort_values('Datetime')
)
Output:
Values Date Period Datetime
0 absd 2022-05-25 min 2022-05-25 08:00:00
1 abse 2022-05-25 min 2022-05-25 08:00:00
2 dara 2022-05-25 min 2022-05-25 08:00:00
7 abse 2022-05-25 median 2022-05-25 11:30:00
6 absd 2022-05-25 median 2022-05-25 11:30:00
8 dara 2022-05-25 median 2022-05-25 11:30:00
12 absd 2022-05-25 max 2022-05-25 20:25:00
13 abse 2022-05-25 max 2022-05-25 20:25:00
14 dara 2022-05-25 max 2022-05-25 20:25:00
4 abse 2022-05-26 min 2022-05-26 09:00:00
3 absd 2022-05-26 min 2022-05-26 09:00:00
5 dara 2022-05-26 min 2022-05-26 09:00:00
9 absd 2022-05-26 median 2022-05-26 13:40:00
10 abse 2022-05-26 median 2022-05-26 13:40:00
11 dara 2022-05-26 median 2022-05-26 13:40:00
16 abse 2022-05-26 max 2022-05-26 19:15:00
15 absd 2022-05-26 max 2022-05-26 19:15:00
17 dara 2022-05-26 max 2022-05-26 19:15:00
Another pandas only method:
out = (df.groupby('Date')
.agg({'Timestamp':['min', 'median', 'max'], 'Values':list})
.explode(('Values', 'list'))
.droplevel(0, axis=1)
.rename(columns={'list':'Values'})
.reset_index()
.melt(['Values', 'Date'], var_name='Period', value_name='Datetime')
.sort_values('Datetime', ignore_index=True))
print(out)
Output:
Values Date Period Datetime
0 absd 2022-05-25 min 2022-05-25 08:00:00
1 abse 2022-05-25 min 2022-05-25 08:00:00
2 dara 2022-05-25 min 2022-05-25 08:00:00
3 abse 2022-05-25 median 2022-05-25 11:30:00
4 absd 2022-05-25 median 2022-05-25 11:30:00
5 dara 2022-05-25 median 2022-05-25 11:30:00
6 absd 2022-05-25 max 2022-05-25 20:25:00
7 abse 2022-05-25 max 2022-05-25 20:25:00
8 dara 2022-05-25 max 2022-05-25 20:25:00
9 abse 2022-05-26 min 2022-05-26 09:00:00
10 absd 2022-05-26 min 2022-05-26 09:00:00
11 dara 2022-05-26 min 2022-05-26 09:00:00
12 absd 2022-05-26 median 2022-05-26 13:40:00
13 abse 2022-05-26 median 2022-05-26 13:40:00
14 dara 2022-05-26 median 2022-05-26 13:40:00
15 abse 2022-05-26 max 2022-05-26 19:15:00
16 absd 2022-05-26 max 2022-05-26 19:15:00
17 dara 2022-05-26 max 2022-05-26 19:15:00

Python: How to chronologically sort by date and find any gaps

I searched for an answer, but couldn't find!
I have a dataframe that looks like:
import pandas as pd
df = pd.DataFrame({'Cust_Name' : ['APPT1', 'APPT1','APPT2','APPT2'],
'Move_In':['2013-02-01','2019-02-01','2019-02-04','2019-02-19'],
'Move_Out':['2019-01-31','','2019-02-15','']})
I am looking to find a way to calculate the vacancy.
APPT1 was occupied from 2013-02-01 to 2019-01-31 and, again from the next day 2019-02-01. So the vacancy for APPT1 is 0 and is currently occupied.
APPT2 was occupied from 2019-02-04 to 2019-02-15 and, again from 2019-02-19. So the vacancy for APPT2 is 2 business days and is currently occupied.
NaT: means currently occupied or currently occupied.
TIA
df = pd.DataFrame({
'Cust_Name': ['APPT1', 'APPT1','APPT2','APPT2'],
'Move_In': ['2013-02-01','2019-02-01','2019-02-04','2019-02-19'],
'Move_Out': ['2019-01-31','','2019-02-15','']
})
df['Move_In'] = df['Move_In'].astype('datetime64')
df['Move_Out'] = df['Move_Out'].astype('datetime64')
df['Prev_Move_Out'] = df['Move_Out'].shift()
Cust_Name Move_In Move_Out Prev_Move_Out
0 APPT1 2013-02-01 2019-01-31 NaT
1 APPT1 2019-02-01 NaT 2019-01-31
2 APPT2 2019-02-04 2019-02-15 NaT
3 APPT2 2019-02-19 NaT 2019-02-15
def calculate_business_day_vacancy(df):
try:
return len(pd.date_range(start=df['Prev_Move_Out'], end=df['Move_In'], freq='B')) - 2
except ValueError:
# Consider instead running the function only on rows that do not contain NaT.
return 0
df['Vacancy_BDays'] = df.apply(calculate_business_day_vacancy, axis=1)
Output
Cust_Name Move_In Move_Out Prev_Move_Out Vacancy_BDays
0 APPT1 2013-02-01 2019-01-31 NaT 0
1 APPT1 2019-02-01 NaT 2019-01-31 0
2 APPT2 2019-02-04 2019-02-15 NaT 0
3 APPT2 2019-02-19 NaT 2019-02-15 1
Note that there is only one Business Day vacancy between 15 Feb 2019 and 19 Feb 2019.

Categories

Resources