Let's say I have purchase records with two fields Buy and Time.
What I want to get is the third column of time elapsed since first not-buy so it looks like:
buy| time | time difference
1 | 8:00 | NULL
0 | 9:01 | NULL
0 | 9:10 | NULL
0 | 9:21 | NULL
1 | 9:31 | 0:30
0 | 9:41 | NULL
0 | 9:42 | NULL
1 | 9:53 | 0:12
How can I achieve this? It seems to me that it's a mix of pd.groupby() and pd.shift(), but I can't seem to work that out in my head.
IIUC
df.time=pd.to_datetime(df.time)
df.loc[df.buy==1,'DIFF']=df.groupby(df.buy.cumsum().shift().fillna(0)).time.transform(lambda x : x.iloc[-1]-x.iloc[0])
df
Out[19]:
buy time timedifference DIFF
0 1 2018-02-26 08:00:00 NaN 00:00:00
1 0 2018-02-26 09:01:00 NaN NaT
2 0 2018-02-26 09:10:00 NaN NaT
3 0 2018-02-26 09:21:00 NaN NaT
4 1 2018-02-26 09:31:00 0:30 00:30:00
5 0 2018-02-26 09:41:00 NaN NaT
6 0 2018-02-26 09:42:00 NaN NaT
7 1 2018-02-26 09:53:00 0:12 00:12:00
#df.buy.cumsum().shift().fillna(0) Create the key for groupby
#time.transform(lambda x : x.iloc[-1]-x.iloc[0]) create the different for each group
#df.loc[df.buy==1,'DIFF'] fill the value from groupby by the right position which buy equal to 1
Related
I have been fiddling about with pandas.DataFrame.rolling for some time now and I haven't been able to achieve the result that I am looking for, so before I write a custom windowing function I figured I would ask if I'm missing something.
I have postgresql data with a composite index of (time, node) that has been read into a pandas.DataFrame, where time is a certain hour on a certain date. I need to create windows that contain all entries within the last two calendar dates (or any arbitrary number of days), for example, beginning at 2022-12-26 00:00:00 and ending on 2022-12-27 23:00:00, and then perform operations on that window to return a new, resultant DataFrame. The window should then move forward an entire calendar date, which is where I am failing.
| time | node | value |
| --------------------- | ----- | ------ |
| 2022-12-26 00:00:00 | 123 | low |
| 2022-12-26 01:00:00 | 123 | med |
| 2022-12-26 02:00:00 | 123 | low |
| 2022-12-26 03:00:00 | 123 | high |
| ... | ... | ... |
| 2022-12-26 00:00:00 | 999 | low |
| 2022-12-26 01:00:00 | 999 | low |
| 2022-12-26 02:00:00 | 999 | low |
| 2022-12-26 03:00:00 | 999 | med |
| ... | ... | ... |
| 2022-12-27 00:00:00 | 123 | low |
| 2022-12-27 01:00:00 | 123 | med |
| 2022-12-27 02:00:00 | 123 | low |
| 2022-12-27 03:00:00 | 123 | high |
When I use something akin to df.rolling(window=pd.Timedelta('2days'), the windows move forward hour-by-hour, as opposed to beginning on the next calendar date.
I've played around with using min_periods, but it doesn't seem to work with my data, nor would it be acceptable in the long run because the number of expected observations per window is not fixed regardless. The step parameter also appears to be useless in this case because I am using an offset versus an integer for the window anyways.
Is the behaviour I am looking for doable with pandas.DataFrame.rolling or must I look elsewhere/write my own windowing function?
Any guidance would be appreciated. Thanks!
So from what I understand, you want to create windows of length ndays and the next window should start with the next day.
Given some dataframe with 5 days in total in the frequency of 1H between indices:
import pandas as pd
import numpy as np
periods = 23 * 5
df = pd.DataFrame(
{'value': list(range(periods))},
index=pd.date_range('2022-12-16', periods=periods, freq='H')
)
d = np.random.choice(
pd.date_range('2022-12-16', periods=periods, freq='H'),
int(periods * 0.25)
)
df = df.drop(index=d)
df.head(5)
>>> value
2022-12-16 00:00:00 0
2022-12-16 01:00:00 1
2022-12-16 02:00:00 2
2022-12-16 04:00:00 4
2022-12-16 05:00:00 5
I randomly dropped some indices to simulate missing data.
We can use df.resample (docs) to group the data by days (regardless of missing data):
days = df.resample('1d')
print(days.get_group('2022-12-16'))
>>> value
2022-12-16 00:00:00 0
2022-12-16 01:00:00 1
2022-12-16 02:00:00 2
2022-12-16 04:00:00 4
2022-12-16 05:00:00 5
2022-12-16 06:00:00 6
2022-12-16 07:00:00 7
2022-12-16 08:00:00 8
2022-12-16 09:00:00 9
2022-12-16 11:00:00 11
2022-12-16 12:00:00 12
2022-12-16 13:00:00 13
2022-12-16 14:00:00 14
2022-12-16 15:00:00 15
2022-12-16 17:00:00 17
2022-12-16 18:00:00 18
2022-12-16 19:00:00 19
2022-12-16 21:00:00 21
2022-12-16 22:00:00 22
2022-12-16 23:00:00 23
Now, we only need to iterate over the days in a "sliding" manner. The package more-itertools has some helpful functions, such as windowed and we can easily control the size of the window (here with ndays):
from more_itertools import windowed
ndays = 2
windows = [
pd.concat([w[1] for w in window])
for window in windowed(days, ndays)
]
Printing the first and last index of each window returns:
for window in windows:
print(window.iloc[[0, -1]])
>>> value
2022-12-16 00:00:00 0
2022-12-17 23:00:00 47
value
2022-12-17 00:00:00 24
2022-12-18 23:00:00 71
value
2022-12-18 00:00:00 48
2022-12-19 23:00:00 95
value
2022-12-19 01:00:00 73
2022-12-20 18:00:00 114
Furthermore, you can set step in windowed to control the step size between windows.
I have a dataframe like this (actual data has 70 columns with timestamp) with Column name as A_Timestamp, BC_Timestamp, DA_Timestamp, CA_Timestamp, B_Values, C_values, D_Values, Q_Values
A_Timestamp
B_Values
2020-11-08 11:15:00
1
2020-11-10 15:34:00
2
BC_Timestamp
C_Values
2020-11-11 12:13:00
8
2020-11-15 02:47:00
4
DA_Timestamp
D_Values
2020-1-13 14:47:00
3
2020-11-9 5:34:00
5
CA_Timestamp
Q_Values
2020-7-18 01:04:00
7
2020-04-10 16:34:00
6
And I want Like this:
| Timestamp | |B_Values| C_values| D_values| Q_Values|
| 2020-11-08 11:15:00 | 1 | Nan | Nan | Nan|
| 2020-11-10 15:34:00 | 2 | Nan | Nan | Nan |
| 2020-11-11 12:13:00 | Nan | 8 | Nan | Nan|
| 2020-11-15 02:47:00 | Nan | 4 | Nan | Nan|
| 2020-1-13 14:47:00 | Nan | Nan | 3 | Nan|
| 2020-11-9 05:34:00 | Nan | Nan | 5 | Nan|
| 2020-7-18 01:04:00 | Nan | Nan | Nan | 7|
I want to merge all the columns ending with 'Timestamp' into one single column. And each timestamp with their respective value in the respective columns.
You can use a renamer for the Timestamp column:
dfs = [df1, df2, df3, df4]
renamer = lambda x: 'Timestamp' if x.endswith('Timestamp') else x
out = pd.concat([d.rename(renamer, axis=1) for d in dfs])
Output:
Timestamp B_Values C_Values D_Values Q_Values
0 2020-11-08 11:15:00 1.0 NaN NaN NaN
1 2020-11-10 15:34:00 2.0 NaN NaN NaN
0 2020-11-11 12:13:00 NaN 8.0 NaN NaN
1 2020-11-15 02:47:00 NaN 4.0 NaN NaN
0 2020-1-13 14:47:00 NaN NaN 3.0 NaN
1 2020-11-9 5:34:00 NaN NaN 5.0 NaN
0 2020-7-18 01:04:00 NaN NaN NaN 7.0
1 2020-04-10 16:34:00 NaN NaN NaN 6.0
alternative
Assuming you have a single DataFrame as input:
A_Timestamp B_Values BC_Timestamp C_Values DA_Timestamp D_Values CA_Timestamp Q_Values
0 2020-11-08 11:15:00 1 2020-11-11 12:13:00 8 2020-1-13 14:47:00 3 2020-7-18 01:04:00 7
1 2020-11-10 15:34:00 2 2020-11-15 02:47:00 4 2020-11-9 5:34:00 5 2020-04-10 16:34:00 6
You can then reshape with a MultiIndex:
m = df.columns.str.endswith('Timestamp')
s = df.columns.to_series().mask(m)
out = (df
.set_axis(pd.MultiIndex.from_arrays(
[s.bfill(), s.fillna('Timestamp')]), axis=1)
.T.stack().unstack(-2).droplevel(0)
)
Output:
B_Values C_Values D_Values Q_Values Timestamp
0 1 NaN NaN NaN 2020-11-08 11:15:00
1 2 NaN NaN NaN 2020-11-10 15:34:00
0 NaN 8 NaN NaN 2020-11-11 12:13:00
1 NaN 4 NaN NaN 2020-11-15 02:47:00
0 NaN NaN 3 NaN 2020-1-13 14:47:00
1 NaN NaN 5 NaN 2020-11-9 5:34:00
0 NaN NaN NaN 7 2020-7-18 01:04:00
1 NaN NaN NaN 6 2020-04-10 16:34:00
Or, if order of the rows doesn't matter:
m = df.columns.str.endswith('Timestamp')
s = df.columns.to_series().mask(m)
(df.set_axis(pd.MultiIndex.from_arrays(
[s.fillna('Timestamp'), s.bfill()]), axis=1)
.stack()
)
Lets say I have the following model:
class DateRange(models.Model):
start = models.DateTimeField()
end = models.DateTimeField()
Is there a way to get all pairs of DateRange instances with any overlap in their start to end range? For example, if I had:
id | start | end
-----+---------------------+---------------------
1 | 2020-01-02 12:00:00 | 2020-01-02 16:00:00 # overlap with 2, 3 and 5
2 | 2020-01-02 13:00:00 | 2020-01-02 14:00:00 # overlap with 1 and 3
3 | 2020-01-02 13:30:00 | 2020-01-02 17:00:00 # overlap with 1 and 2
4 | 2020-01-02 10:00:00 | 2020-01-02 10:30:00 # no overlap
5 | 2020-01-02 12:00:00 | 2020-01-02 12:30:00 # overlap with 1
I'd want:
id_1 | id_2
------+-----
1 | 2
1 | 3
1 | 5
2 | 3
Any thoughts on the best way to do this? The order of id_1 and id_2 doesn't matter, but I do need it to be distinct (e.g.- id_1=1, id_2=2 is the same as id_1=2, id_2=1 and should not be repeated)
I have a pandas dataframe where there's three levels of row indexing. The last level is a datetime index. There are nan values and I am trying to fill them with the average of each row at the datetime level. How can I go about doing this?
data_df
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | nan
2019-01-28 18:00:00 | 2 | nan | 1
2019-01-28 19:00:00 | nan | nan | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | nan | nan | nan
Some rows may all be nan values. In this case I want to fill the row with 0's. Some rows may have all values filled in so imputing with average isn't needed.
I want this the following result:
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | 2
2019-01-28 18:00:00 | 2 | 1.5 | 1
2019-01-28 19:00:00 | 5 | 5 | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | 0 | 0 | 0
Use DataFrame.mask with mean per rows and last convert only NaNs rows by DataFrame.fillna:
df = df.mask(df.isna(), df.mean(axis=1), axis=0).fillna(0)
print (df)
a b c
Level 0 Level 1 Level 2
A 123 2019-01-28 17:00:00 3.0 1.0 2.0
2019-01-28 18:00:00 2.0 1.5 1.0
2019-01-28 19:00:00 5.0 5.0 5.0
234 2019-01-28 05:00:00 1.0 1.0 3.0
2019-01-28 06:00:00 0.0 0.0 0.0
Another solution is use DataFrame.fillna for replace, but because not implemented df.fillna(df.mean(axis=1), axis=1) is necessary double transpose:
df = df.T.fillna(df.mean(axis=1)).fillna(0).T
I have a pandas dataframe as shown below where I have Month-Year, need to get the continuous dataframe which should include count as 0 if no rows are found for that month. Excepcted output is shown below.
Input dataframe
Month | Count
--------------
Jan-15 | 10
Feb-15 | 100
Mar-15 | 20
Jul-15 | 10
Sep-15 | 11
Oct-15 | 1
Dec-15 | 15
Expected Output
Month | Count
--------------
Jan-15 | 10
Feb-15 | 100
Mar-15 | 20
Apr-15 | 0
May-15 | 0
Jun-15 | 0
Jul-15 | 10
Aug-15 | 0
Sep-15 | 11
Oct-15 | 1
Nov-15 | 0
Dec-15 | 15
You can set the Month column as the index. It looks like Excel input, if so, it will be parsed at 01.01.2015 so you can resample it as follows:
df.set_index('Month').resample('MS').asfreq().fillna(0)
Out:
Count
Month
2015-01-01 10.0
2015-02-01 100.0
2015-03-01 20.0
2015-04-01 0.0
2015-05-01 0.0
2015-06-01 0.0
2015-07-01 10.0
2015-08-01 0.0
2015-09-01 11.0
2015-10-01 1.0
2015-11-01 0.0
2015-12-01 15.0
If the month column is not recognized as date, you need to convert it first:
df['Month'] = pd.to_datetime(df['Month'], format='%b-%y')