Lets say I have the following model:
class DateRange(models.Model):
start = models.DateTimeField()
end = models.DateTimeField()
Is there a way to get all pairs of DateRange instances with any overlap in their start to end range? For example, if I had:
id | start | end
-----+---------------------+---------------------
1 | 2020-01-02 12:00:00 | 2020-01-02 16:00:00 # overlap with 2, 3 and 5
2 | 2020-01-02 13:00:00 | 2020-01-02 14:00:00 # overlap with 1 and 3
3 | 2020-01-02 13:30:00 | 2020-01-02 17:00:00 # overlap with 1 and 2
4 | 2020-01-02 10:00:00 | 2020-01-02 10:30:00 # no overlap
5 | 2020-01-02 12:00:00 | 2020-01-02 12:30:00 # overlap with 1
I'd want:
id_1 | id_2
------+-----
1 | 2
1 | 3
1 | 5
2 | 3
Any thoughts on the best way to do this? The order of id_1 and id_2 doesn't matter, but I do need it to be distinct (e.g.- id_1=1, id_2=2 is the same as id_1=2, id_2=1 and should not be repeated)
Related
I have been fiddling about with pandas.DataFrame.rolling for some time now and I haven't been able to achieve the result that I am looking for, so before I write a custom windowing function I figured I would ask if I'm missing something.
I have postgresql data with a composite index of (time, node) that has been read into a pandas.DataFrame, where time is a certain hour on a certain date. I need to create windows that contain all entries within the last two calendar dates (or any arbitrary number of days), for example, beginning at 2022-12-26 00:00:00 and ending on 2022-12-27 23:00:00, and then perform operations on that window to return a new, resultant DataFrame. The window should then move forward an entire calendar date, which is where I am failing.
| time | node | value |
| --------------------- | ----- | ------ |
| 2022-12-26 00:00:00 | 123 | low |
| 2022-12-26 01:00:00 | 123 | med |
| 2022-12-26 02:00:00 | 123 | low |
| 2022-12-26 03:00:00 | 123 | high |
| ... | ... | ... |
| 2022-12-26 00:00:00 | 999 | low |
| 2022-12-26 01:00:00 | 999 | low |
| 2022-12-26 02:00:00 | 999 | low |
| 2022-12-26 03:00:00 | 999 | med |
| ... | ... | ... |
| 2022-12-27 00:00:00 | 123 | low |
| 2022-12-27 01:00:00 | 123 | med |
| 2022-12-27 02:00:00 | 123 | low |
| 2022-12-27 03:00:00 | 123 | high |
When I use something akin to df.rolling(window=pd.Timedelta('2days'), the windows move forward hour-by-hour, as opposed to beginning on the next calendar date.
I've played around with using min_periods, but it doesn't seem to work with my data, nor would it be acceptable in the long run because the number of expected observations per window is not fixed regardless. The step parameter also appears to be useless in this case because I am using an offset versus an integer for the window anyways.
Is the behaviour I am looking for doable with pandas.DataFrame.rolling or must I look elsewhere/write my own windowing function?
Any guidance would be appreciated. Thanks!
So from what I understand, you want to create windows of length ndays and the next window should start with the next day.
Given some dataframe with 5 days in total in the frequency of 1H between indices:
import pandas as pd
import numpy as np
periods = 23 * 5
df = pd.DataFrame(
{'value': list(range(periods))},
index=pd.date_range('2022-12-16', periods=periods, freq='H')
)
d = np.random.choice(
pd.date_range('2022-12-16', periods=periods, freq='H'),
int(periods * 0.25)
)
df = df.drop(index=d)
df.head(5)
>>> value
2022-12-16 00:00:00 0
2022-12-16 01:00:00 1
2022-12-16 02:00:00 2
2022-12-16 04:00:00 4
2022-12-16 05:00:00 5
I randomly dropped some indices to simulate missing data.
We can use df.resample (docs) to group the data by days (regardless of missing data):
days = df.resample('1d')
print(days.get_group('2022-12-16'))
>>> value
2022-12-16 00:00:00 0
2022-12-16 01:00:00 1
2022-12-16 02:00:00 2
2022-12-16 04:00:00 4
2022-12-16 05:00:00 5
2022-12-16 06:00:00 6
2022-12-16 07:00:00 7
2022-12-16 08:00:00 8
2022-12-16 09:00:00 9
2022-12-16 11:00:00 11
2022-12-16 12:00:00 12
2022-12-16 13:00:00 13
2022-12-16 14:00:00 14
2022-12-16 15:00:00 15
2022-12-16 17:00:00 17
2022-12-16 18:00:00 18
2022-12-16 19:00:00 19
2022-12-16 21:00:00 21
2022-12-16 22:00:00 22
2022-12-16 23:00:00 23
Now, we only need to iterate over the days in a "sliding" manner. The package more-itertools has some helpful functions, such as windowed and we can easily control the size of the window (here with ndays):
from more_itertools import windowed
ndays = 2
windows = [
pd.concat([w[1] for w in window])
for window in windowed(days, ndays)
]
Printing the first and last index of each window returns:
for window in windows:
print(window.iloc[[0, -1]])
>>> value
2022-12-16 00:00:00 0
2022-12-17 23:00:00 47
value
2022-12-17 00:00:00 24
2022-12-18 23:00:00 71
value
2022-12-18 00:00:00 48
2022-12-19 23:00:00 95
value
2022-12-19 01:00:00 73
2022-12-20 18:00:00 114
Furthermore, you can set step in windowed to control the step size between windows.
This question already has answers here:
Pandas filling missing dates and values within group
(3 answers)
Closed 3 months ago.
The community reviewed whether to reopen this question 3 months ago and left it closed:
Original close reason(s) were not resolved
Suppose that I have a dataframe which can be created using code below
df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-05'],
'product':['A', 'A', 'A', 'B', 'B', 'B'],
'price':[10, 20, 30, 40, 50, 60]
}
)
df['date'] = pd.to_datetime(df['date'])
I want to create an empty dataframe let's say main_df which will contain all dates between df.date.min() and df.date.max() for each product and on days where values in nan I want to ffill and bfill for remaning. The resulting dataframe would be as below:
+------------+---------+-------+
| date | product | price |
+------------+---------+-------+
| 2021-01-01 | A | 10 |
| 2021-01-02 | A | 20 |
| 2021-01-03 | A | 20 |
| 2021-01-04 | A | 20 |
| 2021-01-05 | A | 30 |
| 2021-01-01 | B | 40 |
| 2021-01-02 | B | 40 |
| 2021-01-03 | B | 50 |
| 2021-01-04 | B | 50 |
| 2021-01-05 | B | 60 |
+------------+---------+-------+
First
make pivot table, upsampling by asfreq and fill null
df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()
output:
product A B
date
2021-01-01 10.0 40.0
2021-01-02 20.0 40.0
2021-01-03 20.0 50.0
2021-01-04 20.0 50.0
2021-01-05 30.0 60.0
Second
stack result and so on (include full code)
(df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()
.stack().reset_index().rename(columns={0:'price'})
.sort_values('product').reset_index(drop=True))
output:
date product price
0 2021-01-01 A 10.0
1 2021-01-02 A 20.0
2 2021-01-03 A 20.0
3 2021-01-04 A 20.0
4 2021-01-05 A 30.0
5 2021-01-01 B 40.0
6 2021-01-02 B 40.0
7 2021-01-03 B 50.0
8 2021-01-04 B 50.0
9 2021-01-05 B 60.0
Using resample
df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-06'],
'product':['A', 'A', 'A', 'B', 'B', 'B'],
'price':[10, 20, 30, 40, 50, 60]
}
)
df['date'] = pd.to_datetime(df['date'])
df
# Out:
# date product price
# 0 2021-01-01 A 10
# 1 2021-01-02 A 20
# 2 2021-01-05 A 30
# 3 2021-01-02 B 40
# 4 2021-01-03 B 50
# 5 2021-01-06 B 60
df.set_index("date").groupby("product")["price"].resample("d").ffill().reset_index()
# Out:
# product date price
# 0 A 2021-01-01 10
# 1 A 2021-01-02 20
# 2 A 2021-01-03 20
# 3 A 2021-01-04 20
# 4 A 2021-01-05 30
# 5 B 2021-01-02 40
# 6 B 2021-01-03 50
# 7 B 2021-01-04 50
# 8 B 2021-01-05 50
# 9 B 2021-01-06 60
See the rows that have been filled by ffill:
df.set_index("date").groupby("product")["price"].resample("d").mean()
# Out:
# product date
# A 2021-01-01 10.0
# 2021-01-02 20.0
# 2021-01-03 NaN
# 2021-01-04 NaN
# 2021-01-05 30.0
# B 2021-01-02 40.0
# 2021-01-03 50.0
# 2021-01-04 NaN
# 2021-01-05 NaN
# 2021-01-06 60.0
# Name: price, dtype: float64
Note that by grouping by product before resampling and filling the empty slots, you can have different ranges (from min to max) for each product (I modified the data to showcase this).
I have a pandas dataframe where there's three levels of row indexing. The last level is a datetime index. There are nan values and I am trying to fill them with the average of each row at the datetime level. How can I go about doing this?
data_df
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | nan
2019-01-28 18:00:00 | 2 | nan | 1
2019-01-28 19:00:00 | nan | nan | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | nan | nan | nan
Some rows may all be nan values. In this case I want to fill the row with 0's. Some rows may have all values filled in so imputing with average isn't needed.
I want this the following result:
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | 2
2019-01-28 18:00:00 | 2 | 1.5 | 1
2019-01-28 19:00:00 | 5 | 5 | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | 0 | 0 | 0
Use DataFrame.mask with mean per rows and last convert only NaNs rows by DataFrame.fillna:
df = df.mask(df.isna(), df.mean(axis=1), axis=0).fillna(0)
print (df)
a b c
Level 0 Level 1 Level 2
A 123 2019-01-28 17:00:00 3.0 1.0 2.0
2019-01-28 18:00:00 2.0 1.5 1.0
2019-01-28 19:00:00 5.0 5.0 5.0
234 2019-01-28 05:00:00 1.0 1.0 3.0
2019-01-28 06:00:00 0.0 0.0 0.0
Another solution is use DataFrame.fillna for replace, but because not implemented df.fillna(df.mean(axis=1), axis=1) is necessary double transpose:
df = df.T.fillna(df.mean(axis=1)).fillna(0).T
Let's say I have purchase records with two fields Buy and Time.
What I want to get is the third column of time elapsed since first not-buy so it looks like:
buy| time | time difference
1 | 8:00 | NULL
0 | 9:01 | NULL
0 | 9:10 | NULL
0 | 9:21 | NULL
1 | 9:31 | 0:30
0 | 9:41 | NULL
0 | 9:42 | NULL
1 | 9:53 | 0:12
How can I achieve this? It seems to me that it's a mix of pd.groupby() and pd.shift(), but I can't seem to work that out in my head.
IIUC
df.time=pd.to_datetime(df.time)
df.loc[df.buy==1,'DIFF']=df.groupby(df.buy.cumsum().shift().fillna(0)).time.transform(lambda x : x.iloc[-1]-x.iloc[0])
df
Out[19]:
buy time timedifference DIFF
0 1 2018-02-26 08:00:00 NaN 00:00:00
1 0 2018-02-26 09:01:00 NaN NaT
2 0 2018-02-26 09:10:00 NaN NaT
3 0 2018-02-26 09:21:00 NaN NaT
4 1 2018-02-26 09:31:00 0:30 00:30:00
5 0 2018-02-26 09:41:00 NaN NaT
6 0 2018-02-26 09:42:00 NaN NaT
7 1 2018-02-26 09:53:00 0:12 00:12:00
#df.buy.cumsum().shift().fillna(0) Create the key for groupby
#time.transform(lambda x : x.iloc[-1]-x.iloc[0]) create the different for each group
#df.loc[df.buy==1,'DIFF'] fill the value from groupby by the right position which buy equal to 1
I have a pandas dataframe as shown below where I have Month-Year, need to get the continuous dataframe which should include count as 0 if no rows are found for that month. Excepcted output is shown below.
Input dataframe
Month | Count
--------------
Jan-15 | 10
Feb-15 | 100
Mar-15 | 20
Jul-15 | 10
Sep-15 | 11
Oct-15 | 1
Dec-15 | 15
Expected Output
Month | Count
--------------
Jan-15 | 10
Feb-15 | 100
Mar-15 | 20
Apr-15 | 0
May-15 | 0
Jun-15 | 0
Jul-15 | 10
Aug-15 | 0
Sep-15 | 11
Oct-15 | 1
Nov-15 | 0
Dec-15 | 15
You can set the Month column as the index. It looks like Excel input, if so, it will be parsed at 01.01.2015 so you can resample it as follows:
df.set_index('Month').resample('MS').asfreq().fillna(0)
Out:
Count
Month
2015-01-01 10.0
2015-02-01 100.0
2015-03-01 20.0
2015-04-01 0.0
2015-05-01 0.0
2015-06-01 0.0
2015-07-01 10.0
2015-08-01 0.0
2015-09-01 11.0
2015-10-01 1.0
2015-11-01 0.0
2015-12-01 15.0
If the month column is not recognized as date, you need to convert it first:
df['Month'] = pd.to_datetime(df['Month'], format='%b-%y')