I have a pandas data frame with dates. I need to know if every other date pair is consecutive.
2 1988-01-01
3 2015-01-31
4 2015-02-01
5 2015-05-31
6 2015-06-01
7 2021-11-16
11 2021-11-17
12 2022-10-05
8 2022-10-06
9 2022-10-12
10 2022-10-13
# How to build this example dataframe
df=pd.DataFrame({'date':pd.to_datetime(['1988-01-01','2015-01-31','2015-02-01', '2015-05-31','2015-06-01', '2021-11-16', '2021-11-17', '2022-10-05', '2022-10-06', '2022-10-12', '2022-10-13'])})
Each pair should be consecutive. I have tried different sorting but everything I see relates to the entire series being consecutive. I need to compare each pair of dates after the first date.
cb_gap = cb_sorted.sort_values('dates').groupby('dates').diff() > pd.to_timedelta('1 day')
What I need to see is this...
2 1988-01-01 <- Ignore the start date
3 2015-01-31 <- these dates have no gap
4 2015-02-01
5 2015-05-31 <- these dates have no gap
6 2015-06-01
7 2021-11-16 <- these have a gap!!!!
11 2021-11-18
12 2022-10-05 <- these have no gap
8 2022-10-06
9 2022-10-12
One way is to use shift and compute differences.
pd.DataFrame({'date':df.date,'diff':df.date.shift(-1)-df.date})[1::2]
returns
date diff
1 2015-01-31 1 days
3 2015-05-31 1 days
5 2021-11-16 1 days
7 2022-10-05 1 days
9 2022-10-12 1 days
It is also faster
Method
Timeit
Naveed's
4.23 ms
This one
0.93 ms
here is one way to do it
btw, what is your expected output? the answer get you the difference b/w the consecutive dates skipping the first row and populate diff column
# make date into datetime
df['date'] = pd.to_datetime(df['date'])
# create two intermediate DF skipping the first and taking alternate values
# and concat them along x-axis
df2=pd.concat([df.iloc[1:].iloc[::2].reset_index()[['id','date']],
df.iloc[2:].iloc[::2].reset_index()[['id','date']]
],axis=1 )
# take the difference of second date from the first one
df2['diff']=df2.iloc[:,3]-df2.iloc[:,1]
df2
id date id date diff
0 3 2015-01-31 4 2015-02-01 1 days
1 5 2015-05-31 6 2015-06-01 1 days
2 7 2021-11-16 11 2021-11-17 1 days
3 12 2022-10-05 8 2022-10-06 1 days
4 9 2022-10-12 10 2022-10-13 1 days
Related
currently trying to figure this out with code like this..but not quite yet able to get it:
df[(df.index.day_of_week==0) & (df.index.day<15) & (df.shift(-4).index.day_of_week==4)]
this is what the data looks like. (i've added the day_of_week column for convenience). basically, i am trying to find the first day_of_week=0 (monday) in the month, then filter for the first day_of_week=4 after that (friday)
close day_of_week
date
2022-07-01 3825.330078 4
2022-07-05 3831.389893 1
2022-07-06 3845.080078 2
2022-07-07 3902.620117 3
2022-07-08 3899.379883 4
2022-07-11 3854.429932 0
2022-07-12 3818.800049 1
2022-07-13 3801.780029 2
2022-07-14 3790.379883 3
2022-07-15 3863.159912 4
...
2022-08-01 4118.629883 0
2022-08-02 4091.189941 1
2022-08-03 4155.169922 2
2022-08-04 4151.939941 3
2022-08-05 4145.189941 4
2022-08-08 4140.060059 0
2022-08-09 4122.470215 1
2022-08-10 4210.240234 2
2022-08-11 4207.270020 3
2022-08-12 4280.149902 4
...
2022-09-01 3966.850098 3
2022-09-02 3924.260010 4
2022-09-06 3908.189941 1
2022-09-07 3979.870117 2
2022-09-08 4006.179932 3
2022-09-09 4067.360107 4
2022-09-12 4110.410156 0
2022-09-13 3932.689941 1
2022-09-14 3946.010010 2
2022-09-15 3901.350098 3
2022-09-16 3873.330078 4
...
2022-10-03 3678.429932 0
2022-10-04 3790.929932 1
2022-10-05 3783.280029 2
2022-10-06 3744.520020 3
2022-10-07 3639.659912 4
2022-10-10 3612.389893 0
2022-10-11 3588.840088 1
2022-10-12 3577.030029 2
...
2022-11-01 3856.100098 1
2022-11-02 3759.689941 2
2022-11-03 3719.889893 3
2022-11-04 3770.550049 4
2022-11-07 3806.800049 0
2022-11-08 3828.110107 1
This should return:
2022-07-15 3863.159912 4
2022-08-05 4145.189941 4
2022-09-16 3873.330078 4
2022-10-07 3639.659912 4
EDIT:
while i dont expect this to work, curious as to why this returns no results? is shifting not supported when filtering in this manner
df[(df.index.day_of_week==0) & (df.index.day<15) & (df.shift(-4).index.day_of_week==4)]
You can group by [year, month, day_of_week] and do a cumcount to assign to each row the number of times its day_of_week has appeared in this month.
Then, grab the rows corresponding to the first monday of the month using the filter day_of_week == 0 & cumcount == 0 and shift their index by 4 days to get the following Fridays. Finally, we do an intersection to filter out shifted indexes that do not exist in the original frame.
wanted_indices = df.index[df.groupby([df.index.year, df.index.month, df['dow']]).cumcount().eq(0) & df['dow'].eq(0)].shift(4, 'D')
df.loc[wanted_indices.intersection(df.index)]
Result on your example dataframe
close dow
date
2022-07-15 3863.159912 4
2022-08-05 4145.189941 4
2022-09-16 3873.330078 4
2022-10-07 3639.659912 4
I feel like the simplest way to do that is to recognize that the Friday after the first Monday of each month must be no less than day 5 (Monday must be at least day 1). So, just first filter by that requirement and then filter by day_of_week.
df_filtered = df[df.index.dt.day >= 5]
first_fridays = df[df['day_of_week'] == 5]
I have a dataframe that has a datetime index and a time series of integer values 1 per day. From this, I want to identify occurrences where the timeseries is above a threshold for at least 2 consecutive days. For these events, I want to count how many of these occur over the entire span and the start date of each one.
One of the issues is making sure I don't over count the occurrences when the event lasts more than 2 days, as long as the values stay over the threshold it should only be 1 event whether it lasts 2 days of 10 days.
I can do this using a function with lots of if statements but it's very kludgy. I want to learn a more pandas/pythonic way of doing it.
I started by looking at a masked version of the data of only the values that were above the threshold (arbitrary here) and using diff() which seemed promising but I'm still stuck. Any help is appreciated.
dates = pd.date_range('2012-01-01', periods=100, freq='D')
values = np.random.randint(100, size=len(dates))
df = pd.DataFrame({'timeseries':values}, index=dates)
df.loc[df['timeseries'] > arb_thr].index.to_series().diff().head(20)
You can make use of booleans to flag the rows which are above the threshold, then cumsum these to create the artificial groups and finally groupby on them:
arb_thr = 20
df = df.reset_index()
grps = df["timeseries"].lt(arb_thr).cumsum()
result = df.groupby(grps).agg(
min_date=("index", "min"),
max_date=("index", "max"),
count=("timeseries", "count")
).rename_axis(None, axis=0)
min_date max_date count
0 2012-01-01 2012-01-09 9
1 2012-01-10 2012-01-11 2
2 2012-01-12 2012-01-12 1
3 2012-01-13 2012-01-22 10
4 2012-01-23 2012-01-24 2
5 2012-01-25 2012-02-04 11
6 2012-02-05 2012-02-07 3
7 2012-02-08 2012-02-08 1
8 2012-02-09 2012-02-10 2
9 2012-02-11 2012-02-12 2
10 2012-02-13 2012-02-15 3
11 2012-02-16 2012-02-20 5
12 2012-02-21 2012-02-21 1
13 2012-02-22 2012-02-23 2
14 2012-02-24 2012-02-25 2
15 2012-02-26 2012-03-04 8
16 2012-03-05 2012-03-07 3
17 2012-03-08 2012-03-20 13
18 2012-03-21 2012-03-22 2
19 2012-03-23 2012-03-23 1
20 2012-03-24 2012-03-24 1
21 2012-03-25 2012-03-28 4
22 2012-03-29 2012-03-29 1
23 2012-03-30 2012-04-01 3
24 2012-04-02 2012-04-08 7
25 2012-04-09 2012-04-09 1
I have two different columns in my dataset,
start end
0 2015-01-01 2017-01-01
1 2015-01-02 2015-06-02
2 2015-01-03 2015-12-03
3 2015-01-04 2020-11-25
4 2015-01-05 2025-07-27
I want the difference between start and end in a specific way, here's my desired output.
year_diff month_diff
2 1
0 6
0 12
5 11
10 7
Here the day is not important to me, only month and year. I've tried to period to get diff but it returns just different in months only. how can I achieve my desired output?
df['end'].dt.to_period('M') - df['start'].dt.to_period('M'))
Try:
df["year_diff"]=df["end"].dt.year.sub(df["start"].df.year)
df["month_diff"]=df["end"].dt.month.sub(df["start"].df.month)
This solution assumes that the number of days that make up a year (365) and a month (30) are constant. If the datetimes are strings, convert them into a datetime object. In a Pandas DataFrame this can be done like so
def to_datetime(dataframe):
new_dataframe = pd.DataFrame()
new_dataframe[0] = pd.to_datetime(dataframe[0], format="%Y-%m-%d")
new_dataframe[1] = pd.to_datetime(dataframe[1], format="%Y-%m-%d")
return new_dataframe
Next, column 1 can be subtracted from column 0 to give the difference in days. We can divide this number by 365 using the // operator to get the number of whole years. We can get the number of remaining days using the % operator and divide this by 30 using the // operator the get the number of whole months.
def get_time_diff(dataframe):
dataframe[2] = dataframe[1] - dataframe[0]
diff_dataframe = pd.DataFrame(columns=["year_diff", "month_diff"])
for i in range(0, dataframe.index.stop):
year_diff = dataframe[2][i].days // 365
month_diff = (dataframe[2][i].days % 365) // 30
diff_dataframe.loc[i] = [year_diff, month_diff]
return diff_dataframe
An example output from using these functions would be
start end days_diff year_diff month_diff
0 2019-10-15 2021-08-11 666 days 1 10
1 2020-02-11 2022-10-13 975 days 2 8
2 2018-12-17 2020-09-16 639 days 1 9
3 2017-01-03 2017-01-28 25 days 0 0
4 2019-12-21 2022-03-10 810 days 2 2
5 2018-08-08 2019-05-07 272 days 0 9
6 2017-06-18 2020-08-01 1140 days 3 1
7 2017-11-14 2020-04-17 885 days 2 5
8 2019-08-19 2020-05-10 265 days 0 8
9 2018-05-05 2020-09-08 857 days 2 4
Note: This will give the number of whole years and months. Hence, if there is a remainder of 29 days, one day short from a month, this will not be counted.
I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?
If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})
Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1
I want to apply an operation to the following data frame:
index date username count
0 2015-11-01 1 16
1 2015-11-01 2 1
2 2015-11-01 3 1
3 2015-10-01 1 2
4 2015-10-01 4 29
5 2015-10-01 5 1
6 2014-09-01 1 3
7 2014-09-01 3 1
8 2014-09-01 4 1
And apply an operation that will get it to this:
index date mean
0 2015-11-01 6
1 2015-10-01 10.7
2 2014-09-01 1.3
The calculation takes the sum of all counts in a given date (e.g. for 2015-11-01 is it is 16+1+1=18) then divides by the unique number of usernames for a given date (e.g. for 2015-10-01 there are 3). A new column, mean is created to record the calculation, in this case we have called it mean.
I have been trying to use the 'apply' method from DataFrame but without success yet. Help would be very much appreciated. Thanks
You can use GroupBy + sum divided by GroupBy + nunique:
g = df.groupby('date')
res = g['count'].sum().div(g['username'].nunique())\
.rename('mean').reset_index()
print(res)
date mean
0 2014-09-01 1.666667
1 2015-10-01 10.666667
2 2015-11-01 6.000000