Problem
I want to calculate diff by group. And I don’t know how to sort the time column so that each group results are sorted and positive.
The original data :
In [37]: df
Out[37]:
id time
0 A 2016-11-25 16:32:17
1 A 2016-11-25 16:36:04
2 A 2016-11-25 16:35:29
3 B 2016-11-25 16:35:24
4 B 2016-11-25 16:35:46
The result I want
Out[40]:
id time
0 A 00:35
1 A 03:12
2 B 00:22
notice: the type of time col is timedelta64[ns]
Trying
In [38]: df['time'].diff(1)
Out[38]:
0 NaT
1 00:03:47
2 -1 days +23:59:25
3 -1 days +23:59:55
4 00:00:22
Name: time, dtype: timedelta64[ns]
Don't get desired result.
Hope
Not only solve the problem but the code can run fast because there are 50 million rows.
You can use sort_values with groupby and aggregating diff:
df['diff'] = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
id time diff
0 A 2016-11-25 16:32:17 NaT
1 A 2016-11-25 16:36:04 00:00:35
2 A 2016-11-25 16:35:29 00:03:12
3 B 2016-11-25 16:35:24 NaT
4 B 2016-11-25 16:35:46 00:00:22
If need remove rows with NaT in column diff use dropna:
df = df.dropna(subset=['diff'])
print (df)
id time diff
2 A 2016-11-25 16:35:29 00:03:12
1 A 2016-11-25 16:36:04 00:00:35
4 B 2016-11-25 16:35:46 00:00:22
You can also overwrite column:
df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
id time
0 A NaT
1 A 00:00:35
2 A 00:03:12
3 B NaT
4 B 00:00:22
df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
df = df.dropna(subset=['time'])
print (df)
id time
1 A 00:00:35
2 A 00:03:12
4 B 00:00:22
Related
As I am new to Python I am probably asking for something basic for most of you. However, I have a df where 'Date' is the index, another column that is returning the month related to the Date, and one Data column.
Mnth TSData
Date
2012-01-05 1 192.6257
2012-01-12 1 194.2714
2012-01-19 1 192.0086
2012-01-26 1 186.9729
2012-02-02 2 183.7700
2012-02-09 2 178.2343
2012-02-16 2 172.3429
2012-02-23 2 171.7800
2012-03-01 3 169.6300
2012-03-08 3 168.7386
2012-03-15 3 167.1700
2012-03-22 3 165.9543
2012-03-29 3 165.0771
2012-04-05 4 164.6371
2012-04-12 4 164.6500
2012-04-19 4 166.9171
2012-04-26 4 166.4514
2012-05-03 5 166.3657
2012-05-10 5 168.2543
2012-05-17 5 176.8271
2012-05-24 5 179.1971
2012-05-31 5 183.7120
2012-06-07 6 195.1286
I wish to calculate monthly changes in the data set that I can later use in a boxplot. So from the table above the results i seek are:
Mnth Chng
1 -8,9 (183,77 - 192,66)
2 -14,14 (169,63 - 183,77)
3 -5 (164,63 - 169,63)
4 1,73 (166,36 - 164,63)
5 28,77 (195,13 - 166,36)
and so on...
any suggestions?
thanks :)
IIUC, starting from this as df:
Date Mnth TSData
0 2012-01-05 1 192.6257
1 2012-01-12 1 194.2714
2 2012-01-19 1 192.0086
3 2012-01-26 1 186.9729
4 2012-02-02 2 183.7700
...
20 2012-05-24 5 179.1971
21 2012-05-31 5 183.7120
22 2012-06-07 6 195.1286
you can use:
df.groupby('Mnth')['TSData'].first().diff().shift(-1)
# or
# -df.groupby('Mnth')['TSData'].first().diff(-1)
NB. the data must be sorted by date to have the desired date to be used in the computation as the first item of each group (df.sort_values(by=['Mnth', 'Date']))
output:
Mnth
1 -8.8557
2 -14.1400
3 -4.9929
4 1.7286
5 28.7629
6 NaN
Name: TSData, dtype: float64
I'll verify that we have a datetime index:
df.index = pd.to_datetime(df.index)
Then it's simply a matter of using resample:
df['TSData'].resample('M').first().diff().shift(freq='-1M')
Output:
Date
2011-12-31 NaN
2012-01-31 -8.8557
2012-02-29 -14.1400
2012-03-31 -4.9929
2012-04-30 1.7286
2012-05-31 28.7629
Name: TSData, dtype: float64
I have two pandas data frames df1 and df2 like the following:
df1 containing all the data of type a in increasing time order:
type Date
0 a 1970-01-01
1 a 2008-08-01
2 a 2009-07-24
3 a 2010-09-30
4 a 2011-09-29
5 a 2013-06-11
6 a 2013-12-17
7 a 2015-06-02
8 a 2016-06-14
9 a 2017-06-21
10 a 2018-11-26
11 a 2019-06-03
12 a 2019-12-16
df2 containing all the data of type b in increasing time order:
type Date
0 b 2017-11-29
1 b 2018-05-30
2 b 2018-11-26
3 b 2019-06-03
4 b 2019-12-16
5 b 2020-06-18
6 b 2020-12-17
7 b 2021-06-28
A type a entry and a type b entry are determined as matching if the date difference between them is within one year. One type a entry can only match with one other type b entry, and vice versa. Time efficiently, how can I find the maximum amount of matching pairs in increasing time order like the following?
type1 Date1 type2 Date2
0 a 2017-06-21 b 2017-11-29
1 a 2018-11-26 b 2018-05-30
2 a 2019-06-03 b 2018-11-26
3 a 2019-12-16 b 2019-06-03
Use merge_asof:
df3 = pd.merge_asof(df1.rename(columns={'Date':'Date2', 'type':'type1'}),
df2.rename(columns={'Date':'Date1', 'type':'type2'}),
left_on='Date2',
right_on='Date1',
direction='nearest',
allow_exact_matches=False,
tolerance=pd.Timedelta('365 days')).dropna(subset=['Date1'])
print (df3)
type1 Date2 type2 Date1
9 a 2017-06-21 b 2017-11-29
10 a 2018-11-26 b 2018-05-30
11 a 2019-06-03 b 2018-11-26
12 a 2019-12-16 b 2020-06-18
I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?
If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})
Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1
I'm trying to count a frequency of 2 events by the month using 2 columns from my df. What I have done so far has counted all events by the unique time which is not efficient enough as there are too many results. I wish to create a graph with the results afterwards.
I've tried adapting my code by the answers on the SO questions:
[How to groupby time series by 10 minutes using pandas?
[Counting frequency of occurrence by month-year using python panda
[Pandas Groupby using time frequency
but can not seem to get the command working when I input freq='day' within the groupby command.
My code is:
print(df.groupby(['Priority', 'Create Time']).Priority.count())
which initially produced something like 170000 results in the structure of the following:
Priority Create Time
1.0 2011-01-01 00:00:00 1
2011-01-01 00:01:11 1
2011-01-01 00:02:10 1
...
2.0 2011-01-01 00:01:25 1
2011-01-01 00:01:35 1
...
But now for some reason (I'm using Jupyter Notebook) it only produces:
Priority Create Time
1.0 2011-01-01 00:00:00 1
2011-01-01 00:01:11 1
2011-01-01 00:02:10 1
2.0 2011-01-01 00:01:25 1
2011-01-01 00:01:35 1
Name: Priority, dtype: int64
No idea why the output has changed to only 5 results (maybe I unknowingly changed something).
I would like the results to be in the following format:
Priority month Count
1.0 2011-01 a
2011-02 b
2011-03 c
...
2.0 2011-01 x
2011-02 y
2011-03 z
...
Top points for showing how to change the frequency correctly for other values as well, for example hour/day/month/year. With the answers please could you explain what is going on in your code as I am new and learning pandas and wish to understand the process. Thank you.
One possible solution is convert datetime column to months periods by Series.dt.to_period:
print(df.groupby(['Priority', df['Create Time'].dt.to_period('m')]).Priority.count())
Or use Grouper:
print(df.groupby(['Priority', pd.Grouper(key='Create Time', freq='MS')]).Priority.count())
Sample:
np.random.seed(123)
df = pd.DataFrame({'Create Time':pd.date_range('2019-01-01', freq='10D', periods=10),
'Priority':np.random.choice([0,1], size=10)})
print (df)
Create Time Priority
0 2019-01-01 0
1 2019-01-11 1
2 2019-01-21 0
3 2019-01-31 0
4 2019-02-10 0
5 2019-02-20 0
6 2019-03-02 0
7 2019-03-12 1
8 2019-03-22 1
9 2019-04-01 0
print(df.groupby(['Priority', df['Create Time'].dt.to_period('m')]).Priority.count())
Priority Create Time
0 2019-01 3
2019-02 2
2019-03 1
2019-04 1
1 2019-01 1
2019-03 2
Name: Priority, dtype: int64
print(df.groupby(['Priority', pd.Grouper(key='Create Time', freq='MS')]).Priority.count())
Priority Create Time
0 2019-01-01 3
2019-02-01 2
2019-03-01 1
2019-04-01 1
1 2019-01-01 1
2019-03-01 2
Name: Priority, dtype: int64
I've got a dataframe that looks like this:
userid date count
a 2016-12-01 4
a 2016-12-03 5
a 2016-12-05 1
b 2016-11-17 14
b 2016-11-18 15
b 2016-11-23 4
The first column is a user id, the second column is a date (resulting from a groupby(pd.TimeGrouper('d')), and the third column is a daily count. However, per user, I would like to ensure that any days missing between a user's min and max date are filled in to be 0 on a per user basis. So if I am starting with a data frame like the above, I end up with a data frame like this:
userid date count
a 2016-12-01 4
a 2016-12-02 0
a 2016-12-03 5
a 2016-12-04 0
a 2016-12-05 1
b 2016-11-17 14
b 2016-11-18 15
b 2016-11-19 0
b 2016-11-20 0
b 2016-11-21 0
b 2016-11-22 0
b 2016-11-23 4
I know that there are various methods available with a pandas data frame to resample (with options to pick to interpolate forwards, backwards, or by averaging) but how would I do this in the sense above, where I want a continuous time series for each userid but where the dates of the time series are different per user?
Here's what I tried that hasn't worked:
grouped_users = user_daily_counts.groupby('user').set_index('timestamp').resample('d', fill_method = None)
However this throws an error AttributeError: Cannot access callable attribute 'set_index' of 'DataFrameGroupBy' objects, try using the 'apply' method. I'm not sure how I'd be able to use the apply method while bringing forward all columns as I'd like to do.
Thanks for any suggestions!
You can use groupby with resample, but first need Datetimeindex created by set_index.
(need pandas 0.18.1 and higher)
Then fill NaN by 0 by asfreq with fillna.
Last remove column userid and reset_index:
df = df.set_index('date')
.groupby('userid')
.resample('D')
.asfreq()
.fillna(0)
.drop('userid', axis=1)
.reset_index()
print (df)
userid date count
0 a 2016-12-01 4.0
1 a 2016-12-02 0.0
2 a 2016-12-03 5.0
3 a 2016-12-04 0.0
4 a 2016-12-05 1.0
5 b 2016-11-17 14.0
6 b 2016-11-18 15.0
7 b 2016-11-19 0.0
8 b 2016-11-20 0.0
9 b 2016-11-21 0.0
10 b 2016-11-22 0.0
11 b 2016-11-23 4.0
If want dtype of column count integer add astype:
df = df.set_index('date') \
.groupby('userid') \
.resample('D') \
.asfreq() \
.fillna(0) \
.drop('userid', axis=1) \
.astype(int) \
.reset_index()
print (df)
userid date count
0 a 2016-12-01 4
1 a 2016-12-02 0
2 a 2016-12-03 5
3 a 2016-12-04 0
4 a 2016-12-05 1
5 b 2016-11-17 14
6 b 2016-11-18 15
7 b 2016-11-19 0
8 b 2016-11-20 0
9 b 2016-11-21 0
10 b 2016-11-22 0
11 b 2016-11-23 4