How do you groupby on consecutive blocks of rows where each block is separated by a threshold value?
I have the following sample pandas dataframe, and I'm having difficulty getting blocks of rows whose difference in their dates is greater than 365 days.
Date
Data
2019-01-01
A
2019-05-01
B
2020-04-01
C
2021-07-01
D
2022-02-01
E
2024-05-01
F
The output I'm looking for is the following,
Min Date
Max Date
Data
2019-01-01
2020-04-01
ABC
2021-07-01
2022-02-01
DE
2024-05-01
2024-05-01
F
I was looking at pandas .diff() and .cumsum() for getting the number of days between two rows and filtering for rows with difference > 365 days, however, it doesn't work when the dataframe has multiple blocks of rows.
I would also suggest .diff() and .cumsum():
import pandas as pd
df = pd.read_clipboard()
df["Date"] = pd.to_datetime(df["Date"])
blocks = df["Date"].diff().gt("365D").cumsum()
out = df.groupby(blocks).agg({"Date": ["min", "max"], "Data": "sum"})
out:
Date Data
min max sum
Date
0 2019-01-01 2019-05-01 AB
1 2020-06-01 2020-06-01 C
2 2021-07-01 2022-02-01 DE
3 2024-05-01 2024-05-01 F
after which you can replace the column labels (now a 2 level MultiIndex) as appropriate.
The date belonging to data "C" is more than 365 days apart from both "B" and "D", so it got its own group. Or am I misunderstanding your expected output?
Related
Let's say I have a pandas df with a Date column (datetime64[ns]):
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0
I want to get a column (rows_num in the above example) with the number of rows I need to claw back to find the current row date minus 365 days (1 year before).
So, in the above example, for index 7 (date 2021-01-26) I want to know how many rows before I can find the date 2020-01-26.
If a perfect match is not available (like in the example df), I should reference the closest available date (or the closest smaller/larger date: it doesn't really matter in my case).
Any idea? Thanks
Edited to reflect OP's original question. Created a demo dataframe. Created a column to hold that row_count value to reflect number of business days. Then, for each row, create a filter to grab all rows between the start date and 365 days later. the shape[0] of that filtered dataframe represents the number of business days, and we add it into the appropriate field of the df.
# Import Pandas package
import pandas as pd
from datetime import datetime, timedelta
# Create a sample dataframe
df = pd.DataFrame({'num_posts': [4, 6, 3, 9, 1, 14, 2, 5, 7, 2],
'date' : ['2020-08-09', '2020-08-25', '2020-09-05',
'2020-09-12', '2020-09-29', '2020-10-15',
'2020-11-21', '2020-12-02', '2020-12-10',
'2020-12-18']})
#create the column for the row count:
df.insert(2, "row_count", '')
# Convert the date to datetime64
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
for row in range(len(df['date'])):
start_date = str(df['date'].iloc[row])
end_date = str(df['date'].iloc[row] + timedelta(days=365)) #set the end date for the filter
# Filter data between two dates
filtered_df = df.loc[(df['date'] >= start_date) & (df['date'] < end_date)]
df['row_count'][row] = filtered_df.shape[0] # fill in the row_count column with the number of rows returned by your filter
source
You can use pd.merge_asof, which performs the exact nearest-match lookup you describe. You can even choose to use backward (smaller), forward (larger), or nearest search types.
# setup
text = StringIO(
"""
Date
2020-01-01
2020-02-25
2020-04-23
2020-06-28
2020-08-17
2020-10-11
2020-12-06
2021-01-26
2021-03-17
"""
)
data = pd.read_csv(text, delim_whitespace=True, parse_dates=["Date"])
# calculate the reference date from 1 year (365 days) ago
one_year_ago = data["Date"] - pd.Timedelta("365D")
# we only care about the index values for the original and matched dates
merged = pd.merge_asof(
one_year_ago.reset_index(),
data.reset_index(),
on="Date",
suffixes=("_original", "_matched"),
direction="backward",
)
data["rows_num"] = merged["index_original"] - merged["index_matched"]
Result:
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0
I have a DataFrame (df1) with patients, where each patient (with unique id) has an admission timestamp:
admission_timestamp id
0 2020-03-31 12:00:00 1
1 2021-01-13 20:52:00 2
2 2020-04-02 07:36:00 3
3 2020-04-05 16:27:00 4
4 2020-03-21 18:51:00 5
I also have a DataFrame (df2) with for each patient (with unique id), data for a specific feature. For example:
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61
2 1 temperature 2020-04-03 13:04:33 36.51
3 2 temperature 2020-04-02 07:44:12 36.45
4 2 temperature 2020-04-08 08:36:00 36.50
Where effective_timestamp is of type: datetime64[ns], for both columns. The ids for both dataframes link to the same patients.
In reality there is a lot more data with +- 1 value per minute. What I want is for each patient, only the data for the first X (say 24) hours after the admission timestamp from df1. So the above would result in:
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61
3 2 temperature 2020-04-02 07:44:12 36.45
This would thus include first searching for the admission timestamp, and with this timestamp, drop all rows for that patient where the effective_timestamp is not within X hours from the admission timestamp. Here, X should be variable (could be 7, 24, 72, etc). I could not find a similar question on SO. I tried this using panda's date_range but I don't know how to perform that for each patient, with a variable value for X. Any help is appreciated.
Edit: I could also merge the dataframes together so each row in df2 has the admission_timestamp, and then subtract the two columns to get the difference in time. And then drop all rows where difference > X. But this sounds very cumbersome.
Let's use pd.DateOffset
First get the value of admission_timestamp for a given patient id, and convert it to pandas datetime.
Let's say id = 1
>>admissionTime = pd.to_datetime(df1[df1['id'] == 1]['admission_timestamp'].values[0])
>>admissionTime
Timestamp('2020-03-31 12:00:00')
Now, you just need to use pd.DateOffset to add 24 hours to it.
>>admissionTime += pd.DateOffset(hours=24)
Now, just look for the rows where id=1 and effective_timestamp < admissionTime
>>df2[(df2['id'] == 1) & (df2['effective_timestamp']<admissionTime)]
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61
Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))
This forloop will take 3 days to complete. How can I increase the speed?
for i in range(df.shape[0]):
df.loc[df['Creation date'] >= pd.to_datetime(str(df['Original conf GI dte'].iloc[i])),'delivered'] += df['Sale order item'].iloc[i]
I think the forloop is enough to understand?
If Creation date is bigger than Original conf GI date, then add Sale order item value to delivered column.
Each row's date is "Date Accepted" (Date Delivered is future date). Input is Order Ouantity, Date Accepted & Date Delivered....Output is Delivered column
Order Quantity Date Accepted Date Delivered Delivered
20 01-05-2010 01-02-2011 0
10 01-11-2010 01-03-2011 0
300 01-12-2010 01-09-2011 0
5 01-03-2011 01-03-2012 30
20 01-04-2012 01-11-2013 335
10 01-07-2013 01-12-2014 335
Convert values to numpy arrays by Series.to_numpy, compare them with broadcasting, match order values by numpy.where and last sum:
date1 = df['Date Accepted'].to_numpy()
date2 = df['Date Delivered'].to_numpy()
order = df['Order Quantity'].to_numpy()
#oldier pandas versions
#date1 = df['Date Accepted'].values
#date2 = df['Date Delivered'].values
#order = df['Order Quantity'].values
df['Delivered1'] = np.where(date1[:, None] >= date2, order, 0).sum(axis=1)
print (df)
Order Quantity Date Accepted Date Delivered Delivered Delivered1
0 20 2010-01-05 2011-01-02 0 0
1 10 2010-01-11 2011-01-03 0 0
2 300 2010-01-12 2011-01-09 0 0
3 5 2011-01-03 2012-01-03 30 30
4 20 2012-01-04 2013-01-11 335 335
5 10 2013-01-07 2014-01-12 335 335
If I understand correctly, you can use np.where() for speed. Currently you are looping on the dataframe rows whereas numpy operations are designed to operate on the entire column:
cond= df['Creation date'].ge(pd.to_datetime(str(df['Original conf GI dte'])))
df['delivered']=np.where(cond,df['delivered']+df['Sale order item'],df['delivered'])
I am looking to determine the count of string variables in a column across a 3 month data sample. Samples were taken at random times throughout each day. I can group the data by hour, but I require the fidelity of 30 minute intervals (ex. 0500-0600, 0600-0630) on roughly 10k rows of data.
An example of the data:
datetime stringvalues
2018-06-06 17:00 A
2018-06-07 17:30 B
2018-06-07 17:33 A
2018-06-08 19:00 B
2018-06-09 05:27 A
I have tried setting the datetime column as the index, but I cannot figure how to group the data on anything other than 'hour' and I don't have fidelity on the string value count:
df['datetime'] = pd.to_datetime(df['datetime']
df.index = df['datetime']
df.groupby(df.index.hour).count()
Which returns an output similar to:
datetime stringvalues
datetime
5 0 0
6 2 2
7 5 5
8 1 1
...
I researched multi-indexing and resampling to some length the past two days but I have been unable to find a similar question. The desired result would look something like this:
datetime A B
0500 1 2
0530 3 5
0600 4 6
0630 2 0
....
There is no straightforward way to do a TimeGrouper on the time component, so we do this in two steps:
v = (df.groupby([pd.Grouper(key='datetime', freq='30min'), 'stringvalues'])
.size()
.unstack(fill_value=0))
v.groupby(v.index.time).sum()
stringvalues A B
05:00:00 1 0
17:00:00 1 0
17:30:00 1 1
19:00:00 0 1