I have a pandas dataframe that looks like this, whereby each row represents data collected on a different day (days 1 -> 5) for each participant (long form).
ID Heart_Rate
1 89
1 98
1 99
1 73
1 54
...
24 88
24 90
24 79
24 92
24 97
How can I aggregate the data over the first 3 days for each participant such that I create a new data frame with 1 row for each patient whereby the data represents the mean heart rate over 72 hours.
We can set the index of dataframe to ID then group the dataframe on level=0 and aggregate using head to select first three rows for each user ID then take mean on level=0 to get the average heart rate over the first 72 hours:
out = df.set_index('ID').groupby(level=0).head(3).mean(level=0)
Alternate approach which is more efficient but applicable only if there are always equal number of rows present corresponding to each user ID and dataframe is sorted on ID column:
n_days = 5 # Number of rows present for each user ID
n_days_to_avg = 3 # First n rows/days to average
m = np.isin(np.r_[:len(df)] % n_days, np.r_[:n_days_to_avg])
out = df[m].groupby('ID').mean()
>>> out
Heart_Rate
ID
1 95.333333
24 85.666667
Related
I have a dataframe named data as shown:
Date
Value
X1
X2
X3
2019-05
15
23
65
98
2019-05
34
132
56
87
2019-06
23
66
90
44
The date column is in a datetime format of Year-Month starting from 2017-01 and the most recent 2022-05. I want to write a piece that will extract data into separate data frames. More specifically I want one data frame to contain the rows of the current month and year (2022-05), another dataframe to contain to data from the previous month (2022-04), and one more dataframe that contains data from 12 months ago (2021-05).
For my code I have the following:
import pandas as pd
from datetime import datetime as dt
data = pd.read_csv("results.csv")
current = data[data["Date"].dt.month == dt.now().month]
My results show the following:
Date
Value
X1
X2
X3
2019-05
15
23
65
98
2019-05
34
132
56
87
2020-05
23
66
90
44
So I get the rows that match the current month but I need it to match the current year I assumed I could just add multiple conditions to match current month and current year but that did not seem to work for me.
Also is there a way to write the code in such a way where I can extract the data from the previous month and the previous year based on what the current month-year is? My first thought was to just take the month and subtract 1 and do the same thing for the year and if the current year is in January I would just write an exception to subtract 1 from both the month and year for the previous month analysis.
Split your DF into a dict of DFs and then access the one you want directly by the date (YYYY-MM).
index
Date
Value
X1
X2
X3
0
2017-04
15
23
65
98
1
2019-05
34
132
56
87
2
2021-06
23
66
90
44
dfs = {x:df[df.Date == x ] for x in df.Date.unique()}
dfs['2017-04']
index
Date
Value
X1
X2
X3
0
2017-04
15
23
65
98
You can do this with a groupby operation, which is a first-class kind of thing in tabular analysis (sql/pandas). In this case, you want to group by both year and month, creating dataframes:
dfs = []
for key, group_df in df.groupby([df.Date.dt.year, df.Date.dt.month]):
dfs.append(group_df)
dfs will have the subgroups you want.
One thing: it's worth noting that there is a performance cost breaking dataframes into list items. Its just as likely that you could do whatever processing comes next directly in the groupby statement, such as df.groupby(...).X1.transform(sum) for example.
I would like to calculate the daily sales from average sales using the following function:
def derive_daily_sales(avg_sales_series, period, first_day_sales):
"""
derive the daily sales from previous_avg_sales start date to current_avg_sales end date
for detail formula, please refer to README.md
#avg_sales_series: an array of avg sales(e.g. 2020-08-04 to 2020-08-06)
#period: the averaging period in days (e.g. 30 days, 90 days)
#first_day_sales: the sales at the first day of previous_avg_sales
"""
x_n1 = avg_sales_series[-1]*period - avg_sales_series[0]*period + first_day_sales
return x_n1
The avg_sales_series is supposed to be a pandas series.
The dataframe looks like the following:
date, customer_id, avg_30_day_sales
12/08/2020, 1, 30
13/08/2020, 1, 40
14/08/2020, 1, 40
12/08/2020, 2, 20
13/08/2020, 2, 40
14/08/2020, 2, 30
I would like to first groupby customer_id and sort by date. Then, get the rolling window of size 2. And apply the custom function derive_daily_sales assuming that period=30 and first_day_sales equal to the first avg_30_day_sales.
I tried:
df_sales_grouped = df_sales.sort_values('date').groupby(['customer_id','date'])]
df_daily_sales['daily_sales'] = df_sales_grouped['avg_30_day_sales'].rolling(2).apply(derive_daily_sales, axis=1, period=30, first_day_sales= df_sales['avg_30_day_sales'][0])
You should not group by the date since you want to roll over that column, so the grouping should be:
df_sales_grouped = df_sales.sort_values('date').groupby('customer_id')
Next, what you actually want to do is apply a rolling window on each group in the dataframe. So you need to use apply twice, once on the grouped dataframe and once on each rolling window. This can be done as follows:
rolling_arguments = {'period': 30, 'first_day_sales': df_sales['avg_30_day_sales'][0]}
df_sales['daily_sales'] = df_sales_grouped['avg_30_day_sales'].apply(
lambda g: g.rolling(2).apply(derive_daily_sales, kwargs=rolling_arguments))
For the given input data, the result is:
date customer_id avg_30_day_sales daily_sales
12/08/2020 1 30 NaN
13/08/2020 1 40 330.0
14/08/2020 1 40 30.0
12/08/2020 2 20 NaN
13/08/2020 2 40 630.0
14/08/2020 2 30 -270.0
I want to make a different dataframe for those Number(Column B) where Main Date > Reported Date (see the below image). If this condition comes true then I have to make other dataframe displaying that Number Data.
Example
:- if take Number (column B) 223311, now if any main date > Reported Date, then display all the records of that Number
Here is a simple solution with Pandas. You can separate out Dataframes very easily by column values of a particular column. From there, iterate the new Dataframe, resetting for index (if you want to keep the index, use dataframe.shape instead). I appended them to a list for convenience, which could be easily extracted into labeled dataframes, or combined. Long variable names are to help comprehension.
df = pd.read_csv('forstack.csv')
list_of_dataframes = [] #A place to store each dataframe. You could also name them as you go
checked_Numbers = [] #Simply to avoid multiple of same dataframe
for aNumber in df['Number']: #For every number in the column "Number"
if(aNumber not in checked_Numbers): #While this number has not been processed
checked_Numbers.append(aNumber) #Mark as checked
df_forThisNumber = df[df.Number == aNumber].reset_index(drop=True) #"Make a different Dataframe" Per request, with new index
for index in range(0,len(df_forThisNumber)): #Parse each element of this dataframe to see if it matches criteria
if(df_forThisNumber.at[index,'Main Date'] > df_forThisNumber.at[index,'Reported Date']):
list_of_dataframes.append(df_forThisNumber) #If it matches the criteria, append it
Outputs :
Main Date Number Reported Date Fee Amount Cost Name
0 1/1/2019 223311 1/1/2019 100 12 20 11
1 1/7/2019 223311 1/1/2019 100 12 20 11
Main Date Number Reported Date Fee Amount Cost Name
0 1/2/2019 111111 1/2/2019 100 12 20 11
1 1/6/2019 111111 1/2/2019 100 12 20 11
Main Date Number Reported Date Fee Amount Cost Name
0 1/3/2019 222222 1/3/2019 100 12 20 11
1 1/8/2019 222222 1/3/2019 100 12 20 11
I have a Pandas dataframe of the form:
Date ID Temp
2019/03/27 1 23
2019/04/27 2 32
2019/04/27 1 42
2019/04/28 1 41
2019/01/27 2 33
2019/08/27 2 23
What I need to do?
Select Rows which are at least 30 days old from their latest measurement for each id.
i.e. the latest date for Id = 2 is 2019/08/27, so for ID =2 I need to select rows which are at least 30 days older. So, the row with 2019/08/27 for ID=2 will itself be dropped.
Similarly, the latest date for ID = 1 is 2019/04/28. This means I can select rows for ID =1 only if the date is less than 2019/03/28 (30 days older). So, the row 2019/04/27 with ID=1 will be dropped.
How to do this in Pandas. Any help is greatly appreciated.
Thank you.
Final dataframe will be:
Date ID Temp
2019/03/27 1 23
2019/04/27 2 32
2019/01/27 2 33
In your case using groupby + transform('last') and filter the original df
Yourdf=df[df.Date<df.groupby('ID').Date.transform('last')-pd.Timedelta('30 days')].copy()
Date ID Temp
0 2019-03-27 1 23
1 2019-04-27 2 32
4 2019-01-27 2 33
Notice I am adding the .copy at the end to prevent the setting copy error.
I'm looping through a DataFrame of 200k rows. It's doing what I want but it takes hours. I'm not very sophisticated when it comes to all the ways you can join and manipulate DataFrames so I wonder if I'm doing this in a very inefficient way. It's quite simple, here's the code:
three_yr_gaps = []
for index, row in df.iterrows():
three_yr_gaps.append(df[(df['GROUP_ID'] == row['GROUP_ID']) &
(df['BEG_DATE'] >= row['THREE_YEAR_AGO']) &
(df['END_DATE'] <= row['BEG_DATE'])]['GAP'].sum() + row['GAP'])
df['GAP_THREE'] = three_yr_gaps
The DF has a column called GAP that holds an integer value. the logic I'm employing to sum this number up is:
for each row get these columns from the dataframe:
those that match on the group id, and...
those that have a beginning date within the last 3 years of this rows start date, and...
those that have an ending date before this row's beginning date.
sum up those rows GAP number and add this row's GAP number then append those to a list of indexes.
So is there a faster way to introduce this logic into some kind of automatic merge or join that could speed up this process?
PS.
I was asked for some clarification on input and output, so here's a constructed dataset to play with:
from dateutil import parser
df = pd.DataFrame( columns = ['ID_NBR','GROUP_ID','BEG_DATE','END_DATE','THREE_YEAR_AGO','GAP'],
data = [['09','185',parser.parse('2008-08-13'),parser.parse('2009-07-01'),parser.parse('2005-08-13'),44],
['10','185',parser.parse('2009-08-04'),parser.parse('2010-01-18'),parser.parse('2006-08-04'),35],
['11','185',parser.parse('2010-01-18'),parser.parse('2011-01-18'),parser.parse('2007-01-18'),0],
['12','185',parser.parse('2014-09-04'),parser.parse('2015-09-04'),parser.parse('2011-09-04'),0]])
and here's what I wrote at the top of the script, may help:
The purpose of this script is to extract gaps counts over the
last 3 year period. It uses gaps.sql as its source extract. this query
returns a DataFrame that looks like this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP
09 185 2008-08-13 2009-07-01 2005-08-13 44
10 185 2009-08-04 2010-01-18 2006-08-04 35
11 185 2010-01-18 2011-01-18 2007-01-18 0
12 185 2014-09-04 2015-09-04 2011-09-04 0
The python code then looks back at the previous 3 years (those
previous rows that have the same GROUP_ID but whose effective dates
come after their own THIRD_YEAR_AGO and whose end date come before
their own beginning date). Those rows are added up and a new column is
made called GAP_THREE. What remains is this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP GAP_THREE
09 185 2008-08-13 2009-07-01 2005-08-13 44 44
10 185 2009-08-04 2010-01-18 2006-08-04 35 79
11 185 2010-01-18 2011-01-18 2007-01-18 0 79
12 185 2014-09-04 2015-09-04 2011-09-04 0 0
you'll notice that row id_nbr 11 has a 79 value in the last 3 years but id_nbr 12 has 0 because the last gap was 35 in 2009 which is more than 3 years before 12's beginning date of 2014