Condition on all rows of a groupby - python

Concerning this type of dataframe:
import pandas as pd
import datetime
df = pd.DataFrame({'ID': [1,1,1,1,2,2,2,3],
'Time': [datetime.date(2019, 12, 1), datetime.date(2019, 12, 5),datetime.date(2019, 12, 8),datetime.date(2019, 8, 4),datetime.date(2019, 11, 4),datetime.date(2019, 11, 4),datetime.date(2019, 11, 3),datetime.date(2019, 12, 20)],
'Value':[2,2,2,50,7,100,7,5]})
ID Time Value
0 1 2019-12-01 2
1 1 2019-12-05 2
2 1 2019-12-08 2
3 1 2019-08-04 50
4 2 2019-11-04 7
5 2 2019-11-04 100
6 2 2019-11-03 7
7 3 2019-12-20 5
I am intersted only in the 3 latest values (regarding the time)
and
I would like to keep only the IDs where these 3 values are < 10.
So my desired output will look like this:
ID
0 1
Indeed the value 50 for the first ID is the fourth last value, so it's not interesting.

You could use a combination of query and groupby+size:
ids = df.query('Value < 10').groupby('ID')['Time'].size().ge(3)
ids[ids].reset_index().drop('Time', axis=1)
output:
ID
0 1
Alternative using filter (slower):
df.groupby('ID').filter(lambda g: len(g[g['Value'].lt(10)]['Time'].nlargest(3))>2)
output:
ID Time Value
0 1 2019-12-01 2
1 1 2019-12-05 2
2 1 2019-12-08 2
3 1 2019-08-04 50
and to get only the ID: add ['ID'].unique()

Within a groupby:
I sort the group by time
use a boolean to determine if the condition <10 is satisfied or not
Take the last 3 values only and sum the boolean defined above
Check if this number is exactly 3
grp = df.groupby("ID")\
.apply(lambda x:
x.sort_values("Time")["Value"].lt(10)[-3:].sum()==3)
grp[grp]
ID
1 True
dtype: bool

Related

Summing values up to a column value change in pandas dataframe

I have a pandas data frame that looks like this:
Count Status
Date
2021-01-01 11 1
2021-01-02 13 1
2021-01-03 14 1
2021-01-04 8 0
2021-01-05 8 0
2021-01-06 5 0
2021-01-07 2 0
2021-01-08 6 1
2021-01-09 8 1
2021-01-10 10 0
I want to calculate the difference between the initial and final value of the "Count" column before the "Status" column changes from 0 to 1 or vice-versa (for every cycle) and make a new dataframe out of these values.
The output for this example would be:
Cycle Difference
1 3
2 -6
3 2
Use GroupBy.agg by consecutive groups created by comapre shifted values with cumulative sum, last subtract last and first value:
df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
.agg(['first','last'])
.eval('last - first')
.reset_index(name='Difference'))
print (df)
Cycle Difference
0 1 3
1 2 -6
2 3 2
3 4 0
If need filter out groups rows with 1 row is possible add aggregation GroupBy.size and then filter oupt rows by DataFrame.loc:
df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
.agg(['first','last', 'size'])
.loc[lambda x: x['size'] > 1]
.eval('last - first')
.reset_index(name='Difference'))
print (df)
Cycle Difference
0 1 3
1 2 -6
2 3 2
You can use a GroupBy.agg on the groups formed of the consecutive values, then get the first minus last value (see below for variants):
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0])
)
output:
Status
1 3
2 -6
3 2
4 0
Name: Count, dtype: int64
If you only want to do this for groups of more than one element:
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
.dropna()
)
output:
Status
1 3
2 -6
3 2
Name: Count, dtype: object
output as DataFrame:
add .rename_axis('Cycle').reset_index(name='Difference'):
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
.dropna()
.rename_axis('Cycle').reset_index(name='Difference')
)
output:
Cycle Difference
0 1 3
1 2 -6
2 3 2

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

I am looking for a way to identify the row that is the 'master' row. The way I am defining the master row is for each group id the row that has the minimum in cust_hierarchy then if it is a tie use the row with the most recent date.
I have supplied some sample tables below:
row_id
group_id
cust_hierarchy
most_recent_date
master(I am looking for)
1
0
2
2020-01-03
1
2
0
7
2019-01-01
0
3
1
7
2019-05-01
0
4
1
6
2019-04-01
0
5
1
6
2019-04-03
1
I was thinking of possibly ordering by the two columns (cust_hierarchy (ascending), most_recent_date (descending), and then a new column that places a 1 on the first row for each group id?
Does anyone have any helpful code for this?
You basically can to an groupby with an idxmin(), but with a little bit of sorting to ensure the most recent use date is selected by the min operation:
import pandas as pd
import numpy as np
# example data
dates = ['2020-01-03','2019-01-01','2019-05-01',
'2019-04-01','2019-04-03']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'group_id':[0,0,1,1,1],
'cust_hierarchy':[2,7,7,6,6,],
'most_recent_date':dates})
# solution
df = df.sort_values('most_recent_date', ascending=False)
idxs = df.groupby('group_id')['cust_hierarchy'].idxmin()
df['master'] = np.where(df.index.isin(idxs), True, False)
df = df.sort_index()
df before:
group_id cust_hierarchy most_recent_date
0 0 2 2020-01-03
1 0 7 2019-01-01
2 1 7 2019-05-01
3 1 6 2019-04-01
4 1 6 2019-04-03
df after:
group_id cust_hierarchy most_recent_date master
0 0 2 2020-01-03 True
1 0 7 2019-01-01 False
2 1 7 2019-05-01 False
3 1 6 2019-04-01 False
4 1 6 2019-04-03 True
Use duplicated on sort_values:
df['master'] = 1- (df.sort_values(['cust_hierarchy', 'most_recent_date'],
ascending=[False, True])
.duplicated('group_id', keep='last')
.astype(int)
)

one year rolling count of unique values by group in pandas

So I have the following dataframe:
Period group ID
20130101 A 10
20130101 A 20
20130301 A 20
20140101 A 20
20140301 A 30
20140401 A 40
20130101 B 11
20130201 B 21
20130401 B 31
20140401 B 41
20140501 B 51
I need to count how many different ID there are by group in the last year. So my desired output would look like this:
Period group num_ids_last_year
20130101 A 2 # ID 10 and 20 in the last year
20130301 A 2
20140101 A 2
20140301 A 2 # ID 30 enters, ID 10 leaves
20140401 A 3 # ID 40 enters
20130101 B 1
20130201 B 2
20130401 B 3
20140401 B 2 # ID 11 and 21 leave
20140501 B 2 # ID 31 leaves, ID 51 enters
Period is in datetime format. I tried many things along the lines of:
df.groupby(['group','Period'])['ID'].nunique() # Get number of IDs by group in a given period.
df.groupby(['group'])['ID'].nunique() # Get total number of IDs by group.
df.set_index('Period').groupby('group')['ID'].rolling(window=1, freq='Y').nunique()
But the last one isn't even possible. Is there any straightforward way to do this? I'm thinking maybe some kind of combination of cumcount() and pd.DateOffset or maybe ge(df.Period - dt.timedelta(365), but I can't find the answer.
Thanks.
Edit: added the fact that I can find more than one ID in a given Period
looking at your data structure, I am guessing you have MANY duplicates, so start with dropping them. drop_duplicates tend to be fast
I am assuming that df['Period'] columns is of dtype datetime64[ns]
df = df.drop_duplicates()
results = dict()
for start in df['Period'].drop_duplicates():
end = start.date() - relativedelta(years=1)
screen = (df.Period <= start) & (df.Period >= end) # screen for 1 year of data
singles = df.loc[screen, ['group', 'ID']].drop_duplicates() # screen for same year ID by groups
x = singles.groupby('group').count()
results[start] = x
results = pd.concat(results, 0)
results
ID
group
2013-01-01 A 2
B 1
2013-02-01 A 2
B 2
2013-03-01 A 2
B 2
2013-04-01 A 2
B 3
2014-01-01 A 2
B 3
2014-03-01 A 2
B 1
2014-04-01 A 3
B 2
2014-05-01 A 3
B 2
is that any faster?
p.s. if df['Period'] is not a datetime:
df['Period'] = pd.to_datetime(df['Period'],format='%Y%m%d', errors='ignore')
Here the solution using groupby and rolling. Note: your desired ouput counts a year from YYYY0101 to next year YYYY0101, so you need rolling 366D instead of 365D
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
df = df.set_index('Period')
df_final = (df.groupby('group')['ID'].rolling(window='366D')
.apply(lambda x: np.unique(x).size, raw=True)
.reset_index(name='ID_count')
.drop_duplicates(['group','Period'], keep='last'))
Out[218]:
group Period ID_count
1 A 2013-01-01 2.0
2 A 2013-03-01 2.0
3 A 2014-01-01 2.0
4 A 2014-03-01 2.0
5 A 2014-04-01 3.0
6 B 2013-01-01 1.0
7 B 2013-02-01 2.0
8 B 2013-04-01 3.0
9 B 2014-04-01 2.0
10 B 2014-05-01 2.0
Note: On 18M+ rows, I don't think this solution will make it at 10 mins. I hope it would take about 30 mins.
from dateutil.relativedelta import relativedelta
df.sort_values(by=['Period'], inplace=True) # if not already sorted
# create new output df
df1 = (df.groupby(['Period','group'])['ID']
.apply(lambda x: list(x))
.reset_index())
df1['num_ids_last_year'] = df1.apply(lambda x: len(set(df1.loc[(df1['Period'] >= x['Period']-relativedelta(years=1)) & (df1['Period'] <= x['Period']) & (df1['group'] == x['group'])].ID.apply(pd.Series).stack())), axis=1)
df1.sort_values(by=['group'], inplace=True)
df1.drop('ID', axis=1, inplace=True)
df1 = df1.reset_index(drop=True)

Filtering a DataFrame for data of id's of which the values are decreasing over time

I have a large time series dataset of patient results. A single patient has one ID with various result values. The data is sorted by date and ID. I want to look only at patients of which the values are strictly descending over time. For example patient x has result values 5, 3, 2, 1 would be true. However 5,3,6,7,1 would be false.
Example data:
import pandas as pd
df = pd.read_excel(...)
print(df.head())
PSA PSAdateā€Ž PatientID ... datefirstinject ADTkey RT_PSAbin
0 2.40 2007-06-26 11448 ... 2006-08-05 00:00:00 1 14
1 0.04 2007-09-26 11448 ... 2006-08-05 00:00:00 1 15
2 2.30 2008-01-14 11448 ... 2006-08-05 00:00:00 1 17
3 4.03 2008-04-16 11448 ... 2006-08-05 00:00:00 1 18
4 6.70 2008-07-01 11448 ... 2006-08-05 00:00:00 1 19
So for this example, I want to only see lines with PatientIDs for which the PSA Value is decreasing over time.
groupID = df.groupby('PatientID')
def is_desc(d):
for i in range(len(d) - 1):
if d[i] > d[i+1]:
return False
return True
x = groupID.PSA.apply(is_desc)
df['is_desc'] = groupID.PSA.transform(is_desc)
#patients whose PSA values is decreasing overtime.
df1 = df[df['is_desc']]
I get:
KeyError: 0
I suppose the loop cant make its way through the grouped values as it requires an array to find the 'range'.
Any ideas for editing the loop?
TL;DR
# (see is_desc function definition below)
df['is_desc'] = df.groupby('PationtID').PSA.transform(is_desc)
df[df['is_desc']]
Explanation
Let's use a very simple data set:
df = pd.DataFrame({'id': [1,2,1,3,3,1], 'res': [3,1,2,1,5,1]})
It only contains the id and one value column (and it has an index automatically assigned from pandas).
So if you just want to get a list of all ids whose values are descending, we can group the values by the id, then check if the values in the group are descending, then filter the list for just ids with descending values.
So first let's define a function that checks if the values are descending:
def is_desc(d):
first = True
for i in d:
if first:
first = False
else:
if i >= last:
return False
last = i
return True
(yes, this could probably be done more elegantly, you can search online for a better implementation)
now we group by the id:
gb = df.groupby('id')
and apply the function:
x = gb.res.apply(is_desc)
x now holds this Series:
id
1 True
2 True
3 False
dtype: bool
so now if you want to filter this you can just do this:
x[x].index
which you can of course convert to a normal list like that:
list(x[x].index)
which would give you a list of all ids of which the values are descending. in this case:
[1, 2]
But if you want to also have all the original data for all those chosen ids do it like this:
df['is_desc'] = gb.res.transform(is_des)
so now df has all the original data it had in the beginning, plus a column that tell for each line if it's id's values are descending:
id res is_desc
0 1 3 True
1 2 1 True
2 1 2 True
3 3 1 False
4 3 5 False
5 1 1 True
Now you can very easily filter this like that:
df[df['is_desc']]
which is:
id res is_desc
0 1 3 True
1 2 1 True
2 1 2 True
5 1 1 True
Selecting and sorting your data is quite easy and objective. However, deciding whether or not a patient's data is declining can be subjective, so it is best if you decide on a criteria beforehand to see if their data is declining.
To sort and select:
import pandas as pd
data = [['pat_1', 10, 1],
['pat_1', 9, 2],
['pat_2', 11, 2],
['pat_1', 4, 5],
['pat_1', 2, 6],
['pat_2', 10, 1],
['pat_1', 7, 3],
['pat_1', 5, 4],
['pat_2', 20, 3]]
df = pd.DataFrame(data).rename(columns={0:'Patient', 1:'Result', 2:'Day'})
print df
df_pat1 = df[df['Patient']=='pat_1']
print df_pat1
df_pat1_sorted = df_pat1.sort_values(['Day']).reset_index(drop=True)
print df_pat1_sorted
returns:
df:
Patient Result Day
0 pat_1 10 1
1 pat_1 9 2
2 pat_2 11 2
3 pat_1 4 5
4 pat_1 2 6
5 pat_2 10 1
6 pat_1 7 3
7 pat_1 5 4
8 pat_2 20 3
df_pat1
Patient Result Day
0 pat_1 10 1
1 pat_1 9 2
3 pat_1 4 5
4 pat_1 2 6
6 pat_1 7 3
7 pat_1 5 4
df_pat1_sorted
Patient Result Day
0 pat_1 10 1
1 pat_1 9 2
2 pat_1 7 3
3 pat_1 5 4
4 pat_1 4 5
5 pat_1 2 6
For the purposes of this answer, I am going to say that if the first value of the new DataFrame is larger than the last, then their values are declining:
if df_pat1_sorted['Result'].values[0] > df_pat1_sorted['Result'].values[-1]:
print "Patient 1's values are declining"
This returns:
Patient 1's values are declining
There is a better way if you have many unique IDs (as I'm sure you do) of iterating through your patients. I shall present an example using integers, however you may need to use regex if your patient IDs include characters.
import pandas as pd
import numpy as np
min_ID = 1003
max_ID = 1005
patients = np.random.randint(min_ID, max_ID, size=10)
df = pd.DataFrame(patients).rename(columns={0:'Patients'})
print df
s = pd.Series(df['Patients']).unique()
print s
for i in range(len(s)):
print df[df['Patients']==s[i]]
returns:
Patients
0 1004
1 1004
2 1004
3 1003
4 1003
5 1003
6 1003
7 1004
8 1003
9 1003
[1004 1003] # s (the unique values in the df['Patients'])
Patients
3 1003
4 1003
5 1003
6 1003
8 1003
9 1003
Patients
0 1004
1 1004
2 1004
7 1004
I hope this has helped!
This should solve your question, interpreting 'decreasing' as monotonic decreasing:
import pandas as pd
d = {"PatientID": [1,1,1,1,2,2,2,2],
"PSAdate": [2010,2011,2012,2013,2010,2011,2012,2013],
"PSA": [5,3,2,1,5,3,4,5]}
# Sorts by id and date
df = pd.DataFrame(data=d).sort_values(['PatientID', 'PSAdate'])
# Computes change and max(change) between sequential PSA's
df["change"] = df.groupby('PatientID')["PSA"].diff()
df["max_change"] = df.groupby('PatientID')['change'].transform('max')
# Considers only patients whose PSA are monotonic decreasing
df = df.loc[df["max_change"] <= 0]
print(df)
PatientID PSAdate PSA change max_change
0 1 2010 5 NaN -1.0
1 1 2011 3 -2.0 -1.0
2 1 2012 2 -1.0 -1.0
3 1 2013 1 -1.0 -1.0
Note: to consider only strictly monotonic decreasing PSA, change the final loc condition to < 0

Pandas: per individual, find number of records that are near the current observation. Apply vs transform

Suppose I have several records for each person, each with a certain date. I want to construct a column that indicates, per person, the number of other records that are less than 2 months old. That is, I focus just on the records of, say, individual 'A', and I loop over his/her records to see whether there are other records of individual 'A' that are less than two months old (compared to the current row/record).
Let's see some test data to make it clearer
import pandas as pd
testdf = pd.DataFrame({
'id_indiv': [1, 1, 1, 2, 2, 2],
'id_record': [12, 13, 14, 19, 20, 23],
'date': ['2017-04-28', '2017-04-05', '2017-08-05',
'2016-02-01', '2016-02-05', '2017-10-05'] })
testdf.date = pd.to_datetime(testdf.date)
I'll add the expected column of counts
testdf['expected_counts'] = [1, 0, 0, 0, 1, 0]
#Gives:
date id_indiv id_record expected
0 2017-04-28 1 12 1
1 2017-04-05 1 13 0
2 2017-08-05 1 14 0
3 2016-02-01 2 19 0
4 2016-02-05 2 20 1
5 2017-10-05 2 23 0
My first thought was to group by id_indiv then use apply or transform with custom function. To make things easier, I'll first add a variable that substracts two months from the record date and then I'll write the count_months custom function for the apply or transform
testdf['2M_before'] = testdf['date'] - pd.Timedelta('{0}D'.format(30*2))
def count_months(chunk, month_var='2M_before'):
counts = np.empty(len(chunk))
for i, (ind, row) in enumerate(chunk.iterrows()):
#Count records earlier than two months old
#but not newer than the current one
counts[i] = ((chunk.date > row[month_var])
& (chunk.date < row.date)).sum()
return counts
I tried first with transform:
testdf.groupby('id_indiv').transform(count_months)
but it gives an AttributeError: ("'Series' object has no attribute 'iterrows'", 'occurred at index date') which I guess means that transform passes a Series object to the custom function, but I don't know how to fix that.
Then I tried with apply
testdf.groupby('id_indiv').apply(count_months)
#Gives
id_indiv
1 [1.0, 0.0, 0.0]
2 [0.0, 1.0, 0.0]
dtype: object
This almost works, but it gives the result as a list. To "unstack" that list, I followed an answer on this question:
#First sort, just in case the order gets messed up when pasting back:
testdf = testdf.sort_values(['id_indiv', 'id_record'])
counts = (testdf.groupby('id_indiv').apply(count_months)
.apply(pd.Series).stack()
.reset_index(level=1, drop=True))
#Now create the new column
testdf.set_index('id_indiv', inplace=True)
testdf['mycount'] = counts.astype('int')
assert (testdf.expected == testdf.mycount).all()
#df looks now likes this
date id_record expected 2M_before mycount
id_indiv
1 2017-04-28 12 1 2017-02-27 1
1 2017-04-05 13 0 2017-02-04 0
1 2017-08-05 14 0 2017-06-06 0
2 2016-02-01 19 0 2015-12-03 0
2 2016-02-05 20 1 2015-12-07 1
2 2017-10-05 23 0 2017-08-06 0
This seems to work, but it seems like there should be a much easier way (maybe using transform?). Besides, pasting back the column like that doesn't seem very robust.
Thanks for your time!
Edited to count recent records per person
Here's one way to count all records strictly newer than 2 months for each person using a lookback window of exactly two calendar months minus 1 day (as opposed to an approximate 2-month window of 60 days or something).
# imports and setup
import pandas as pd
testdf = pd.DataFrame({
'id_indiv': [1, 1, 1, 2, 2, 2],
'id_record': [12, 13, 14, 19, 20, 23],
'date': ['2017-04-28', '2017-04-05', '2017-08-05',
'2016-02-01', '2016-02-05', '2017-10-05'] })
# more setup
testdf['date'] = pd.to_datetime(testdf['date'])
testdf.set_index('date', inplace=True)
testdf.sort_index(inplace=True) # required for the index-slicing below
# solution
count_recent_records = lambda x: [x.loc[d - pd.DateOffset(months=2, days=-1):d].count() - 1 for d in x.index]
testdf['mycount'] = testdf.groupby('id_indiv').transform(count_recent_records)
# output
testdf
id_indiv id_record mycount
date
2016-02-01 2 19 0
2016-02-05 2 20 1
2017-04-05 1 13 0
2017-04-28 1 12 1
2017-08-05 1 14 0
2017-10-05 2 23 0
testdf = testdf.sort_values('date')
out_df = pd.DataFrame()
for i in testdf.id_indiv.unique():
for d in testdf.date:
date_diff = (d - testdf.loc[testdf.id_indiv == i,'date']).dt.days
out_dict = {'person' : i,
'entry_date' : d,
'count' : sum((date_diff > 0) & (date_diff <= 60))}
out_df = out_df.append(out_dict, ignore_index = True)
out_df
count entry_date person
0 0.0 2016-02-01 2.0
1 1.0 2016-02-05 2.0
2 0.0 2017-04-05 2.0
3 0.0 2017-04-28 2.0
4 0.0 2017-08-05 2.0
5 0.0 2017-10-05 2.0
6 0.0 2016-02-01 1.0
7 0.0 2016-02-05 1.0
8 0.0 2017-04-05 1.0
9 1.0 2017-04-28 1.0
10 0.0 2017-08-05 1.0
11 0.0 2017-10-05 1.0

Categories

Resources