I have a dataset in the following format:
Patient Date colA colB
1 1/3/2015 . 5
1 2/5/2015 3 10
1 3/5/2016 8 .
2 4/5/2014 2 .
2 etc
I am trying to define a function in PANDAS which treats unique patients as an item and iterates over these unique patient items to keep only to most recent observation per column (replacing all other values with missing or null). For example: for patient 1, the output would entail -
Patient Date colA colB
1 1/3/2015 . .
1 2/5/2015 . 10
1 3/5/2016 8 .
I understand that I can use something like the following with .apply(), but this does not account for duplicate patient IDs...
def getrecentobs():
for i in df['Patient']:
etc
Any help or direction is much appreciated.
There is a function in pandas called last which can be used with groupby to give you the last values for a given groupby. I'm not sure why you require the blank rows but if you need them you can join the groupby back on the original data frame. Sorry the sort is there as the date was not sorted in my sample data. Hope that helps.
Example:
DataFrame
id date amount code
0 3107 2010-10-20 136.4004 290
1 3001 2010-10-08 104.1800 290
2 3109 2010-10-08 276.0629 165
3 3001 2010-10-08 -177.9800 290
4 3002 2010-10-08 1871.1094 290
5 3109 2010-10-08 225.7038 155
6 3109 2010-10-08 98.5578 170
7 3107 2010-10-08 231.3949 165
8 3203 2010-10-08 333.6636 290
9 -9100 2010-10-08 3478.7500 290
If previous rows not needed:
b.sort_values("date").groupby(["id","date"]).last().reset_index()
The groupby aggregates the data by the "last" meaning the last value for those columns.
Output only latest rows with values:
id date amount code
0 -9100 2010-10-08 3478.7500 290
1 3001 2010-10-08 -177.9800 290
2 3002 2010-10-08 1871.1094 290
3 3107 2010-10-08 231.3949 165
4 3107 2010-10-20 136.4004 290
5 3109 2010-10-08 98.5578 170
6 3203 2010-10-08 333.6636 290
I think you can use to_numeric for convert values . to NaN, then create mask with groupby and rank and last apply mask:
print df
Patient Date colA colB
0 1 1/3/2015 . 5
1 1 2/5/2015 3 10
2 1 3/5/2016 8 .
3 2 4/5/2014 2 .
4 2 5/5/2014 4 .
df['colA'] = pd.to_numeric(df['colA'], errors='coerce')
df['colB'] = pd.to_numeric(df['colB'], errors='coerce')
print df
Patient Date colA colB
0 1 1/3/2015 NaN 5
1 1 2/5/2015 3 10
2 1 3/5/2016 8 NaN
3 2 4/5/2014 2 NaN
4 2 5/5/2014 4 NaN
print df.groupby('Patient')[['colA','colB']].rank(method='max', ascending=False)
colA colB
0 NaN 2
1 2 1
2 1 NaN
3 2 NaN
4 1 NaN
mask = df.groupby('Patient')[['colA','colB']].rank(method='max', ascending=False) == 1
print mask
colA colB
0 False False
1 False True
2 True False
3 False False
4 True False
df[['colA','colB']] = df[['colA','colB']][mask]
print df
Patient Date colA colB
0 1 1/3/2015 NaN NaN
1 1 2/5/2015 NaN 10
2 1 3/5/2016 8 NaN
3 2 4/5/2014 NaN NaN
4 2 5/5/2014 4 NaN
I think you are looking for pandas groupby.
For example, df.groubpy('Patient').last() will return a DataFrame with the last observation of each patient. If the patients are not sorted by date you can find the latest record date using max function.
df.groupby('Patient').last()
Date colA colB
Patient
1 3/5/2016 8 .
2 etc 2 .
You can make your own functions and then call the apply() function of groupby.
Related
How can I drop all rows after there is a change in a value in 1 column by group?
I have a data that looks like:
ID Date CD
0 1 1/1/2015 A
1 1 1/2/2015 A
2 1 1/3/2015 A
3 1 1/4/2015 A
4 1 1/5/2015 B
5 1 1/6/2015 B
6 1 1/7/2015 A
7 1 1/8/2015 A
8 1 1/9/2016 C
9 2 1/2/2015 A
10 2 1/3/2015 A
11 2 1/4/2015 A
12 2 1/5/2015 A
13 2 1/6/2015 A
14 2 1/7/2015 A
I need to drop last 3 rows for ID 1 because it goes back to CD A after it has been changed. Result
I am looking for is :
Since I am not dropping all duplicates, I couldn't use Duplicates. I am neither keeping all "A" to use loc function.
I tried using groupby and cumcount. Any help would be helpful.
Thank you.
Lets create a boolean mask to identify the occurrence of the first row where the value in CD column goes back after it has been changed, then group this mask by ID and use cummax to flag all the rows after such occurrence
m = df.duplicated(['ID', 'CD']) & (df['CD'] != df['CD'].shift())
df[~m.groupby(df['ID']).cummax()]
ID Date CD
0 1 1/1/2015 A
1 1 1/2/2015 A
2 1 1/3/2015 A
3 1 1/4/2015 A
4 1 1/5/2015 B
5 1 1/6/2015 B
9 2 1/2/2015 A
10 2 1/3/2015 A
11 2 1/4/2015 A
12 2 1/5/2015 A
13 2 1/6/2015 A
14 2 1/7/2015 A
I have a data frame that looks like this:
Identification
Date (day/month/year)
X
Y
123
01/01/2022
NaN
abc
123
02/01/2022
200
acb
123
03/01/2022
200
ary
124
01/01/2022
200
abc
124
02/01/2022
NaN
abc
124
03/01/2022
NaN
NaN
I am trying to create two separate 'change' columns, one for x and y separately, that is keeping a rolling count of how many times a given element is changing over time. I would like my output to look something like this, where NaN ---> NaN is not counted as a change but NaN ---> some element is counted:
Identification
Date (day/month/year)
X
Y
Change X
Change Y
123
01/01/2022
NaN
abc
0
0
123
02/01/2022
200
acb
1
1
123
03/01/2022
200
ary
1
2
124
01/01/2022
200
abc
0
0
124
02/01/2022
NaN
abc
1
0
124
03/01/2022
NaN
NaN
1
1
Thanks :)
You can use a classical comparison with the next item (obtained with groupby.shift) combined with a groupby.cumsum, however a NaN compared with another NaN yields False. To overcome this, we can first fillna with an object that is not part of the dataset. Here I chose object, it could be -1 if your data is strictly positive.
def change(s):
s = s.fillna(object)
return (s.ne(s.groupby(df['Identification']).shift())
.groupby(df['Identification']).cumsum().sub(1)
)
out = df.join(df[['X', 'Y']].apply(change).add_prefix('Change '))
print(out)
Output:
Identification Date (day/month/year) X Y Change X Change Y
0 123 01/01/2022 NaN abc 0 0
1 123 02/01/2022 200.0 acb 1 1
2 123 03/01/2022 200.0 ary 1 2
3 124 01/01/2022 200.0 abc 0 0
4 124 02/01/2022 NaN abc 1 0
5 124 03/01/2022 NaN NaN 1 1
I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.
I've got a date-ordered dataframe that can be grouped. What I am attempting to do is groupby a variable (Person), determine the maximum (weight) for each group (person), and then drop all rows that come after (date) the maximum.
Here's an example of the data:
df = pd.DataFrame({'Person': 1,1,1,1,1,2,2,2,2,2],'Date': '1/1/2015','2/1/2015','3/1/2015','4/1/2015','5/1/2015','6/1/2011','7/1/2011','8/1/2011','9/1/2011','10/1/2011'], 'MonthNo':[1,2,3,4,5,1,2,3,4,5], 'Weight':[100,110,115,112,108,205,210,211,215,206]})
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
3 4/1/2015 4 1 112
4 5/1/2015 5 1 108
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
9 10/1/2011 5 2 206
Here's what I want the result to look like:
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
I think its worth noting, there can be disjoint start dates and the maximum may appear at different times.
My idea was to find the maximum for each group, obtain the MonthNo the maximum was in for that group, and then discard any rows with MonthNo greater Max Weight MonthNo. So far I've been able to obtain the max by group, but cannot get past doing a comparison based on that.
Please let me know if I can edit/provide more information, haven't posted many questions here! Thanks for the help, sorry if my formatting/question isn't clear.
Using idxmax with groupby
df.groupby('Person',sort=False).apply(lambda x : x.reset_index(drop=True).iloc[:x.reset_index(drop=True).Weight.idxmax()+1,:])
Out[131]:
Date MonthNo Person Weight
Person
1 0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
2 0 6/1/2011 1 2 205
1 7/1/2011 2 2 210
2 8/1/2011 3 2 211
3 9/1/2011 4 2 215
You can use groupby.transform with idxmax. The first 2 steps may not be necessary depending on how your dataframe is structured.
# convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'])
# sort by Person and Date to make index usable for next step
df = df.sort_values(['Person', 'Date']).reset_index(drop=True)
# filter for index less than idxmax transformed by group
df = df[df.index <= df.groupby('Person')['Weight'].transform('idxmax')]
print(df)
Date MonthNo Person Weight
0 2015-01-01 1 1 100
1 2015-02-01 2 1 110
2 2015-03-01 3 1 115
5 2011-06-01 1 2 205
6 2011-07-01 2 2 210
7 2011-08-01 3 2 211
8 2011-09-01 4 2 215
I have 2 dataframes that I am wanting to compare one to the other and add a 'True/False' to a new column in the first based on the comparison.
My data resembles:
DF1:
cat sub-cat low high
3 3 1 208 223
4 3 1 224 350
8 4 1 223 244
9 4 1 245 350
13 5 1 232 252
14 5 1 253 350
DF2:
Cat Sub-Cat Rating
0 5 1 246
1 5 2 239
2 8 1 203
3 8 2 218
4 K 1 149
5 K 2 165
6 K 1 171
7 K 2 185
8 K 1 157
9 K 2 171
Desired result would be for DF2 to have an additional column with a True or False depending on if, based on the cat and sub-cat, that the rating is between the low.min() and high.max() or Null if no matches found to compare.
Have been running rounds with this for far too long with no results to speak of.
Thank you in advance for any assistance.
Update:
First row would look something like:
Cat Sub-Cat Rating In-Spec
0 5 1 246 True
As it falls within the min low and the max high.
Example: There are two rows in DF1 for cat = 5 and sub-cat = 2. I need to get the minimum low and the maximum high from those 2 rows and then check if the rating from row 0 in DF2 falls within the minimum low and maximum high from the two matching rows in DF1
join post groupby.agg
d2 = DF2.join(
DF1.groupby(
['cat', 'sub-cat']
).agg(dict(low='min', high='max')),
on=['Cat', 'Sub-Cat']
)
d2
Cat Sub-Cat Rating high low
0 5 1 246 350.0 232.0
1 5 2 239 NaN NaN
2 8 1 203 NaN NaN
3 8 2 218 NaN NaN
4 K 1 149 NaN NaN
5 K 2 165 NaN NaN
6 K 1 171 NaN NaN
7 K 2 185 NaN NaN
8 K 1 157 NaN NaN
9 K 2 171 NaN NaN
assign with .loc
DF2.loc[d2.eval('low <= Rating <= high'), 'In-Spec'] = True
DF2
Cat Sub-Cat Rating In-Spec
0 5 1 246 True
1 5 2 239 NaN
2 8 1 203 NaN
3 8 2 218 NaN
4 K 1 149 NaN
5 K 2 165 NaN
6 K 1 171 NaN
7 K 2 185 NaN
8 K 1 157 NaN
9 K 2 171 NaN
To add a new column based on a boolean expression would involve something along the lines of:
temp = boolean code involving inequality
df2['new column name'] = temp
However I'm not sure I understand, the first row in your DF2 table for instance, has a rating of 246, which means it's true for row 13 of DF1, but false for row 14. What would you like it to return?
You can do it like this
df2['In-Spec'] = 'False'
df2['In-Spec'][(df2['Rating'] > df1['low']) & (df2['Rating'] < df1['high'])] = 'True'
But which rows should be compared with each others? Do you want them to compare by their index or by their cat & subcat names?