Finding the maximum value in a group with differentiation - python

I have a Pandas DataFrame that looks like this:
index
ID
value_1
value_2
0
1
200
126
1
1
200
127
2
1
200
128.1
3
1
200
125.7
4
2
300.1
85
5
2
289.4
0
6
2
0
76.9
7
2
199.7
0
My aim is to find all rows in each ID-group (1,2 in this example) which have the max value for value_1 column. The second condition is if there are multiple maximum values per group, the row where the value in column value_2 is maximum should be taken.
So the target table should look like this:
index
ID
value_1
value_2
0
1
200
128.1
1
2
300.1
85

Use DataFrame.sort_values by all 3 columns and then DataFrame.drop_duplicates:
df1 = (df.sort_values(['ID', 'value_1', 'value_2'], ascending=[True, False, False])
.drop_duplicates('ID'))
print (df1)
ID value_1 value_2
2 1 200.0 128.1
4 2 300.1 85.0

Related

Pandas cumsum with keys

I have two DataFrames (first, second):
index_first
value_1
value_2
0
100
1
1
200
2
2
300
3
index_second
value_1
value_2
0
50
10
1
100
20
2
150
30
Next I concat the two DataFrames with keys:
z = pd.concat([first, second],keys=['x','y'])
My goal is to calculate the cumulative sum of value_1 and value_2 in z considering the keys.
So the final DataFrame should look like this:
index_z
value_1
value_2
x,0
100
1
x,1
300
3
x,2
600
6
y,0
50
10
y,1
150
30
y,2
300
60
Use GroupBy.cumsum by first level created by keys from concat:
df = z.groupby(level=0).cumsum()
print (df)
value_1 value_2
index_first
x 0 100 1
1 300 3
2 600 6
y 0 50 10
1 150 30
2 300 60

python pandas adding 2 dataframe with specific column

I have 2 dataframe the one looks like this :
Date id name amount period
2011-06-30 1 A 10000 1
2011-06-30 2 B 10000 1
2011-06-30 3 C 10000 1
And another one looks like this :
id amount period
1 10000 1
3 10000 0
And the result that i want looks like this :
id amount period
1 20000 2
2 10000 1
3 20000 1
How can i do that in python pandas?
Use concat with filtered columns with aggregate sum:
df = pd.concat([df1[['id','amount','period']], df2]).groupby('id', as_index=False).sum()
print (df)
id amount period
0 1 20000 2
1 2 10000 1
2 3 20000 1
EDIT:
If need subtract by id create index for id and then use DataFrame.sub:
df11 = df1[['id','amount','period']].set_index('id')
df22 = df2.set_index('id')
df3 = df11.sub(df22, fill_value=0).reset_index()
print (df3)
id amount period
0 1 0.0 0.0
1 2 10000.0 1.0
2 3 0.0 1.0

Get consecutive occurrences of an event by group in pandas

I'm working with a DataFrame that has id, wage and date, like this:
id wage date
1 100 201212
1 100 201301
1 0 201302
1 0 201303
1 120 201304
1 0 201305
.
2 0 201302
2 0 201303
And I want to create a n_months_no_income column that counts how many consecutive months a given individual has got wage==0, like this:
id wage date n_months_no_income
1 100 201212 0
1 100 201301 0
1 0 201302 1
1 0 201303 2
1 120 201304 0
1 0 201305 1
. .
2 0 201302 1
2 0 201303 2
I feel it's some sort of mix between groupby('id') , cumcount(), maybe diff() or apply() and then a fillna(0), but I'm not finding the right one.
Do you have any ideas?
Here's an example for the dataframe for ease of replication:
df = pd.DataFrame({'id':[1,1,1,1,1,1,2,2],'wage':[100,100,0,0,120,0,0,0],
'date':[201212,201301,201302,201303,201304,201305,201302,201303]})
Edit: Added code for ease of use.
In your case two groupby with cumcount and create the addtional key with cumsum
df.groupby('id').wage.apply(lambda x : x.groupby(x.ne(0).cumsum()).cumcount())
Out[333]:
0 0
1 0
2 1
3 2
4 0
5 1
Name: wage, dtype: int64

Iterating Through Pandas Dataframe to Calculate based on Conditions

For the DataFrame below, I need to create a new column 'unit_count' which is 'unit'/'count' for each year and month. However, because each year and month is not unique, for each entry, I only want to use the count for a given month from the B option.
key UID count month option unit year
0 1 100 1 A 10 2015
1 1 200 1 B 20 2015
2 1 300 2 A 30 2015
3 1 400 2 B 40 2015
Essentially, I need a function that does the following:
unit_count = df.unit / df.count
for value of unit, but using the only the 'count' value of option 'B' in that given 'month'.
So that the end result would look like the table below, where unit_count is dividing the number of units by the count of 'sector' 'B' for a given month.
key UID count month option unit year unit_count
0 1 100 1 A 10 2015 0.05
1 1 200 1 B 20 2015 0.10
2 1 300 2 A 30 2015 0.075
3 1 400 2 B 40 2015 0.01
Here is the code I used to create the original DataFrame:
df = pd.DataFrame({'UID':[1,1,1,1],
'year':[2015,2015,2015,2015],
'month':[1,1,2,2],
'option':['A','B','A','B'],
'unit':[10,20,30,40],
'count':[100,200,300,400]
})
It seems you can first create NaN where not option is B and then divide back filled NaN values:
Notice: DataFrame has to be sorted by year, month and option first for last value with B for each group
#if necessary in real data
#df.sort_values(['year','month', 'option'], inplace=True)
df['unit_count'] = df.loc[df.option=='B', 'count']
print (df)
UID count month option unit year unit_count
0 1 100 1 A 10 2015 NaN
1 1 200 1 B 20 2015 200.0
2 1 300 2 A 30 2015 NaN
3 1 400 2 B 40 2015 400.0
df['unit_count'] = df.unit.div(df['unit_count'].bfill())
print (df)
UID count month option unit year unit_count
0 1 100 1 A 10 2015 0.050
1 1 200 1 B 20 2015 0.100
2 1 300 2 A 30 2015 0.075
3 1 400 2 B 40 2015 0.100

iterate over unique values in PANDAS

I have a dataset in the following format:
Patient Date colA colB
1 1/3/2015 . 5
1 2/5/2015 3 10
1 3/5/2016 8 .
2 4/5/2014 2 .
2 etc
I am trying to define a function in PANDAS which treats unique patients as an item and iterates over these unique patient items to keep only to most recent observation per column (replacing all other values with missing or null). For example: for patient 1, the output would entail -
Patient Date colA colB
1 1/3/2015 . .
1 2/5/2015 . 10
1 3/5/2016 8 .
I understand that I can use something like the following with .apply(), but this does not account for duplicate patient IDs...
def getrecentobs():
for i in df['Patient']:
etc
Any help or direction is much appreciated.
There is a function in pandas called last which can be used with groupby to give you the last values for a given groupby. I'm not sure why you require the blank rows but if you need them you can join the groupby back on the original data frame. Sorry the sort is there as the date was not sorted in my sample data. Hope that helps.
Example:
DataFrame
id date amount code
0 3107 2010-10-20 136.4004 290
1 3001 2010-10-08 104.1800 290
2 3109 2010-10-08 276.0629 165
3 3001 2010-10-08 -177.9800 290
4 3002 2010-10-08 1871.1094 290
5 3109 2010-10-08 225.7038 155
6 3109 2010-10-08 98.5578 170
7 3107 2010-10-08 231.3949 165
8 3203 2010-10-08 333.6636 290
9 -9100 2010-10-08 3478.7500 290
If previous rows not needed:
b.sort_values("date").groupby(["id","date"]).last().reset_index()
The groupby aggregates the data by the "last" meaning the last value for those columns.
Output only latest rows with values:
id date amount code
0 -9100 2010-10-08 3478.7500 290
1 3001 2010-10-08 -177.9800 290
2 3002 2010-10-08 1871.1094 290
3 3107 2010-10-08 231.3949 165
4 3107 2010-10-20 136.4004 290
5 3109 2010-10-08 98.5578 170
6 3203 2010-10-08 333.6636 290
I think you can use to_numeric for convert values . to NaN, then create mask with groupby and rank and last apply mask:
print df
Patient Date colA colB
0 1 1/3/2015 . 5
1 1 2/5/2015 3 10
2 1 3/5/2016 8 .
3 2 4/5/2014 2 .
4 2 5/5/2014 4 .
df['colA'] = pd.to_numeric(df['colA'], errors='coerce')
df['colB'] = pd.to_numeric(df['colB'], errors='coerce')
print df
Patient Date colA colB
0 1 1/3/2015 NaN 5
1 1 2/5/2015 3 10
2 1 3/5/2016 8 NaN
3 2 4/5/2014 2 NaN
4 2 5/5/2014 4 NaN
print df.groupby('Patient')[['colA','colB']].rank(method='max', ascending=False)
colA colB
0 NaN 2
1 2 1
2 1 NaN
3 2 NaN
4 1 NaN
mask = df.groupby('Patient')[['colA','colB']].rank(method='max', ascending=False) == 1
print mask
colA colB
0 False False
1 False True
2 True False
3 False False
4 True False
df[['colA','colB']] = df[['colA','colB']][mask]
print df
Patient Date colA colB
0 1 1/3/2015 NaN NaN
1 1 2/5/2015 NaN 10
2 1 3/5/2016 8 NaN
3 2 4/5/2014 NaN NaN
4 2 5/5/2014 4 NaN
I think you are looking for pandas groupby.
For example, df.groubpy('Patient').last() will return a DataFrame with the last observation of each patient. If the patients are not sorted by date you can find the latest record date using max function.
df.groupby('Patient').last()
Date colA colB
Patient
1 3/5/2016 8 .
2 etc 2 .
You can make your own functions and then call the apply() function of groupby.

Categories

Resources