How to get initial row's indexes from df.groupby? - python

Actually, I have df
print(df):
date value other_columns
0 1995 5
1 1995 13
2 1995 478
and so on...
After grouping them by date df1 = df.groupby(by='date')['value'].min()
I wonder how to get initial row's index. In this case, I want to get integer 0, because there was the lowest value in 1995. Thank you in advance.

You have to create a Column with the index value before doing the groupby:
df['initialIndex'] = df.index.values
#do the groupby

I think this what you mean:
Actually, you want the original dataframe with only the rows with minimal value across each group.
For this you can use pandas transform method:
>>> df = pd.DataFrame({'date' : [1995, 1995, 1995, 2000, 2000, 2000], 'value': [5, 13, 478, 7, 1, 8]})
>>> df
date value
0 1995 5
1 1995 13
2 1995 478
3 2000 7
4 2000 1
5 2000 8
>>> minimal_value = df.groupby(['date'])['value'].transform(min)
>>> minimal_value
0 5
1 5
2 5
3 1
4 1
5 1
Name: value, dtype: int64
Now you can use this to get only the relevant rows:
>>> df.loc[df['value'] == minimal_value]
date value
0 1995 5
4 2000 1

Related

How to calculate the average of specific values in a column in a pandas data frame?

My pandas data frame has 11 columns and 453 rows. I would like to calculate the average of the values in rows 450 to 453 in column 11. I would then like to add this 'average value' as a new column to my dataset.
I can use df['average']= df[['norm']].mean
To get the average of column 11 (here called norm). I'm not sure how to only calculate the average of specific rows within that column though?
Here you go:
df["average"] = df["norm"][450:].mean()
Demo:
>>> df = pd.DataFrame({"a": [1, 2, 6, 2, 3]})
>>> df
a
0 1
1 2
2 6
3 2
4 3
>>> df["b"] = df["a"][2:].mean()
>>> df
a b
0 1 3.666667
1 2 3.666667
2 6 3.666667
3 2 3.666667
4 3 3.666667
Use loc?
df['average'] = df.loc[450:, 'norm'].mean()

Drop dataframe rows with values that are an array of NaN

I have a dataframe where in one column, I've ended up with some values that are not merely "NaN" but an array of NaNs (ie, "[nan, nan, nan]")
I want to change those values to 0. If it were simply "nan" I would use:
df.fillna(0)
But that doesn't work in this instance.
For instance if:
df1 = pd.DataFrame({
'ID':[1,2,3,4,5,6],
'Version':[1,1,2,2,1,2],
'Cost':[17,np.nan,24,[np.nan, np.nan, np.nan],13,8]})
Using df1.fillna(0) yields:
ID Version Cost
0 1 1 17
1 2 1 0
2 3 2 24
3 4 2 [nan, nan, nan]
4 5 1 13
5 6 2 8
When I'd like to get the output:
ID Version Cost
0 1 1 17
1 2 1 0
2 3 2 24
3 4 2 0
4 5 1 13
5 6 2 8
In your case column Cost is an object so you can first convert to numeric and then fillna.
import pandas as pd
df = pd.DataFrame({"ID":list(range(1,7)),
"Version":[1,1,2,2,1,2],
"Cost": [17,0,24,['nan', 'nan', 'nan'], 13, 8]})
Where df.dtypes
ID int64
Version int64
Cost object
dtype: object
So you can convert this columns to_numeric using errors='coerce' which means that assign a np.nan if conversion is not possible.
df["Cost"] = pd.to_numeric(df["Cost"], errors='coerce')\
.fillna(0)
or if you prefer in two steps
df["Cost"] = pd.to_numeric(df["Cost"], errors='coerce')
df["Cost"] = df["Cost"].fillna(0)

Pandas for loop to copy columns to separate dataframe, rename df accordingly

I'm trying to take a dataframe, iterate over each column starting with the 2nd and copy that first constant column + next ones one by one to new dataframe.
df = pd.DataFrame({'Year':[2001 ,2002, 2003, 2004, 2005], 'a': [1,2, 3, 4, 5], 'b': [10,20, 30, 40, 50], 'c': [0.1, 0.2, 0.3, 0.4,0.5]})
df
To get a result similar to what this outputs, but i need it to loop since I can have up to 40 columns to run logic on.
df_a=pd.DataFrame()
df_a=df[['Year', 'a']].copy()
df_b=df[['Year', 'b']].copy()
df_c=df[['Year', 'c']].copy()
print(df_a)
print(df_b)
print(df_c)
It would also be nice if I know how to name the df_['name of column it's copying']. Thank you so much and sorry if it's a duplicate.
I'd suggest splitting it through a dict comprehension, then you'll have a dictionary of your separate dataframes. For example:
dict_of_frames = {f'df_{col}':df[['Year', col]] for col in df.columns[1:]}
Gives you a dictionary of df_a, df_b and df_c, which you can access as you would any other dictionary:
>>> dict_of_frames['df_a']
Year a
0 2001 1
1 2002 2
2 2003 3
3 2004 4
4 2005 5
>>> dict_of_frames['df_b']
Year b
0 2001 10
1 2002 20
2 2003 30
3 2004 40
4 2005 50
You need to make a dictionary of dataframes like below with column name as key and subdataframe as value.
df = df.set_index('Year')
dict_ = {col: df[[col]].reset_index() for col in df.columns}
You can simply use column name to access the dictionary and get the corresponding dataframe.
dict_['a']
Output:
Year a
0 2001 1
1 2002 2
2 2003 3
3 2004 4
4 2005 5
You can iterate over the dict_ by:
for col, df in dict_.items():
print("-"*40) #just for separation
print(df) #or print(dict_[col])
Output:
----------------------------------------
Year a
0 2001 1
1 2002 2
2 2003 3
3 2004 4
4 2005 5
----------------------------------------
Year b
0 2001 10
1 2002 20
2 2003 30
3 2004 40
4 2005 50
----------------------------------------
Year c
0 2001 0.1
1 2002 0.2
2 2003 0.3
3 2004 0.4
4 2005 0.5
You don't need to create a dictionary to copy and access the data you require. You can simply copy your dataframe (deep copy if you have mutable elements) and then use indexing to access a particular series:
dfs = df.set_index('Year').copy()
print(dfs['a'])
Year
2001 1
2002 2
2003 3
2004 4
2005 5
Name: a, dtype: int64
You can iterate over your columns via pd.DataFrame.iteritems:
for key, series in dfs.iteritems():
print(key, series)
Yes, this gives series, but they can easily be converted to dataframes via series.reset_index() or series.to_frame().

Filtering a DataFrame for data of id's of which the values are decreasing over time

I have a large time series dataset of patient results. A single patient has one ID with various result values. The data is sorted by date and ID. I want to look only at patients of which the values are strictly descending over time. For example patient x has result values 5, 3, 2, 1 would be true. However 5,3,6,7,1 would be false.
Example data:
import pandas as pd
df = pd.read_excel(...)
print(df.head())
PSA PSAdateā€Ž PatientID ... datefirstinject ADTkey RT_PSAbin
0 2.40 2007-06-26 11448 ... 2006-08-05 00:00:00 1 14
1 0.04 2007-09-26 11448 ... 2006-08-05 00:00:00 1 15
2 2.30 2008-01-14 11448 ... 2006-08-05 00:00:00 1 17
3 4.03 2008-04-16 11448 ... 2006-08-05 00:00:00 1 18
4 6.70 2008-07-01 11448 ... 2006-08-05 00:00:00 1 19
So for this example, I want to only see lines with PatientIDs for which the PSA Value is decreasing over time.
groupID = df.groupby('PatientID')
def is_desc(d):
for i in range(len(d) - 1):
if d[i] > d[i+1]:
return False
return True
x = groupID.PSA.apply(is_desc)
df['is_desc'] = groupID.PSA.transform(is_desc)
#patients whose PSA values is decreasing overtime.
df1 = df[df['is_desc']]
I get:
KeyError: 0
I suppose the loop cant make its way through the grouped values as it requires an array to find the 'range'.
Any ideas for editing the loop?
TL;DR
# (see is_desc function definition below)
df['is_desc'] = df.groupby('PationtID').PSA.transform(is_desc)
df[df['is_desc']]
Explanation
Let's use a very simple data set:
df = pd.DataFrame({'id': [1,2,1,3,3,1], 'res': [3,1,2,1,5,1]})
It only contains the id and one value column (and it has an index automatically assigned from pandas).
So if you just want to get a list of all ids whose values are descending, we can group the values by the id, then check if the values in the group are descending, then filter the list for just ids with descending values.
So first let's define a function that checks if the values are descending:
def is_desc(d):
first = True
for i in d:
if first:
first = False
else:
if i >= last:
return False
last = i
return True
(yes, this could probably be done more elegantly, you can search online for a better implementation)
now we group by the id:
gb = df.groupby('id')
and apply the function:
x = gb.res.apply(is_desc)
x now holds this Series:
id
1 True
2 True
3 False
dtype: bool
so now if you want to filter this you can just do this:
x[x].index
which you can of course convert to a normal list like that:
list(x[x].index)
which would give you a list of all ids of which the values are descending. in this case:
[1, 2]
But if you want to also have all the original data for all those chosen ids do it like this:
df['is_desc'] = gb.res.transform(is_des)
so now df has all the original data it had in the beginning, plus a column that tell for each line if it's id's values are descending:
id res is_desc
0 1 3 True
1 2 1 True
2 1 2 True
3 3 1 False
4 3 5 False
5 1 1 True
Now you can very easily filter this like that:
df[df['is_desc']]
which is:
id res is_desc
0 1 3 True
1 2 1 True
2 1 2 True
5 1 1 True
Selecting and sorting your data is quite easy and objective. However, deciding whether or not a patient's data is declining can be subjective, so it is best if you decide on a criteria beforehand to see if their data is declining.
To sort and select:
import pandas as pd
data = [['pat_1', 10, 1],
['pat_1', 9, 2],
['pat_2', 11, 2],
['pat_1', 4, 5],
['pat_1', 2, 6],
['pat_2', 10, 1],
['pat_1', 7, 3],
['pat_1', 5, 4],
['pat_2', 20, 3]]
df = pd.DataFrame(data).rename(columns={0:'Patient', 1:'Result', 2:'Day'})
print df
df_pat1 = df[df['Patient']=='pat_1']
print df_pat1
df_pat1_sorted = df_pat1.sort_values(['Day']).reset_index(drop=True)
print df_pat1_sorted
returns:
df:
Patient Result Day
0 pat_1 10 1
1 pat_1 9 2
2 pat_2 11 2
3 pat_1 4 5
4 pat_1 2 6
5 pat_2 10 1
6 pat_1 7 3
7 pat_1 5 4
8 pat_2 20 3
df_pat1
Patient Result Day
0 pat_1 10 1
1 pat_1 9 2
3 pat_1 4 5
4 pat_1 2 6
6 pat_1 7 3
7 pat_1 5 4
df_pat1_sorted
Patient Result Day
0 pat_1 10 1
1 pat_1 9 2
2 pat_1 7 3
3 pat_1 5 4
4 pat_1 4 5
5 pat_1 2 6
For the purposes of this answer, I am going to say that if the first value of the new DataFrame is larger than the last, then their values are declining:
if df_pat1_sorted['Result'].values[0] > df_pat1_sorted['Result'].values[-1]:
print "Patient 1's values are declining"
This returns:
Patient 1's values are declining
There is a better way if you have many unique IDs (as I'm sure you do) of iterating through your patients. I shall present an example using integers, however you may need to use regex if your patient IDs include characters.
import pandas as pd
import numpy as np
min_ID = 1003
max_ID = 1005
patients = np.random.randint(min_ID, max_ID, size=10)
df = pd.DataFrame(patients).rename(columns={0:'Patients'})
print df
s = pd.Series(df['Patients']).unique()
print s
for i in range(len(s)):
print df[df['Patients']==s[i]]
returns:
Patients
0 1004
1 1004
2 1004
3 1003
4 1003
5 1003
6 1003
7 1004
8 1003
9 1003
[1004 1003] # s (the unique values in the df['Patients'])
Patients
3 1003
4 1003
5 1003
6 1003
8 1003
9 1003
Patients
0 1004
1 1004
2 1004
7 1004
I hope this has helped!
This should solve your question, interpreting 'decreasing' as monotonic decreasing:
import pandas as pd
d = {"PatientID": [1,1,1,1,2,2,2,2],
"PSAdate": [2010,2011,2012,2013,2010,2011,2012,2013],
"PSA": [5,3,2,1,5,3,4,5]}
# Sorts by id and date
df = pd.DataFrame(data=d).sort_values(['PatientID', 'PSAdate'])
# Computes change and max(change) between sequential PSA's
df["change"] = df.groupby('PatientID')["PSA"].diff()
df["max_change"] = df.groupby('PatientID')['change'].transform('max')
# Considers only patients whose PSA are monotonic decreasing
df = df.loc[df["max_change"] <= 0]
print(df)
PatientID PSAdate PSA change max_change
0 1 2010 5 NaN -1.0
1 1 2011 3 -2.0 -1.0
2 1 2012 2 -1.0 -1.0
3 1 2013 1 -1.0 -1.0
Note: to consider only strictly monotonic decreasing PSA, change the final loc condition to < 0

data manipulation example from wide to long in python

I've just placed a similar question here and got an answer but recognised, that by adding a new column to a DataFrame the presented solution fails as the problem is a bit different.
I want to go from here:
import pandas as pd
df = pd.DataFrame({'ID': [1, 2],
'Value_2013': [100, 200],
'Value_2014': [245, 300],
'Value_2016': [200, float('NaN')]})
print(df)
ID Value_2013 Value_2014 Value_2016
0 1 100 245 200.0
1 2 200 300 NaN
to:
df_new = pd.DataFrame({'ID': [1, 1, 1, 2, 2],
'Year': [2013, 2014, 2016, 2013, 2014],
'Value': [100, 245, 200, 200, 300]})
print(df_new)
ID Value Year
0 1 100 2013
1 1 245 2014
2 1 200 2016
3 2 200 2013
4 2 300 2014
Any ideas how I can face this challenge?
The pandas.melt() method gets you halfway there. After that it's just some minor cleaning up.
df = pd.melt(df, id_vars='ID', var_name='Year', value_name='Value')
df['Year'] = df['Year'].map(lambda x: x.split('_')[1])
df = df.dropna().astype(int).sort_values(['ID', 'Year']).reset_index(drop=True)
df = df.reindex_axis(['ID', 'Value', 'Year'], axis=1)
print(df)
ID Value Year
0 1 100 2013
1 1 245 2014
2 1 200 2016
3 2 200 2013
4 2 300 2014
You need add set_index first:
df = df.set_index('ID')
df.columns = df.columns.str.split('_', expand=True)
df = df.stack().rename_axis(['ID','Year']).reset_index()
df.Value = df.Value.astype(int)
#if order of columns is important
df = df.reindex_axis(['ID','Value','Year'], axis=1)
print (df)
ID Value Year
0 1 100 2013
1 1 245 2014
2 1 200 2016
3 2 200 2013
4 2 300 2014
Leveraging Multi Indexing in Pandas
import numpy as np
import pandas as pd
from collections import OrderedDict
df = pd.DataFrame({'ID': [1, 2],
'Value_2013': [100, 200],
'Value_2014': [245, 300],
'Value_2016': [200, float('NaN')]})
# Set ID column as Index
df = df.set_index('ID')
# unstack all columns, swap the levels in the row index
# and convert series to df
df = df.unstack().swaplevel().to_frame().reset_index()
# Rename columns as desired
df.columns = ['ID', 'Year', 'Value']
# Transform the year values from Value_2013 --> 2013 and so on
df['Year'] = df['Year'].apply(lambda x : x.split('_')[1]).astype(np.int)
# Sort by ID
df = df.sort_values(by='ID').reset_index(drop=True).dropna()
print(df)
ID Year Value
0 1 2013 100.0
1 1 2014 245.0
2 1 2016 200.0
3 2 2013 200.0
4 2 2014 300.0
Another option is pd.wide_to_long(). Admittedly it doesn't give you exactly the same output but you can clean up as needed.
pd.wide_to_long(df, ['Value_',], i='', j='Year')
ID Value_
Year
NaN 2013 1 100
2013 2 200
2014 1 245
2014 2 300
2016 1 200
2016 2 NaN
Yet another soution (two steps):
In [31]: x = df.set_index('ID').stack().astype(int).reset_index(name='Value')
In [32]: x
Out[32]:
ID level_1 Value
0 1 Value_2013 100
1 1 Value_2014 245
2 1 Value_2016 200
3 2 Value_2013 200
4 2 Value_2014 300
In [33]: x = x.assign(Year=x.pop('level_1').str.extract(r'(\d{4})', expand=False))
In [34]: x
Out[34]:
ID Value Year
0 1 100 2013
1 1 245 2014
2 1 200 2016
3 2 200 2013
4 2 300 2014
One option is with the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = "ID",
names_to=(".value", "Year"),
names_sep="_",
sort_by_appearance=True)
.dropna()
)
ID Year Value
0 1 2013 100.0
1 1 2014 245.0
2 1 2016 200.0
3 2 2013 200.0
4 2 2014 300.0
The .value keeps the part of the column associated with it as header, while the rest goes into the Year column.

Categories

Resources