I have a two column data frame of the form:
Death HEALTH
0 other 0.0
1 other 1.0
2 vascular 0.0
3 other 0.0
4 other 0.0
5 vascular 0.0
6 NaN 0.0
7 NaN 0.0
8 NaN 0.0
9 vascular 1.0
I would like to create a new column following the steps:
wherever appears the value 'other', write a 'No'
wherever appears the NaN, leave it as it is
wherever appears the value 'vascular' in the first column and 1.0 in the second, write 'Yes'
wherever appears the value 'vascular' in the first column and 0.0 in the second, write 'No'
The output should be:
Death HEAlTH New
0 other 0.0 No
1 other 1.0 No
2 vascular 0.0 No
3 other 0.0 No
4 other 0.0 No
5 vascular 0.0 No
6 NaN 0.0 NaN
7 NaN 0.0 NaN
8 NaN 0.0 NaN
9 vascular 1.0 Yes
Is there a pythonic way to achieve this? I'm all lost between loops and conditionals.
You can create conditions for No and Yes and for all another values are created original value in numpy.select:
m1 = df['Death'].eq('other') | (df['Death'].eq('vascular') & df['HEALTH'].eq(0))
m2 = (df['Death'].eq('vascular') & df['HEALTH'].eq(1))
df['new'] = np.select([m1, m2], ['No','Yes'], default=df['Death'])
Another idea is test also missing values and if no match conditions is set original values:
m1 = df['Death'].eq('other') | (df['Death'].eq('vascular') & df['HEALTH'].eq(0))
m2 = (df['Death'].eq('vascular') & df['HEALTH'].eq(1))
m3 = df['Death'].isna()
df['new'] = np.select([m1, m2, m3], ['No','Yes', np.nan], default=df['Death'])
print (df)
print (df)
0 another val 0.0 another val
1 other 1.0 No
2 vascular 0.0 No
3 other 0.0 No
4 other 0.0 No
5 vascular 0.0 No
6 NaN 0.0 NaN
7 NaN 0.0 NaN
8 NaN 0.0 NaN
9 vascular 1.0 Yes
A simple way to do this is to implement your conditional logic using if/else inside a function, and apply this function row-wise to the dataframe.
def function(row):
if row['Death']=='other':
return 'No'
if row['Death']=='vascular':
if row['Health']==1:
return 'Yes'
elif row['Health']==0:
return 'No'
return np.nan
# axis = 1 to apply it row-wise
df['New'] = df.apply(function, axis=1)
It produces the following output as required:
Death Health New
0 other 0 No
1 other 1 No
2 vascular 0 No
3 other 0 No
4 other 0 No
5 vascular 0 No
6 NaN 0 NaN
7 NaN 0 NaN
8 NaN 0 NaN
9 vascular 1 Yes
Related
I have this code that will subtract, for each person (AAC or AAB), timepoint 1 from time point 2 data.
i.e this is the original data:
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 1 2.0 NaN 4.0
1 4 3 2.0 6.0 NaN
2 4 3 NaN 6.0 NaN
3 4 5 2.0 6.0 NaN
This is the code:
import sys
import numpy as np
from sklearn.metrics import auc
import pandas as pd
from numpy import trapz
#read in file
df = pd.DataFrame([[0,1,2,np.nan,4],[4,3,2,6,np.nan],[4,3,np.nan,6,np.nan],[4,5,2,6,np.nan]],columns=['pep_seq','AAC-T01','AAC-T02','AAB-T01','AAB-T02'])
#standardise the data by taking T0 away from each sample
df2 = df.drop(['pep_seq'],axis=1)
df2 = df2.apply(lambda x: x.sub(df2[x.name[:4]+"T01"]))
df2.insert(0,'pep_seq',df['pep_seq'])
print(df)
print(df2)
This is the output (i.e. df2)
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 0 1.0 NaN NaN
1 4 0 -1.0 0.0 NaN
2 4 0 NaN 0.0 NaN
3 4 0 -3.0 0.0 NaN
...but what I actually wanted was to subtract the T01 columns from all the others EXCEPT for when the T01 value is NaN in which case keep the original value, so the desired output was (see the 4.0 in AAB-T02):
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 0 1.0 NaN 4.0
1 4 0 -1.0 0 NaN
2 4 0 NaN 0 NaN
3 4 0 -3.0 0 NaN
Could someone show me where I went wrong? Note that in real life, there are ~100 timepoints per person, not just two.
You can fill the nan to 0 when doing subtraction
df2 = df2.apply(lambda x: x.sub(df2[x.name[:4]+"T01"].fillna(0)))
# ^^^^ Changes here
df2.insert(0,'pep_seq',df['pep_seq'])
print(df2)
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 0 1.0 NaN 4.0
1 4 0 -1.0 0.0 NaN
2 4 0 NaN 0.0 NaN
3 4 0 -3.0 0.0 NaN
I hope that I understand you correctly but numpy.where() should do it for you.
Have a look here: condition based substraction
So I want to add/append data to a specific pandas dataFrame column but without it causing NaN values in the remaining columns
I.e.
DataFrame = pd.DataFrame(columns=["column1", "column2", "column3"])
for i in range():
DataFrame = DataFrame.append({"column1":int(i)}, ignore_index=True)
DataFrame = DataFrame.append({"column2":float(i*2)}, ignore_index=True)
DataFrame = DataFrame.append({"column3":int(i*5)}, ignore_index=True)
print(DataFrame)
This will return:
column1 column2 column3
0 0.0 NaN NaN
1 NaN 0.0 NaN
2 NaN NaN 0.0
3 1.0 NaN NaN
4 NaN 2.0 NaN
5 NaN NaN 5.0
6 2.0 NaN NaN
7 NaN 4.0 NaN
8 NaN NaN 10.0
What we want returned:
column1 column2 column3
0 0.0 0.0 0.0
1 1.0 2.0 5.0
2 2.0 4.0 10.0
I know I can in this case use one .append for all the different columns. But I have some cases where the data to be appended will vary based on multiple conditions. Hence I'd like to know if it's possible to append to single columns in a dataframe without producing NaN values in the remaining columns. So that I can avoid writing hundreds of if else statements.
Or if someone has any good idea regarding how to 'collapse' the NaN values (removing the NaN values without removing the entire row so that if there is a NaN value at index 0 in column 3 and there is a integer 5 at index 1 in the same column the integer 5 gets moved up to index 0)
Happy to hear any ideas.
IIUC for your current example you can try this:
DataFrame[['column2','column3']]=DataFrame[['column2','column3']].bfill()
Output:
column1 column2 column3
0 0.0 0.0 0.0
1 NaN 0.0 0.0
2 NaN 2.0 0.0
3 1.0 2.0 5.0
4 NaN 2.0 5.0
5 NaN 4.0 5.0
6 2.0 4.0 10.0
7 NaN 4.0 10.0
8 NaN 6.0 10.0
9 3.0 6.0 15.0
10 NaN 6.0 15.0
11 NaN 8.0 15.0
12 4.0 8.0 20.0
13 NaN 8.0 20.0
14 NaN NaN 20.0
then remove the NaN :
DataFrame.dropna(inplace=True)
Outpt:
column1 column2 column3
0 0.0 0.0 0.0
3 1.0 2.0 5.0
6 2.0 4.0 10.0
9 3.0 6.0 15.0
12 4.0 8.0 20.0
I'm having a bit of trouble with this. My dataframe looks like this:
id amount dummy
1 130 0
1 120 0
1 110 1
1 nan nan
1 nan nan
2 nan 0
2 50 0
2 20 1
2 nan nan
2 nan nan
So, what I need to do is, after the dummy gets value = 1, I need to fill the amount variable with zeroes for each id, like this:
id amount dummy
1 130 0
1 120 0
1 110 1
1 0 nan
1 0 nan
2 nan 0
2 50 0
2 20 1
2 0 nan
2 0 nan
I'm guessing I'll need some combination of groupby('id'), fillna(method='ffill'), maybe a .loc or a shift() , but everything I tried has had some problem or is very slow. Any suggestions?
The way I will use
s = df.groupby('id')['dummy'].ffill().eq(1)
df.loc[s&df.dummy.isna(),'amount']=0
You can do this much easier:
data[data['dummy'].isna()]['amount'] = 0
This will select all the rows where dummy is nan and fill the amount column with 0.
IIUC, ffill() and mask the still-nan:
s = df.groupby('id')['amount'].ffill().notnull()
df.loc[df['amount'].isna() & s, 'amount'] = 0
Output:
id amount dummy
0 1 130.0 0.0
1 1 120.0 0.0
2 1 110.0 1.0
3 1 0.0 NaN
4 1 0.0 NaN
5 2 NaN 0.0
6 2 50.0 0.0
7 2 20.0 1.0
8 2 0.0 NaN
9 2 0.0 NaN
Could you please try following.
df.loc[df['dummy'].isnull(),'amount']=0
df
Output will be as follows.
id amount dummy
0 1 130.0 0.0
1 1 120.0 0.0
2 1 110.0 1.0
3 1 0.0 NaN
4 1 0.0 NaN
5 2 NaN 0.0
6 2 50.0 0.0
7 2 20.0 1.0
8 2 0.0 NaN
9 2 0.0 NaN
I have a Pandas (0.23.4) DataFrame with several categorical columns.
df = pd.DataFrame(np.random.choice([True, False, np.nan], (6,4)), columns = ['a','b','c','d'])
a b c d
0 NaN 1.0 NaN NaN
1 NaN 1.0 NaN 0.0
2 1.0 NaN 1.0 NaN
3 0.0 NaN 0.0 1.0
4 NaN 1.0 NaN NaN
5 NaN 1.0 0.0 1.0
I have two sets of columns of interest:
cross_cols = ['a', 'b']
type_cols = ['c', 'd']
I would like to get a cross tab of counts of each cross_col variable with each type_col variable (a with c and d, and b with c and d), excluding NaN, all displayed side-by-side. The desired result is:
c d
0.0 1.0 All 0.0 1.0 All
a 0.0 0 0 0 1 1 2
1.0 2 1 3 1 0 1
All 2 1 3 2 1 3
b 0.0 0 0 0 0 1 1
1.0 2 1 3 2 0 2
All 2 1 3 2 1 3
Notice that I am not interested in counts for different combinations of a and b or of c and d, which is what I'm getting by changing the index and columns parameters of pd.crosstab.
Currently I'm using the following code:
cross_rows = []
for col in cross_cols:
cross_rows.append(pd.concat([pd.crosstab(df[col], df[type_var],margins=True) for type_var in type_cols],axis=1,keys = type_cols,sort=True))
results = pd.concat(cross_rows, keys = cross_cols,sort=True)
It gives the following result:
c d
c 0.0 1.0 All 0.0 1.0 All
a 1.0 2.0 1.0 3.0 1 0 1
All 2.0 1.0 3.0 2 1 3
0.0 NaN NaN NaN 1 1 2
b 1.0 2.0 1.0 3.0 2 0 2
All 2.0 1.0 3.0 2 1 3
0.0 NaN NaN NaN 0 1 1
The result is fine, but the code is slow and a bit ugly. I suspect that there's a faster and more Pythonic approach. Is there a single function call that would get the job done, or another faster solution?
I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.
So if I would know the number of my columns in advance, I could do something like this:
list_of_cols = ['col1', ...]
df[
df[list_of_cols[0]].notnull() |
df[list_of_cols[1]].notnull() |
...
df[list_of_cols[6]].notnull() |
]
I can also iterate over the list of cols and create a mask which then I would apply to df, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.
Use the thresh parameter in the dropna() method. By setting thresh=1, you specify that if there is at least 1 non null item, don't drop it.
df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))
df[list_of_cols].dropna(thresh=1).head()
Starting with this:
data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,],
'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7],
'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1],
'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]}
df = pd.DataFrame(data, columns=['a','b','c','d'])
df
a b c d
0 NaN NaN NaN NaN
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
Rows where not all values are nulls. (Removing row index 0)
df[~df.isnull().all(axis=1)]
a b c d
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
One can use boolean indexing
df[~pd.isnull(df[list_of_cols]).all(axis=1)]
Explanation:
The expression df[list_of_cols]).all(axis=1) returns a boolean array that is applied as a filter to the dataframe:
isnull() applied to df[list_of_cols] creates a boolean mask for the dataframe df[list_of_cols] with True values for the null elements in df[list_of_cols], False otherwise
all() returns True if all of the elements are True (row-wise axis=1)
So, by negation ~ (not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.
An example:
Dataframe:
>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
'B':['x',np.NaN,np.NaN,'w'],
'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
A B C
0 11 x 2016-03-13
1 22 NaN NaN
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
isnull mask:
>>> ~pd.isnull(df[list_of_cols])
B C
0 True True
1 False False
2 False True
3 True True
apply all(axis=1) row-wise:
>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0 True
1 False
2 True
3 True
dtype: bool
Boolean selection from dataframe:
>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
A B C
0 11 x 2016-03-13
2 33 NaN 2016-03-14
3 NaN w 2016-03-15