Column operations in Pandas - python

Say I have a dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
I would like to substract the entries in column df.a from all other columns. In other words, I would like to get a dataframe that holds as columns the following columns:
|col_b - col_a | col_c - col_a | col_d - col_a|
I have tried df - df.a but this yields something odd:
0 1 2 3 a b c d e
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
How can I do this type of columnwise operations in Pandas? Also, just wondering, what does df -df.a do?

You probably want
>>> df.sub(df.a, axis=0)
a b c d e
0 0 0.112285 0.267105 0.365407 -0.159907
1 0 0.380421 0.119536 0.356203 0.096637
2 0 -0.100310 -0.180927 0.112677 0.260202
3 0 0.653642 0.566408 0.086720 0.256536
df-df.a is basically trying to do the subtraction along the other axis, so the indices don't match, and when using binary operators like subtraction "mismatched indices will be unioned together" (as the docs say). Since the indices don't match, you wind up with
0 1 2 3 a b c d e.
For example, you could get to the same destination more indirectly by transposing things,
(df.T - df.a).T, which by flipping df means that the default axis is now the right one.

Related

Clean way to rearrange columns that are repeated and have nans in them

I have the following dataframe:
Subject Val1 Val1 Int Val1 Val1 Int2 Val1
A 1 2 3 NaN NaN Sp NaN
B NaN NaN NaN 2 3 NaN NaN
C NaN NaN 4 NaN NaN 0 3
D NaN NaN 3 NaN NaN 8 NaN
I want to ended up with only 2 column that are val1 because it has at most 2 non-nans for a given subject. Namely, the output would look like this:
Subject Val1 Val1 Int Int2
A 1 2 3 Sp
B 2 3 NaN NaN
C 3 NaN 4 0
D NaN NaN 3 8
is there a function in pandas to do this in a clean way? Clean meaning only a few lines of code. Because one way would be to iterate through row with a for loop and bring all nonnan values to the left, but I'd like something cleaner and more efficient as well.
Idea is per groups by duplicated columns names use lambda function for sort values based by missing values, so possible remove all columns with only missing values in last steps:
df = df.set_index('Subject')
f = lambda x: pd.DataFrame(x.apply(sorted, key=pd.isna, axis=1).tolist(), index=x.index)
df = df.groupby(level=0, axis=1).apply(f).dropna(axis=1, how='all').droplevel(1, axis=1)
print (df)
Int Int2 Val1 Val1
Subject
A 3.0 Sp 1.0 2.0
B NaN NaN 2.0 3.0
C 4.0 0 3.0 NaN
D 3.0 8 NaN NaN

Select rows with specific values in columns and include rows with NaN in pandas dataframe

I have a DataFrame df that looks something like this:
df
a b c
0 0.557894 -0.196294 -0.020490
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
5 -0.337374 NaN -0.771888
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
9 -2.345448 2.443669 -1.409422
I want to select the rows that have a value over some value, which I would normally do using:
new_df = df[df['c'] >= .5]
but that will return:
a b c
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
5 -0.337374 NaN 0.771888
8 0.737413 NaN 0.679575
I want to get those rows, but also keep the rows that have nan values in column 'c'. I haven't been able to find a question asking the same thing, they usually ask for one or the other, but not both. I can hard code the rows that I want to drop since I know the specific values, but I was wondering if there is a better solution. The end result should look something like this:
a b c
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
Only dropping rows 0,5 and 9 since they are less than .5 in columns 'c'
You should use the | (or) operator.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0.557894,1.138774,np.nan,-0.069319,1.040089,-0.337374,-1.813278,np.nan,0.737413,-2.345448],
'b': [-0.196294,-0.699224,2.384483,np.nan,-0.271777,np.nan,-1.564666,np.nan,np.nan,2.443669],
'c': [-0.020490,np.nan,0.554292,1.162941,np.nan,-0.771888,np.nan,np.nan,0.679575,-1.409422]})
df = df[(df['c'] >= .5) | (df['c'].isnull())]
print(df)
Output:
a b c
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
You should be able to do this by
new_df = df[df['c'] >=5 or df['c'] == 'NaN']

Pandas shift values in a column over intervening rows

I have a pandas data frame as shown below. One column has values with intervening NaN cells. The values are to be shifted ahead by one so that they replace the next value that follows with the last being lost. The intervening NaN cells have to remain. I tried using .shift() but since I never know how many intervening NaN rows it means a calculation for each shift. Is there a better approach?
IIUC, you may just groupby by non-na values, and shift them.
df['y'] = df.y.groupby(pd.isnull(df.y)).shift()
x y
0 A NaN
1 A NaN
2 A NaN
3 B 5.0
4 B NaN
5 B NaN
6 B NaN
7 C 10.0
8 C NaN
9 C NaN
10 C NaN
Another way:
s = df['y'].notnull()
df.loc[s,'y'] = df.loc[s,'y'].shift()
It would be easier to test if you paste your text data instead of the picture.
Input:
df = pd.DataFrame({'x':list('AAABBBBCCCC'),
'y':[5,np.nan,np.nan,10, np.nan,np.nan,np.nan,
20, np.nan,np.nan,np.nan]})
output:
x y
0 A NaN
1 A NaN
2 A NaN
3 B 5.0
4 B NaN
5 B NaN
6 B NaN
7 C 10.0
8 C NaN
9 C NaN
10 C NaN

Pandas: replace column A with column B if B is not missing

I have question similar to a previous post. I want to replace missing values in A with B if B is not-missing. I've used a toy dataset.
#Create sample dataset
import pandas as pd
import numpy as np
np.random.seed(12345)
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df
df[df < 0] = 'NaN'
print(df)
Obs. A B
0 NaN 0.478943
1 NaN NaN
2 1.96578 1.39341
3 0.0929079 0.281746
4 0.769023 1.24643
5 1.00719 NaN
6 0.274992 0.228913
7 1.35292 0.886429
8 NaN NaN
9 1.66903 NaN
#Replace NaN in A with B if B is not NaN
df['A'] = np.where(pd.isnull(df['A']) & pd.notnull(df['B']) == 0, df['B']*1, df['A'])
print(df)
obs A B
0 0.478943 0.478943
1 NaN NaN
2 1.39341 1.39341
3 0.281746 0.281746
4 1.24643 1.24643
5 NaN NaN
6 0.228913 0.228913
7 0.886429 0.886429
8 NaN NaN
9 NaN NaN
This code does the job. But why do I need pd.notnull(df['B']) == 0? If I write:
pd.notnull(df['B'])
instead, the code does not work correctly. The output from that is:
Obs. A B
0 NaN 0.478943
1 NaN NaN
2 1.96578 1.39341
3 0.0929079 0.281746
4 0.769023 1.24643
5 1.00719 NaN
6 0.274992 0.228913
7 1.35292 0.886429
8 NaN NaN
9 1.66903 NaN
I'm trying to understand the flaw in my logic. Any other simple intuitive code will be appreciated.
I basically need to do this simple operation for a very large dataset (100m obs+) so looking for a fast way (in terms of computer processing time) to do it. Thanks in advance.
Replace 'NaN' with np.nan and apply fillna on column A using column B
df = df.replace('NaN', np.nan)
df.A.fillna(df.B, inplace=True)
Output:
A B
0 0.478943 0.478943
1 NaN NaN
2 1.965781 1.393406
3 0.092908 0.281746
4 0.769023 1.246435
5 1.007189 NaN
6 0.274992 0.228913
7 1.352917 0.886429
8 NaN NaN
9 1.669025 NaN

Check for NaN values in some particular column in a dataframe

Suppose I have a dataframe:
a b c
0 1 2 NaN
1 2 NaN 4
3 Nan 4 NaN
I want to check for NaN in only some particular column's and want the resulting dataframe as:
a b c
0 1 2 NaN
3 Nan 4 NaN
Here I want to check for NaN in only Column 'a' and Column 'c'.
How this can be done?
You could do that with isnull and any methods:
In [264]: df
Out[264]:
a b c
0 1 2 NaN
1 2 NaN 4
2 NaN 4 NaN
In [265]: df[df.isnull().any(axis=1)]
Out[265]:
a b c
0 1 2 NaN
2 NaN 4 NaN
Note: if you just want clear rows without any NaN you could use dropna method
EDIT
If you want to subset your dataframe you could use mask with your columns and apply it to the whole dataframe:
df_subset = df[['a', 'c']]
In [282]: df[df_subset.isnull().any(axis=1)]
Out[282]:
a b c
0 1 2 NaN
2 NaN 4 NaN

Categories

Resources