Check for NaN values in some particular column in a dataframe - python

Suppose I have a dataframe:
a b c
0 1 2 NaN
1 2 NaN 4
3 Nan 4 NaN
I want to check for NaN in only some particular column's and want the resulting dataframe as:
a b c
0 1 2 NaN
3 Nan 4 NaN
Here I want to check for NaN in only Column 'a' and Column 'c'.
How this can be done?

You could do that with isnull and any methods:
In [264]: df
Out[264]:
a b c
0 1 2 NaN
1 2 NaN 4
2 NaN 4 NaN
In [265]: df[df.isnull().any(axis=1)]
Out[265]:
a b c
0 1 2 NaN
2 NaN 4 NaN
Note: if you just want clear rows without any NaN you could use dropna method
EDIT
If you want to subset your dataframe you could use mask with your columns and apply it to the whole dataframe:
df_subset = df[['a', 'c']]
In [282]: df[df_subset.isnull().any(axis=1)]
Out[282]:
a b c
0 1 2 NaN
2 NaN 4 NaN

Related

Create dataframe with hierarchical indices and extra columns from non-hierarchically indexed dataframe

Consider a simple dataframe:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(10).reshape(5,2))
print(x)
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I would like to create a hierarchically indexed dataframe of the form:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
where the 'a' columns correspond to the original dataframe columns and the 'b' columns are blank (or nan).
I can certainly create a hierarchically indexed dataframe with all NaNs and loop over the columns of the original dataframe, writing them into
the new dataframe. Is there something more compact than that?
you can do with MultiIndex.from_product
extra_level = ['a', 'b']
new_cols = pd.MultiIndex.from_product([x.columns, extra_level])
x.columns = new_cols[::len(x.columns)] # select all the first element of extra_level
x = x.reindex(columns=new_cols)
print(x)
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
Very much like #Ben.T I am using MultiIndex.from_product:
x.assign(l='a')
.set_index('l', append=True)
.unstack()
.reindex(pd.MultiIndex.from_product([x.columns.tolist(), ['a','b']]), axis=1)
Output:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN

How to get previous not NaN value of a pandas DataFrame, without apply, to calculate?

Without using apply (because dataframe is too big), how I can get the previous not NaN value of a specific column to use in a calc ?
For example, this dataframe:
df = pd.DataFrame([['A',1,100],['B',2,None],['C',3,None],['D',4,182],['E',5,None]], columns=['A','B','C'])
A B C
0 A 1 100.0
1 B 2 NaN
2 C 3 NaN
3 D 4 182.0
4 E 5 NaN
I need to calc the difference, in the column 'C' of the line 3 with the line 0.
The number of NaN values between the values is variable, then .shift() maybe is not applicable here (I think)
I need some like: df['D'] = df.C - df.C[previous_not_nan] (in the line 3 will be 82.
dropna + diff
df['D'] = df['C'].dropna().diff()
A B C D
0 A 1 100.0 NaN
1 B 2 NaN NaN
2 C 3 NaN NaN
3 D 4 182.0 82.0
4 E 5 NaN NaN

Pandas: replace column A with column B if B is not missing

I have question similar to a previous post. I want to replace missing values in A with B if B is not-missing. I've used a toy dataset.
#Create sample dataset
import pandas as pd
import numpy as np
np.random.seed(12345)
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df
df[df < 0] = 'NaN'
print(df)
Obs. A B
0 NaN 0.478943
1 NaN NaN
2 1.96578 1.39341
3 0.0929079 0.281746
4 0.769023 1.24643
5 1.00719 NaN
6 0.274992 0.228913
7 1.35292 0.886429
8 NaN NaN
9 1.66903 NaN
#Replace NaN in A with B if B is not NaN
df['A'] = np.where(pd.isnull(df['A']) & pd.notnull(df['B']) == 0, df['B']*1, df['A'])
print(df)
obs A B
0 0.478943 0.478943
1 NaN NaN
2 1.39341 1.39341
3 0.281746 0.281746
4 1.24643 1.24643
5 NaN NaN
6 0.228913 0.228913
7 0.886429 0.886429
8 NaN NaN
9 NaN NaN
This code does the job. But why do I need pd.notnull(df['B']) == 0? If I write:
pd.notnull(df['B'])
instead, the code does not work correctly. The output from that is:
Obs. A B
0 NaN 0.478943
1 NaN NaN
2 1.96578 1.39341
3 0.0929079 0.281746
4 0.769023 1.24643
5 1.00719 NaN
6 0.274992 0.228913
7 1.35292 0.886429
8 NaN NaN
9 1.66903 NaN
I'm trying to understand the flaw in my logic. Any other simple intuitive code will be appreciated.
I basically need to do this simple operation for a very large dataset (100m obs+) so looking for a fast way (in terms of computer processing time) to do it. Thanks in advance.
Replace 'NaN' with np.nan and apply fillna on column A using column B
df = df.replace('NaN', np.nan)
df.A.fillna(df.B, inplace=True)
Output:
A B
0 0.478943 0.478943
1 NaN NaN
2 1.965781 1.393406
3 0.092908 0.281746
4 0.769023 1.246435
5 1.007189 NaN
6 0.274992 0.228913
7 1.352917 0.886429
8 NaN NaN
9 1.669025 NaN

Find observations in which both columns are NaN and replace them with 0 in pandas DataFrame

Here is a dataframe
a b c d
nan nan 3 5
nan 1 2 3
1 nan 4 5
2 3 7 9
nan nan 2 3
I want to replace the observations in both columns 'a' and 'b' where both of them are NaNs with 0s. Rows 2 and 5 in columns 'a' and 'b' have both both NaN, so I want to replace only those rows with 0's in those matching NaN columns.
so my output must be
a b c d
0 0 3 5
nan 1 2 3
1 nan 4 5
2 3 7 9
0 0 2 3
There might be a easier builtin function in Pandas, but this one should work.
df[['a', 'b']] = df.ix[ (np.isnan(df.a)) & (np.isnan(df.b)), ['a', 'b'] ].fillna(0)
Actually the solution from #Psidom much easier to read.
You can create a boolean series based on the conditions on columns a/b, and then use loc to modify corresponding columns and rows:
df.loc[df[['a','b']].isnull().all(1), ['a','b']] = 0
df
# a b c d
#0 0.0 0.0 3 5
#1 NaN 1.0 2 3
#2 1.0 NaN 4 5
#3 2.0 3.0 7 9
#4 0.0 0.0 2 3
Or:
df.loc[df.a.isnull() & df.b.isnull(), ['a','b']] = 0

Pandas, Using generated values while iterating through rows within grouped data

I'm pretty new to Pandas and programming in general but I've always been able to find the answer to any problem through google until now. Sorry about the not terribly descriptive question, hopefully someone can come up with something clearer.
I'm trying to group data together, perform functions on that data, update a column and then use the data from that column on the next group of data.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random(9),columns=['A'])
df['B'] = [1,1,1,2,2,3,3,3,3]
df['C'] = np.nan
df['D'] = np.nan
df.loc[0:2,'C'] = 500
Giving me
A B C D
0 0.825828 1 500.0 NaN
1 0.218618 1 500.0 NaN
2 0.902476 1 500.0 NaN
3 0.452525 2 NaN NaN
4 0.513505 2 NaN NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
The 500 in column C is the initial condition. I want to group the data by column B and perform the following function on the first group
def function1(row):
return row['A']*row['C']/6
giving me
A B C D
0 0.825828 1 500.0 68.818971
1 0.218618 1 500.0 18.218145
2 0.902476 1 500.0 75.206313
3 0.452525 2 NaN NaN
4 0.513505 2 NaN NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
I then want to sum the first three values in D and add them to the last value in C and making this value the group 2 value
A B C D
0 0.825828 1 500.000000 68.818971
1 0.218618 1 500.000000 18.218145
2 0.902476 1 500.000000 75.206313
3 0.452525 2 662.243429 NaN
4 0.513505 2 662.243429 NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
I then perform function1 on group 2 and repeat until I end up with this
A B C D
0 0.825828 1 500.000000 68.818971
1 0.218618 1 500.000000 18.218145
2 0.902476 1 500.000000 75.206313
3 0.452525 2 662.243429 49.946896
4 0.513505 2 662.243429 56.677505
5 0.089975 3 768.867830 11.529874
6 0.282479 3 768.867830 36.198113
7 0.774286 3 768.867830 99.220591
8 0.408501 3 768.867830 52.347246
The dataframe will consist of hundreds of rows. I've been trying various groupby, apply combinations but I'm completely stumped.
Thanks
Here is a solution:
df['D'] = df['A'] * df['C']/6
for i in df['B'].unique()[1:]:
df.loc[df['B']==i, 'C'] = df['D'].sum()
df.loc[df['B']==i, 'D'] = df['A'] * df['C']/6
You can use numpy.unique() for the selction. In your code this might look somehow like this:
import numpy as np
import math
unique, indices, counts = np.unique(df['B'], return_index=True, return_counts=True)
for i in range(len(indices)):
for j in range(len(counts)):
row = df[indices[i]+j]
if math.isnan(row['C']):
row['C'] = df.loc[indices[i-1], 'D']
# then call your function
function1(row)

Categories

Resources