Shift NaNs to the end of their respective rows - python

I have a DataFrame like :
0 1 2
0 0.0 1.0 2.0
1 NaN 1.0 2.0
2 NaN NaN 2.0
What I want to get is
Out[116]:
0 1 2
0 0.0 1.0 2.0
1 1.0 2.0 NaN
2 2.0 NaN NaN
This is my approach as of now.
df.apply(lambda x : (x[x.notnull()].values.tolist()+x[x.isnull()].values.tolist()),1)
Out[117]:
0 1 2
0 0.0 1.0 2.0
1 1.0 2.0 NaN
2 2.0 NaN NaN
Is there any efficient way to achieve this ? apply Here is way to slow .
Thank you for your assistant!:)
My real data size
df.shape
Out[117]: (54812040, 1522)

Here's a NumPy solution using justify -
In [455]: df
Out[455]:
0 1 2
0 0.0 1.0 2.0
1 NaN 1.0 2.0
2 NaN NaN 2.0
In [456]: pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=1, side='left'))
Out[456]:
0 1 2
0 0.0 1.0 2.0
1 1.0 2.0 NaN
2 2.0 NaN NaN
If you want to save memory, assign it back instead -
df[:] = justify(df.values, invalid_val=np.nan, axis=1, side='left')

Your best easiest option is to use sorted on df.apply/df.transform and sort by nullity.
df = df.apply(lambda x: sorted(x, key=pd.isnull), 1)
df
0 1 2
0 0.0 1.0 2.0
1 1.0 2.0 NaN
2 2.0 NaN NaN
You may also pass np.isnan to the key argument.

Related

Pandas: Fillna with local average if a condition is met

Let's say I have data like this:
df = pd.DataFrame({'col1': [5, np.nan, 2, 2, 5, np.nan, 4], 'col2':[1,3,np.nan,np.nan,5,np.nan,4]})
print(df)
col1 col2
0 5.0 1.0
1 NaN 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 NaN NaN
6 4.0 4.0
How can I use fillna() to replace NaN values with the average of the prior and the succeeding value if both of them are not NaN ?
The result would look like this:
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
Also, is there a way of calculating the average from the previous n and succeeding n values (if they are all not NaN) ?
We can shift the dataframe forward and backwards. Then add these together and divide them by two and use that to fillna:
s1, s2 = df.shift(), df.shift(-1)
df = df.fillna((s1 + s2) / 2)
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0

pivot dataframe with duplicate values

consider the below pd.DataFrame
temp = pd.DataFrame({'label_0':[1,1,1,2,2,2],'label_1':['a','b','c',np.nan,'c','b'], 'values':[0,2,4,np.nan,8,5]})
print(temp)
label_0 label_1 values
0 1 a 0.0
1 1 b 2.0
2 1 c 4.0
3 2 NaN NaN
4 2 c 8.0
5 2 b 5.0
my desired output is
label_1 1 2
0 a 0.0 NaN
1 b 2.0 5.0
2 c 4.0 8.0
3 NaN NaN NaN
I have tried pd.pivot and wrangling around with pd.gropuby but cannot get to the desired output due to duplicate entries. any help most appreciated.
d = {}
for _0, _1, v in zip(*map(temp.get, temp)):
d.setdefault(_1, {})[_0] = v
pd.DataFrame.from_dict(d, orient='index')
1 2
a 0.0 NaN
b 2.0 5.0
c 4.0 8.0
NaN NaN NaN
OR
pd.DataFrame.from_dict(d, orient='index').rename_axis('label_1').reset_index()
label_1 1 2
0 a 0.0 NaN
1 b 2.0 5.0
2 c 4.0 8.0
3 NaN NaN NaN
Another way is to use set_index and unstack:
temp.set_index(['label_0','label_1'])['values'].unstack(0)
Output:
label_0 1 2
label_1
NaN NaN NaN
a 0.0 NaN
b 2.0 5.0
c 4.0 8.0
You can do fillna then pivot
temp.fillna('NaN').pivot(*temp.columns).T
Out[251]:
label_0 1 2
label_1
NaN NaN NaN
a 0 NaN
b 2 5
c 4 8
Seems like a straightforward pivot works:
temp.pivot(columns='label_0', index='label_1', values='values')
Output:
label_0 1 2
label_1
NaN NaN NaN
a 0.0 NaN
b 2.0 5.0
c 4.0 8.0

Concatenating crosstabs of different variables

I have a Pandas (0.23.4) DataFrame with several categorical columns.
df = pd.DataFrame(np.random.choice([True, False, np.nan], (6,4)), columns = ['a','b','c','d'])
a b c d
0 NaN 1.0 NaN NaN
1 NaN 1.0 NaN 0.0
2 1.0 NaN 1.0 NaN
3 0.0 NaN 0.0 1.0
4 NaN 1.0 NaN NaN
5 NaN 1.0 0.0 1.0
I have two sets of columns of interest:
cross_cols = ['a', 'b']
type_cols = ['c', 'd']
I would like to get a cross tab of counts of each cross_col variable with each type_col variable (a with c and d, and b with c and d), excluding NaN, all displayed side-by-side. The desired result is:
c d
0.0 1.0 All 0.0 1.0 All
a 0.0 0 0 0 1 1 2
1.0 2 1 3 1 0 1
All 2 1 3 2 1 3
b 0.0 0 0 0 0 1 1
1.0 2 1 3 2 0 2
All 2 1 3 2 1 3
Notice that I am not interested in counts for different combinations of a and b or of c and d, which is what I'm getting by changing the index and columns parameters of pd.crosstab.
Currently I'm using the following code:
cross_rows = []
for col in cross_cols:
cross_rows.append(pd.concat([pd.crosstab(df[col], df[type_var],margins=True) for type_var in type_cols],axis=1,keys = type_cols,sort=True))
results = pd.concat(cross_rows, keys = cross_cols,sort=True)
It gives the following result:
c d
c 0.0 1.0 All 0.0 1.0 All
a 1.0 2.0 1.0 3.0 1 0 1
All 2.0 1.0 3.0 2 1 3
0.0 NaN NaN NaN 1 1 2
b 1.0 2.0 1.0 3.0 2 0 2
All 2.0 1.0 3.0 2 1 3
0.0 NaN NaN NaN 0 1 1
The result is fine, but the code is slow and a bit ugly. I suspect that there's a faster and more Pythonic approach. Is there a single function call that would get the job done, or another faster solution?

Generate New DataFrame without NaN Values

I've the following Dataframe:
a b c d e
0 NaN 2.0 NaN 4.0 5.0
1 NaN 2.0 3.0 NaN 5.0
2 1.0 NaN 3.0 4.0 NaN
3 1.0 2.0 NaN 4.0 NaN
4 NaN 2.0 NaN 4.0 5.0
What I try to to is to generate a new Dataframe without the NaN values.
There are always the same number of NaN Values in a row.
The final Dataframe should look like this:
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
Does someone know an easy way to do this?
Any help is appreciated.
Using array indexing:
pd.DataFrame(df.values[df.notnull().values].reshape(df.shape[0],3),
columns=list('xyz'),dtype=int)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
If the dataframe has more inconsistance values across rows like 1st row with 4 values and from 2nd row if it has 3 values, Then this will do:
a b c d e g
0 NaN 2.0 NaN 4.0 5.0 6.0
1 NaN 2.0 3.0 NaN 5.0 NaN
2 1.0 NaN 3.0 4.0 NaN NaN
3 1.0 2.0 NaN 4.0 NaN NaN
4 NaN 2.0 NaN 4.0 5.0 NaN
pd.DataFrame(df.apply(lambda x: x.values[x.notnull()],axis=1).tolist())
0 1 2 3
0 2.0 4.0 5.0 6.0
1 2.0 3.0 5.0 NaN
2 1.0 3.0 4.0 NaN
3 1.0 2.0 4.0 NaN
4 2.0 4.0 5.0 NaN
Here we cannot remove NaN's in last column.
Use justify function and select first 3 columns:
df = pd.DataFrame(justify(df.values,invalid_val=np.nan)[:, :3].astype(int),
columns=list('xyz'),
index=df.index)
print (df)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
If, as in your example, values increase across columns, you can sort over axis=1:
res = pd.DataFrame(np.sort(df.values, 1)[:, :3],
columns=list('xyz'), dtype=int)
print(res)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
You can use panda's method for dataframe df.fillna()
This method is used for converting the NaN or NA to your given parameter.
df.fillna(param to replace Nan)
import numpy as np
import pandas as pd
data = {
'A':[np.nan, 2.0, np.nan, 4.0, 5.0],
'B':[np.nan, 2.0, 3.0, np.nan, 5.0],
'C':[1.0 , np.nan, 3.0, 4.0, np.nan],
'D':[1.0 , 2.0, np.nan, 4.0, np.nan,],
'E':[np.nan, 2.0, np.nan, 4.0, 5.0]
}
df = pd.DataFrame(data)
print(df)
A B C D E
0 NaN NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 NaN 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
df = df.fillna(0) # Applying the method with parameter 0
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
If you want to apply this method to the particular column, the syntax would be like this
df[column_name] = df[column_name].fillna(param)
df['A'] = df['A'].fillna(0)
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
You can also use Python's replace() method to replace np.nan
df = df.replace(np.nan,0)
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
df['A'] = df['A'].replace() # Replacing only column A
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0

Select rows where at least one value from the list of columns is not null

I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.
So if I would know the number of my columns in advance, I could do something like this:
list_of_cols = ['col1', ...]
df[
df[list_of_cols[0]].notnull() |
df[list_of_cols[1]].notnull() |
...
df[list_of_cols[6]].notnull() |
]
I can also iterate over the list of cols and create a mask which then I would apply to df, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.
Use the thresh parameter in the dropna() method. By setting thresh=1, you specify that if there is at least 1 non null item, don't drop it.
df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))
df[list_of_cols].dropna(thresh=1).head()
Starting with this:
data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,],
'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7],
'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1],
'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]}
df = pd.DataFrame(data, columns=['a','b','c','d'])
df
a b c d
0 NaN NaN NaN NaN
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
Rows where not all values are nulls. (Removing row index 0)
df[~df.isnull().all(axis=1)]
a b c d
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
One can use boolean indexing
df[~pd.isnull(df[list_of_cols]).all(axis=1)]
Explanation:
The expression df[list_of_cols]).all(axis=1) returns a boolean array that is applied as a filter to the dataframe:
isnull() applied to df[list_of_cols] creates a boolean mask for the dataframe df[list_of_cols] with True values for the null elements in df[list_of_cols], False otherwise
all() returns True if all of the elements are True (row-wise axis=1)
So, by negation ~ (not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.
An example:
Dataframe:
>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
'B':['x',np.NaN,np.NaN,'w'],
'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
A B C
0 11 x 2016-03-13
1 22 NaN NaN
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
isnull mask:
>>> ~pd.isnull(df[list_of_cols])
B C
0 True True
1 False False
2 False True
3 True True
apply all(axis=1) row-wise:
>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0 True
1 False
2 True
3 True
dtype: bool
Boolean selection from dataframe:
>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
A B C
0 11 x 2016-03-13
2 33 NaN 2016-03-14
3 NaN w 2016-03-15

Categories

Resources