Remove all values below certain threshold and shift columns up in Pandas - python

I have growth data. I would like to calibrate all the columns to a certain (arbitrary) cutoff by removing all values below this threshold and "shift" the values up in each individual column.
To illustrate:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4],[5, 6]], columns=list('AB'))
result:
A B
0 1 2
1 3 4
2 5 6
Removing all values below 3:
df = df.where(df > 3, np.nan)
result:
A B
0 NaN NaN
1 NaN 4
2 5 6
What I'd finally want is the following dataframe (in a sense cutting and pasting values more than 3 to the top of the df):
A B
0 5 4
1 NaN 6
2 NaN NaN
Any idea how I would be able to do so?

Use justify function for improve performance:
df = pd.DataFrame([[1, 2], [3, 4],[5, 6]], columns=list('AB'))
df = df.where(df > 3, np.nan)
arr = justify(df.to_numpy(), invalid_val=np.nan, axis=0, side='up')
#oldier pandas versions
arr = justify(df.values, invalid_val=np.nan, axis=0, side='up')
df = pd.DataFrame(arr, columns=df.columns)
print (df)
A B
0 5.0 4.0
1 NaN 6.0
2 NaN NaN
Function by divakar:
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out

I would do it by using built-in Python sorted following way:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 6],[5, 4]], columns=list('AB'))
df = df.where(df > 3, np.nan)
print(df)
Output:
A B
0 NaN NaN
1 NaN 6.0
2 5.0 4.0
Then just do:
for col in df.columns:
df[col] = sorted(df[col], key=pd.isnull)
print(df)
Output:
A B
0 5.0 6.0
1 NaN 4.0
2 NaN NaN
I harness fact that built-in sorted is stable (note that I slightly changed input (6 before 4) to show that). isnull function produce False for all non-NaNs which is treated as 0 during sorting and True for rest which is treated as 1 during sorting.

Related

Is there a way to fill NaNs at the end of a column with values from another column at the beginning in Python?

So I have two columns for example A & B and they look like this:
A B
1 4
2 5
3 6
NaN NaN
NaN NaN
NaN NaN
and I want it like this:
A
1
2
3
4
5
6
Any ideas?
I'm assuming your data is in two columns in a DataFrame, you can append B values to the end of A values, then drop the NA values with np.nan != np.nan trick. Here's an example
import pandas as pd
import numpy as np
d = {
'A': [1,2,3, np.nan, np.nan, np.nan],
'B': [4,5,6, np.nan, np.nan, np.nan]
}
df = pd.DataFrame(d)
>>> df
A B
1 4
2 5
3 6
NaN NaN
NaN NaN
NaN NaN
# np.nan == np.nan trick
>>> df['A'] == df['A']
0 True
1 True
2 True
3 False
4 False
5 False
Name: A, dtype: bool
x = pd.concat([df['A'], df['B']])
>>> x
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 NaN
0 4.0
1 5.0
2 6.0
3 NaN
4 NaN
5 NaN
dtype: float64
x = x[x == x]
>>> x
A
1
2
3
4
5
6
Using numpy, it could be something like:
import numpy as np
A = np.array([1, 2, 3, np.nan, np.nan, np.nan])
B = np.array([4, 5, 6, np.nan, np.nan, np.nan])
C = np.hstack([A[A < np.infty], B[B < np.infty]])
print(C) # [1. 2. 3. 4. 5. 6.]
What you might want is:
import pandas as pd
a = pd.Series([1, 2, 3, None, None, None])
b = pd.Series([4, 5, 6, None, None, None])
print(pd.concat([a.iloc[:3], b.iloc[:3]]))
And if you are just looking for non-NaN values feel free to use .dropna() in Series.

Compare two columns with NaNs in Pandas and get differences

I have a following dataframe:
case c1 c2
1 x x
2 NaN y
3 x NaN
4 y x
5 NaN NaN
I would like to get a column "match" which will show which records with values in "c1" and "c2" are equal or different:
case c1 c2 match
1 x x True
2 NaN y False
3 x NaN False
4 y x False
5 NaN NaN True
I tried the following based on another Stack Overflow question: Comparing two columns and keeping NaNs
However, I can't get both cases 4 and 5 correct.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'case': [1, 2, 3, 4, 5],
'c1': ['x', np.nan,'x','y', np.nan],
'c2': ['x', 'y',np.nan,'x', np.nan],
})
cond1 = df['c1'] == df['c2']
cond2 = (df['c1'].isnull()) == (df['c2'].isnull())
df['c3'] = np.select([cond1, cond2], [True, True], False)
df
Use eq with isna:
df.c1.eq(df.c2)|df.iloc[:, 1:].isna().all(1)
#or
df.c1.eq(df.c2)|df.loc[:, ['c1','c2']].isna().all(1)
import pandas as pd
import numpy as np
df = pd.DataFrame({
'case': [1, 2, 3, 4, 5],
'c1': ['x', np.nan,'x','y', np.nan],
'c2': ['x', 'y',np.nan,'x', np.nan],
})
df['c3'] = df.apply(lambda row: True if str(row.c1) == str(row.c2) else False, axis=1)
print(df)
Output
case c1 c2 c3
0 1 x x True
1 2 NaN y False
2 3 x NaN False
3 4 y x False
4 5 NaN NaN True
Use nuquine with fillna
import numpy as np
df.fillna(np.inf)[['c1','c2']].nunique(1) < 2
Or nunique with option dropna=False
df[['c1','c2']].nunique(1, dropna=False) < 2
Out[13]:
0 True
1 False
2 False
3 False
4 True
dtype: bool

Use a custom function to apply on a df column if a condition is satisfied

I have a DataFrame like
A B
1 2
2 -
5 -
4 5
I want to apply a function func() on column B (but the function gives an error if - is passed). I cannot modify the func() function. I need something like:
df['B']=df['B'].apply(func) only if value not equal to -
Use a custom function to apply on a df column if a condition is satisfied:
def func(a):
return a + 10
#new pandas dataframe with four rows and 2 columns. 3rd row having a nan
df = pd.DataFrame([[1, 2], [3, 4], [5, pd.np.nan], [7, 8]], columns=["A", "B"])
print(df)
#coerce column named B to numeric
s = pd.to_numeric(df['B'], errors='coerce')
#a mask has true for numeric rows, false for non numeric rows
mask = s.notna()
#mask
print(mask)
#run function named func across the B column
df.loc[mask, 'B'] = s[mask].apply(func)
print(df)
Which prints:
A B
0 1 2.0
1 3 4.0
2 5 NaN
3 7 8.0
0 True
1 True
2 False
3 True
A B
0 1 12.0
1 3 14.0
2 5 NaN
3 7 18.0
Try:
df['B'] = df[df['B']!='-']['B'].apply(func)
Or when the - is actaully nan you can use:
df['B'] = df[pd.notnull(df['B'])]['B'].apply(func)

Pandas: How can I find col, index where Nan value exists?

In [3]: import numpy as np
In [4]: b = pd.DataFrame(np.array([
...: [1,np.nan,3,4],
...: [np.nan, 4, np.nan, 4]
...: ]))
In [13]: b
Out[13]:
0 1 2 3
0 1.0 NaN 3.0 4.0
1 NaN 4.0 NaN 4.0
I want to find column name and index where Nan value exists.
For example, "b has NaN value at index 0, col1, index 0, col0, index 1 col2.
What I've tried:
1
In [14]: b[b.isnull()]
Out[14]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
=> I don't know why it shows DataFrame filled with NaN
2
In [15]: b[b[0].isnull()]
Out[15]:
0 1 2 3
1 NaN 4.0 NaN 4.0
=> It only shows part of DataFrame where Nan value exist in column 0..
How can I
You could use np.where to find the indices where pd.isnull(b) is True:
import numpy as np
import pandas as pd
b = pd.DataFrame(np.array([
[1,np.nan,3,4],
[np.nan, 4, np.nan, 4]]))
idx, idy = np.where(pd.isnull(b))
result = np.column_stack([b.index[idx], b.columns[idy]])
print(result)
# [[0 1]
# [1 0]
# [1 2]]
or use DataFrame.stack to reshape the DataFrame by moving the column labels into the index.
This creates a Series which is True where b is null:
mask = pd.isnull(b).stack()
# 0 0 False
# 1 True
# 2 False
# 3 False
# 1 0 True
# 1 False
# 2 True
# 3 False
and then read off the row and column labels from the MultiIndex:
print(mask.loc[mask])
# 0 1 True
# 1 0 True
# 2 True
# dtype: bool
print(mask.loc[mask].index.tolist())
# [(0, 1), (1, 0), (1, 2)]

Duplicating & modifying rows in pandas based on columns condition

As part of a classification problem, I work on a DataFrame containing multiple label columns.
My dataframe is of this form :
df = pd.DataFrame([['a', 1, 1],
['b', 1, 0],
['c', 0, 0]] , columns=['col1', 'label1', 'label2'])
>>> col1 label1 label2
0 a 1 1
1 b 1 0
2 c 0 0
As I do not want to have more than one true label per row, I want to duplicate only those rows and regularize this condition as follows :
>>> col1 label1 label2
0 a 1 0 # Modified original row
1 a 0 1 # Duplicated & modified row
2 b 1 0
3 c 0 0
With only the row of value "a" being duplicated / regularized
At the moment I do that in a for loop, replicating the rows in a second DataFrame, appending it and dropping all the "invalid" rows.
Would there be a more clean/efficient way to do that ?
>>> cols = [x for x in df.columns is x != 'col1']
>>> res = pd.concat([df[['col1', x]] for x in cols])
>>> res = res.drop_duplicates()
>>> res.fillna(0, inplace=True)
>>> res.sort_values(by='col1', inplace=True)
>>> res.reset_index(drop=True, inplace=True)
>>> res
col1 label1 label2
0 a 1 0
1 a 0 1
2 b 1 0
3 b 0 0
4 c 0 0
You can also use df.iterrows() doing as follows :
for index, row in df.iterrows():
if row[1]+row[2]==2:
df = pd.concat((df, pd.DataFrame({'col1':[row[0]], 'label1':[0], 'label2':[1]})),ignore_index=True)
df = pd.concat((df, pd.DataFrame({'col1':[row[0]], 'label1':[1], 'label2':[0]})), ignore_index=True)
df.drop(index, inplace=True)
Result :
col1 label1 label2
1 b 1 0
2 c 0 0
3 a 0 1
4 a 1 0
Then you can sort regarding values on col1
Here is a somewhat intuitive way of thinking about the problem. First, filter for just the rows that have label both equal to 1. Make two new dataframes by replacing each column by zero, once each.
Then concatenate the original dataframe without both rows equal to one to the two new dataframes created.
mask_ones = (df['label1'] == 1) & (df['label2'] == 1)
df_ones = df[mask_ones]
df_not_ones = df[~mask_ones]
df_final = pd.concat([df_not_ones,
df_ones.replace({'label2':{1:0}}),
df_ones.replace({'label1':{1:0}})]).sort_values('col1')
Split into 2 df - unique and duplicates.
For duplicates took col1 + label1 columns and concat with col1 + label2 and fill nan with 0.
Concat unique and duplicates df into one:
df = pd.DataFrame([['a', 1, 1],
['b', 1, 0],
['c', 0, 0]], columns=['col1', 'label1', 'label2'])
mask = (df['label1'] == 1) & (df['label2'] == 1)
df_dup, df_uq = df[mask], df[~mask]
df_dup = pd.concat([df_dup[['col1', x]] for x in df_dup.columns if x != 'col1']).fillna(0)
df = pd.concat([df_dup, df_uq], ignore_index=True)
print(df)
col1 label1 label2
0 a 1.0 0.0
1 a 0.0 1.0
2 b 1.0 0.0
3 c 0.0 0.0
Something like that:
df = pd.DataFrame([['a', 1, 1],
['b', 1, 0],
['c', 0, 0]] , columns=['col1', 'label1', 'label2'])
df2 = pd.DataFrame()
df2["col1"] = df["col1"]
df2["label2"] = df["label2"]
df.drop(labels="label2", axis=1, inplace=True)
result = df.append(df2, ignore_index=True)
result.fillna(value=0, inplace=True)
result.sort_values(by="col1")
Result:
col1 label1 label2
0 a 1.000000 0.000000
3 a 0.000000 1.000000
1 b 1.000000 0.000000
4 b 0.000000 0.000000
2 c 0.000000 0.000000
5 c 0.000000 0.000000
Finally, you could drop duplicates
result.drop_duplicates()

Categories

Resources