Pandas: Compare next column value with previous column value - python

I have the following DataFrame structure with example data:
Col1 Col2 Col3
1 1 8
5 4 7
3 9 9
1 NaN NaN
Columns have a sequential ordering, meaning Col1 comes before Col2 and so on...
I want to compare if two (or more) subsequent columns have the same value. If so I want to drop the entire row. NaN values can appear but should not be treated as having the same value
So with the rows above, I'd like to have row 1 and 3 dropped (row 1: Col1->Col2 same value, row 3: Col2 -> Col3 same value) and row 2 and 4 to be kept in the dataframe.
How can I accomplish this? Thanks!

Use DataFrame.diff and filter rows if exist no 0 value per rows by DataFrame.ne for not equal and DataFrame.all for test if all True and filter in boolean indexing:
df = df[df.diff(axis=1).ne(0).all(axis=1)]
print (df)
Col1 Col2 Col3
1 5 4.0 7.0
3 1 NaN NaN
Detail:
print (df.diff(axis=1))
Col1 Col2 Col3
0 NaN 0.0 7.0
1 NaN -1.0 3.0
2 NaN 6.0 0.0
3 NaN NaN NaN
print (df.diff(axis=1).ne(0))
Col1 Col2 Col3
0 True False True
1 True True True
2 True True False
3 True True True
print (df.diff(axis=1).ne(0).all(axis=1))
0 False
1 True
2 False
3 True
dtype: bool

Related

Pandas update columns with loc and apply

I'm trying to update some columns of a dataframe where some condition is met (only some lines will meet the condition).
I'm using apply with loc. My function returns a pandas series.
The problem is that the columns are updates with NaN.
Simplifying my problem, we can consider the following dataframe df_test:
col1 col2 col3 col4
0 A 1 1 2
1 B 2 1 2
2 A 3 1 2
3 B 4 1 2
I now want to update col3 and col4 when col1=A. For that I'll use the apply method
df_test.loc[df_test['col1']=='A', ['col3', 'col4']] = df_test[df_test['col1']=='A'].apply(lambda row: pd.Series([10,20]), axis=1)
Doing that I get:
col1 col2 col3 col4
0 A 1 NaN NaN
1 B 2 1.0 2.0
2 A 3 NaN NaN
3 B 4 1.0 2.0
If instead of pd.Series([10, 20]) I use np.array([10, 20]) or [10, 20] I get the following error
ValueError: shape mismatch: value array of shape (2,2) could not be broadcast to indexing result of shape (2,)
What do I need to return to obtain
col1 col2 col3 col4
0 A 1 10 20
1 B 2 1 2
2 A 3 10 20
3 B 4 1 2
thanks!

create new column among multiple columns value

I have a question about how to create a new column based on several column value.
Input:
col1 col2 col3
1 1 1
NULL NULL NULL
2 NULL 2
NULL NULL 3
4 NULL NULL
5 5 NULL
Output
col1 col2 col3 new
1 1 1 1
NULL NULL NULL NULL
2 NULL 2 2
NULL NULL 3 3
4 NULL NULL 4
5 5 NULL 5
I am trying to use combine_first, but it seems not be a good choice since I have multiple columns need to combine.
One option is to rename the columns to have the same name; then use groupby + first:
df['new'] = (df.rename(columns={col: 'col' for col in df.columns})
.groupby(level=0, axis=1).first())
You could also iteratively use combine_first:
df['new'] = float('nan')
for c in df.columns:
df['new'] = df['new'].combine_first(df[c])
Or you could apply a lambda that selects non-NaN values row-wise (works for Python>=3.8 since it uses the walrus operator; could write the same function differently if you have Python<3.8):
df['new'] = df.apply(lambda x: res[0] if (res:=x[x.notna()].tolist()) else float('nan'), axis=1)
Output:
col1 col2 col3 new
0 1.0 1.0 1.0 1.0
1 NaN NaN NaN NaN
2 2.0 NaN 2.0 2.0
3 NaN NaN 3.0 3.0
4 4.0 NaN NaN 4.0
5 5.0 5.0 NaN 5.0

Get last non NaN value after groupby and aggregation

I have a data frame like this for example:
col1 col2
0 A 3
1 B 4
2 A NaN
3 B 5
4 A 5
5 A NaN
6 B NaN
.
.
.
47 B 8
48 A 9
49 B NaN
50 A NaN
when i try df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index() it gives me this output
col1 col2
0 A NaN
1 B NaN
I want to get the last non NaN value after groupby and agg. The desirable output is like below
col1 col2
0 A 9
1 B 8
For me your solution working well, if NaN are missing values.
Here is alternative:
df = df.dropna(subset=['col2']).drop_duplicates('col1', keep='last')
If NaNs are strings first convert them to missing values:
df['col2'] = df['col2'].replace('NaN', np.nan)
df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index()

Take n last rows of a dataframe with no NaN

Let's take this dataframe :
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4,5,6], Col2=[4,np.nan,5,np.nan,1,5]))
Col1 Col2
0 1.0 4.0
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
4 5.0 1.0
5 6.0 5.0
I would like to extract the n last rows of df with no NaN.
Could you please help me to get this expected result ?
Col1 Col2
0 5 1
1 6 5
EDIT : Let's say I don't know where is the last NaN
Use DataFrame.dropna with DataFrame.tail and converting to integers:
N = 2
df1 = df.dropna().tail(N).astype(int)
#alternative
#df1 = df.dropna().iloc[-N:].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
EDIT: For last group with no missing values compare misisng values with DataFrame.isna and DataFrame.any, then swap order with cumulative sum, so last group has 0 values in mask:
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
If no row match it return correct empty DataFrame:
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4], Col2=[np.nan,np.nan,5,np.nan]))
print (df)
Col1 Col2
0 1.0 NaN
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Empty DataFrame
Columns: [Col1, Col2]
Index: []
another way is to use isna with drop_duplicates and cumsum to get the max index and then just use index filtering
last_na = df.isna().cumsum(axis=0).drop_duplicates(keep='first').index.max() + 1
new_df = df.iloc[last_na:]
print(new_df)
Col1 Col2
4 5.0 1.0
5 6.0 5.0

Compare each of the column values and return final value based on conditions

I currently have a dataframe which looks like this:
col1 col2 col3
1 2 3
2 3 NaN
3 4 NaN
2 NaN NaN
0 2 NaN
What I want to do is apply some condition to the column values and return the final result in a new column.
The condition is to assign values based on this order of priority where 2 being the first priority: [2,1,3,0,4]
I tried to define a function to append the final results but wasnt really getting anywhere...any thoughts?
The desired outcome would look something like:
col1 col2 col3 col4
1 2 3 2
2 3 NaN 2
3 4 NaN 3
2 NaN NaN 2
0 2 NaN 2
where col4 is the new column created.
Thanks
first you may want to get ride of the NaNs:
df.fillna(5)
and then apply a function to every row to find your value:
def func(x,l=[2,1,3,0,4,5]):
for j in l:
if(j in x):
return j
df['new'] = df.apply(lambda x: func(list(x)),axis =1)
Output:
col1 col2 col3 new
0 1 2 3 2
1 2 3 5 2
2 3 4 5 3
3 2 5 5 2
4 0 2 5 2
maybe a little later.
import numpy as np
def f(x):
for i in [2,1,3,0,4]:
if i in x.tolist():
return i
return np.nan
df["col4"] = df.apply(f, axis=1)
and the Output:
col1 col2 col3 col4
0 1 2.0 3.0 2
1 2 3.0 NaN 2
2 3 4.0 NaN 3
3 2 NaN NaN 2
4 0 2.0 NaN 2

Categories

Resources