Get last non NaN value after groupby and aggregation - python

I have a data frame like this for example:
col1 col2
0 A 3
1 B 4
2 A NaN
3 B 5
4 A 5
5 A NaN
6 B NaN
.
.
.
47 B 8
48 A 9
49 B NaN
50 A NaN
when i try df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index() it gives me this output
col1 col2
0 A NaN
1 B NaN
I want to get the last non NaN value after groupby and agg. The desirable output is like below
col1 col2
0 A 9
1 B 8

For me your solution working well, if NaN are missing values.
Here is alternative:
df = df.dropna(subset=['col2']).drop_duplicates('col1', keep='last')
If NaNs are strings first convert them to missing values:
df['col2'] = df['col2'].replace('NaN', np.nan)
df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index()

Related

How to insert an empty column after each column in an existing dataframe

I have a dataframe that looks as follows
df = pd.DataFrame({"A":[1,2,3,4],
"B":[3,4,5,6],
"C":[2,3,4,5]})
I would like to insert an empty column (with type string) after each existing column in the dataframe, such that the output looks like
A col1 B col2 C col3
0 1 NaN 3 NaN 2 NaN
1 2 NaN 4 NaN 3 NaN
2 3 NaN 5 NaN 4 NaN
3 4 NaN 6 NaN 5 NaN
Actually there's a much more simple way thanks to reindex:
df.reindex([x for i, c in enumerate(df.columns, 1) for x in (c, f'col{i}')], axis=1)
Result:
A col1 B col2 C col3
0 1 NaN 3 NaN 2 NaN
1 2 NaN 4 NaN 3 NaN
2 3 NaN 5 NaN 4 NaN
3 4 NaN 6 NaN 5 NaN
Here's the other more complicated way:
import numpy as np
df.join(pd.DataFrame(np.empty(df.shape, dtype=object), columns=df.columns + '_sep')).sort_index(axis=1)
A A_sep B B_sep C C_sep
0 1 None 3 None 2 None
1 2 None 4 None 3 None
2 3 None 5 None 4 None
3 4 None 6 None 5 None
This solution worked for me:
merged = pd.concat([myDataFrame, pd.DataFrame(columns= [' '])], axis=1)
This is what you can do:
for count in range(len(df.columns)):
df.insert(count*2+1, str('col'+str(count+1)), 'NaN')
print(df)
Output:
A col1 B col2 C col3
0 1 NaN 3 NaN 2 NaN
1 2 NaN 4 NaN 3 NaN
2 3 NaN 5 NaN 4 NaN
3 4 NaN 6 NaN 5 NaN

Take n last rows of a dataframe with no NaN

Let's take this dataframe :
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4,5,6], Col2=[4,np.nan,5,np.nan,1,5]))
Col1 Col2
0 1.0 4.0
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
4 5.0 1.0
5 6.0 5.0
I would like to extract the n last rows of df with no NaN.
Could you please help me to get this expected result ?
Col1 Col2
0 5 1
1 6 5
EDIT : Let's say I don't know where is the last NaN
Use DataFrame.dropna with DataFrame.tail and converting to integers:
N = 2
df1 = df.dropna().tail(N).astype(int)
#alternative
#df1 = df.dropna().iloc[-N:].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
EDIT: For last group with no missing values compare misisng values with DataFrame.isna and DataFrame.any, then swap order with cumulative sum, so last group has 0 values in mask:
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
If no row match it return correct empty DataFrame:
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4], Col2=[np.nan,np.nan,5,np.nan]))
print (df)
Col1 Col2
0 1.0 NaN
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Empty DataFrame
Columns: [Col1, Col2]
Index: []
another way is to use isna with drop_duplicates and cumsum to get the max index and then just use index filtering
last_na = df.isna().cumsum(axis=0).drop_duplicates(keep='first').index.max() + 1
new_df = df.iloc[last_na:]
print(new_df)
Col1 Col2
4 5.0 1.0
5 6.0 5.0

Compare each of the column values and return final value based on conditions

I currently have a dataframe which looks like this:
col1 col2 col3
1 2 3
2 3 NaN
3 4 NaN
2 NaN NaN
0 2 NaN
What I want to do is apply some condition to the column values and return the final result in a new column.
The condition is to assign values based on this order of priority where 2 being the first priority: [2,1,3,0,4]
I tried to define a function to append the final results but wasnt really getting anywhere...any thoughts?
The desired outcome would look something like:
col1 col2 col3 col4
1 2 3 2
2 3 NaN 2
3 4 NaN 3
2 NaN NaN 2
0 2 NaN 2
where col4 is the new column created.
Thanks
first you may want to get ride of the NaNs:
df.fillna(5)
and then apply a function to every row to find your value:
def func(x,l=[2,1,3,0,4,5]):
for j in l:
if(j in x):
return j
df['new'] = df.apply(lambda x: func(list(x)),axis =1)
Output:
col1 col2 col3 new
0 1 2 3 2
1 2 3 5 2
2 3 4 5 3
3 2 5 5 2
4 0 2 5 2
maybe a little later.
import numpy as np
def f(x):
for i in [2,1,3,0,4]:
if i in x.tolist():
return i
return np.nan
df["col4"] = df.apply(f, axis=1)
and the Output:
col1 col2 col3 col4
0 1 2.0 3.0 2
1 2 3.0 NaN 2
2 3 4.0 NaN 3
3 2 NaN NaN 2
4 0 2.0 NaN 2

Re-occurrence count

I have a dataframe like the follow:
Col1
0 C
1 A
3 D
4 A
5 A
I would like to count the step/index that a certain value will re-occur so I would get the following:
Col1 Col2
0 C NaN
1 A 2
3 D NaN
4 A 1
5 A NaN
Any ideas on how to do it ? Thanks for help !
Use GroupBy.cumcount and then replace 0 to NaNs:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).replace(0,np.nan)
print (df)
Col1 Col2
0 C NaN
1 A 2.0
3 D NaN
4 A 1.0
5 A NaN
Alternative solution with mask:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).mask(lambda x: x == 0)

Check for NaN values in some particular column in a dataframe

Suppose I have a dataframe:
a b c
0 1 2 NaN
1 2 NaN 4
3 Nan 4 NaN
I want to check for NaN in only some particular column's and want the resulting dataframe as:
a b c
0 1 2 NaN
3 Nan 4 NaN
Here I want to check for NaN in only Column 'a' and Column 'c'.
How this can be done?
You could do that with isnull and any methods:
In [264]: df
Out[264]:
a b c
0 1 2 NaN
1 2 NaN 4
2 NaN 4 NaN
In [265]: df[df.isnull().any(axis=1)]
Out[265]:
a b c
0 1 2 NaN
2 NaN 4 NaN
Note: if you just want clear rows without any NaN you could use dropna method
EDIT
If you want to subset your dataframe you could use mask with your columns and apply it to the whole dataframe:
df_subset = df[['a', 'c']]
In [282]: df[df_subset.isnull().any(axis=1)]
Out[282]:
a b c
0 1 2 NaN
2 NaN 4 NaN

Categories

Resources