Re-occurrence count - python

I have a dataframe like the follow:
Col1
0 C
1 A
3 D
4 A
5 A
I would like to count the step/index that a certain value will re-occur so I would get the following:
Col1 Col2
0 C NaN
1 A 2
3 D NaN
4 A 1
5 A NaN
Any ideas on how to do it ? Thanks for help !

Use GroupBy.cumcount and then replace 0 to NaNs:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).replace(0,np.nan)
print (df)
Col1 Col2
0 C NaN
1 A 2.0
3 D NaN
4 A 1.0
5 A NaN
Alternative solution with mask:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).mask(lambda x: x == 0)

Related

Get last non NaN value after groupby and aggregation

I have a data frame like this for example:
col1 col2
0 A 3
1 B 4
2 A NaN
3 B 5
4 A 5
5 A NaN
6 B NaN
.
.
.
47 B 8
48 A 9
49 B NaN
50 A NaN
when i try df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index() it gives me this output
col1 col2
0 A NaN
1 B NaN
I want to get the last non NaN value after groupby and agg. The desirable output is like below
col1 col2
0 A 9
1 B 8
For me your solution working well, if NaN are missing values.
Here is alternative:
df = df.dropna(subset=['col2']).drop_duplicates('col1', keep='last')
If NaNs are strings first convert them to missing values:
df['col2'] = df['col2'].replace('NaN', np.nan)
df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index()

Take n last rows of a dataframe with no NaN

Let's take this dataframe :
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4,5,6], Col2=[4,np.nan,5,np.nan,1,5]))
Col1 Col2
0 1.0 4.0
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
4 5.0 1.0
5 6.0 5.0
I would like to extract the n last rows of df with no NaN.
Could you please help me to get this expected result ?
Col1 Col2
0 5 1
1 6 5
EDIT : Let's say I don't know where is the last NaN
Use DataFrame.dropna with DataFrame.tail and converting to integers:
N = 2
df1 = df.dropna().tail(N).astype(int)
#alternative
#df1 = df.dropna().iloc[-N:].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
EDIT: For last group with no missing values compare misisng values with DataFrame.isna and DataFrame.any, then swap order with cumulative sum, so last group has 0 values in mask:
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
If no row match it return correct empty DataFrame:
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4], Col2=[np.nan,np.nan,5,np.nan]))
print (df)
Col1 Col2
0 1.0 NaN
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Empty DataFrame
Columns: [Col1, Col2]
Index: []
another way is to use isna with drop_duplicates and cumsum to get the max index and then just use index filtering
last_na = df.isna().cumsum(axis=0).drop_duplicates(keep='first').index.max() + 1
new_df = df.iloc[last_na:]
print(new_df)
Col1 Col2
4 5.0 1.0
5 6.0 5.0

Pandas Summing Two Columns with Nan

I have three columns in pandas dataframes with Nan:
>>> d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4], 'col3':[5,6]})
>>> d
col1 col2 col3
0 1 3 5
1 2 4 6
>>> d['col2'].iloc[0]=np.nan
>>> d
col1 col2 col3
0 1 NaN 5
1 2 4.0 6
>>> d['col1'].iloc[1]=np.nan
>>> d
col1 col2 col3
0 1.0 NaN 5
1 NaN 4.0 6
>>> d['col3'].iloc[1]=np.nan
>>> d
col1 col2 col3
0 1.0 NaN 5.0
1 NaN 4.0 NaN
Now, I would like the column addition to have the following output:
>>> d['col1']+d['col3']
0 6.0
1 NaN
>>> d['col1']+d['col2']
0 1.0
1 4.0
However, in reality, the output is instead:
>>> d['col1']+d['col3']
0 6.0
1 NaN
>>> d['col1']+d['col2']
0 NaN
1 NaN
Anyone knows how to achieve this?
You can use add to get your sums, with fill_value=0:
>>> d.col1.add(d.col2, fill_value=0)
0 1.0
1 4.0
dtype: float64
>>> d.col1.add(d.col3, fill_value=0)
0 6.0
1 NaN
dtype: float64
When adding columns one and two, use Series.add with fill_value=0.
>>> d
col1 col2 col3
0 1.0 NaN 5.0
1 NaN 4.0 NaN
>>>
>>> d['col1'].add(d['col2'], fill_value=0)
0 1.0
1 4.0
dtype: float64
Dataframes and series have methods like add, sub, ... in order to perform more sophisticated operations than the associated operators +, -, ... can provide.
The methods may take additional arguments that finetune the operation.

pandas groupby two columns and generate columns from second

I have a pandas dataframe like:
index col1 col2 col3 col4 col5
0 a c 1 2 f
1 a c 1 2 f
2 a d 1 2 f
3 b d 1 2 g
4 b e 1 2 g
5 b e 1 2 g
if i group by two columns, like the following:
df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'})
I get:
col3 col4
col1 col2
a c 2 4
d 1 2
b d 1 2
e 2 4
Is it possible to convert this to:
col1 c_col3 d_col3 c_col4 d_col4 e_col3 e_col4
a 2 1 4 2 Nan Nan
b Nan 1 Nan 2 2 4
in an efficient manner where col1 is the index?
Add unstack for MultiIndex in columns, so necessary flattening:
df1 = df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'}).unstack()
#python 3.6+
df1.columns = [f'{j}_{i}' for i, j in df1.columns]
#python bellow
#df1.columns = ['{}_{}'.format(j, i) for i, j in df1.columns]
print (df1)
c_col3 d_col3 e_col3 c_col4 d_col4 e_col4
col1
a 2.0 1.0 NaN 4.0 2.0 NaN
b NaN 1.0 2.0 NaN 2.0 4.0

Compare each of the column values and return final value based on conditions

I currently have a dataframe which looks like this:
col1 col2 col3
1 2 3
2 3 NaN
3 4 NaN
2 NaN NaN
0 2 NaN
What I want to do is apply some condition to the column values and return the final result in a new column.
The condition is to assign values based on this order of priority where 2 being the first priority: [2,1,3,0,4]
I tried to define a function to append the final results but wasnt really getting anywhere...any thoughts?
The desired outcome would look something like:
col1 col2 col3 col4
1 2 3 2
2 3 NaN 2
3 4 NaN 3
2 NaN NaN 2
0 2 NaN 2
where col4 is the new column created.
Thanks
first you may want to get ride of the NaNs:
df.fillna(5)
and then apply a function to every row to find your value:
def func(x,l=[2,1,3,0,4,5]):
for j in l:
if(j in x):
return j
df['new'] = df.apply(lambda x: func(list(x)),axis =1)
Output:
col1 col2 col3 new
0 1 2 3 2
1 2 3 5 2
2 3 4 5 3
3 2 5 5 2
4 0 2 5 2
maybe a little later.
import numpy as np
def f(x):
for i in [2,1,3,0,4]:
if i in x.tolist():
return i
return np.nan
df["col4"] = df.apply(f, axis=1)
and the Output:
col1 col2 col3 col4
0 1 2.0 3.0 2
1 2 3.0 NaN 2
2 3 4.0 NaN 3
3 2 NaN NaN 2
4 0 2.0 NaN 2

Categories

Resources