Re-occurrence count

Re-occurrence count - python

I have a dataframe like the follow:
Col1
0 C
1 A
3 D
4 A
5 A
I would like to count the step/index that a certain value will re-occur so I would get the following:
Col1 Col2
0 C NaN
1 A 2
3 D NaN
4 A 1
5 A NaN
Any ideas on how to do it ? Thanks for help !

Use GroupBy.cumcount and then replace 0 to NaNs:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).replace(0,np.nan)
print (df)
Col1 Col2
0 C NaN
1 A 2.0
3 D NaN
4 A 1.0
5 A NaN
Alternative solution with mask:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).mask(lambda x: x == 0)

Related

Get last non NaN value after groupby and aggregation

I have a data frame like this for example:
col1 col2
0 A 3
1 B 4
2 A NaN
3 B 5
4 A 5
5 A NaN
6 B NaN
.
.
.
47 B 8
48 A 9
49 B NaN
50 A NaN
when i try df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index() it gives me this output
col1 col2
0 A NaN
1 B NaN
I want to get the last non NaN value after groupby and agg. The desirable output is like below
col1 col2
0 A 9
1 B 8

For me your solution working well, if NaN are missing values.
Here is alternative:
df = df.dropna(subset=['col2']).drop_duplicates('col1', keep='last')
If NaNs are strings first convert them to missing values:
df['col2'] = df['col2'].replace('NaN', np.nan)
df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index()

Take n last rows of a dataframe with no NaN

Let's take this dataframe :
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4,5,6], Col2=[4,np.nan,5,np.nan,1,5]))
Col1 Col2
0 1.0 4.0
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
4 5.0 1.0
5 6.0 5.0
I would like to extract the n last rows of df with no NaN.
Could you please help me to get this expected result ?
Col1 Col2
0 5 1
1 6 5
EDIT : Let's say I don't know where is the last NaN

Use DataFrame.dropna with DataFrame.tail and converting to integers:
N = 2
df1 = df.dropna().tail(N).astype(int)
#alternative
#df1 = df.dropna().iloc[-N:].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
EDIT: For last group with no missing values compare misisng values with DataFrame.isna and DataFrame.any, then swap order with cumulative sum, so last group has 0 values in mask:
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
If no row match it return correct empty DataFrame:
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4], Col2=[np.nan,np.nan,5,np.nan]))
print (df)
Col1 Col2
0 1.0 NaN
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Empty DataFrame
Columns: [Col1, Col2]
Index: []

another way is to use isna with drop_duplicates and cumsum to get the max index and then just use index filtering
last_na = df.isna().cumsum(axis=0).drop_duplicates(keep='first').index.max() + 1
new_df = df.iloc[last_na:]
print(new_df)
Col1 Col2
4 5.0 1.0
5 6.0 5.0

Pandas Summing Two Columns with Nan

I have three columns in pandas dataframes with Nan:
>>> d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4], 'col3':[5,6]})
>>> d
col1 col2 col3
0 1 3 5
1 2 4 6
>>> d['col2'].iloc[0]=np.nan
>>> d
col1 col2 col3
0 1 NaN 5
1 2 4.0 6
>>> d['col1'].iloc[1]=np.nan
>>> d
col1 col2 col3
0 1.0 NaN 5
1 NaN 4.0 6
>>> d['col3'].iloc[1]=np.nan
>>> d
col1 col2 col3
0 1.0 NaN 5.0
1 NaN 4.0 NaN
Now, I would like the column addition to have the following output:
>>> d['col1']+d['col3']
0 6.0
1 NaN
>>> d['col1']+d['col2']
0 1.0
1 4.0
However, in reality, the output is instead:
>>> d['col1']+d['col3']
0 6.0
1 NaN
>>> d['col1']+d['col2']
0 NaN
1 NaN
Anyone knows how to achieve this?

You can use add to get your sums, with fill_value=0:
>>> d.col1.add(d.col2, fill_value=0)
0 1.0
1 4.0
dtype: float64
>>> d.col1.add(d.col3, fill_value=0)
0 6.0
1 NaN
dtype: float64

When adding columns one and two, use Series.add with fill_value=0.
>>> d
col1 col2 col3
0 1.0 NaN 5.0
1 NaN 4.0 NaN
>>>
>>> d['col1'].add(d['col2'], fill_value=0)
0 1.0
1 4.0
dtype: float64
Dataframes and series have methods like add, sub, ... in order to perform more sophisticated operations than the associated operators +, -, ... can provide.
The methods may take additional arguments that finetune the operation.

pandas groupby two columns and generate columns from second

I have a pandas dataframe like:
index col1 col2 col3 col4 col5
0 a c 1 2 f
1 a c 1 2 f
2 a d 1 2 f
3 b d 1 2 g
4 b e 1 2 g
5 b e 1 2 g
if i group by two columns, like the following:
df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'})
I get:
col3 col4
col1 col2
a c 2 4
d 1 2
b d 1 2
e 2 4
Is it possible to convert this to:
col1 c_col3 d_col3 c_col4 d_col4 e_col3 e_col4
a 2 1 4 2 Nan Nan
b Nan 1 Nan 2 2 4
in an efficient manner where col1 is the index?

Add unstack for MultiIndex in columns, so necessary flattening:
df1 = df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'}).unstack()
#python 3.6+
df1.columns = [f'{j}_{i}' for i, j in df1.columns]
#python bellow
#df1.columns = ['{}_{}'.format(j, i) for i, j in df1.columns]
print (df1)
c_col3 d_col3 e_col3 c_col4 d_col4 e_col4
col1
a 2.0 1.0 NaN 4.0 2.0 NaN
b NaN 1.0 2.0 NaN 2.0 4.0

Compare each of the column values and return final value based on conditions

I currently have a dataframe which looks like this:
col1 col2 col3
1 2 3
2 3 NaN
3 4 NaN
2 NaN NaN
0 2 NaN
What I want to do is apply some condition to the column values and return the final result in a new column.
The condition is to assign values based on this order of priority where 2 being the first priority: [2,1,3,0,4]
I tried to define a function to append the final results but wasnt really getting anywhere...any thoughts?
The desired outcome would look something like:
col1 col2 col3 col4
1 2 3 2
2 3 NaN 2
3 4 NaN 3
2 NaN NaN 2
0 2 NaN 2
where col4 is the new column created.
Thanks

first you may want to get ride of the NaNs:
df.fillna(5)
and then apply a function to every row to find your value:
def func(x,l=[2,1,3,0,4,5]):
for j in l:
if(j in x):
return j
df['new'] = df.apply(lambda x: func(list(x)),axis =1)
Output:
col1 col2 col3 new
0 1 2 3 2
1 2 3 5 2
2 3 4 5 3
3 2 5 5 2
4 0 2 5 2

maybe a little later.
import numpy as np
def f(x):
for i in [2,1,3,0,4]:
if i in x.tolist():
return i
return np.nan
df["col4"] = df.apply(f, axis=1)
and the Output:
col1 col2 col3 col4
0 1 2.0 3.0 2
1 2 3.0 NaN 2
2 3 4.0 NaN 3
3 2 NaN NaN 2
4 0 2.0 NaN 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Re-occurrence count - python

I have a dataframe like the follow: Col1 0 C 1 A 3 D 4 A 5 A I would like to count the step/index that a certain value will re-occur so I would get the following: Col1 Col2 0 C NaN 1 A 2 3 D NaN 4 A 1 5 A NaN Any ideas on how to do it ? Thanks for help !

Use GroupBy.cumcount and then replace 0 to NaNs: df['Col2'] = df.groupby('Col1').cumcount(ascending=False).replace(0,np.nan) print (df) Col1 Col2 0 C NaN 1 A 2.0 3 D NaN 4 A 1.0 5 A NaN Alternative solution with mask: df['Col2'] = df.groupby('Col1').cumcount(ascending=False).mask(lambda x: x == 0)

Related

Get last non NaN value after groupby and aggregation

Take n last rows of a dataframe with no NaN

Pandas Summing Two Columns with Nan

pandas groupby two columns and generate columns from second

Compare each of the column values and return final value based on conditions

Categories

Resources