Pandas: Generate column on groupby and value_counts - python

Goal is to generate a column pct group by id where 'pct' = (1st value of 'pts' group by 'id' * 100) / number of same consecutive 'id' value where 'x' and 'y' both are 'NaN'. For e.g. when id=1, pct = (5*100) / 2 = 250. It will loop through whole dataframe.
Sample df:
id pts x y
0 1 5 NaN NaN
1 1 5 1.0 NaN
2 1 5 NaN NaN
3 2 8 NaN NaN
4 2 8 2.0 1.0
5 3 7 NaN NaN
6 3 7 NaN 5.0
7 3 7 NaN NaN
8 3 7 NaN NaN
9 4 1 NaN NaN
Expected Output:
id pts x y pct
0 1 5 NaN NaN 250
1 1 5 1.0 NaN 250
2 1 5 NaN NaN 250
3 2 8 NaN NaN 800
4 2 8 2.0 1.0 800
5 3 7 NaN NaN 233
6 3 7 NaN 5.0 233
7 3 7 NaN NaN 233
8 3 7 NaN NaN 233
9 4 1 NaN NaN 100
I tried:
df['pct'] = df.groupby('id')['pts']/df.groupby('id')['x']['y'].count(axis=1)* 100

This works:
df['pct'] = df['id'].map(df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum()))
Output:
>>> df
id pts x y pct
0 1 5 NaN NaN 250
1 1 5 1.0 NaN 250
2 1 5 NaN NaN 250
3 2 8 NaN NaN 800
4 2 8 2.0 1.0 800
5 3 7 NaN NaN 233
6 3 7 NaN 5.0 233
7 3 7 NaN NaN 233
8 3 7 NaN NaN 233
9 4 1 NaN NaN 100
Explanation
>>> df[['x', 'y']]
x y
0 NaN NaN
1 1.0 NaN
2 NaN NaN
3 NaN NaN
4 2.0 1.0
5 NaN NaN
6 NaN 5.0
7 NaN NaN
8 NaN NaN
9 NaN NaN
First, we create a mask of the selected x and y columns where each value is True if it's not NaN and False if it is NaN:
>>> df[['x', 'y']].isna()
0 True True
1 False True
2 True True
3 True True
4 False False
5 True True
6 True False
7 True True
8 True True
9 True True
Next, we count how many NaNs were in each row by summing horizontally. Since True is interepreted as 1 and False as 0, this will work:
>>> df[['x', 'y']].isna().sum(axis=1)
0 2
1 1
2 2
3 2
4 0
5 2
6 1
7 2
8 2
9 2
Then, we count how many rows had 2 NaN values (2 because x and y are 2 columns):
>>> df[['x', 'y']].isna().sum(axis=1).eq(2)
0 True
1 False
2 True
3 True
4 False
5 True
6 False
7 True
8 True
9 True
Finally, we count how many True values there were (a True value means that row contained only NaNs), by summing the True values again:
>>> df[['x', 'y']].isna().sum(axis=1).eq(2).sum()
7
Of course, we do this in a .groupby(...).apply(...) call, so this code gets executed for each group of id, not across the whole dataframe like this explanation has done. But the concepts are identical:
>>> df.groupby('id').apply(lambda x: x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
id
1 2
2 1
3 3
4 1
dtype: int64
So for id = 1, 2 rows have x and y NaN. For id = 2, 1 row has x and y NaN. And so on...
The other (first) part of the code in the groupby call:
x['pts'].iloc[0] * 100
All it does is, for each group, it selects the 0th (first) value, and multiplies it by 100:
>>> df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100)
id
1 500
2 800
3 700
4 100
dtype: int64
Combined with the other code just explained:
>>> df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
id
1 250
2 800
3 233
4 100
dtype: int64
Finally, we map the values in id to the values we've just computed (notice in the above that the numbers are indexes by the values of id):
>>> df['id']
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 3
8 3
9 4
Name: id, dtype: int64
>>> computed = df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
>>> computed
id
1 250
2 800
3 233
4 100
dtype: int64
>>> df['id'].map(computed)
0 250
1 250
2 250
3 800
4 800
5 233
6 233
7 233
8 233
9 100
Name: id, dtype: int64

Related

Pandas ffill for certain values in a column

I have a df like this:
time data
0 1
1 1
2 nan
3 nan
4 6
5 nan
6 nan
7 nan
8 5
9 4
10 nan
Is there a way to use pd.Series.ffill() to ffill on for certain occurences of values? Specifically, I want to forward fill only if values in df.data are == 1 or 4. Should look like this:
time data
0 1
1 1
2 1
3 1
4 6
5 nan
6 nan
7 nan
8 5
9 4
10 4
One option would be to forward fill (ffill) the column, then only populate where the values are 1 or 4 using (isin) and (mask):
s = df['data'].ffill()
df['data'] = df['data'].mask(s.isin([1, 4]), s)
df:
time data
0 0 1.0
1 1 1.0
2 2 1.0
3 3 1.0
4 4 6.0
5 5 NaN
6 6 NaN
7 7 NaN
8 8 5.0
9 9 4.0
10 10 4.0

How to count consecutive repetitions in a pandas series

Consider the following series, ser
date id
2000 NaN
2001 NaN
2001 1
2002 1
2000 2
2001 2
2002 2
2001 NaN
2010 NaN
2000 1
2001 1
2002 1
2010 NaN
How to count the values such that every consecutive number is counted and returned? Thanks.
Count
NaN 2
1 2
2 3
NaN 2
1 3
NaN 1
Here is another approach using fillna to handle NaN values:
s = df.id.fillna('nan')
mask = s.ne(s.shift())
ids = s[mask].to_numpy()
counts = s.groupby(mask.cumsum()).cumcount().add(1).groupby(mask.cumsum()).max().to_numpy()
# Convert 'nan' string back to `NaN`
ids[ids == 'nan'] = np.nan
ser_out = pd.Series(counts, index=ids, name='counts')
[out]
nan 2
1.0 2
2.0 3
nan 2
1.0 3
nan 1
Name: counts, dtype: int64
The cumsum trick is useful here, it's a little tricky with the NaNs though, so I think you need to handle these separately:
In [11]: df.id.isnull() & df.id.shift(-1).isnull()
Out[11]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False
10 False
11 False
12 True
Name: id, dtype: bool
In [12]: df.id.eq(df.id.shift(-1))
Out[12]:
0 False
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 False
9 True
10 True
11 False
12 False
Name: id, dtype: bool
In [13]: (df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))
Out[13]:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 True
8 False
9 True
10 True
11 False
12 True
Name: id, dtype: bool
In [14]: ((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum()
Out[14]:
0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 5
8 5
9 6
10 7
11 7
12 8
Name: id, dtype: int64
Now you can use this labeling in your groupby:
In [15]: g = df.groupby(((df.id.isnull() & df.id.shift(-1).isnull()) | (df.id.eq(df.id.shift(-1)))).cumsum())
In [16]: pd.DataFrame({"count": g.id.size(), "id": g.id.nth(0)})
Out[16]:
count id
id
1 2 NaN
2 2 1.0
3 1 2.0
4 2 2.0
5 2 NaN
6 1 1.0
7 2 1.0
8 1 NaN

How to sort to know sort data by completeness rate level on pandas

Here's my dataset
id feature_1 feature_2 feature_3 feature_4 feature_5
1 10 15 10 15 20
2 10 NaN 10 NaN 20
3 10 NaN 10 NaN 20
4 10 46 NaN 23 20
5 10 NaN 10 NaN 20
Here's what I need, I want to sort data based on completeness level (the higher percentage data is not nan, is higher the completeness level) of the dataset, the str is will be ascending so make me easier impute the missing value
id feature_1 feature_2 feature_3 feature_4 feature_5
2 10 NaN 10 NaN 20
3 10 NaN 10 NaN 20
5 10 NaN 10 NaN 20
4 10 46 NaN 23 20
1 10 15 10 15 20
Best Regards,
Try this:
import pandas as pd
import numpy as np
d = ({
'A' : ['X',np.NaN,np.NaN,'X','Y',np.NaN,'X','X',np.NaN,'X','X'],
'B' : ['Y',np.NaN,'X','Val','X','X',np.NaN,'X','X','X','X'],
'C' : ['Y','X','X',np.NaN,'X','X','Val','X','X',np.NaN,np.NaN],
})
df = pd.DataFrame(data=d)
df.T.isnull().sum()
Out[72]:
0 0
1 2
2 1
3 1
4 0
5 1
6 1
7 0
8 1
9 1
10 1
dtype: int64
df['is_null'] = df.T.isnull().sum()
df.sort_values('is_null', ascending=False)
Out[77]:
A B C is_null
1 NaN NaN X 2
2 NaN X X 1
3 X Val NaN 1
5 NaN X X 1
6 X NaN Val 1
8 NaN X X 1
9 X X NaN 1
10 X X NaN 1
0 X Y Y 0
4 Y X X 0
7 X X X 0
If want sorting by column with maximal number of NaNs:
c = df.isnull().sum().idxmax()
print (c)
feature_2
df = df.sort_values(c, na_position='first', ascending=False)
print (df)
id feature_1 feature_2 feature_3 feature_4 feature_5
1 2 10 NaN 10.0 NaN 20
2 3 10 NaN 10.0 NaN 20
4 5 10 NaN 10.0 NaN 20
3 4 10 46.0 NaN 23.0 20
0 1 10 15.0 10.0 15.0 20

Count distinct strings in rolling window include NaN using pandas

I would like to use rolling count with maximum value is 36 which need to include NaN value such as start with 0 if its NaN. I have dataframe that look like this:
Input:
val
NaN
1
1
NaN
2
1
3
NaN
5
Code:
b = a.rolling(36,min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
It gives me:
Val count
NaN 1
1 2
1 2
NaN 3
2 4
1 4
3 5
NaN 6
5 7
Expected Output:
Val count
NaN 0
1 1
1 1
NaN 1
2 2
1 2
3 3
NaN 3
5 4
You can just filter out nan
df.val.rolling(36,min_periods=1).apply(lambda x: len(np.unique(x[~np.isnan(x)]))).fillna(0)
Out[35]:
0 0.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 3.0
7 3.0
8 4.0
Name: val, dtype: float64
The reason why
np.unique([np.nan]*2)
Out[38]: array([nan, nan])
np.nan==np.nan
Out[39]: False

replace each column of pandas dataframe with each value of array

df is like this,
A B C
0 NaN 150 -150
1 100 NaN 150
2 -100 -150 NaN
3 -100 -150 NaN
4 NaN 150 150
5 100 NaN -150
Another array is array([1, 2, 3])
I want to replace non-null value in each column with each value in array, and the result will be,
A B C
0 NaN 2 3
1 1 NaN 3
2 1 2 NaN
3 1 2 NaN
4 NaN 2 3
5 1 NaN 3
How can I achieve this in a simple way? I write something like,
df[df.notnull()] = np.array([1,2,3])
df[df.notnull()].loc[:,] = np.array([1,2,3])
but all cannot work.
How about:
>>> (df * 0 + 1) * arr
A B C
0 NaN 2 3
1 1 NaN 3
2 1 2 NaN
3 1 2 NaN
4 NaN 2 3
5 1 NaN 3

Categories

Resources