df is like this,
A B C
0 NaN 150 -150
1 100 NaN 150
2 -100 -150 NaN
3 -100 -150 NaN
4 NaN 150 150
5 100 NaN -150
Another array is array([1, 2, 3])
I want to replace non-null value in each column with each value in array, and the result will be,
A B C
0 NaN 2 3
1 1 NaN 3
2 1 2 NaN
3 1 2 NaN
4 NaN 2 3
5 1 NaN 3
How can I achieve this in a simple way? I write something like,
df[df.notnull()] = np.array([1,2,3])
df[df.notnull()].loc[:,] = np.array([1,2,3])
but all cannot work.
How about:
>>> (df * 0 + 1) * arr
A B C
0 NaN 2 3
1 1 NaN 3
2 1 2 NaN
3 1 2 NaN
4 NaN 2 3
5 1 NaN 3
Related
Goal is to generate a column pct group by id where 'pct' = (1st value of 'pts' group by 'id' * 100) / number of same consecutive 'id' value where 'x' and 'y' both are 'NaN'. For e.g. when id=1, pct = (5*100) / 2 = 250. It will loop through whole dataframe.
Sample df:
id pts x y
0 1 5 NaN NaN
1 1 5 1.0 NaN
2 1 5 NaN NaN
3 2 8 NaN NaN
4 2 8 2.0 1.0
5 3 7 NaN NaN
6 3 7 NaN 5.0
7 3 7 NaN NaN
8 3 7 NaN NaN
9 4 1 NaN NaN
Expected Output:
id pts x y pct
0 1 5 NaN NaN 250
1 1 5 1.0 NaN 250
2 1 5 NaN NaN 250
3 2 8 NaN NaN 800
4 2 8 2.0 1.0 800
5 3 7 NaN NaN 233
6 3 7 NaN 5.0 233
7 3 7 NaN NaN 233
8 3 7 NaN NaN 233
9 4 1 NaN NaN 100
I tried:
df['pct'] = df.groupby('id')['pts']/df.groupby('id')['x']['y'].count(axis=1)* 100
This works:
df['pct'] = df['id'].map(df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum()))
Output:
>>> df
id pts x y pct
0 1 5 NaN NaN 250
1 1 5 1.0 NaN 250
2 1 5 NaN NaN 250
3 2 8 NaN NaN 800
4 2 8 2.0 1.0 800
5 3 7 NaN NaN 233
6 3 7 NaN 5.0 233
7 3 7 NaN NaN 233
8 3 7 NaN NaN 233
9 4 1 NaN NaN 100
Explanation
>>> df[['x', 'y']]
x y
0 NaN NaN
1 1.0 NaN
2 NaN NaN
3 NaN NaN
4 2.0 1.0
5 NaN NaN
6 NaN 5.0
7 NaN NaN
8 NaN NaN
9 NaN NaN
First, we create a mask of the selected x and y columns where each value is True if it's not NaN and False if it is NaN:
>>> df[['x', 'y']].isna()
0 True True
1 False True
2 True True
3 True True
4 False False
5 True True
6 True False
7 True True
8 True True
9 True True
Next, we count how many NaNs were in each row by summing horizontally. Since True is interepreted as 1 and False as 0, this will work:
>>> df[['x', 'y']].isna().sum(axis=1)
0 2
1 1
2 2
3 2
4 0
5 2
6 1
7 2
8 2
9 2
Then, we count how many rows had 2 NaN values (2 because x and y are 2 columns):
>>> df[['x', 'y']].isna().sum(axis=1).eq(2)
0 True
1 False
2 True
3 True
4 False
5 True
6 False
7 True
8 True
9 True
Finally, we count how many True values there were (a True value means that row contained only NaNs), by summing the True values again:
>>> df[['x', 'y']].isna().sum(axis=1).eq(2).sum()
7
Of course, we do this in a .groupby(...).apply(...) call, so this code gets executed for each group of id, not across the whole dataframe like this explanation has done. But the concepts are identical:
>>> df.groupby('id').apply(lambda x: x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
id
1 2
2 1
3 3
4 1
dtype: int64
So for id = 1, 2 rows have x and y NaN. For id = 2, 1 row has x and y NaN. And so on...
The other (first) part of the code in the groupby call:
x['pts'].iloc[0] * 100
All it does is, for each group, it selects the 0th (first) value, and multiplies it by 100:
>>> df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100)
id
1 500
2 800
3 700
4 100
dtype: int64
Combined with the other code just explained:
>>> df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
id
1 250
2 800
3 233
4 100
dtype: int64
Finally, we map the values in id to the values we've just computed (notice in the above that the numbers are indexes by the values of id):
>>> df['id']
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 3
8 3
9 4
Name: id, dtype: int64
>>> computed = df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
>>> computed
id
1 250
2 800
3 233
4 100
dtype: int64
>>> df['id'].map(computed)
0 250
1 250
2 250
3 800
4 800
5 233
6 233
7 233
8 233
9 100
Name: id, dtype: int64
I have a df
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 4
1 3 1 nan
1 1 nan 3
1 1 2 3
1 1 2 4
I need to groub by a and b and then if c or d contains 1 or more nan's within groups I want the entire group in the specific column to be nan:
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 nan
1 3 1 nan
1 1 nan 3
1 1 nan 3
1 1 nan 4
and then combine c and d that there is no nan's anymore
a b c d e
0 1 nan 1 1
0 2 2 nan 2
0 2 3 nan 3
1 3 1 nan 1
1 1 nan 3 3
1 1 nan 3 3
1 1 nan 4 4
You will want to check each group for whether it is nan and then set the appropriate value (nan or existing value) and then use combine_first() to combine the columns.
from io import StringIO
import pandas as pd
import numpy as np
df = pd.read_csv(StringIO("""
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 4
1 3 1 nan
1 1 nan 3
1 1 2 3
1 1 2 4
"""), sep=' ')
for col in ['c', 'd']:
df[col] = df.groupby(['a','b'])[col].transform(lambda x: np.nan if any(x.isna()) else x)
df['e'] = df['c'].combine_first(df['d'])
df
a b c d e
0 0 1 NaN 1.0 1.0
1 0 2 2.0 NaN 2.0
2 0 2 3.0 NaN 3.0
3 1 3 1.0 NaN 1.0
4 1 1 NaN 3.0 3.0
5 1 1 NaN 3.0 3.0
6 1 1 NaN 4.0 4.0
I would like to use rolling count with maximum value is 36 which need to include NaN value such as start with 0 if its NaN. I have dataframe that look like this:
Input:
val
NaN
1
1
NaN
2
1
3
NaN
5
Code:
b = a.rolling(36,min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
It gives me:
Val count
NaN 1
1 2
1 2
NaN 3
2 4
1 4
3 5
NaN 6
5 7
Expected Output:
Val count
NaN 0
1 1
1 1
NaN 1
2 2
1 2
3 3
NaN 3
5 4
You can just filter out nan
df.val.rolling(36,min_periods=1).apply(lambda x: len(np.unique(x[~np.isnan(x)]))).fillna(0)
Out[35]:
0 0.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 3.0
7 3.0
8 4.0
Name: val, dtype: float64
The reason why
np.unique([np.nan]*2)
Out[38]: array([nan, nan])
np.nan==np.nan
Out[39]: False
Suppose I have a pandas dataframe as follows,
data
id A B C D E
1 NaN 1 NaN 1 1
1 NaN 2 NaN 2 2
1 NaN 3 NaN NaN 3
1 NaN 4 NaN NaN 4
1 NaN 5 NaN NaN 5
2 NaN 6 NaN NaN 6
2 NaN 7 5 NaN 7
2 NaN 8 6 2 8
2 NaN 9 NaN NaN 9
2 NaN 10 NaN 4 10
3 NaN 11 NaN NaN 11
3 NaN 12 NaN NaN 12
3 NaN 13 NaN NaN 13
3 NaN 14 NaN NaN 14
3 NaN 15 NaN NaN 15
3 NaN 16 NaN NaN 16
I am using the following command,
pd.DataFrame(data.count().sort_values(ascending=False)).reset_index()
and get the following output,
index 0
0 E 16
1 B 16
2 id 16
3 D 4
4 C 2
5 A 0
I want the following output,
columns count unique(id) count
E 16 3
B 16 3
D 4 2
C 2 1
A 0 0
where count is same as the previous one but the unique(id) count is the unique ids for each columns present. I want both to be as two fields.
Can anybody help me in doing this?
Thanks
Starting with:
In [7]: df
Out[7]:
id A B C D E
0 1 NaN 1 NaN 1.0 1
1 1 NaN 2 NaN 2.0 2
2 1 NaN 3 NaN NaN 3
3 1 NaN 4 NaN NaN 4
4 1 NaN 5 NaN NaN 5
5 2 NaN 6 NaN NaN 6
6 2 NaN 7 5.0 NaN 7
7 2 NaN 8 6.0 2.0 8
8 2 NaN 9 NaN NaN 9
9 2 NaN 10 NaN 4.0 10
10 3 NaN 11 NaN NaN 11
11 3 NaN 12 NaN NaN 12
12 3 NaN 13 NaN NaN 13
13 3 NaN 14 NaN NaN 14
14 3 NaN 15 NaN NaN 15
15 3 NaN 16 NaN NaN 16
Here is a rather inelegant way:
In [8]: (df.groupby('id').count() > 0).sum()
Out[8]:
A 0
B 3
C 1
D 2
E 3
dtype: int64
So, simply make your base DataFrame as you've specified:
In [9]: counts = (df[['A','B','C','D','E']].count().sort_values(ascending=False)).to_frame(name='counts')
In [10]: counts
Out[10]:
counts
E 16
B 16
D 4
C 2
A 0
And then simply:
In [11]: counts['unique(id) counts'] = (df.groupby('id').count() > 0).sum()
In [12]: counts
Out[12]:
counts unique(id) counts
E 16 3
B 16 3
D 4 2
C 2 1
A 0 0
I have a data frame like this:
A B C D
0 1 0 nan nan
1 8 0 nan nan
2 8 1 nan nan
3 2 1 nan nan
4 0 0 nan nan
5 1 1 nan nan
and i have a dictionary like this:
dc = {'C': 5, 'D' : 10}
I want to fill the nanvalues in the data frame with the dictionary but only for the cells in which the column B values are 0, i want to obtain this:
A B C D
0 1 0 5 10
1 8 0 5 10
2 8 1 nan nan
3 2 1 nan nan
4 0 0 5 10
5 1 1 nan nan
I know how to subset the dataframe but i can't find a way to fill the values with the dictionary; any ideas?
You could use fillna with loc and pass your dict to it:
In [13]: df.loc[df.B==0,:].fillna(dc)
Out[13]:
A B C D
0 1 0 5 10
1 8 0 5 10
4 0 0 5 10
To do it for you dataframe you need to slice with the same mask and assign the result above to it:
df.loc[df.B==0, :] = df.loc[df.B==0,:].fillna(dc)
In [15]: df
Out[15]:
A B C D
0 1 0 5 10
1 8 0 5 10
2 8 1 NaN NaN
3 2 1 NaN NaN
4 0 0 5 10
5 1 1 NaN NaN