replace each column of pandas dataframe with each value of array

replace each column of pandas dataframe with each value of array - python

df is like this,
A B C
0 NaN 150 -150
1 100 NaN 150
2 -100 -150 NaN
3 -100 -150 NaN
4 NaN 150 150
5 100 NaN -150
Another array is array([1, 2, 3])
I want to replace non-null value in each column with each value in array, and the result will be,
A B C
0 NaN 2 3
1 1 NaN 3
2 1 2 NaN
3 1 2 NaN
4 NaN 2 3
5 1 NaN 3
How can I achieve this in a simple way? I write something like,
df[df.notnull()] = np.array([1,2,3])
df[df.notnull()].loc[:,] = np.array([1,2,3])
but all cannot work.

How about:
>>> (df * 0 + 1) * arr
A B C
0 NaN 2 3
1 1 NaN 3
2 1 2 NaN
3 1 2 NaN
4 NaN 2 3
5 1 NaN 3

Related

Pandas: Generate column on groupby and value_counts

Goal is to generate a column pct group by id where 'pct' = (1st value of 'pts' group by 'id' * 100) / number of same consecutive 'id' value where 'x' and 'y' both are 'NaN'. For e.g. when id=1, pct = (5*100) / 2 = 250. It will loop through whole dataframe.
Sample df:
id pts x y
0 1 5 NaN NaN
1 1 5 1.0 NaN
2 1 5 NaN NaN
3 2 8 NaN NaN
4 2 8 2.0 1.0
5 3 7 NaN NaN
6 3 7 NaN 5.0
7 3 7 NaN NaN
8 3 7 NaN NaN
9 4 1 NaN NaN
Expected Output:
id pts x y pct
0 1 5 NaN NaN 250
1 1 5 1.0 NaN 250
2 1 5 NaN NaN 250
3 2 8 NaN NaN 800
4 2 8 2.0 1.0 800
5 3 7 NaN NaN 233
6 3 7 NaN 5.0 233
7 3 7 NaN NaN 233
8 3 7 NaN NaN 233
9 4 1 NaN NaN 100
I tried:
df['pct'] = df.groupby('id')['pts']/df.groupby('id')['x']['y'].count(axis=1)* 100

This works:
df['pct'] = df['id'].map(df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum()))
Output:
>>> df
id pts x y pct
0 1 5 NaN NaN 250
1 1 5 1.0 NaN 250
2 1 5 NaN NaN 250
3 2 8 NaN NaN 800
4 2 8 2.0 1.0 800
5 3 7 NaN NaN 233
6 3 7 NaN 5.0 233
7 3 7 NaN NaN 233
8 3 7 NaN NaN 233
9 4 1 NaN NaN 100
Explanation
>>> df[['x', 'y']]
x y
0 NaN NaN
1 1.0 NaN
2 NaN NaN
3 NaN NaN
4 2.0 1.0
5 NaN NaN
6 NaN 5.0
7 NaN NaN
8 NaN NaN
9 NaN NaN
First, we create a mask of the selected x and y columns where each value is True if it's not NaN and False if it is NaN:
>>> df[['x', 'y']].isna()
0 True True
1 False True
2 True True
3 True True
4 False False
5 True True
6 True False
7 True True
8 True True
9 True True
Next, we count how many NaNs were in each row by summing horizontally. Since True is interepreted as 1 and False as 0, this will work:
>>> df[['x', 'y']].isna().sum(axis=1)
0 2
1 1
2 2
3 2
4 0
5 2
6 1
7 2
8 2
9 2
Then, we count how many rows had 2 NaN values (2 because x and y are 2 columns):
>>> df[['x', 'y']].isna().sum(axis=1).eq(2)
0 True
1 False
2 True
3 True
4 False
5 True
6 False
7 True
8 True
9 True
Finally, we count how many True values there were (a True value means that row contained only NaNs), by summing the True values again:
>>> df[['x', 'y']].isna().sum(axis=1).eq(2).sum()
7
Of course, we do this in a .groupby(...).apply(...) call, so this code gets executed for each group of id, not across the whole dataframe like this explanation has done. But the concepts are identical:
>>> df.groupby('id').apply(lambda x: x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
id
1 2
2 1
3 3
4 1
dtype: int64
So for id = 1, 2 rows have x and y NaN. For id = 2, 1 row has x and y NaN. And so on...
The other (first) part of the code in the groupby call:
x['pts'].iloc[0] * 100
All it does is, for each group, it selects the 0th (first) value, and multiplies it by 100:
>>> df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100)
id
1 500
2 800
3 700
4 100
dtype: int64
Combined with the other code just explained:
>>> df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
id
1 250
2 800
3 233
4 100
dtype: int64
Finally, we map the values in id to the values we've just computed (notice in the above that the numbers are indexes by the values of id):
>>> df['id']
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 3
8 3
9 4
Name: id, dtype: int64
>>> computed = df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
>>> computed
id
1 250
2 800
3 233
4 100
dtype: int64
>>> df['id'].map(computed)
0 250
1 250
2 250
3 800
4 800
5 233
6 233
7 233
8 233
9 100
Name: id, dtype: int64

set entire group to NaN if containing a single NaN and combine columns

I have a df
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 4
1 3 1 nan
1 1 nan 3
1 1 2 3
1 1 2 4
I need to groub by a and b and then if c or d contains 1 or more nan's within groups I want the entire group in the specific column to be nan:
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 nan
1 3 1 nan
1 1 nan 3
1 1 nan 3
1 1 nan 4
and then combine c and d that there is no nan's anymore
a b c d e
0 1 nan 1 1
0 2 2 nan 2
0 2 3 nan 3
1 3 1 nan 1
1 1 nan 3 3
1 1 nan 3 3
1 1 nan 4 4

You will want to check each group for whether it is nan and then set the appropriate value (nan or existing value) and then use combine_first() to combine the columns.
from io import StringIO
import pandas as pd
import numpy as np
df = pd.read_csv(StringIO("""
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 4
1 3 1 nan
1 1 nan 3
1 1 2 3
1 1 2 4
"""), sep=' ')
for col in ['c', 'd']:
df[col] = df.groupby(['a','b'])[col].transform(lambda x: np.nan if any(x.isna()) else x)
df['e'] = df['c'].combine_first(df['d'])
df
a b c d e
0 0 1 NaN 1.0 1.0
1 0 2 2.0 NaN 2.0
2 0 2 3.0 NaN 3.0
3 1 3 1.0 NaN 1.0
4 1 1 NaN 3.0 3.0
5 1 1 NaN 3.0 3.0
6 1 1 NaN 4.0 4.0

Count distinct strings in rolling window include NaN using pandas

I would like to use rolling count with maximum value is 36 which need to include NaN value such as start with 0 if its NaN. I have dataframe that look like this:
Input:
val
NaN
1
1
NaN
2
1
3
NaN
5
Code:
b = a.rolling(36,min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
It gives me:
Val count
NaN 1
1 2
1 2
NaN 3
2 4
1 4
3 5
NaN 6
5 7
Expected Output:
Val count
NaN 0
1 1
1 1
NaN 1
2 2
1 2
3 3
NaN 3
5 4

You can just filter out nan
df.val.rolling(36,min_periods=1).apply(lambda x: len(np.unique(x[~np.isnan(x)]))).fillna(0)
Out[35]:
0 0.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 3.0
7 3.0
8 4.0
Name: val, dtype: float64
The reason why
np.unique([np.nan]*2)
Out[38]: array([nan, nan])
np.nan==np.nan
Out[39]: False

Get the counts and unique counts of every columns in python

Suppose I have a pandas dataframe as follows,
data
id A B C D E
1 NaN 1 NaN 1 1
1 NaN 2 NaN 2 2
1 NaN 3 NaN NaN 3
1 NaN 4 NaN NaN 4
1 NaN 5 NaN NaN 5
2 NaN 6 NaN NaN 6
2 NaN 7 5 NaN 7
2 NaN 8 6 2 8
2 NaN 9 NaN NaN 9
2 NaN 10 NaN 4 10
3 NaN 11 NaN NaN 11
3 NaN 12 NaN NaN 12
3 NaN 13 NaN NaN 13
3 NaN 14 NaN NaN 14
3 NaN 15 NaN NaN 15
3 NaN 16 NaN NaN 16
I am using the following command,
pd.DataFrame(data.count().sort_values(ascending=False)).reset_index()
and get the following output,
index 0
0 E 16
1 B 16
2 id 16
3 D 4
4 C 2
5 A 0
I want the following output,
columns count unique(id) count
E 16 3
B 16 3
D 4 2
C 2 1
A 0 0
where count is same as the previous one but the unique(id) count is the unique ids for each columns present. I want both to be as two fields.
Can anybody help me in doing this?
Thanks

Starting with:
In [7]: df
Out[7]:
id A B C D E
0 1 NaN 1 NaN 1.0 1
1 1 NaN 2 NaN 2.0 2
2 1 NaN 3 NaN NaN 3
3 1 NaN 4 NaN NaN 4
4 1 NaN 5 NaN NaN 5
5 2 NaN 6 NaN NaN 6
6 2 NaN 7 5.0 NaN 7
7 2 NaN 8 6.0 2.0 8
8 2 NaN 9 NaN NaN 9
9 2 NaN 10 NaN 4.0 10
10 3 NaN 11 NaN NaN 11
11 3 NaN 12 NaN NaN 12
12 3 NaN 13 NaN NaN 13
13 3 NaN 14 NaN NaN 14
14 3 NaN 15 NaN NaN 15
15 3 NaN 16 NaN NaN 16
Here is a rather inelegant way:
In [8]: (df.groupby('id').count() > 0).sum()
Out[8]:
A 0
B 3
C 1
D 2
E 3
dtype: int64
So, simply make your base DataFrame as you've specified:
In [9]: counts = (df[['A','B','C','D','E']].count().sort_values(ascending=False)).to_frame(name='counts')
In [10]: counts
Out[10]:
counts
E 16
B 16
D 4
C 2
A 0
And then simply:
In [11]: counts['unique(id) counts'] = (df.groupby('id').count() > 0).sum()
In [12]: counts
Out[12]:
counts unique(id) counts
E 16 3
B 16 3
D 4 2
C 2 1
A 0 0

Pandas Fill Column with Dictionary

I have a data frame like this:
A B C D
0 1 0 nan nan
1 8 0 nan nan
2 8 1 nan nan
3 2 1 nan nan
4 0 0 nan nan
5 1 1 nan nan
and i have a dictionary like this:
dc = {'C': 5, 'D' : 10}
I want to fill the nanvalues in the data frame with the dictionary but only for the cells in which the column B values are 0, i want to obtain this:
A B C D
0 1 0 5 10
1 8 0 5 10
2 8 1 nan nan
3 2 1 nan nan
4 0 0 5 10
5 1 1 nan nan
I know how to subset the dataframe but i can't find a way to fill the values with the dictionary; any ideas?

You could use fillna with loc and pass your dict to it:
In [13]: df.loc[df.B==0,:].fillna(dc)
Out[13]:
A B C D
0 1 0 5 10
1 8 0 5 10
4 0 0 5 10
To do it for you dataframe you need to slice with the same mask and assign the result above to it:
df.loc[df.B==0, :] = df.loc[df.B==0,:].fillna(dc)
In [15]: df
Out[15]:
A B C D
0 1 0 5 10
1 8 0 5 10
2 8 1 NaN NaN
3 2 1 NaN NaN
4 0 0 5 10
5 1 1 NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

replace each column of pandas dataframe with each value of array - python

How about: >>> (df * 0 + 1) * arr A B C 0 NaN 2 3 1 1 NaN 3 2 1 2 NaN 3 1 2 NaN 4 NaN 2 3 5 1 NaN 3

Related

Pandas: Generate column on groupby and value_counts

set entire group to NaN if containing a single NaN and combine columns

Count distinct strings in rolling window include NaN using pandas

Get the counts and unique counts of every columns in python

Pandas Fill Column with Dictionary

Categories

Resources