Pandas set all values after first NaN to NaN - python

For each row I would like to set all values to NaN after the appearance of the first NaN. E.g.:
a b c
1 2 3 4
2 nan 2 nan
3 3 nan 23
Should become this:
a b c
1 2 3 4
2 nan nan nan
3 3 nan nan
So far I only know how to do this with an apply with a for loop over each column per row - it's very slow!

Check with cumprod
df=df.where(df.notna().cumprod(axis=1).eq(1))
a b c
1 2.0 3.0 4.0
2 NaN NaN NaN
3 3.0 NaN NaN

Related

Use .assign() and a dictionary comprehension to create multiple lags of one column [duplicate]

This question already has answers here:
What do lambda function closures capture?
(7 answers)
Closed 1 year ago.
I have a data frame df and I want to create multiple lags of column A.
I should be able to use the .assign() method and a dictionary comprehension, I think.
However, all lags are the longest lag with my solution below, even though the dictionary comprehension itself creates the correct lags.
Also, can someone explain why I need the ** just before my dictionary comprehension?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': np.arange(5)})
df.assign(**{'lag_' + str(i): lambda x: x['A'].shift(i) for i in range(1, 5+1)})
A lag_1 lag_2 lag_3 lag_4 lag_5
0 0 NaN NaN NaN NaN NaN
1 1 NaN NaN NaN NaN NaN
2 2 NaN NaN NaN NaN NaN
3 3 NaN NaN NaN NaN NaN
4 4 NaN NaN NaN NaN NaN
The dictionary comprehension itself creates the correct lags.
{'lag_' + str(i): df['A'].shift(i) for i in range(1, 5+1)}
{'lag_1': 0 NaN
1 0.0
2 1.0
3 2.0
4 3.0
Name: A, dtype: float64,
'lag_2': 0 NaN
1 NaN
2 0.0
3 1.0
4 2.0
Name: A, dtype: float64,
'lag_3': 0 NaN
1 NaN
2 NaN
3 0.0
4 1.0
Name: A, dtype: float64,
'lag_4': 0 NaN
1 NaN
2 NaN
3 NaN
4 0.0
Name: A, dtype: float64,
'lag_5': 0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: A, dtype: float64}
​
Just pass what you did for the dict remove lambda
out = df.assign(**{'lag_' + str(i): df['A'].shift(i) for i in range(1, 5+1)})
Out[65]:
A lag_1 lag_2 lag_3 lag_4 lag_5
0 0 NaN NaN NaN NaN NaN
1 1 0.0 NaN NaN NaN NaN
2 2 1.0 0.0 NaN NaN NaN
3 3 2.0 1.0 0.0 NaN NaN
4 4 3.0 2.0 1.0 0.0 NaN

Transfer multiple columns string values to numbers in Pandas

I'm working at a data frame like this:
id type1 type2 type3
0 1 dog NaN NaN
1 2 cat NaN NaN
2 3 dog cat NaN
3 4 cow NaN NaN
4 5 dog NaN NaN
5 6 cat NaN NaN
6 7 cat dog cow
7 8 dog NaN NaN
How can I transfer it to the following dataframe? Thank you.
id dog cat cow
0 1 1.0 NaN NaN
1 2 NaN 1.0 NaN
2 3 1.0 1.0 NaN
3 4 NaN NaN 1.0
4 5 1.0 NaN NaN
5 6 NaN 1.0 NaN
6 7 1.0 1.0 1.0
7 8 1.0 NaN NaN
First filter ony type columns by DataFrame.filter, reshape by DataFrame.stack, so possible call Series.str.get_dummies. Then for 0/1 output use max by first level of MultiIndex and change 1 to NaNs by DataFrame.mask. Last add first column by DataFrame.join:
df1 = df.filter(like='type').stack().str.get_dummies().max(level=0).mask(lambda x: x == 0)
Or use get_dummies and max per columns names and last change 1 to NaNs:
df1 = (pd.get_dummies(df.filter(like='type'), prefix='', prefix_sep='')
.max(level=0, axis=1)
.mask(lambda x: x == 0))
df = df[['id']].join(df1)
print (df)
id cat cow dog
0 1 NaN NaN 1.0
1 2 1.0 NaN NaN
2 3 1.0 NaN 1.0
3 4 NaN 1.0 NaN
4 5 NaN NaN 1.0
5 6 1.0 NaN NaN
6 7 1.0 1.0 1.0
7 8 NaN NaN 1.0

How do you shift each row in pandas data frame by a specific value?

If I have a pandas dataframe like this:
2 3 4 NaN NaN NaN
1 NaN NaN NaN NaN NaN
5 6 7 2 3 NaN
4 3 NaN NaN NaN NaN
and an array for the number I would like to shift:
array = [2, 4, 0, 3]
How do I iterate through each row to shift the columns by the number in my array to get something like this:
NaN NaN 2 3 4 NaN
NaN NaN NaN NaN 1 NaN
5 6 7 2 3 NaN
NaN NaN NaN 3 4 NaN
I was trying to do something like this but had no luck.
df = pd.DataFrame(values)
for rows in df.iterrows():
df[rows] = df.shift[change_in_bins[rows]]
Use for loop with loc and shift:
for index,value in enumerate([2, 4, 0, 3]):
df.loc[index,:] = df.loc[index,:].shift(value)
print(df)
0 1 2 3 4 5
0 NaN NaN 2.0 3.0 4.0 NaN
1 NaN NaN NaN NaN 1.0 NaN
2 5.0 6.0 7.0 2.0 3.0 NaN
3 NaN NaN NaN 4.0 3.0 NaN

Combine multi columns to one column Pandas

Hi I have the following dataframe
z a b c
a 1 NaN NaN
ss NaN 2 NaN
cc 3 NaN NaN
aa NaN 4 NaN
ww NaN 5 NaN
ss NaN NaN 6
aa NaN NaN 7
g NaN NaN 8
j 9 NaN NaN
I would like to create a new column d to do something like this
z a b c d
a 1 NaN NaN 1
ss NaN 2 NaN 2
cc 3 NaN NaN 3
aa NaN 4 NaN 4
ww NaN 5 NaN 5
ss NaN NaN 6 6
aa NaN NaN 7 7
g NaN NaN 8 8
j 9 NaN NaN 9
For the numbers, it is not in integer. It is in np.float64. The integers are for clear example. you may assume the numbers are like 32065431243556.62, 763835218962767.8 Thank you for your help
We can replace the NA by 0 and sum up the rows.
df['d'] = df[['a', 'b', 'c']].fillna(0).sum(axis=1)
In fact, it's not nessary to use fillna, sum can transform the NAN elements to zeros automatically.
I'm a python newcomer as well,and I suggest maybe you should read the pandas cookbook first.
The code is:
df['Total']=df[['a','b','c']].sum(axis=1).astype(int)
You can use pd.DataFrame.ffill over axis=1:
df['D'] = df.ffill(1).iloc[:, -1].astype(int)
print(df)
a b c D
0 1.0 NaN NaN 1
1 NaN 2.0 NaN 2
2 3.0 NaN NaN 3
3 NaN 4.0 NaN 4
4 NaN 5.0 NaN 5
5 NaN NaN 6.0 6
6 NaN NaN 7.0 7
7 NaN NaN 8.0 8
8 9.0 NaN NaN 9
Of course, if you have float values, int conversion is not required.
if there is only one value per row as given example, you can use the code below to dropna for each row and assign the remaining value to column d
df['d']=df.apply(lambda row: row.dropna(), axis=1)

How to count occurrences of consecutive 1s by column and take mean by block

x = pd.DataFrame(index = pd.date_range(start="2017-1-1", end="2017-1-13"),
columns="a b c".split())
x.ix[0:2, "a"] = 1
x.ix[5:10, "a"] = 1
x.ix[9:12, "b"] = 1
x.ix[1:3, "c"] = 1
x.ix[5, "c"] = 1
a b c
2017-01-01 1 NaN NaN
2017-01-02 1 NaN 1
2017-01-03 NaN NaN 1
2017-01-04 NaN NaN NaN
2017-01-05 NaN NaN NaN
2017-01-06 1 NaN 1
2017-01-07 1 NaN NaN
2017-01-08 1 NaN NaN
2017-01-09 1 NaN NaN
2017-01-10 1 1 NaN
2017-01-11 NaN 1 NaN
2017-01-12 NaN 1 NaN
2017-01-13 NaN NaN NaN
Given the above dataframe, x, I want to return the average of the number of occurences of 1s within each group of a, b, and c. The average for each column is taken over the number of blocks that contains consecutive 1s.
For example, column a will output the average of 2 and 5, which is 3.5. We divide it by 2 because there are 2 consecutive 1s between Jan-1 and Jan-2, then 5 consecutive 1s between Jan-06 and Jan-10, 2 blocks of 1s in total. Similarly, for column b, we will have 3 because only one consecutive sequence of 1s occur once between Jan-10 and Jan-13. Finally, for column c, we will have the average of 2 and 1, which is 1.5.
Expected output of the toy example:
a b c
3.5 3 1.5
Use mask + apply with value_counts, and finally, find the mean of your counts -
x.eq(1)\
.ne(x.eq(1).shift())\
.cumsum(0)\
.mask(x.ne(1))\
.apply(pd.Series.value_counts)\
.mean(0)
a 3.5
b 3.0
c 1.5
dtype: float64
Details
First, find a list of all consecutive values in your dataframe -
i = x.eq(1).ne(x.eq(1).shift()).cumsum(0)
i
a b c
2017-01-01 1 1 1
2017-01-02 1 1 2
2017-01-03 2 1 2
2017-01-04 2 1 3
2017-01-05 2 1 3
2017-01-06 3 1 4
2017-01-07 3 1 5
2017-01-08 3 1 5
2017-01-09 3 1 5
2017-01-10 3 2 5
2017-01-11 4 2 5
2017-01-12 4 2 5
2017-01-13 4 3 5
Now, keep only those group values whose cells were originally 1 in x -
j = i.mask(x.ne(1))
j
a b c
2017-01-01 1.0 NaN NaN
2017-01-02 1.0 NaN 2.0
2017-01-03 NaN NaN 2.0
2017-01-04 NaN NaN NaN
2017-01-05 NaN NaN NaN
2017-01-06 3.0 NaN 4.0
2017-01-07 3.0 NaN NaN
2017-01-08 3.0 NaN NaN
2017-01-09 3.0 NaN NaN
2017-01-10 3.0 2.0 NaN
2017-01-11 NaN 2.0 NaN
2017-01-12 NaN 2.0 NaN
2017-01-13 NaN NaN NaN
Now, apply value_counts across each column -
k = j.apply(pd.Series.value_counts)
k
a b c
1.0 2.0 NaN NaN
2.0 NaN 3.0 2.0
3.0 5.0 NaN NaN
4.0 NaN NaN 1.0
And just find the column-wise mean -
k.mean(0)
a 3.5
b 3.0
c 1.5
dtype: float64
As a handy note, if you want to, for example, find the mean counts only for more than n consecutive 1s (say, n = 1 here), then you can filter on k's index quite easily -
k[k.index > 1].mean(0)
a 5.0
b 3.0
c 1.5
dtype: float64
Let's try:
x.apply(lambda s: s.groupby(s.ne(1).cumsum()).sum().mean())
Output:
a 3.5
b 3.0
c 1.5
dtype: float64
Apply the lambda function to each column of the dataframe. The lambda function groups none 1 values together and counts them using sum() then takes the average using mean().
This utilizes cumsum, shift, and an xor mask.
b = x.cumsum()
c = b.shift(-1)
b_masked = b[b.isnull() ^ c.isnull()]
b_masked.max() / b_masked.count()
a 3.5
b 3.0
c 1.5
dtype: float64
First do b = x.cumsum()
a b c
0 1.0 NaN NaN
1 2.0 NaN 1.0
2 NaN NaN 2.0
3 NaN NaN NaN
4 NaN NaN NaN
5 3.0 NaN 3.0
6 4.0 NaN NaN
7 5.0 NaN NaN
8 6.0 NaN NaN
9 7.0 1.0 NaN
10 NaN 2.0 NaN
11 NaN 3.0 NaN
12 NaN NaN NaN
Then, shift b upward: c = b.shift(-1). Then, we create a xor mask with b.isnull() ^ c.isnull(). This mask will only keep one value per consecutive ones. Note that it seems that it will create an extra True at the end. But since we put it back to b, where in the place it is NaN, it will not generate new elements. We use an example to illustrate
b c b.isnull() ^ c.isnull() b[b.isnull() ^ c.isnull()]
NaN 1 True NaN
1 2 False NaN
2 NaN True 2
NaN NaN False NaN
Real big b[b.isnull() ^ c.isnull()] looks like
a b c
0 NaN NaN NaN
1 2.0 NaN NaN
2 NaN NaN 2.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN 3.0
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 7.0 NaN NaN
10 NaN NaN NaN
11 NaN 3.0 NaN
12 NaN NaN NaN
Because we did cumsum in the first place, we only need the maximum and the number of non-NaN in each column to calculate the mean.
Thus, we do b[b.isnull() ^ c.isnull()].max() / b[b.isnull() ^ c.isnull()].count()
you could use regex:
import re
p = r'1+'
counts = {
c: np.mean(
[len(x) for x in re.findall(p, ''.join(map(str, x[c].values)))]
)
for c in ['a', 'b', 'c']
}
This method works because the columns here could be thought as expressions in a language with alphabet {1, nan}. 1+ matches all groups of adjacent 1s and re.findall returns a list of strings. Then, it is necessary to calculate the mean of the lengths of each string.

Categories

Resources