Combine multi columns to one column Pandas - python

Hi I have the following dataframe
z a b c
a 1 NaN NaN
ss NaN 2 NaN
cc 3 NaN NaN
aa NaN 4 NaN
ww NaN 5 NaN
ss NaN NaN 6
aa NaN NaN 7
g NaN NaN 8
j 9 NaN NaN
I would like to create a new column d to do something like this
z a b c d
a 1 NaN NaN 1
ss NaN 2 NaN 2
cc 3 NaN NaN 3
aa NaN 4 NaN 4
ww NaN 5 NaN 5
ss NaN NaN 6 6
aa NaN NaN 7 7
g NaN NaN 8 8
j 9 NaN NaN 9
For the numbers, it is not in integer. It is in np.float64. The integers are for clear example. you may assume the numbers are like 32065431243556.62, 763835218962767.8 Thank you for your help

We can replace the NA by 0 and sum up the rows.
df['d'] = df[['a', 'b', 'c']].fillna(0).sum(axis=1)

In fact, it's not nessary to use fillna, sum can transform the NAN elements to zeros automatically.
I'm a python newcomer as well,and I suggest maybe you should read the pandas cookbook first.
The code is:
df['Total']=df[['a','b','c']].sum(axis=1).astype(int)

You can use pd.DataFrame.ffill over axis=1:
df['D'] = df.ffill(1).iloc[:, -1].astype(int)
print(df)
a b c D
0 1.0 NaN NaN 1
1 NaN 2.0 NaN 2
2 3.0 NaN NaN 3
3 NaN 4.0 NaN 4
4 NaN 5.0 NaN 5
5 NaN NaN 6.0 6
6 NaN NaN 7.0 7
7 NaN NaN 8.0 8
8 9.0 NaN NaN 9
Of course, if you have float values, int conversion is not required.

if there is only one value per row as given example, you can use the code below to dropna for each row and assign the remaining value to column d
df['d']=df.apply(lambda row: row.dropna(), axis=1)

Related

Use .assign() and a dictionary comprehension to create multiple lags of one column [duplicate]

This question already has answers here:
What do lambda function closures capture?
(7 answers)
Closed 1 year ago.
I have a data frame df and I want to create multiple lags of column A.
I should be able to use the .assign() method and a dictionary comprehension, I think.
However, all lags are the longest lag with my solution below, even though the dictionary comprehension itself creates the correct lags.
Also, can someone explain why I need the ** just before my dictionary comprehension?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': np.arange(5)})
df.assign(**{'lag_' + str(i): lambda x: x['A'].shift(i) for i in range(1, 5+1)})
A lag_1 lag_2 lag_3 lag_4 lag_5
0 0 NaN NaN NaN NaN NaN
1 1 NaN NaN NaN NaN NaN
2 2 NaN NaN NaN NaN NaN
3 3 NaN NaN NaN NaN NaN
4 4 NaN NaN NaN NaN NaN
The dictionary comprehension itself creates the correct lags.
{'lag_' + str(i): df['A'].shift(i) for i in range(1, 5+1)}
{'lag_1': 0 NaN
1 0.0
2 1.0
3 2.0
4 3.0
Name: A, dtype: float64,
'lag_2': 0 NaN
1 NaN
2 0.0
3 1.0
4 2.0
Name: A, dtype: float64,
'lag_3': 0 NaN
1 NaN
2 NaN
3 0.0
4 1.0
Name: A, dtype: float64,
'lag_4': 0 NaN
1 NaN
2 NaN
3 NaN
4 0.0
Name: A, dtype: float64,
'lag_5': 0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: A, dtype: float64}
​
Just pass what you did for the dict remove lambda
out = df.assign(**{'lag_' + str(i): df['A'].shift(i) for i in range(1, 5+1)})
Out[65]:
A lag_1 lag_2 lag_3 lag_4 lag_5
0 0 NaN NaN NaN NaN NaN
1 1 0.0 NaN NaN NaN NaN
2 2 1.0 0.0 NaN NaN NaN
3 3 2.0 1.0 0.0 NaN NaN
4 4 3.0 2.0 1.0 0.0 NaN

merge two rows in one row and convert to NA

Dataframe:
0 1 2 3 4 slicing
0 NaN Object 1 NaN NaN 0
6 NaN Object 2 NaN NaN 6
12 NaN Object 3 NaN NaN 12
18 NaN Object 4 NaN NaN 18
23 NaN Object 5 NaN NaN 23
desired output:
0 1 2 3 4 slicing
0 NaN Object1 NaN NaN NaN 0
6 NaN Object2 NaN NaN NaN 6
12 NaN Object3 NaN NaN NaN 12
18 NaN Object4 NaN NaN NaN 18
23 NaN Object5 NAN NaN NaN 23
library pandas
iterate through each row in the dataset (since there are only NA's and str'Object' with its corresponding str'1-10' number)
replace str numbers with Na and concatenate data in the same row
Code for now:
df= df[df.apply(lambda row: row.astype(str).str.contains('Desk').any().df[row]+df[row], axis=1)]
Index 0 1 2 3 4
0 NaN Desk 1 NaN NaN
5 NaN Desk 2 NaN NaN
10 NaN Desk 3 NaN NaN
15 NaN Desk 4 NaN NaN
20 NaN Desk 5 NaN NaN
Here's what I did:
Using the following dataframe as an example:
0 1 2 3 4 slicing
index
0 NaN Object 1 NaN NaN 0
6 NaN Object 2 NaN A 6
12 NaN Object 3 NaN NaN 12
18 NaN NaN 4 NaN NaN 18
23 Stuff Object NaN 5 NaN 23
I perform 4 steps in the below 4 lines of code, when 'Object' exists in column 1: 1) replace nans with nothing; 2) set everything to string type; 3) join the row, to column 1, 4) replace all the other columns with nan
df.loc[df['1']=='Object',['0', '2', '3','4']] = df.loc[df['1']=='Object',['0', '2', '3','4']].fillna('')
df.loc[df['1']=='Object',['0','1', '2', '3','4']] = df.loc[df['1']=='Object',['0','1', '2', '3','4']].astype(str)
df.loc[df['1']=='Object', ['1','0', '2', '3','4']] = df.loc[df['1']=='Object', ['1', '0', '2', '3','4']].agg(''.join, axis=1)
df.loc[df['1'].str.contains('Object', na = False), ['0', '2', '3','4']] = np.nan
df
0 1 2 3 4 slicing
index
0 NaN Object1 NaN NaN NaN 0
6 NaN Object2A NaN NaN NaN 6
12 NaN Object3 NaN NaN NaN 12
18 NaN NaN 4 NaN NaN 18
23 NaN ObjectStuff5 NaN NaN NaN 23
If I understand what you are trying to achieve, you should really try to wok with columns instead of iterating. It is way faster. You can try something like this :
import numpy as np
columns = df.columns.tolist()
ix = df[df[columns[1]].str.contains('Object')].index
df.loc[ix:columns[1]] = df.loc[ix:columns[1]]+df.loc[ix:columns[2]]
df.loc[ix:columns[2]] = np.nan

Pandas set all values after first NaN to NaN

For each row I would like to set all values to NaN after the appearance of the first NaN. E.g.:
a b c
1 2 3 4
2 nan 2 nan
3 3 nan 23
Should become this:
a b c
1 2 3 4
2 nan nan nan
3 3 nan nan
So far I only know how to do this with an apply with a for loop over each column per row - it's very slow!
Check with cumprod
df=df.where(df.notna().cumprod(axis=1).eq(1))
a b c
1 2.0 3.0 4.0
2 NaN NaN NaN
3 3.0 NaN NaN

How do you shift each row in pandas data frame by a specific value?

If I have a pandas dataframe like this:
2 3 4 NaN NaN NaN
1 NaN NaN NaN NaN NaN
5 6 7 2 3 NaN
4 3 NaN NaN NaN NaN
and an array for the number I would like to shift:
array = [2, 4, 0, 3]
How do I iterate through each row to shift the columns by the number in my array to get something like this:
NaN NaN 2 3 4 NaN
NaN NaN NaN NaN 1 NaN
5 6 7 2 3 NaN
NaN NaN NaN 3 4 NaN
I was trying to do something like this but had no luck.
df = pd.DataFrame(values)
for rows in df.iterrows():
df[rows] = df.shift[change_in_bins[rows]]
Use for loop with loc and shift:
for index,value in enumerate([2, 4, 0, 3]):
df.loc[index,:] = df.loc[index,:].shift(value)
print(df)
0 1 2 3 4 5
0 NaN NaN 2.0 3.0 4.0 NaN
1 NaN NaN NaN NaN 1.0 NaN
2 5.0 6.0 7.0 2.0 3.0 NaN
3 NaN NaN NaN 4.0 3.0 NaN

How to count occurrences of consecutive 1s by column and take mean by block

x = pd.DataFrame(index = pd.date_range(start="2017-1-1", end="2017-1-13"),
columns="a b c".split())
x.ix[0:2, "a"] = 1
x.ix[5:10, "a"] = 1
x.ix[9:12, "b"] = 1
x.ix[1:3, "c"] = 1
x.ix[5, "c"] = 1
a b c
2017-01-01 1 NaN NaN
2017-01-02 1 NaN 1
2017-01-03 NaN NaN 1
2017-01-04 NaN NaN NaN
2017-01-05 NaN NaN NaN
2017-01-06 1 NaN 1
2017-01-07 1 NaN NaN
2017-01-08 1 NaN NaN
2017-01-09 1 NaN NaN
2017-01-10 1 1 NaN
2017-01-11 NaN 1 NaN
2017-01-12 NaN 1 NaN
2017-01-13 NaN NaN NaN
Given the above dataframe, x, I want to return the average of the number of occurences of 1s within each group of a, b, and c. The average for each column is taken over the number of blocks that contains consecutive 1s.
For example, column a will output the average of 2 and 5, which is 3.5. We divide it by 2 because there are 2 consecutive 1s between Jan-1 and Jan-2, then 5 consecutive 1s between Jan-06 and Jan-10, 2 blocks of 1s in total. Similarly, for column b, we will have 3 because only one consecutive sequence of 1s occur once between Jan-10 and Jan-13. Finally, for column c, we will have the average of 2 and 1, which is 1.5.
Expected output of the toy example:
a b c
3.5 3 1.5
Use mask + apply with value_counts, and finally, find the mean of your counts -
x.eq(1)\
.ne(x.eq(1).shift())\
.cumsum(0)\
.mask(x.ne(1))\
.apply(pd.Series.value_counts)\
.mean(0)
a 3.5
b 3.0
c 1.5
dtype: float64
Details
First, find a list of all consecutive values in your dataframe -
i = x.eq(1).ne(x.eq(1).shift()).cumsum(0)
i
a b c
2017-01-01 1 1 1
2017-01-02 1 1 2
2017-01-03 2 1 2
2017-01-04 2 1 3
2017-01-05 2 1 3
2017-01-06 3 1 4
2017-01-07 3 1 5
2017-01-08 3 1 5
2017-01-09 3 1 5
2017-01-10 3 2 5
2017-01-11 4 2 5
2017-01-12 4 2 5
2017-01-13 4 3 5
Now, keep only those group values whose cells were originally 1 in x -
j = i.mask(x.ne(1))
j
a b c
2017-01-01 1.0 NaN NaN
2017-01-02 1.0 NaN 2.0
2017-01-03 NaN NaN 2.0
2017-01-04 NaN NaN NaN
2017-01-05 NaN NaN NaN
2017-01-06 3.0 NaN 4.0
2017-01-07 3.0 NaN NaN
2017-01-08 3.0 NaN NaN
2017-01-09 3.0 NaN NaN
2017-01-10 3.0 2.0 NaN
2017-01-11 NaN 2.0 NaN
2017-01-12 NaN 2.0 NaN
2017-01-13 NaN NaN NaN
Now, apply value_counts across each column -
k = j.apply(pd.Series.value_counts)
k
a b c
1.0 2.0 NaN NaN
2.0 NaN 3.0 2.0
3.0 5.0 NaN NaN
4.0 NaN NaN 1.0
And just find the column-wise mean -
k.mean(0)
a 3.5
b 3.0
c 1.5
dtype: float64
As a handy note, if you want to, for example, find the mean counts only for more than n consecutive 1s (say, n = 1 here), then you can filter on k's index quite easily -
k[k.index > 1].mean(0)
a 5.0
b 3.0
c 1.5
dtype: float64
Let's try:
x.apply(lambda s: s.groupby(s.ne(1).cumsum()).sum().mean())
Output:
a 3.5
b 3.0
c 1.5
dtype: float64
Apply the lambda function to each column of the dataframe. The lambda function groups none 1 values together and counts them using sum() then takes the average using mean().
This utilizes cumsum, shift, and an xor mask.
b = x.cumsum()
c = b.shift(-1)
b_masked = b[b.isnull() ^ c.isnull()]
b_masked.max() / b_masked.count()
a 3.5
b 3.0
c 1.5
dtype: float64
First do b = x.cumsum()
a b c
0 1.0 NaN NaN
1 2.0 NaN 1.0
2 NaN NaN 2.0
3 NaN NaN NaN
4 NaN NaN NaN
5 3.0 NaN 3.0
6 4.0 NaN NaN
7 5.0 NaN NaN
8 6.0 NaN NaN
9 7.0 1.0 NaN
10 NaN 2.0 NaN
11 NaN 3.0 NaN
12 NaN NaN NaN
Then, shift b upward: c = b.shift(-1). Then, we create a xor mask with b.isnull() ^ c.isnull(). This mask will only keep one value per consecutive ones. Note that it seems that it will create an extra True at the end. But since we put it back to b, where in the place it is NaN, it will not generate new elements. We use an example to illustrate
b c b.isnull() ^ c.isnull() b[b.isnull() ^ c.isnull()]
NaN 1 True NaN
1 2 False NaN
2 NaN True 2
NaN NaN False NaN
Real big b[b.isnull() ^ c.isnull()] looks like
a b c
0 NaN NaN NaN
1 2.0 NaN NaN
2 NaN NaN 2.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN 3.0
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 7.0 NaN NaN
10 NaN NaN NaN
11 NaN 3.0 NaN
12 NaN NaN NaN
Because we did cumsum in the first place, we only need the maximum and the number of non-NaN in each column to calculate the mean.
Thus, we do b[b.isnull() ^ c.isnull()].max() / b[b.isnull() ^ c.isnull()].count()
you could use regex:
import re
p = r'1+'
counts = {
c: np.mean(
[len(x) for x in re.findall(p, ''.join(map(str, x[c].values)))]
)
for c in ['a', 'b', 'c']
}
This method works because the columns here could be thought as expressions in a language with alphabet {1, nan}. 1+ matches all groups of adjacent 1s and re.findall returns a list of strings. Then, it is necessary to calculate the mean of the lengths of each string.

Categories

Resources