Groupby two columns pandas dataframe and shift().rolling() - python

I want to groupby two columns, 'Y' and 'A', then shift().rolling() for column 'ValueA'.
I tried this code but result is not correct.
Code
df = pd.DataFrame({
'Y' : [0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1],
'A' : ['b','c','a','c','a','c','b','c','a', 'a', 'b', 'b','c','a','a','b'],
'B': ['a', 'a', 'b', 'b','c','a','a','b','b','c','a','c','a','c','b','c'],
'ValueA':[1,2,2,1,2,4,7,1,3,2,4,3,1,2,4,5],
'ValueB':[3,2,4,3,1,2,4,5,1,2,2,1,2,4,7,1]
})
df['ValueX'] = df.groupby(['Y','A'])['ValueA'].shift().rolling(3, min_periods=3).sum()
Output for 'A' == a
Y A B ValueA ValueB ValueX
2 0 a b 2 4 NaN
4 1 a c 2 1 NaN
8 1 a b 3 1 NaN
9 1 a c 2 2 9.0
13 1 a c 2 4 7.0
14 1 a b 4 7 5.0
Expected Output
Y A B ValueA ValueB ValueX
2 0 a b 2 4 NaN
4 1 a c 2 1 NaN
8 1 a b 3 1 NaN
9 1 a c 2 2 NaN
13 1 a c 2 4 7.0
14 1 a b 4 7 7.0

We need to to perform both shift and rolling operation per group, but instead you are performing the shift operation per group then rolling operation for the entire column which is producing the incorrect output.
df['ValueX'] = df.groupby(['Y', 'A'])['ValueA']\
.apply(lambda v: v.shift().rolling(3).sum())
print(df)
Y A B ValueA ValueB ValueX
0 0 b a 1 3 NaN
1 0 c a 2 2 NaN
2 0 a b 2 4 NaN
3 1 c b 1 3 NaN
4 1 a c 2 1 NaN
5 1 c a 4 2 NaN
6 1 b a 7 4 NaN
7 1 c b 1 5 NaN
8 1 a b 3 1 NaN
9 1 a c 2 2 NaN
10 1 b a 4 2 NaN
11 1 b c 3 1 NaN
12 1 c a 1 2 6.0
13 1 a c 2 4 7.0
14 1 a b 4 7 7.0
15 1 b c 5 1 14.0
As a side note, you don't have to explicitly specify the min_periods optional argument, it will default to the window size if not specified.

Related

how do I insert a column at a specific column index in pandas data frame? (Change column order in pandas data frame)

I have a pandas data frame and I want to move the "F" column to after the "B" column. Is there a way to do that?
A B C D E F
0 7 1 8 1 6
1 8 2 5 8 5 8
2 9 3 6 8 5
3 1 8 1 3 4
4 6 8 2 5 0 9
5 2 N/A 1 3 8
df2
A B F C D E
0 7 1 6 8 1
1 8 2 8 5 8 5
2 9 3 5 6 8
3 1 4 8 1 3
4 6 8 9 2 5 0
5 2 8 N/A 1 3
So it should finally look like df2.
Thanks in advance.
You can try df.insert + df.pop after getting location of B by get_loc
df.insert(df.columns.get_loc("B")+1,"F",df.pop("F"))
print(df)
A B F C D E
0 7.0 1 6.0 NaN 8 1.0
1 8.0 2 8.0 5.0 8 5.0
2 9.0 3 5.0 6.0 8 NaN
3 1.0 8 NaN 1.0 3 4.0
4 6.0 8 9.0 2.0 5 0.0
5 NaN 2 8.0 NaN 1 3.0
Another minimalist, (and very specific!) approach:
df = df[list('ABFCDE')]
Here is a very simple answer to this(only one line). Giving littlebit more explanation to the answer from #warped
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2

set entire group to NaN if containing a single NaN and combine columns

I have a df
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 4
1 3 1 nan
1 1 nan 3
1 1 2 3
1 1 2 4
I need to groub by a and b and then if c or d contains 1 or more nan's within groups I want the entire group in the specific column to be nan:
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 nan
1 3 1 nan
1 1 nan 3
1 1 nan 3
1 1 nan 4
and then combine c and d that there is no nan's anymore
a b c d e
0 1 nan 1 1
0 2 2 nan 2
0 2 3 nan 3
1 3 1 nan 1
1 1 nan 3 3
1 1 nan 3 3
1 1 nan 4 4
You will want to check each group for whether it is nan and then set the appropriate value (nan or existing value) and then use combine_first() to combine the columns.
from io import StringIO
import pandas as pd
import numpy as np
df = pd.read_csv(StringIO("""
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 4
1 3 1 nan
1 1 nan 3
1 1 2 3
1 1 2 4
"""), sep=' ')
for col in ['c', 'd']:
df[col] = df.groupby(['a','b'])[col].transform(lambda x: np.nan if any(x.isna()) else x)
df['e'] = df['c'].combine_first(df['d'])
df
a b c d e
0 0 1 NaN 1.0 1.0
1 0 2 2.0 NaN 2.0
2 0 2 3.0 NaN 3.0
3 1 3 1.0 NaN 1.0
4 1 1 NaN 3.0 3.0
5 1 1 NaN 3.0 3.0
6 1 1 NaN 4.0 4.0

fill NaN values with mean based on another column specific value

I want to fill the NaN values on my dataframe on column c with the mean for only rows who has as category B, and ignore the others.
print (df)
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 NaN
4 A 2 1.0
5 B 2 Nan
6 C 1 3.0
7 C 1 2.0
8 B 1 NaN
So what I'm doing for the moment is :
df.c = df.c.fillna(df.c.mean())
But it fill all the NaN values, while I want only to fill the 3rd, 5th and the 8th rows who had category value equal to B.
Combine fillna with slicing assignment
df.loc[df.Category.eq('B'), 'c'] = (df.loc[df.Category.eq('B'), 'c'].
fillna(df.c.mean()))
Out[736]:
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 3.0
4 A 2 1.0
5 B 2 3.0
6 C 1 3.0
7 C 1 2.0
8 B 1 3.0
Or a direct assignment with 2 masks
pandas.DataFrame.eq is the element wise equality operator.
df.loc[df.Category.eq('B') & df.c.isna(), 'c'] = df.c.mean()
Out[745]:
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 3.0
4 A 2 1.0
5 B 2 3.0
6 C 1 3.0
7 C 1 2.0
8 B 1 3.0
This would be the answer for your question:
df.c = df.apply(
lambda row: row['c'].fillna(df.c.mean()) if row['Category']=='B' else row['c'] ,axis=1)

Cut dataframe at row

I have a dataframe which I want to cut at a specific row and then I want to add this cut to right of the data frame.
I hope my example clarifies what I mean.
Appreciate your help.
Example:
Column_name1 Column_name2 column_name3 Column_name4
0
1
2
3
4
5------------------------------------------------------< cut here
6
7
8
9
10
Column_name1 Column_name2 column_name3 column_name4 column_name5
0 5
1 6
2 7 add cut here
3 8
4 9
Use:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
n = 3
df = pd.concat([df.iloc[:n].reset_index(drop=True),
df.iloc[n:].add_prefix('cutted_').reset_index(drop=True)], axis=1)
print (df)
A B C D E F cutted_A cutted_B cutted_C cutted_D cutted_E cutted_F
0 a 4 7 1 5 a d 5 4 7 9 b
1 b 5 8 3 3 a e 5 2 1 2 b
2 c 4 9 5 6 a f 4 3 0 4 b
n = 5
df = pd.concat([df.iloc[:n].reset_index(drop=True),
df.iloc[n:].add_prefix('cutted_').reset_index(drop=True)], axis=1)
print (df)
A B C D E F cutted_A cutted_B cutted_C cutted_D cutted_E cutted_F
0 a 4 7 1 5 a f 4.0 3.0 0.0 4.0 b
1 b 5 8 3 3 a NaN NaN NaN NaN NaN NaN
2 c 4 9 5 6 a NaN NaN NaN NaN NaN NaN
3 d 5 4 7 9 b NaN NaN NaN NaN NaN NaN
4 e 5 2 1 2 b NaN NaN NaN NaN NaN NaN

How can I merge two dataframes of dissimilar size and preserve their column order?

Consider the dataframes
A:
g N a
1 3 5
2 4 6
and B:
g N a e
3 3 4 7
4 9 1 8
Is there some way to merge these such that the resultant dataframe is:
g N a e
1 3 5 NaN
2 4 6 NaN
3 3 4 7
4 9 1 8
In other words, is there some way to preserve the column order rather than re-sort lexicographically?
Use reindex_axis:
pd.concat([A,B]).reindex_axis(B.columns, axis=1)
Output:
g N a e
0 1 3 5 NaN
1 2 4 6 NaN
0 3 3 4 7.0
1 4 9 1 8.0
When merging, specify sort=False.
In [1251]: A.merge(B, how='outer', sort=False)
Out[1251]:
g N a e
0 1 3 5 NaN
1 2 4 6 NaN
2 3 3 4 7.0
3 4 9 1 8.0
The following should do the trick: pd.concat([a, b])[b.columns]
Full test code:
import pandas as pd
from io import StringIO
a = pd.read_csv(StringIO("""
g N a
1 3 5
2 4 6
"""), sep=r"\s*")
b = pd.read_csv(StringIO("""
g N a e
3 3 4 7
4 9 1 8
"""), sep=r"\s*")
pd.concat([a, b])[b.columns]
This produces:
g N a e
0 1 3 5 NaN
1 2 4 6 NaN
0 3 3 4 7.0
1 4 9 1 8.0
You might also want to reset the index:
pd.concat([a, b])[b.columns].reset_index(drop=True)
... in order to remove index duplicates. This gives:
g N a e
0 1 3 5 NaN
1 2 4 6 NaN
2 3 3 4 7.0
3 4 9 1 8.0

Categories

Resources