I want to groupby two columns, 'Y' and 'A', then shift().rolling() for column 'ValueA'.
I tried this code but result is not correct.
Code
df = pd.DataFrame({
'Y' : [0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1],
'A' : ['b','c','a','c','a','c','b','c','a', 'a', 'b', 'b','c','a','a','b'],
'B': ['a', 'a', 'b', 'b','c','a','a','b','b','c','a','c','a','c','b','c'],
'ValueA':[1,2,2,1,2,4,7,1,3,2,4,3,1,2,4,5],
'ValueB':[3,2,4,3,1,2,4,5,1,2,2,1,2,4,7,1]
})
df['ValueX'] = df.groupby(['Y','A'])['ValueA'].shift().rolling(3, min_periods=3).sum()
Output for 'A' == a
Y A B ValueA ValueB ValueX
2 0 a b 2 4 NaN
4 1 a c 2 1 NaN
8 1 a b 3 1 NaN
9 1 a c 2 2 9.0
13 1 a c 2 4 7.0
14 1 a b 4 7 5.0
Expected Output
Y A B ValueA ValueB ValueX
2 0 a b 2 4 NaN
4 1 a c 2 1 NaN
8 1 a b 3 1 NaN
9 1 a c 2 2 NaN
13 1 a c 2 4 7.0
14 1 a b 4 7 7.0
We need to to perform both shift and rolling operation per group, but instead you are performing the shift operation per group then rolling operation for the entire column which is producing the incorrect output.
df['ValueX'] = df.groupby(['Y', 'A'])['ValueA']\
.apply(lambda v: v.shift().rolling(3).sum())
print(df)
Y A B ValueA ValueB ValueX
0 0 b a 1 3 NaN
1 0 c a 2 2 NaN
2 0 a b 2 4 NaN
3 1 c b 1 3 NaN
4 1 a c 2 1 NaN
5 1 c a 4 2 NaN
6 1 b a 7 4 NaN
7 1 c b 1 5 NaN
8 1 a b 3 1 NaN
9 1 a c 2 2 NaN
10 1 b a 4 2 NaN
11 1 b c 3 1 NaN
12 1 c a 1 2 6.0
13 1 a c 2 4 7.0
14 1 a b 4 7 7.0
15 1 b c 5 1 14.0
As a side note, you don't have to explicitly specify the min_periods optional argument, it will default to the window size if not specified.
I have a dataframe similar to below
id A B C D E
1 2 3 4 5 5
1 NaN 4 NaN 6 7
2 3 4 5 6 6
2 NaN NaN 5 4 1
I want to do a null value imputation for columns A, B, C in a forward filling but for each group. That means, I want the forward filling be applied on each id. How can I do that?
Use GroupBy.ffill for forward filling per groups for all columns, but if first values per groups are NaNs there is no replace, so is possible use fillna and last casting to integers:
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 NaN 4.0 NaN 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 NaN NaN 5.0 4 1.0
cols = ['A','B','C']
df[['id'] + cols] = df.groupby('id')[cols].ffill().fillna(0).astype(int)
print (df)
id A B C D E
0 1 2 3 4 5 NaN
1 1 2 4 4 6 NaN
2 2 3 4 5 6 6.0
3 2 3 4 5 4 1.0
Detail:
print (df.groupby('id')[cols].ffill().fillna(0).astype(int))
id A B C
0 1 2 3 4
1 1 2 4 4
2 2 3 4 5
3 2 3 4 5
Or:
cols = ['A','B','C']
df.update(df.groupby('id')[cols].ffill().fillna(0))
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 2.0 4.0 4.0 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 3.0 4.0 5.0 4 1.0
df:
id cond1 a b c d
0 Q b 1 1 nan 1
1 R b 8 3 nan 3
2 Q a 12 4 8 nan
3 Q b 8 3 nan 1
4 R b 1 2 nan 3
5 Q a 7 9 8 nan
6 Q b 4 4 nan 1
7 R b 9 8 nan 3
8 Q a 0 10 8 nan
Group by id and cond1 and do a rolling(2).sum():
df.groupby(['id','cond1']).apply(lambda x: x[x.name[1]].rolling(2).sum())
Output:
id cond1
Q a 2 nan
5 19.00000
8 7.00000
b 0 nan
3 4.00000
6 7.00000
R b 1 nan
4 5.00000
7 10.00000
dtype: float64
Why is the output in a table form? Can it be in a series form and its index reset?
You can use reset_index() to make groupby object back to dataframe
Basically, I want it to groupby all ids from the two columns (['id1','id2]), and get the rolling sum (from past 2 rows) of their respective values from columns ['value1','value2']
df:
id1 id2 value1 value2
-----------------------------------
a b 10 5
c a 5 10
b c 0 0
c d 2 1
d a 10 20
a c 5 10
b a 10 5
a b 5 2
c a 2 5
d b 5 2
df (if df.id = 'a') -- just to simplify I'm showing with only id 'a':
id1 id2 value1 value2 a.rolling.sum(2)
-----------------------------------------------------
a b 10 5 NaN
c a 5 10 20
d a 10 20 30
a c 5 10 25
b a 10 5 10
a b 5 2 10
c a 2 5 10
expect df (including all ids in columns ['id1','id2']):
id1 id2 value1 value2 a.rolling.sum(2) b.rolling.sum(2) c.rolling.sum(2)
---------------------------------------------------------------------------------------------
a b 10 5 NaN NaN NaN
c a 5 10 20 NaN NaN
b c 0 0 NaN 5 5
c d 2 1 NaN NaN 2
d a 10 20 30 NaN NaN
a c 5 10 25 NaN 12
b a 10 5 10 10 NaN
a b 5 2 10 12 NaN
c a 2 5 10 NaN 12
d b 5 2 NaN 4 NaN
Preferably I need a groupby function that assigns all ids involved with a x.rolling(2) as original dataset has hundreds of ids to compute.
Reconfigure
i = df.filter(like='id')
i.columns = [i.columns.str[:2], i.columns.str[2:]]
v = df.filter(like='va')
v.columns = [v.columns.str[:5], v.columns.str[5:]]
d = i.join(v)
d
id value
1 2 1 2
0 a b 10 5
1 c a 5 10
2 b c 0 0
3 c d 2 1
4 d a 10 20
5 a c 5 10
6 b a 10 5
7 a b 5 2
8 c a 2 5
9 d b 5 2
Shuffle Stuff About
def modified_roll(x):
return x.dropna().rolling(2).sum()
extra_bit = d.stack().set_index('id', append=True).unstack().value \
.apply(modified_roll).groupby(level=0).first()
extra_bit
id a b c d
0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN
2 NaN 5.0 5.0 NaN
3 NaN NaN 2.0 NaN
4 30.0 NaN NaN 11.0
5 25.0 NaN 12.0 NaN
6 10.0 10.0 NaN NaN
7 10.0 12.0 NaN NaN
8 10.0 NaN 12.0 NaN
9 NaN 4.0 NaN 15.0
join
df.join(extra_bit)
id1 id2 value1 value2 a b c d
0 a b 10 5 NaN NaN NaN NaN
1 c a 5 10 20.0 NaN NaN NaN
2 b c 0 0 NaN 5.0 5.0 NaN
3 c d 2 1 NaN NaN 2.0 NaN
4 d a 10 20 30.0 NaN NaN 11.0
5 a c 5 10 25.0 NaN 12.0 NaN
6 b a 10 5 10.0 10.0 NaN NaN
7 a b 5 2 10.0 12.0 NaN NaN
8 c a 2 5 10.0 NaN 12.0 NaN
9 d b 5 2 NaN 4.0 NaN 15.0
I have two dataframes that I am trying to combine but I'm not getting the result I want using pandas.concat.
I have a database of data that I want to add new data to but only if the column of name matches.
Let says df1 is:
A B C D
1 1 2 2
3 3 4 4
5 5 6 6
and df2 is:
A E D F
7 7 8 8
9 9 0 0
the result I would like to get is:
A B C D
1 1 2 2
3 3 4 4
5 5 6 6
7 - - 8
9 - - 0
The blank data doesn't have to be - it can be anything.
When I use:
results = pandas.concat([df1, df2], axis=0, join='outer')
it gives me a new dataframe with all of the columns A through F, instead of what I want. Any ideas for how I can accomplish this? Thanks!
You want to use the pd.DataFrame.align method and specify that you want to align with the left argument's indices and that you only care about columns.
d1, d2 = df1.align(df2, join='left', axis=1)
Then you can use pd.DataFrame.append or pd.concat
pd.concat([d1, d2], ignore_index=True)
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
3 7 NaN NaN 8
4 9 NaN NaN 0
Or
d1.append(d2, ignore_index=True)
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
3 7 NaN NaN 8
4 9 NaN NaN 0
My preferred way would be to skip the reassignment to names
pd.concat(df1.align(df2, 'left', 1), ignore_index=True)
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
3 7 NaN NaN 8
4 9 NaN NaN 0
You can use find the intersection of columns on df2 and concat or append:
pd.concat(
[df1, df2[df1.columns.intersection(df2.columns)]]
)
Or,
df1.append(df2[df1.columns.intersection(df2.columns)])
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
0 7 NaN NaN 8
1 9 NaN NaN 0
You can also use reindex and concat:
pd.concat([df1,df2.reindex(columns=df1.columns)])
Out[81]:
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
0 7 NaN NaN 8
1 9 NaN NaN 0
Transpose first before merging.
df1.T.merge(df2.T, how="left", left_index=True, right_index=True).T
A B C D
0_x 1.0 1.0 2.0 2.0
1_x 3.0 3.0 4.0 4.0
2 5.0 5.0 6.0 6.0
0_y 7.0 NaN NaN 8.0
1_y 9.0 NaN NaN 0.0
df1.T df2.T
0 1 2 1 2
A 1 3 5 A 7 9
B 1 3 5 E 7 9
C 2 4 6 D 8 0
D 2 4 6 F 8 0
Now the result can be obtained with a merge with how="left" and we use the indices as the join key by passing left_index=True and right_index=True.
df1.T.merge(df2.T, how="left", left_index=True, right_index=True)
0_x 1_x 2 0_y 1_y
A 1 3 5 7.0 9.0
B 1 3 5 NaN NaN
C 2 4 6 NaN NaN
D 2 4 6 8.0 0.0