Rolling sum of all values associated with the ids from two columns - python

Basically, I want it to groupby all ids from the two columns (['id1','id2]), and get the rolling sum (from past 2 rows) of their respective values from columns ['value1','value2']
df:
id1 id2 value1 value2
-----------------------------------
a b 10 5
c a 5 10
b c 0 0
c d 2 1
d a 10 20
a c 5 10
b a 10 5
a b 5 2
c a 2 5
d b 5 2
df (if df.id = 'a') -- just to simplify I'm showing with only id 'a':
id1 id2 value1 value2 a.rolling.sum(2)
-----------------------------------------------------
a b 10 5 NaN
c a 5 10 20
d a 10 20 30
a c 5 10 25
b a 10 5 10
a b 5 2 10
c a 2 5 10
expect df (including all ids in columns ['id1','id2']):
id1 id2 value1 value2 a.rolling.sum(2) b.rolling.sum(2) c.rolling.sum(2)
---------------------------------------------------------------------------------------------
a b 10 5 NaN NaN NaN
c a 5 10 20 NaN NaN
b c 0 0 NaN 5 5
c d 2 1 NaN NaN 2
d a 10 20 30 NaN NaN
a c 5 10 25 NaN 12
b a 10 5 10 10 NaN
a b 5 2 10 12 NaN
c a 2 5 10 NaN 12
d b 5 2 NaN 4 NaN
Preferably I need a groupby function that assigns all ids involved with a x.rolling(2) as original dataset has hundreds of ids to compute.

Reconfigure
i = df.filter(like='id')
i.columns = [i.columns.str[:2], i.columns.str[2:]]
v = df.filter(like='va')
v.columns = [v.columns.str[:5], v.columns.str[5:]]
d = i.join(v)
d
id value
1 2 1 2
0 a b 10 5
1 c a 5 10
2 b c 0 0
3 c d 2 1
4 d a 10 20
5 a c 5 10
6 b a 10 5
7 a b 5 2
8 c a 2 5
9 d b 5 2
Shuffle Stuff About
def modified_roll(x):
return x.dropna().rolling(2).sum()
extra_bit = d.stack().set_index('id', append=True).unstack().value \
.apply(modified_roll).groupby(level=0).first()
extra_bit
id a b c d
0 NaN NaN NaN NaN
1 20.0 NaN NaN NaN
2 NaN 5.0 5.0 NaN
3 NaN NaN 2.0 NaN
4 30.0 NaN NaN 11.0
5 25.0 NaN 12.0 NaN
6 10.0 10.0 NaN NaN
7 10.0 12.0 NaN NaN
8 10.0 NaN 12.0 NaN
9 NaN 4.0 NaN 15.0
join
df.join(extra_bit)
id1 id2 value1 value2 a b c d
0 a b 10 5 NaN NaN NaN NaN
1 c a 5 10 20.0 NaN NaN NaN
2 b c 0 0 NaN 5.0 5.0 NaN
3 c d 2 1 NaN NaN 2.0 NaN
4 d a 10 20 30.0 NaN NaN 11.0
5 a c 5 10 25.0 NaN 12.0 NaN
6 b a 10 5 10.0 10.0 NaN NaN
7 a b 5 2 10.0 12.0 NaN NaN
8 c a 2 5 10.0 NaN 12.0 NaN
9 d b 5 2 NaN 4.0 NaN 15.0

Related

How to assign a value from the last row of a preceding group to the next group?

The goal is to put the digits from the last row of the previous letter group in the new column "last_digit_prev_group". The expected, correct value, as a result formula, was entered by me manually in the column "col_ok". I stopped trying shift (), but the effect was far from what I expected. Maybe there is some other way?
Forgive me the inconsistency of my post, I'm not an IT specialist and I don't know English. Thanks in advance for your support.
df = pd.read_csv('C:/Users/.../a.csv',names=['group_letter', 'digit', 'col_ok'] ,
index_col=0,)
df['last_digit_prev_group'] = df.groupby('group_letter')['digit'].shift(1)
print(df)
group_letter digit col_ok last_digit_prev_group
A 1 n NaN
A 3 n 1.0
A 2 n 3.0
A 5 n 2.0
A 1 n 5.0
B 1 1 NaN
B 2 1 1.0
B 1 1 2.0
B 1 1 1.0
B 3 1 1.0
C 5 3 NaN
C 6 3 5.0
C 1 3 6.0
C 2 3 1.0
C 3 3 2.0
D 4 3 NaN
D 3 3 4.0
D 2 3 3.0
D 5 3 2.0
D 7 3 5.0
Use Series.mask with DataFrame.duplicated for last valeus of digit, then Series.shift and last ffill:
df['last_digit_prev_group'] = (df['digit'].mask(df.duplicated('group_letter', keep='last'))
.shift()
.ffill())
print (df)
group_letter digit col_ok last_digit_prev_group
0 A 1 n NaN
1 A 3 n NaN
2 A 2 n NaN
3 A 5 n NaN
4 A 1 n NaN
5 B 1 1 1.0
6 B 2 1 1.0
7 B 1 1 1.0
8 B 1 1 1.0
9 B 3 1 1.0
10 C 5 3 3.0
11 C 6 3 3.0
12 C 1 3 3.0
13 C 2 3 3.0
14 C 3 3 3.0
15 D 4 3 3.0
16 D 3 3 3.0
17 D 2 3 3.0
18 D 5 3 3.0
19 D 7 3 3.0
If possible some last value is NaN:
df['last_digit_prev_group'] = (df['digit'].mask(df.duplicated('group_letter', keep='last'))
.shift()
.groupby(df['group_letter']).ffill()
print (df)
group_letter digit col_ok last_digit_prev_group
0 A 1.0 n NaN
1 A 3.0 n NaN
2 A 2.0 n NaN
3 A 5.0 n NaN
4 A 1.0 n NaN
5 B 1.0 1 1.0
6 B 2.0 1 1.0
7 B 1.0 1 1.0
8 B 1.0 1 1.0
9 B 3.0 1 1.0
10 C 5.0 3 3.0
11 C 6.0 3 3.0
12 C 1.0 3 3.0
13 C 2.0 3 3.0
14 C NaN 3 3.0
15 D 4.0 3 NaN
16 D 3.0 3 NaN
17 D 2.0 3 NaN
18 D 5.0 3 NaN
19 D 7.0 3 NaN

How to do forward filling for each group in pandas

I have a dataframe similar to below
id A B C D E
1 2 3 4 5 5
1 NaN 4 NaN 6 7
2 3 4 5 6 6
2 NaN NaN 5 4 1
I want to do a null value imputation for columns A, B, C in a forward filling but for each group. That means, I want the forward filling be applied on each id. How can I do that?
Use GroupBy.ffill for forward filling per groups for all columns, but if first values per groups are NaNs there is no replace, so is possible use fillna and last casting to integers:
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 NaN 4.0 NaN 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 NaN NaN 5.0 4 1.0
cols = ['A','B','C']
df[['id'] + cols] = df.groupby('id')[cols].ffill().fillna(0).astype(int)
print (df)
id A B C D E
0 1 2 3 4 5 NaN
1 1 2 4 4 6 NaN
2 2 3 4 5 6 6.0
3 2 3 4 5 4 1.0
Detail:
print (df.groupby('id')[cols].ffill().fillna(0).astype(int))
id A B C
0 1 2 3 4
1 1 2 4 4
2 2 3 4 5
3 2 3 4 5
Or:
cols = ['A','B','C']
df.update(df.groupby('id')[cols].ffill().fillna(0))
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 2.0 4.0 4.0 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 3.0 4.0 5.0 4 1.0

Cut dataframe at row

I have a dataframe which I want to cut at a specific row and then I want to add this cut to right of the data frame.
I hope my example clarifies what I mean.
Appreciate your help.
Example:
Column_name1 Column_name2 column_name3 Column_name4
0
1
2
3
4
5------------------------------------------------------< cut here
6
7
8
9
10
Column_name1 Column_name2 column_name3 column_name4 column_name5
0 5
1 6
2 7 add cut here
3 8
4 9
Use:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
n = 3
df = pd.concat([df.iloc[:n].reset_index(drop=True),
df.iloc[n:].add_prefix('cutted_').reset_index(drop=True)], axis=1)
print (df)
A B C D E F cutted_A cutted_B cutted_C cutted_D cutted_E cutted_F
0 a 4 7 1 5 a d 5 4 7 9 b
1 b 5 8 3 3 a e 5 2 1 2 b
2 c 4 9 5 6 a f 4 3 0 4 b
n = 5
df = pd.concat([df.iloc[:n].reset_index(drop=True),
df.iloc[n:].add_prefix('cutted_').reset_index(drop=True)], axis=1)
print (df)
A B C D E F cutted_A cutted_B cutted_C cutted_D cutted_E cutted_F
0 a 4 7 1 5 a f 4.0 3.0 0.0 4.0 b
1 b 5 8 3 3 a NaN NaN NaN NaN NaN NaN
2 c 4 9 5 6 a NaN NaN NaN NaN NaN NaN
3 d 5 4 7 9 b NaN NaN NaN NaN NaN NaN
4 e 5 2 1 2 b NaN NaN NaN NaN NaN NaN

Add new dataframe to existing database but only add if column name matches

I have two dataframes that I am trying to combine but I'm not getting the result I want using pandas.concat.
I have a database of data that I want to add new data to but only if the column of name matches.
Let says df1 is:
A B C D
1 1 2 2
3 3 4 4
5 5 6 6
and df2 is:
A E D F
7 7 8 8
9 9 0 0
the result I would like to get is:
A B C D
1 1 2 2
3 3 4 4
5 5 6 6
7 - - 8
9 - - 0
The blank data doesn't have to be - it can be anything.
When I use:
results = pandas.concat([df1, df2], axis=0, join='outer')
it gives me a new dataframe with all of the columns A through F, instead of what I want. Any ideas for how I can accomplish this? Thanks!
You want to use the pd.DataFrame.align method and specify that you want to align with the left argument's indices and that you only care about columns.
d1, d2 = df1.align(df2, join='left', axis=1)
Then you can use pd.DataFrame.append or pd.concat
pd.concat([d1, d2], ignore_index=True)
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
3 7 NaN NaN 8
4 9 NaN NaN 0
Or
d1.append(d2, ignore_index=True)
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
3 7 NaN NaN 8
4 9 NaN NaN 0
My preferred way would be to skip the reassignment to names
pd.concat(df1.align(df2, 'left', 1), ignore_index=True)
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
3 7 NaN NaN 8
4 9 NaN NaN 0
You can use find the intersection of columns on df2 and concat or append:
pd.concat(
[df1, df2[df1.columns.intersection(df2.columns)]]
)
Or,
df1.append(df2[df1.columns.intersection(df2.columns)])
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
0 7 NaN NaN 8
1 9 NaN NaN 0
You can also use reindex and concat:
pd.concat([df1,df2.reindex(columns=df1.columns)])
Out[81]:
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
0 7 NaN NaN 8
1 9 NaN NaN 0
Transpose first before merging.
df1.T.merge(df2.T, how="left", left_index=True, right_index=True).T
A B C D
0_x 1.0 1.0 2.0 2.0
1_x 3.0 3.0 4.0 4.0
2 5.0 5.0 6.0 6.0
0_y 7.0 NaN NaN 8.0
1_y 9.0 NaN NaN 0.0
df1.T df2.T
0 1 2 1 2
A 1 3 5 A 7 9
B 1 3 5 E 7 9
C 2 4 6 D 8 0
D 2 4 6 F 8 0
Now the result can be obtained with a merge with how="left" and we use the indices as the join key by passing left_index=True and right_index=True.
df1.T.merge(df2.T, how="left", left_index=True, right_index=True)
0_x 1_x 2 0_y 1_y
A 1 3 5 7.0 9.0
B 1 3 5 NaN NaN
C 2 4 6 NaN NaN
D 2 4 6 8.0 0.0

Get the counts and unique counts of every columns in python

Suppose I have a pandas dataframe as follows,
data
id A B C D E
1 NaN 1 NaN 1 1
1 NaN 2 NaN 2 2
1 NaN 3 NaN NaN 3
1 NaN 4 NaN NaN 4
1 NaN 5 NaN NaN 5
2 NaN 6 NaN NaN 6
2 NaN 7 5 NaN 7
2 NaN 8 6 2 8
2 NaN 9 NaN NaN 9
2 NaN 10 NaN 4 10
3 NaN 11 NaN NaN 11
3 NaN 12 NaN NaN 12
3 NaN 13 NaN NaN 13
3 NaN 14 NaN NaN 14
3 NaN 15 NaN NaN 15
3 NaN 16 NaN NaN 16
I am using the following command,
pd.DataFrame(data.count().sort_values(ascending=False)).reset_index()
and get the following output,
index 0
0 E 16
1 B 16
2 id 16
3 D 4
4 C 2
5 A 0
I want the following output,
columns count unique(id) count
E 16 3
B 16 3
D 4 2
C 2 1
A 0 0
where count is same as the previous one but the unique(id) count is the unique ids for each columns present. I want both to be as two fields.
Can anybody help me in doing this?
Thanks
Starting with:
In [7]: df
Out[7]:
id A B C D E
0 1 NaN 1 NaN 1.0 1
1 1 NaN 2 NaN 2.0 2
2 1 NaN 3 NaN NaN 3
3 1 NaN 4 NaN NaN 4
4 1 NaN 5 NaN NaN 5
5 2 NaN 6 NaN NaN 6
6 2 NaN 7 5.0 NaN 7
7 2 NaN 8 6.0 2.0 8
8 2 NaN 9 NaN NaN 9
9 2 NaN 10 NaN 4.0 10
10 3 NaN 11 NaN NaN 11
11 3 NaN 12 NaN NaN 12
12 3 NaN 13 NaN NaN 13
13 3 NaN 14 NaN NaN 14
14 3 NaN 15 NaN NaN 15
15 3 NaN 16 NaN NaN 16
Here is a rather inelegant way:
In [8]: (df.groupby('id').count() > 0).sum()
Out[8]:
A 0
B 3
C 1
D 2
E 3
dtype: int64
So, simply make your base DataFrame as you've specified:
In [9]: counts = (df[['A','B','C','D','E']].count().sort_values(ascending=False)).to_frame(name='counts')
In [10]: counts
Out[10]:
counts
E 16
B 16
D 4
C 2
A 0
And then simply:
In [11]: counts['unique(id) counts'] = (df.groupby('id').count() > 0).sum()
In [12]: counts
Out[12]:
counts unique(id) counts
E 16 3
B 16 3
D 4 2
C 2 1
A 0 0

Categories

Resources