Python, pandas, cumulative sum in new column on matching groups - python

If I have these columns in a dataframe:
a b
1 5
1 7
2 3
1,2 3
2 5
How do I create column c where column b is summed using groupings of column a (string), keeping the existing dataframe. Some rows can belong to more than one group.
a b c
1 5 15
1 7 15
2 3 11
1,2 3 26
2 5 11
Is there an easy and efficient solution as the dataframe I have is very large.

You can first need split column a and join it to original DataFrame:
print (df.a.str.split(',', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('a'))
0 1
1 1
2 2
3 1
3 2
4 2
Name: a, dtype: object
df1 = df.drop('a', axis=1)
.join(df.a.str.split(',', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('a'))
print (df1)
b a
0 5 1
1 7 1
2 3 2
3 3 1
3 3 2
4 5 2
Then use transform for sum without aggragation.
df1['c'] = df1.groupby(['a'])['b'].transform(sum)
#cast for aggreagation join working with strings
df1['a'] = df1.a.astype(str)
print (df1)
b a c
0 5 1 15
1 7 1 15
2 3 2 11
3 3 1 15
3 3 2 11
4 5 2 11
Last groupby by index and aggregate columns by agg:
print (df1.groupby(level=0)
.agg({'a':','.join,'b':'first' ,'c':sum})
[['a','b','c']] )
a b c
0 1 5 15
1 1 7 15
2 2 3 11
3 1,2 3 26
4 2 5 11

Related

How to identify one column with continuous number and same value of another column?

I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6

Group identical consecutive values in pandas DataFrame

I have the following pandas dataframe :
a
0 0
1 0
2 1
3 2
4 2
5 2
6 3
7 2
8 2
9 1
I want to store the values in another dataframe such as every group of consecutive indentical values make a labeled group like this :
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
The column A represent the value of the group and B represents the number of occurences.
this is what i've done so far:
df = pd.DataFrame({'a':[0,0,1,2,2,2,3,2,2,1]})
df2 = pd.DataFrame()
for i,g in df.groupby([(df.a != df.a.shift()).cumsum()]):
vc = g.a.value_counts()
df2 = df2.append({'A':vc.index[0], 'B': vc.iloc[0]}, ignore_index=True).astype(int)
It works but it's a bit messy.
Do you think of a shortest/better way of doing this ?
use GrouBy.agg in Pandas >0.25.0:
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.agg(A=('a','first'),B=('a','count')) )
print(new_df)
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
pandas <0.25.0
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.a
.agg({'A':'first','B':'count'}) )
I would try:
df['blocks'] = df['a'].ne(df['a'].shift()).cumsum()
(df.groupby(['a','blocks'],
as_index=False,
sort=False)
.count()
.drop('blocks', axis=1)
)
Output:
a B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1

difference between two dataframes in Pandas

I am trying to find difference between two dataframe and the resulting df should return the rows matching the first dataframe. Since id's 6,7 was not there in df2 so the count value is as it is.
My Two Dataframes
Resulting Dataframe:
Use sub with set_index for align DataFrames by id columns, add reindex for id only by df1.id:
df = (df1.set_index('id')
.sub(df2.set_index('id'), fill_value=0)
.reindex(df1['id'])
.astype(int)
.reset_index())
print (df)
id count
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 9
6 7 4
Another solution with merge and left join, then subtract by sub with extracting count_ column by pop:
df = df1.merge(df2, on='id', how='left', suffixes=('','_'))
df['count'] = df['count'].sub(df.pop('count_'), fill_value=0).astype(int)
print (df)
id count
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 9
6 7 4
Setup:
df1 = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'count':[3,5,6,7,2,9,4]})
print (df1)
id count
0 1 3
1 2 5
2 3 6
3 4 7
4 5 2
5 6 9
6 7 4
df2 = pd.DataFrame({'id':[1,2,3,4,5,8,9],
'count':[3,5,6,7,2,4,2]})
print (df2)
id count
0 1 3
1 2 5
2 3 6
3 4 7
4 5 2
5 8 4
6 9 2
Use:
temp = pd.merge(df1, df2, how='left', on='id').fillna(0)
temp['count'] = temp['count_x'] - temp['count_y']
temp[['id', 'count']]
id count
0 1 0.0
1 2 0.0
2 3 0.0
3 4 0.0
4 5 0.0
5 6 9.0
6 7 4.0

How do I multiply a pandas column with a part of a multi index dataframe

I have a data frame with a multi index and one column.
Index fields are type and amount, the column is called count
I would like to add a column that multiplies amount and count
df2 = df.groupby(['type','amount']).count().copy()
# I then dropped all columns but one and renamed it to "count"
df2['total_amount'] = df2['count'].multiply(df2['amount'], axis='index')
doesn't work. I get a key error on amount.
How do I access a part of the multi index to use it in calculations?
Use GroupBy.transform for Series with same size as original df with aggregated values, so possible multiple:
count = df.groupby(['type','amount'])['type'].transform('count')
df['total_amount'] = df['amount'].multiply(count, axis='index')
print (df)
A amount C D E type total_amount
0 a 4 7 1 5 a 8
1 b 5 8 3 3 a 5
2 c 4 9 5 6 a 8
3 d 5 4 7 9 b 10
4 e 5 2 1 2 b 10
5 f 4 3 0 4 b 4
Or:
df = pd.DataFrame({'A':list('abcdef'),
'amount':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'type':list('aaabbb')})
print (df)
A amount C D E type
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df2 = df.groupby(['type','amount'])['type'].count().to_frame('count')
df2['total_amount'] = df2['count'].mul(df2.index.get_level_values('amount'))
print (df2)
count total_amount
type amount
a 4 2 8
5 1 5
b 4 1 4
5 2 10

How to add rows into existing dataframe in pandas? - python

df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12]})
How can I insert a new row of zeros at index 0 in one single line?
I tried pd.concat([pd.DataFrame([[0,0,0]]),df) but it did not work.
The desired output:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
You can concat the temp df with the original df but you need to pass the same column names so that it aligns in the concatenated df, additionally to get the index as you desire call reset_index with drop=True param.
In [87]:
pd.concat([pd.DataFrame([[0,0,0]], columns=df.columns),df]).reset_index(drop=True)
Out[87]:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
alternatively to EdChum's solution you can do this:
In [163]: pd.DataFrame([[0,0,0]], columns=df.columns).append(df, ignore_index=True)
Out[163]:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
An answer more specific to the dataframe being prepended to
pd.concat([df.iloc[[0], :] * 0, df]).reset_index(drop=True)

Categories

Resources