Pandas: sum values in some column - python

I need to group elements and sum it with one column.
member_id event_path event_duration
0 111 vk.com 1
1 111 twitter.com 4
2 111 facebook.com 56
3 111 vk.com 23
4 222 vesti.ru 6
5 222 facebook.com 23
6 222 vk.com 56
7 333 avito.ru 8
8 333 avito.ru 4
9 444 mail.ru 7
10 444 vk.com 20
11 444 yandex.ru 40
12 111 vk.com 10
13 222 vk.com 20
And I want no unify member_id and event_path and sum event_duration.
Desire output
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
4 222 vesti.ru 6
5 222 facebook.com 23
6 222 vk.com 76
7 333 avito.ru 12
9 444 mail.ru 7
10 444 vk.com 20
11 444 yandex.ru 40
I use
df['event_duration'] = df.groupby(['member_id', 'event_path'])['event_duration'].transform('sum')
but I get
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
3 111 vk.com 34
4 222 vesti.ru 6
5 222 facebook.com 23
6 222 vk.com 76
7 333 avito.ru 12
8 333 avito.ru 12
9 444 mail.ru 7
10 444 vk.com 20
11 444 yandex.ru 40
12 111 vk.com 34
13 222 vk.com 76
What I do wrong?

You need groupby with parameters sort=False and as_index=False with aggregation sum:
df = df.groupby(['member_id','event_path'],sort=False,as_index=False)['event_duration'].sum()
print (df)
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
3 222 vesti.ru 6
4 222 facebook.com 23
5 222 vk.com 76
6 333 avito.ru 12
7 444 mail.ru 7
8 444 vk.com 20
9 444 yandex.ru 40
Another possible solution is add reset_index:
df = df.groupby(['member_id', 'event_path'],sort=False)['event_duration'].sum().reset_index()
print (df)
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
3 222 vesti.ru 6
4 222 facebook.com 23
5 222 vk.com 76
6 333 avito.ru 12
7 444 mail.ru 7
8 444 vk.com 20
9 444 yandex.ru 40
Function transform is used to add an aggregated calculation back to the original df as a new column.

What you are doing wrong is that you try to assign it to a column in the original dataframe. And since your new column has less rows than the original dataframe, it gets repeated at the end.

Related

Python Return the First Occurrence in a Group

I have been looking for a way to find the first occurance in a series of rows based on a group.
First I went through and applied a 'group' counter to each group. Then I want to return the ID of the first orruance of 'sold' under status as a new column and apply it to the whole group.
Example below. Final_ID is the new column to be created.
group ID status Final_ID
1 100 view 103
1 101 show 103
1 102 offer 103
1 103 sold 103
1 104 view 103
2 105 view 106
2 106 sold 106
2 107 sold 106
3 108 pending 109
3 109 sold 109
3 110 view 109
4 111 sold 111
4 112 sold 111
4 113 sold 111
4 114 sold 111
I have tried using
df = pd.DataFrame ({'group':['1','1','1','1','1','2','2','2','3','3','3','4','4','4','4'],
'ID':['100','101','102','103','104','105','106','107','108','109','110','111','112','113','114'],
'status':['view','show','offer','sold','view','view','sold','sold','pending','sold','view','sold','sold','sold','sold']
})
df2=df[( df.status=='sold')][['group','ID']].groupby('group'['ID'].apply(min).reset_index()
df2=df.merge(df2, on='group' , how='left')
but I am not sure that is the proper way to go about it.. Any other thoughts?
Mask your ID series wherever status is not sold, then groupby your groups and transform first, which chooses the first non-NaN value for each group, which in this case is the first occurence of sold
df['ID'].mask(df['status'] != 'sold').groupby(df['group']).transform('first').astype(int)
0 103
1 103
2 103
3 103
4 103
5 106
6 106
7 106
8 109
9 109
10 109
11 111
12 111
13 111
14 111
Name: Final_ID, dtype: int32
Assume the ID column is already sorted, you can do:
(
df.set_index('group')
.assign(Final_ID=df.loc[df.status=='sold'].groupby(by='group').ID.first())
.reset_index()
)
group ID status Final_ID
0 1 100 view 103
1 1 101 show 103
2 1 102 offer 103
3 1 103 sold 103
4 1 104 view 103
5 2 105 view 106
6 2 106 sold 106
7 2 107 sold 106
8 3 108 pending 109
9 3 109 sold 109
10 3 110 view 109
11 4 111 sold 111
12 4 112 sold 111
13 4 113 sold 111
14 4 114 sold 111
You need to look for sold rows, drop status column, groupby on group, not on ID, do min.
df.merge(df.loc[df.status=='sold'].drop('status',1).groupby(['group'], as_index=False).min()
.rename(columns={'ID': 'Final_ID'}))
Output:
group ID status Final_ID
0 1 100 view 103
1 1 101 show 103
2 1 102 offer 103
3 1 103 sold 103
4 1 104 view 103
5 2 105 view 106
6 2 106 sold 106
7 2 107 sold 106
8 3 108 pending 109
9 3 109 sold 109
10 3 110 view 109
11 4 111 sold 111
12 4 112 sold 111
13 4 113 sold 111
14 4 114 sold 111

How to calculate cumulative groupby counts in Pandas with point in time?

I have a df that contains multiple weekly snapshots of JIRA tickets. I want to calculate the YTD counts of tickets.
the df looks like this:
pointInTime ticketId
2008-01-01 111
2008-01-01 222
2008-01-01 333
2008-01-07 444
2008-01-07 555
2008-01-07 666
2008-01-14 777
2008-01-14 888
2008-01-14 999
So if I df.groupby(['pointInTime'])['ticketId'].count() I can get the count of Ids in every snaphsots. But what I want to achieve is calculate the cumulative sum.
and have a df looks like this:
pointInTime ticketId cumCount
2008-01-01 111 3
2008-01-01 222 3
2008-01-01 333 3
2008-01-07 444 6
2008-01-07 555 6
2008-01-07 666 6
2008-01-14 777 9
2008-01-14 888 9
2008-01-14 999 9
so for 2008-01-07 number of ticket would be count of 2008-01-07 + count of 2008-01-01.
Use GroupBy.count and cumsum, then map the result back to "pointInTime":
df['cumCount'] = (
df['pointInTime'].map(df.groupby('pointInTime')['ticketId'].count().cumsum()))
df
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9
I am using value_counts
df.pointInTime.map(df.pointInTime.value_counts().sort_index().cumsum())
Out[207]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
Name: pointInTime, dtype: int64
Or
pd.Series(np.arange(len(df))+1,index=df.index).groupby(df['pointInTime']).transform('last')
Out[216]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
dtype: int32
Here's an approach transforming with the size and multiplying by the result of taking pd.factorize on pointInTime:
df['cumCount'] = (df.groupby('pointInTime').ticketId
.transform('size')
.mul(pd.factorize(df.pointInTime)[0]+1))
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9

Pandas: groupby neighboring identical elements

I need to groupby dataframe
df = pd.DataFrame({'id': [111, 111, 111, 111, 111, 222, 222], 'domain': ['vk.com', 'facebook.com', 'facebook.com', 'twitter.com', 'vk.com', 'facebook.com', 'twitter.com'], 'time': ['2017-01-12', '2017-01-12', '2017-01-12', '2017-01-13', '2017-01-12', '2017-01-14', '2017-01-14'], 'duration': [10, 20, 5, 12, 34, 12, 4]})
I use
df.groupby([df.id, df.domain]).agg({'duration':'sum', 'time': 'first'}).reset_index().reindex(columns=df.columns)
And get
domain duration id time
0 facebook.com 25 111 2017-01-12
1 twitter.com 12 111 2017-01-13
2 vk.com 44 111 2017-01-12
3 facebook.com 12 222 2017-01-14
4 twitter.com 4 222 2017-01-14
But desire output is:
domain duration id time
vk.com 10 111 2017-01-12
facebook.com 25 111 2017-01-12
vk.com 34 111 2017-01-12
twitter.com 12 111 2017-01-13
facebook.com 12 222 2017-01-14
twitter.com 4 222 2017-01-14
How can I fix that?
Here's an alternative without an extra column -
i = df.domain.ne(df.domain.shift()).cumsum()
m = dict(zip(i, df.domain))
df = df.groupby(['id', i], sort=False)\
.agg({'duration':'sum', 'time': 'first'})\
.reset_index()
df.domain = df.domain.map(m)
df
id domain time duration
0 111 vk.com 2017-01-12 10
1 111 facebook.com 2017-01-12 25
2 111 twitter.com 2017-01-13 12
3 111 vk.com 2017-01-12 34
4 222 facebook.com 2017-01-14 12
5 222 twitter.com 2017-01-14 4
We can make use of an extra column which denotes next domain is equal to current domain
df['new'] = (df.domain == df.domain.shift(-1)).cumsum()
ndf = df.groupby([df.domain,df.id,df.new]).agg({'duration':'sum', 'time': 'first'}).reset_index()\
.sort_values('id').reindex(columns=df.columns).drop(['new'],1)
domain duration id time
0 facebook.com 25 111 2017-01-12
2 twitter.com 12 111 2017-01-13
4 vk.com 10 111 2017-01-12
5 vk.com 34 111 2017-01-12
1 facebook.com 12 222 2017-01-14
3 twitter.com 4 222 2017-01-14

Pandas: divide dataframe to some parts

I have dataframe
ID url
111 vk.com
111 facebook.com
111 twitter.com
111 avito.ru
111 apple.com
111 tiffany.com
111 pikabu.ru
111 stackoverflow.com
222 vk.com
222 facebook.com
222 vc.ru
222 twitter.com
I need to add new column part, where I should groupby dataframe with ID and next divide it to 4 parts.
Desire output
ID url part
111 vk.com 1
111 facebook.com 1
111 twitter.com 2
111 avito.ru 2
111 apple.com 3
111 tiffany.com 3
111 pikabu.ru 4
111 stackoverflow.com 4
222 vk.com 1
222 facebook.com 2
222 vc.ru 3
222 twitter.com 4
I tried
df.groupby(['ID']).agg({'ID': np.sum / 4}).rename(columns={'ID': 'part'}).reset_index()
But I don't get desirable with it
You can use groupby with numpy.repeat:
df['part'] = df.groupby('ID')['ID']
.apply(lambda x: pd.Series(np.repeat(np.arange(1, 5), (len(x.index) / 4))))
.reset_index(drop=True)
print (df)
ID url part
0 111 vk.com 1
1 111 facebook.com 1
2 111 twitter.com 2
3 111 avito.ru 2
4 111 apple.com 3
5 111 tiffany.com 3
6 111 pikabu.ru 4
7 111 stackoverflow.com 4
8 222 vk.com 1
9 222 facebook.com 2
10 222 vc.ru 3
11 222 twitter.com 4
Another solution with custom function:
def f(x):
#print (x)
x['part'] = np.repeat(np.arange(1, 5), (len(x.index) / 4))
return x
df = df.groupby('ID').apply(f)
print (df)
ID url part
0 111 vk.com 1
1 111 facebook.com 1
2 111 twitter.com 2
3 111 avito.ru 2
4 111 apple.com 3
5 111 tiffany.com 3
6 111 pikabu.ru 4
7 111 stackoverflow.com 4
8 222 vk.com 1
9 222 facebook.com 2
10 222 vc.ru 3
11 222 twitter.com 4
If groups are not divide by 4 get error:
ValueError: Length of values does not match length of index
One possible solution is append values fo0r divide by 4 and last remove them by dropna:
print (df)
ID url
0 111 vk.com
1 111 avito.ru
2 111 apple.com
3 111 tiffany.com
4 111 pikabu.ru
5 222 vk.com
6 222 facebook.com
7 222 twitter.com
def f(x):
a = len(x.index) % 4
if a != 0:
x = pd.concat([x, pd.DataFrame(index = np.arange(4-a))])
x['part'] = np.repeat(np.arange(1, 5), (len(x.index) / 4))
return x
df = df.groupby('ID').apply(f).dropna(subset=['ID']).reset_index(drop=True)
#if necessary convert to int
df.ID = df.ID.astype(int)
print (df)
ID url part
0 111 vk.com 1
1 111 avito.ru 1
2 111 apple.com 2
3 111 tiffany.com 2
4 111 pikabu.ru 3
5 222 vk.com 1
6 222 facebook.com 2
7 222 twitter.com 3

Pandas: union duplicate strings

I have dataframe
ID url date active_seconds
111 vk.com 12.01.2016 5
111 facebook.com 12.01.2016 4
111 facebook.com 12.01.2016 3
111 twitter.com 12.01.2016 12
222 vk.com 12.01.2016 8
222 twitter.com 12.01.2016 34
111 facebook.com 12.01.2016 5
and i need to get
ID url date active_seconds
111 vk.com 12.01.2016 5
111 facebook.com 12.01.2016 7
111 twitter.com 12.01.2016 12
222 vk.com 12.01.2016 8
222 twitter.com 12.01.2016 34
111 facebook.com 12.01.2016 5
If I try
df.groupby(['ID', 'url'])['active_seconds'].sum()
it unions all strings. How should I do to get desirable?
(s != s.shift()).cumsum() is a typical way to identify groups of contiguous identifiers
pd.DataFrame.assign is a convenient way to add a new column to a copy of a dataframe and chain more methods
pivot_table allows us to reconfigure our table and aggregate
args - this is a style preference of mine to keep code cleaner looking. I'll pass these arguments to pivot_table via *args
reset_index * 2 to clean up and get to final result
args = ('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum')
df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table(*args) \
.reset_index([1, 2, 3]).reset_index(drop=True)
ID url date active_seconds
0 111 facebook.com 12.01.2016 7
1 111 twitter.com 12.01.2016 12
2 111 vk.com 12.01.2016 5
3 222 twitter.com 12.01.2016 34
4 222 vk.com 12.01.2016 8
5 111 facebook.com 12.01.2016 5
Solutions 1 - cumsum by column url only:
You need groupby by custom Series created by cumsum of boolean mask, but then column url need aggregate by first. Then remove level url with reset_index and last reorder columns by reindex:
g = (df.url != df.url.shift()).cumsum()
print (g)
0 1
1 2
2 2
3 3
4 4
5 5
6 6
Name: url, dtype: int32
g = (df.url != df.url.shift()).cumsum()
#another solution with ne
#g = df.url.ne(df.url.shift()).cumsum()
print (df.groupby([df.ID,df.date,g], sort=False).agg({'active_seconds':'sum', 'url':'first'})
.reset_index(level='url', drop=True)
.reset_index()
.reindex(columns=df.columns))
ID url date active_seconds
0 111 vk.com 12.01.2016 5
1 111 facebook.com 12.01.2016 7
2 111 twitter.com 12.01.2016 12
3 222 vk.com 12.01.2016 8
4 222 twitter.com 12.01.2016 34
5 111 facebook.com 12.01.2016 5
g = (df.url != df.url.shift()).cumsum().rename('tmp')
print (g)
0 1
1 2
2 2
3 3
4 4
5 5
6 6
Name: tmp, dtype: int32
print (df.groupby([df.ID, df.url, df.date, g], sort=False)['active_seconds']
.sum()
.reset_index(level='tmp', drop=True)
.reset_index())
ID url date active_seconds
0 111 vk.com 12.01.2016 5
1 111 facebook.com 12.01.2016 7
2 111 twitter.com 12.01.2016 12
3 222 vk.com 12.01.2016 8
4 222 twitter.com 12.01.2016 34
5 111 facebook.com 12.01.2016 5
Solutions 2 - cumsum by columns ID and url:
g = df[['ID','url']].ne(df[['ID','url']].shift()).cumsum()
print (g)
ID url
0 1 1
1 1 2
2 1 2
3 1 3
4 2 4
5 2 5
6 3 6
print (df.groupby([g.ID, df.date, g.url], sort=False)
.agg({'active_seconds':'sum', 'url':'first'})
.reset_index(level='url', drop=True)
.reset_index()
.reindex(columns=df.columns))
ID url date active_seconds
0 1 vk.com 12.01.2016 5
1 1 facebook.com 12.01.2016 7
2 1 twitter.com 12.01.2016 12
3 2 vk.com 12.01.2016 8
4 2 twitter.com 12.01.2016 34
5 3 facebook.com 12.01.2016 5
And solution where add column df.url, but is necessary rename columns in helper df:
g = df[['ID','url']].ne(df[['ID','url']].shift()).cumsum()
g.columns = g.columns + '1'
print (g)
ID1 url1
0 1 1
1 1 2
2 1 2
3 1 3
4 2 4
5 2 5
6 3 6
print (df.groupby([df.ID, df.url, df.date, g.ID1, g.url1], sort=False)['active_seconds']
.sum()
.reset_index(level=['ID1','url1'], drop=True)
.reset_index())
ID url date active_seconds
0 111 vk.com 12.01.2016 5
1 111 facebook.com 12.01.2016 7
2 111 twitter.com 12.01.2016 12
3 222 vk.com 12.01.2016 8
4 222 twitter.com 12.01.2016 34
5 111 facebook.com 12.01.2016 5
Timings:
Similar solutions, but pivot_table is slowier as groupby:
In [180]: %timeit (df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum').reset_index([1, 2, 3]).reset_index(drop=True))
100 loops, best of 3: 5.02 ms per loop
In [181]: %timeit (df.groupby([df.ID, df.url, df.date, (df.url != df.url.shift()).cumsum().rename('tmp')], sort=False)['active_seconds'].sum().reset_index(level='tmp', drop=True).reset_index())
100 loops, best of 3: 3.62 ms per loop
it looks like you want a cumsum():
In [195]: df.groupby(['ID', 'url'])['active_seconds'].cumsum()
Out[195]:
0 5
1 4
2 7
3 12
4 8
5 34
6 12
Name: active_seconds, dtype: int64

Categories

Resources