Pandas: groupby neighboring identical elements

Pandas: groupby neighboring identical elements - python

I need to groupby dataframe
df = pd.DataFrame({'id': [111, 111, 111, 111, 111, 222, 222], 'domain': ['vk.com', 'facebook.com', 'facebook.com', 'twitter.com', 'vk.com', 'facebook.com', 'twitter.com'], 'time': ['2017-01-12', '2017-01-12', '2017-01-12', '2017-01-13', '2017-01-12', '2017-01-14', '2017-01-14'], 'duration': [10, 20, 5, 12, 34, 12, 4]})
I use
df.groupby([df.id, df.domain]).agg({'duration':'sum', 'time': 'first'}).reset_index().reindex(columns=df.columns)
And get
domain duration id time
0 facebook.com 25 111 2017-01-12
1 twitter.com 12 111 2017-01-13
2 vk.com 44 111 2017-01-12
3 facebook.com 12 222 2017-01-14
4 twitter.com 4 222 2017-01-14
But desire output is:
domain duration id time
vk.com 10 111 2017-01-12
facebook.com 25 111 2017-01-12
vk.com 34 111 2017-01-12
twitter.com 12 111 2017-01-13
facebook.com 12 222 2017-01-14
twitter.com 4 222 2017-01-14
How can I fix that?

Here's an alternative without an extra column -
i = df.domain.ne(df.domain.shift()).cumsum()
m = dict(zip(i, df.domain))
df = df.groupby(['id', i], sort=False)\
.agg({'duration':'sum', 'time': 'first'})\
.reset_index()
df.domain = df.domain.map(m)
df
id domain time duration
0 111 vk.com 2017-01-12 10
1 111 facebook.com 2017-01-12 25
2 111 twitter.com 2017-01-13 12
3 111 vk.com 2017-01-12 34
4 222 facebook.com 2017-01-14 12
5 222 twitter.com 2017-01-14 4

We can make use of an extra column which denotes next domain is equal to current domain
df['new'] = (df.domain == df.domain.shift(-1)).cumsum()
ndf = df.groupby([df.domain,df.id,df.new]).agg({'duration':'sum', 'time': 'first'}).reset_index()\
.sort_values('id').reindex(columns=df.columns).drop(['new'],1)
domain duration id time
0 facebook.com 25 111 2017-01-12
2 twitter.com 12 111 2017-01-13
4 vk.com 10 111 2017-01-12
5 vk.com 34 111 2017-01-12
1 facebook.com 12 222 2017-01-14
3 twitter.com 4 222 2017-01-14

Related

How to calculate cumulative groupby counts in Pandas with point in time?

I have a df that contains multiple weekly snapshots of JIRA tickets. I want to calculate the YTD counts of tickets.
the df looks like this:
pointInTime ticketId
2008-01-01 111
2008-01-01 222
2008-01-01 333
2008-01-07 444
2008-01-07 555
2008-01-07 666
2008-01-14 777
2008-01-14 888
2008-01-14 999
So if I df.groupby(['pointInTime'])['ticketId'].count() I can get the count of Ids in every snaphsots. But what I want to achieve is calculate the cumulative sum.
and have a df looks like this:
pointInTime ticketId cumCount
2008-01-01 111 3
2008-01-01 222 3
2008-01-01 333 3
2008-01-07 444 6
2008-01-07 555 6
2008-01-07 666 6
2008-01-14 777 9
2008-01-14 888 9
2008-01-14 999 9
so for 2008-01-07 number of ticket would be count of 2008-01-07 + count of 2008-01-01.

Use GroupBy.count and cumsum, then map the result back to "pointInTime":
df['cumCount'] = (
df['pointInTime'].map(df.groupby('pointInTime')['ticketId'].count().cumsum()))
df
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9

I am using value_counts
df.pointInTime.map(df.pointInTime.value_counts().sort_index().cumsum())
Out[207]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
Name: pointInTime, dtype: int64
Or
pd.Series(np.arange(len(df))+1,index=df.index).groupby(df['pointInTime']).transform('last')
Out[216]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
dtype: int32

Here's an approach transforming with the size and multiplying by the result of taking pd.factorize on pointInTime:
df['cumCount'] = (df.groupby('pointInTime').ticketId
.transform('size')
.mul(pd.factorize(df.pointInTime)[0]+1))
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9

Generating missing consecutive dates between dates

I have a file that is dynamically generated(i.e., file headers remain the same but values changes). For instance, let the file be of this form:
ID,CLASS,DATE,MRK
1,321,02/12/2016,30
2,321,05/12/2016,40
3,321,06/12/2016,0
4,321,07/12/2016,60
5,321,10/12/2016,70
6,876,5/12/2016,100
7,876,7/12/2016,80
Notice for CLASS 321 there are some missing dates namely 03/12/2016, 04/12/2016, 08/12/2016, 09/12/2016. I'm trying to insert the missing dates in the appropriate places with their corresponding value for MRK being 0. The expected output would be like so:
ID,CLASS,DATE,MRK
1,321,02/12/2016,30
2,321,03/12/2016,0
3,321,04/12/2016,0
4,321,05/12/2016,40
5,321,06/12/2016,0
6,321,07/12/2016,60
7,321,08/12/2016,0
8,321,09/12/2016,0
9,321,10/12/2016,70
10,876,5/12/2016,100
11,876,6/12/2016,0
12,876,7/12/2016,80
This is what i came up with so far:
import pandas as pd
df = pd.read_csv('In.txt')
resampled_df = df.resample('D').mean()
print resampled_df
But I'm getting exception:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Could somebody help out a python newbie here?

Read your CSV like this -
df = pd.read_csv('file.csv',
sep=',',
parse_dates=['DATE'],
dayfirst=True, # this is important since you have days first
index_col=['DATE'])
Now, call groupby + resample + first, and tie up loose ends -
df = df.groupby('CLASS').resample('1D')[['MRK']].first()
df.ID = np.arange(1, len(df) + 1)
df.MRK = df.MRK.fillna(0).astype(int)
df.reset_index()
CLASS DATE ID MRK
0 321 2016-12-02 1 30
1 321 2016-12-03 2 0
2 321 2016-12-04 3 0
3 321 2016-12-05 4 40
4 321 2016-12-06 5 0
5 321 2016-12-07 6 60
6 321 2016-12-08 7 0
7 321 2016-12-09 8 0
8 321 2016-12-10 9 70
9 876 2016-12-05 10 100
10 876 2016-12-06 11 0
11 876 2016-12-07 12 80
In particular, MRK needs fillna. The rest can be forward filled.
If the order of columns is important, here's another version.
df = pd.read_csv('file.csv',
sep=',',
parse_dates=['DATE'],
dayfirst=True)
c = df.columns
df = df.set_index('DATE').groupby('CLASS').resample('1D')[['MRK']].first()
df['MRK'] = df.MRK.fillna(0).astype(int)
df['ID'] = np.arange(1, len(df) + 1)
df = df.reset_index().reindex(columns=c)
df['DATE'] = df['DATE'].dt.strftime('%d/%m/%Y')
df
ID CLASS DATE MRK
0 1 321 02/12/2016 30
1 2 321 03/12/2016 0
2 3 321 04/12/2016 0
3 4 321 05/12/2016 40
4 5 321 06/12/2016 0
5 6 321 07/12/2016 60
6 7 321 08/12/2016 0
7 8 321 09/12/2016 0
8 9 321 10/12/2016 70
9 10 876 05/12/2016 100
10 11 876 06/12/2016 0
11 12 876 07/12/2016 80

First convert to datetimes and then groupby by CLASS and resample and last add column ID by insert:
df['DATE'] = pd.to_datetime(df['DATE'], dayfirst=True)
df = (df.set_index('DATE')
.groupby('CLASS')
.resample('d')['MRK']
.asfreq()
.fillna(0)
.astype(int)
.reset_index())
df.insert(0, 'ID', range(1, len(df) + 1))
print (df)
ID CLASS DATE MRK
0 1 321 2016-12-02 30
1 2 321 2016-12-03 0
2 3 321 2016-12-04 0
3 4 321 2016-12-05 40
4 5 321 2016-12-06 0
5 6 321 2016-12-07 60
6 7 321 2016-12-08 0
7 8 321 2016-12-09 0
8 9 321 2016-12-10 70
9 10 876 2016-12-05 100
10 11 876 2016-12-06 0
11 12 876 2016-12-07 80
Alternative solution:
df = (df.set_index('DATE')
.groupby('CLASS')
.resample('d')['MRK']
.first()
.fillna(0)
.astype(int)
.reset_index())
df.insert(0, 'ID', range(1, len(df) + 1))
print (df)
ID CLASS DATE MRK
0 1 321 2016-12-02 30
1 2 321 2016-12-03 0
2 3 321 2016-12-04 0
3 4 321 2016-12-05 40
4 5 321 2016-12-06 0
5 6 321 2016-12-07 60
6 7 321 2016-12-08 0
7 8 321 2016-12-09 0
8 9 321 2016-12-10 70
9 10 876 2016-12-05 100
10 11 876 2016-12-06 0
11 12 876 2016-12-07 80
Last for same format as input use strftime:
df['DATE'] = df['DATE'].dt.strftime('%d/%m/%Y')
print (df)
ID CLASS DATE MRK
0 1 321 02/12/2016 30
1 2 321 03/12/2016 0
2 3 321 04/12/2016 0
3 4 321 05/12/2016 40
4 5 321 06/12/2016 0
5 6 321 07/12/2016 60
6 7 321 08/12/2016 0
7 8 321 09/12/2016 0
8 9 321 10/12/2016 70
9 10 876 05/12/2016 100
10 11 876 06/12/2016 0
11 12 876 07/12/2016 80

Pandas: divide dataframe to some parts

I have dataframe
ID url
111 vk.com
111 facebook.com
111 twitter.com
111 avito.ru
111 apple.com
111 tiffany.com
111 pikabu.ru
111 stackoverflow.com
222 vk.com
222 facebook.com
222 vc.ru
222 twitter.com
I need to add new column part, where I should groupby dataframe with ID and next divide it to 4 parts.
Desire output
ID url part
111 vk.com 1
111 facebook.com 1
111 twitter.com 2
111 avito.ru 2
111 apple.com 3
111 tiffany.com 3
111 pikabu.ru 4
111 stackoverflow.com 4
222 vk.com 1
222 facebook.com 2
222 vc.ru 3
222 twitter.com 4
I tried
df.groupby(['ID']).agg({'ID': np.sum / 4}).rename(columns={'ID': 'part'}).reset_index()
But I don't get desirable with it

You can use groupby with numpy.repeat:
df['part'] = df.groupby('ID')['ID']
.apply(lambda x: pd.Series(np.repeat(np.arange(1, 5), (len(x.index) / 4))))
.reset_index(drop=True)
print (df)
ID url part
0 111 vk.com 1
1 111 facebook.com 1
2 111 twitter.com 2
3 111 avito.ru 2
4 111 apple.com 3
5 111 tiffany.com 3
6 111 pikabu.ru 4
7 111 stackoverflow.com 4
8 222 vk.com 1
9 222 facebook.com 2
10 222 vc.ru 3
11 222 twitter.com 4
Another solution with custom function:
def f(x):
#print (x)
x['part'] = np.repeat(np.arange(1, 5), (len(x.index) / 4))
return x
df = df.groupby('ID').apply(f)
print (df)
ID url part
0 111 vk.com 1
1 111 facebook.com 1
2 111 twitter.com 2
3 111 avito.ru 2
4 111 apple.com 3
5 111 tiffany.com 3
6 111 pikabu.ru 4
7 111 stackoverflow.com 4
8 222 vk.com 1
9 222 facebook.com 2
10 222 vc.ru 3
11 222 twitter.com 4
If groups are not divide by 4 get error:
ValueError: Length of values does not match length of index
One possible solution is append values fo0r divide by 4 and last remove them by dropna:
print (df)
ID url
0 111 vk.com
1 111 avito.ru
2 111 apple.com
3 111 tiffany.com
4 111 pikabu.ru
5 222 vk.com
6 222 facebook.com
7 222 twitter.com
def f(x):
a = len(x.index) % 4
if a != 0:
x = pd.concat([x, pd.DataFrame(index = np.arange(4-a))])
x['part'] = np.repeat(np.arange(1, 5), (len(x.index) / 4))
return x
df = df.groupby('ID').apply(f).dropna(subset=['ID']).reset_index(drop=True)
#if necessary convert to int
df.ID = df.ID.astype(int)
print (df)
ID url part
0 111 vk.com 1
1 111 avito.ru 1
2 111 apple.com 2
3 111 tiffany.com 2
4 111 pikabu.ru 3
5 222 vk.com 1
6 222 facebook.com 2
7 222 twitter.com 3

Pandas: sum values in some column

I need to group elements and sum it with one column.
member_id event_path event_duration
0 111 vk.com 1
1 111 twitter.com 4
2 111 facebook.com 56
3 111 vk.com 23
4 222 vesti.ru 6
5 222 facebook.com 23
6 222 vk.com 56
7 333 avito.ru 8
8 333 avito.ru 4
9 444 mail.ru 7
10 444 vk.com 20
11 444 yandex.ru 40
12 111 vk.com 10
13 222 vk.com 20
And I want no unify member_id and event_path and sum event_duration.
Desire output
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
4 222 vesti.ru 6
5 222 facebook.com 23
6 222 vk.com 76
7 333 avito.ru 12
9 444 mail.ru 7
10 444 vk.com 20
11 444 yandex.ru 40
I use
df['event_duration'] = df.groupby(['member_id', 'event_path'])['event_duration'].transform('sum')
but I get
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
3 111 vk.com 34
4 222 vesti.ru 6
5 222 facebook.com 23
6 222 vk.com 76
7 333 avito.ru 12
8 333 avito.ru 12
9 444 mail.ru 7
10 444 vk.com 20
11 444 yandex.ru 40
12 111 vk.com 34
13 222 vk.com 76
What I do wrong?

You need groupby with parameters sort=False and as_index=False with aggregation sum:
df = df.groupby(['member_id','event_path'],sort=False,as_index=False)['event_duration'].sum()
print (df)
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
3 222 vesti.ru 6
4 222 facebook.com 23
5 222 vk.com 76
6 333 avito.ru 12
7 444 mail.ru 7
8 444 vk.com 20
9 444 yandex.ru 40
Another possible solution is add reset_index:
df = df.groupby(['member_id', 'event_path'],sort=False)['event_duration'].sum().reset_index()
print (df)
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
3 222 vesti.ru 6
4 222 facebook.com 23
5 222 vk.com 76
6 333 avito.ru 12
7 444 mail.ru 7
8 444 vk.com 20
9 444 yandex.ru 40
Function transform is used to add an aggregated calculation back to the original df as a new column.

What you are doing wrong is that you try to assign it to a column in the original dataframe. And since your new column has less rows than the original dataframe, it gets repeated at the end.

Pandas: union duplicate strings

I have dataframe
ID url date active_seconds
111 vk.com 12.01.2016 5
111 facebook.com 12.01.2016 4
111 facebook.com 12.01.2016 3
111 twitter.com 12.01.2016 12
222 vk.com 12.01.2016 8
222 twitter.com 12.01.2016 34
111 facebook.com 12.01.2016 5
and i need to get
ID url date active_seconds
111 vk.com 12.01.2016 5
111 facebook.com 12.01.2016 7
111 twitter.com 12.01.2016 12
222 vk.com 12.01.2016 8
222 twitter.com 12.01.2016 34
111 facebook.com 12.01.2016 5
If I try
df.groupby(['ID', 'url'])['active_seconds'].sum()
it unions all strings. How should I do to get desirable?

(s != s.shift()).cumsum() is a typical way to identify groups of contiguous identifiers
pd.DataFrame.assign is a convenient way to add a new column to a copy of a dataframe and chain more methods
pivot_table allows us to reconfigure our table and aggregate
args - this is a style preference of mine to keep code cleaner looking. I'll pass these arguments to pivot_table via *args
reset_index * 2 to clean up and get to final result
args = ('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum')
df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table(*args) \
.reset_index([1, 2, 3]).reset_index(drop=True)
ID url date active_seconds
0 111 facebook.com 12.01.2016 7
1 111 twitter.com 12.01.2016 12
2 111 vk.com 12.01.2016 5
3 222 twitter.com 12.01.2016 34
4 222 vk.com 12.01.2016 8
5 111 facebook.com 12.01.2016 5

Solutions 1 - cumsum by column url only:
You need groupby by custom Series created by cumsum of boolean mask, but then column url need aggregate by first. Then remove level url with reset_index and last reorder columns by reindex:
g = (df.url != df.url.shift()).cumsum()
print (g)
0 1
1 2
2 2
3 3
4 4
5 5
6 6
Name: url, dtype: int32
g = (df.url != df.url.shift()).cumsum()
#another solution with ne
#g = df.url.ne(df.url.shift()).cumsum()
print (df.groupby([df.ID,df.date,g], sort=False).agg({'active_seconds':'sum', 'url':'first'})
.reset_index(level='url', drop=True)
.reset_index()
.reindex(columns=df.columns))
ID url date active_seconds
0 111 vk.com 12.01.2016 5
1 111 facebook.com 12.01.2016 7
2 111 twitter.com 12.01.2016 12
3 222 vk.com 12.01.2016 8
4 222 twitter.com 12.01.2016 34
5 111 facebook.com 12.01.2016 5
g = (df.url != df.url.shift()).cumsum().rename('tmp')
print (g)
0 1
1 2
2 2
3 3
4 4
5 5
6 6
Name: tmp, dtype: int32
print (df.groupby([df.ID, df.url, df.date, g], sort=False)['active_seconds']
.sum()
.reset_index(level='tmp', drop=True)
.reset_index())
ID url date active_seconds
0 111 vk.com 12.01.2016 5
1 111 facebook.com 12.01.2016 7
2 111 twitter.com 12.01.2016 12
3 222 vk.com 12.01.2016 8
4 222 twitter.com 12.01.2016 34
5 111 facebook.com 12.01.2016 5
Solutions 2 - cumsum by columns ID and url:
g = df[['ID','url']].ne(df[['ID','url']].shift()).cumsum()
print (g)
ID url
0 1 1
1 1 2
2 1 2
3 1 3
4 2 4
5 2 5
6 3 6
print (df.groupby([g.ID, df.date, g.url], sort=False)
.agg({'active_seconds':'sum', 'url':'first'})
.reset_index(level='url', drop=True)
.reset_index()
.reindex(columns=df.columns))
ID url date active_seconds
0 1 vk.com 12.01.2016 5
1 1 facebook.com 12.01.2016 7
2 1 twitter.com 12.01.2016 12
3 2 vk.com 12.01.2016 8
4 2 twitter.com 12.01.2016 34
5 3 facebook.com 12.01.2016 5
And solution where add column df.url, but is necessary rename columns in helper df:
g = df[['ID','url']].ne(df[['ID','url']].shift()).cumsum()
g.columns = g.columns + '1'
print (g)
ID1 url1
0 1 1
1 1 2
2 1 2
3 1 3
4 2 4
5 2 5
6 3 6
print (df.groupby([df.ID, df.url, df.date, g.ID1, g.url1], sort=False)['active_seconds']
.sum()
.reset_index(level=['ID1','url1'], drop=True)
.reset_index())
ID url date active_seconds
0 111 vk.com 12.01.2016 5
1 111 facebook.com 12.01.2016 7
2 111 twitter.com 12.01.2016 12
3 222 vk.com 12.01.2016 8
4 222 twitter.com 12.01.2016 34
5 111 facebook.com 12.01.2016 5
Timings:
Similar solutions, but pivot_table is slowier as groupby:
In [180]: %timeit (df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum').reset_index([1, 2, 3]).reset_index(drop=True))
100 loops, best of 3: 5.02 ms per loop
In [181]: %timeit (df.groupby([df.ID, df.url, df.date, (df.url != df.url.shift()).cumsum().rename('tmp')], sort=False)['active_seconds'].sum().reset_index(level='tmp', drop=True).reset_index())
100 loops, best of 3: 3.62 ms per loop

it looks like you want a cumsum():
In [195]: df.groupby(['ID', 'url'])['active_seconds'].cumsum()
Out[195]:
0 5
1 4
2 7
3 12
4 8
5 34
6 12
Name: active_seconds, dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: groupby neighboring identical elements - python

Related

How to calculate cumulative groupby counts in Pandas with point in time?

Generating missing consecutive dates between dates

Pandas: divide dataframe to some parts

Pandas: sum values in some column

Pandas: union duplicate strings

Categories

Resources