Pandas: divide dataframe to some parts

Pandas: divide dataframe to some parts - python

I have dataframe
ID url
111 vk.com
111 facebook.com
111 twitter.com
111 avito.ru
111 apple.com
111 tiffany.com
111 pikabu.ru
111 stackoverflow.com
222 vk.com
222 facebook.com
222 vc.ru
222 twitter.com
I need to add new column part, where I should groupby dataframe with ID and next divide it to 4 parts.
Desire output
ID url part
111 vk.com 1
111 facebook.com 1
111 twitter.com 2
111 avito.ru 2
111 apple.com 3
111 tiffany.com 3
111 pikabu.ru 4
111 stackoverflow.com 4
222 vk.com 1
222 facebook.com 2
222 vc.ru 3
222 twitter.com 4
I tried
df.groupby(['ID']).agg({'ID': np.sum / 4}).rename(columns={'ID': 'part'}).reset_index()
But I don't get desirable with it

You can use groupby with numpy.repeat:
df['part'] = df.groupby('ID')['ID']
.apply(lambda x: pd.Series(np.repeat(np.arange(1, 5), (len(x.index) / 4))))
.reset_index(drop=True)
print (df)
ID url part
0 111 vk.com 1
1 111 facebook.com 1
2 111 twitter.com 2
3 111 avito.ru 2
4 111 apple.com 3
5 111 tiffany.com 3
6 111 pikabu.ru 4
7 111 stackoverflow.com 4
8 222 vk.com 1
9 222 facebook.com 2
10 222 vc.ru 3
11 222 twitter.com 4
Another solution with custom function:
def f(x):
#print (x)
x['part'] = np.repeat(np.arange(1, 5), (len(x.index) / 4))
return x
df = df.groupby('ID').apply(f)
print (df)
ID url part
0 111 vk.com 1
1 111 facebook.com 1
2 111 twitter.com 2
3 111 avito.ru 2
4 111 apple.com 3
5 111 tiffany.com 3
6 111 pikabu.ru 4
7 111 stackoverflow.com 4
8 222 vk.com 1
9 222 facebook.com 2
10 222 vc.ru 3
11 222 twitter.com 4
If groups are not divide by 4 get error:
ValueError: Length of values does not match length of index
One possible solution is append values fo0r divide by 4 and last remove them by dropna:
print (df)
ID url
0 111 vk.com
1 111 avito.ru
2 111 apple.com
3 111 tiffany.com
4 111 pikabu.ru
5 222 vk.com
6 222 facebook.com
7 222 twitter.com
def f(x):
a = len(x.index) % 4
if a != 0:
x = pd.concat([x, pd.DataFrame(index = np.arange(4-a))])
x['part'] = np.repeat(np.arange(1, 5), (len(x.index) / 4))
return x
df = df.groupby('ID').apply(f).dropna(subset=['ID']).reset_index(drop=True)
#if necessary convert to int
df.ID = df.ID.astype(int)
print (df)
ID url part
0 111 vk.com 1
1 111 avito.ru 1
2 111 apple.com 2
3 111 tiffany.com 2
4 111 pikabu.ru 3
5 222 vk.com 1
6 222 facebook.com 2
7 222 twitter.com 3

Related

Selecting Items in dataframe

Using Python 3
I have a dataframe sort of like this:
productCode productType storeCode salesAmount moreInfo
111 1 111 111 info
111 1 112 112 info
456 4 456 456 info
and so on for thousands of rows
I want to select (and have a list with the codes for) the X amount of the best selling unique products for each different store.
How would I accomplish that?

Data:
df = pd.DataFrame({'productCode': [111,111,456,123,125],
'productType' : [1,1,4,3,3],
'storeCode' : [111,112,112,456,456],
'salesAmount' : [111,112,34,456,1235]})
productCode productType storeCode salesAmount
0 111 1 111 111
1 111 1 112 112
2 456 4 112 34
3 123 3 456 456
4 125 3 456 1235
It sounds like you want the best selling product at each storeCode? In which case:
df.sort_values('salesAmount', ascending=False).groupby('storeCode').head(1)
productCode productType storeCode salesAmount
4 125 3 456 1235
1 111 1 112 112
0 111 1 111 111
Instead, if you want the best selling of each productType at each storeCode, then:
df.sort_values('salesAmount', ascending=False).groupby(['storeCode', 'productType']).head(1)
productCode productType storeCode salesAmount
4 125 3 456 1235
1 111 1 112 112
0 111 1 111 111
2 456 4 112 34

Pandas Dataframe iteration and selecting the rows based on condition - Change in Requirements

I have sorted data frame as mentioned below(Input DataFrame) and I need to iterate the rows,select & retrive the rows into output data frame based on below conditions.
• Condition 1: For a given R1,R2,W - if we have two records with TYPE 'A' and 'B'
a) If (amoun1& amount2) of TYPE ‘A’ is > (amoun1& amount2 )of TYPE ‘B’ we need to bring the TYPE 'A' record into the output
b) If (amoun1& amount2) of TYPE ‘B’ is > (amoun1& amount2 )of TYPE ‘A’ we need to bring the TYPE 'B' record into the output
c) If (amoun1& amount2) of TYPE ‘A’ is = (amoun1& amount2 )of TYPE ‘B’ we need to bring the TYPE 'A' record into the output
• Condition 2: For a given R1,R2,W - if we have only record with TYPE 'A', we need to bring the TYPE 'A' record into the output
• Condition 3: For a given R1,R2,W - if we have only record with TYPE 'B', we need to bring the TYPE 'B' record into the output
Input Dataframe
R1 R2 W TYPE amount1 amount2
0 123 12 1 A 111 222
1 123 12 1 B 111 222
2 123 12 2 A 222 222
3 123 12 2 B 333 333
4 123 12 3 A 444 444
5 123 12 3 B 333 333
6 123 34 1 A 111 222
7 123 34 2 A 333 444
8 123 34 2 B 333 444
9 123 34 3 B 444 555
10 123 34 4 A 555 666
11 123 34 4 B 666 777
Output dataframe
R1 R2 W TYPE amount1 amount1
0 123 12 1 A 111 222
3 123 12 2 B 333 333
4 123 12 3 A 444 444
6 123 34 1 A 111 222
7 123 34 2 A 333 444
9 123 34 3 B 444 555
11 123 34 4 B 666 777

Selection based on your criteria's
def my_selection(idf):
# If 'A' and 'B' in 'TYPE' then give me the row with 'A'
if idf['TYPE'].unique().shape[0] == 2:
return idf[idf['TYPE'] == 'A']
else:
return idf
df2 = df.groupby(['R1', 'R2', 'W'], as_index=False).apply(lambda idf: my_selection(idf))
df2.index = df2.index.droplevels(-1)
# R1 R2 W TYPE amount1 amount2
# 0 123 12 1 A 111 222
# 1 123 12 2 A 333 444
# 2 123 12 3 A 555 666
# 3 123 34 1 A 111 222
# 4 123 34 2 A 222 333
# 5 123 34 3 B 444 555
# 6 123 34 4 A 555 666

All you have to do is groupby R1,R2,W and operate on Type column as follows:
data.groupby(['R1','R2','W']).apply(lambda x: 'A' if 'A' in x['Type'].values else 'B').reset_index()
You can merge this output with original DataFrame on the obtained columns from the above output to get corresponding 'amount1', 'amount2' values

This is what I would do:
categories = ['B','A'] #create a list of categories in ascending order of precedence
d={i:e for e,i in enumerate(categories)} #create a dictionary:{'A': 0, 'B': 1}
s=df['TYPE'].map(d) #map to df['TYPE'] and create a helper series
then assign this series to the dataframe and groupby+transform max and check if it is equal to the helper series and return where both value matches:
out = df[s.eq(df.assign(TYPE=s).groupby(['R1','R2','W'])['TYPE'].transform('max'))]
print(out)
R1 R2 W TYPE amount1 amount2
0 123 12 1 A 111 222
2 123 12 2 A 333 444
4 123 12 3 A 555 666
6 123 34 1 A 111 222
7 123 34 2 A 222 333
9 123 34 3 B 444 555
10 123 34 4 A 555 666

Pandas: groupby neighboring identical elements

I need to groupby dataframe
df = pd.DataFrame({'id': [111, 111, 111, 111, 111, 222, 222], 'domain': ['vk.com', 'facebook.com', 'facebook.com', 'twitter.com', 'vk.com', 'facebook.com', 'twitter.com'], 'time': ['2017-01-12', '2017-01-12', '2017-01-12', '2017-01-13', '2017-01-12', '2017-01-14', '2017-01-14'], 'duration': [10, 20, 5, 12, 34, 12, 4]})
I use
df.groupby([df.id, df.domain]).agg({'duration':'sum', 'time': 'first'}).reset_index().reindex(columns=df.columns)
And get
domain duration id time
0 facebook.com 25 111 2017-01-12
1 twitter.com 12 111 2017-01-13
2 vk.com 44 111 2017-01-12
3 facebook.com 12 222 2017-01-14
4 twitter.com 4 222 2017-01-14
But desire output is:
domain duration id time
vk.com 10 111 2017-01-12
facebook.com 25 111 2017-01-12
vk.com 34 111 2017-01-12
twitter.com 12 111 2017-01-13
facebook.com 12 222 2017-01-14
twitter.com 4 222 2017-01-14
How can I fix that?

Here's an alternative without an extra column -
i = df.domain.ne(df.domain.shift()).cumsum()
m = dict(zip(i, df.domain))
df = df.groupby(['id', i], sort=False)\
.agg({'duration':'sum', 'time': 'first'})\
.reset_index()
df.domain = df.domain.map(m)
df
id domain time duration
0 111 vk.com 2017-01-12 10
1 111 facebook.com 2017-01-12 25
2 111 twitter.com 2017-01-13 12
3 111 vk.com 2017-01-12 34
4 222 facebook.com 2017-01-14 12
5 222 twitter.com 2017-01-14 4

We can make use of an extra column which denotes next domain is equal to current domain
df['new'] = (df.domain == df.domain.shift(-1)).cumsum()
ndf = df.groupby([df.domain,df.id,df.new]).agg({'duration':'sum', 'time': 'first'}).reset_index()\
.sort_values('id').reindex(columns=df.columns).drop(['new'],1)
domain duration id time
0 facebook.com 25 111 2017-01-12
2 twitter.com 12 111 2017-01-13
4 vk.com 10 111 2017-01-12
5 vk.com 34 111 2017-01-12
1 facebook.com 12 222 2017-01-14
3 twitter.com 4 222 2017-01-14

Pandas: sum values in some column

I need to group elements and sum it with one column.
member_id event_path event_duration
0 111 vk.com 1
1 111 twitter.com 4
2 111 facebook.com 56
3 111 vk.com 23
4 222 vesti.ru 6
5 222 facebook.com 23
6 222 vk.com 56
7 333 avito.ru 8
8 333 avito.ru 4
9 444 mail.ru 7
10 444 vk.com 20
11 444 yandex.ru 40
12 111 vk.com 10
13 222 vk.com 20
And I want no unify member_id and event_path and sum event_duration.
Desire output
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
4 222 vesti.ru 6
5 222 facebook.com 23
6 222 vk.com 76
7 333 avito.ru 12
9 444 mail.ru 7
10 444 vk.com 20
11 444 yandex.ru 40
I use
df['event_duration'] = df.groupby(['member_id', 'event_path'])['event_duration'].transform('sum')
but I get
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
3 111 vk.com 34
4 222 vesti.ru 6
5 222 facebook.com 23
6 222 vk.com 76
7 333 avito.ru 12
8 333 avito.ru 12
9 444 mail.ru 7
10 444 vk.com 20
11 444 yandex.ru 40
12 111 vk.com 34
13 222 vk.com 76
What I do wrong?

You need groupby with parameters sort=False and as_index=False with aggregation sum:
df = df.groupby(['member_id','event_path'],sort=False,as_index=False)['event_duration'].sum()
print (df)
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
3 222 vesti.ru 6
4 222 facebook.com 23
5 222 vk.com 76
6 333 avito.ru 12
7 444 mail.ru 7
8 444 vk.com 20
9 444 yandex.ru 40
Another possible solution is add reset_index:
df = df.groupby(['member_id', 'event_path'],sort=False)['event_duration'].sum().reset_index()
print (df)
member_id event_path event_duration
0 111 vk.com 34
1 111 twitter.com 4
2 111 facebook.com 56
3 222 vesti.ru 6
4 222 facebook.com 23
5 222 vk.com 76
6 333 avito.ru 12
7 444 mail.ru 7
8 444 vk.com 20
9 444 yandex.ru 40
Function transform is used to add an aggregated calculation back to the original df as a new column.

What you are doing wrong is that you try to assign it to a column in the original dataframe. And since your new column has less rows than the original dataframe, it gets repeated at the end.

Pandas: union duplicate strings

I have dataframe
ID url date active_seconds
111 vk.com 12.01.2016 5
111 facebook.com 12.01.2016 4
111 facebook.com 12.01.2016 3
111 twitter.com 12.01.2016 12
222 vk.com 12.01.2016 8
222 twitter.com 12.01.2016 34
111 facebook.com 12.01.2016 5
and i need to get
ID url date active_seconds
111 vk.com 12.01.2016 5
111 facebook.com 12.01.2016 7
111 twitter.com 12.01.2016 12
222 vk.com 12.01.2016 8
222 twitter.com 12.01.2016 34
111 facebook.com 12.01.2016 5
If I try
df.groupby(['ID', 'url'])['active_seconds'].sum()
it unions all strings. How should I do to get desirable?

(s != s.shift()).cumsum() is a typical way to identify groups of contiguous identifiers
pd.DataFrame.assign is a convenient way to add a new column to a copy of a dataframe and chain more methods
pivot_table allows us to reconfigure our table and aggregate
args - this is a style preference of mine to keep code cleaner looking. I'll pass these arguments to pivot_table via *args
reset_index * 2 to clean up and get to final result
args = ('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum')
df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table(*args) \
.reset_index([1, 2, 3]).reset_index(drop=True)
ID url date active_seconds
0 111 facebook.com 12.01.2016 7
1 111 twitter.com 12.01.2016 12
2 111 vk.com 12.01.2016 5
3 222 twitter.com 12.01.2016 34
4 222 vk.com 12.01.2016 8
5 111 facebook.com 12.01.2016 5

Solutions 1 - cumsum by column url only:
You need groupby by custom Series created by cumsum of boolean mask, but then column url need aggregate by first. Then remove level url with reset_index and last reorder columns by reindex:
g = (df.url != df.url.shift()).cumsum()
print (g)
0 1
1 2
2 2
3 3
4 4
5 5
6 6
Name: url, dtype: int32
g = (df.url != df.url.shift()).cumsum()
#another solution with ne
#g = df.url.ne(df.url.shift()).cumsum()
print (df.groupby([df.ID,df.date,g], sort=False).agg({'active_seconds':'sum', 'url':'first'})
.reset_index(level='url', drop=True)
.reset_index()
.reindex(columns=df.columns))
ID url date active_seconds
0 111 vk.com 12.01.2016 5
1 111 facebook.com 12.01.2016 7
2 111 twitter.com 12.01.2016 12
3 222 vk.com 12.01.2016 8
4 222 twitter.com 12.01.2016 34
5 111 facebook.com 12.01.2016 5
g = (df.url != df.url.shift()).cumsum().rename('tmp')
print (g)
0 1
1 2
2 2
3 3
4 4
5 5
6 6
Name: tmp, dtype: int32
print (df.groupby([df.ID, df.url, df.date, g], sort=False)['active_seconds']
.sum()
.reset_index(level='tmp', drop=True)
.reset_index())
ID url date active_seconds
0 111 vk.com 12.01.2016 5
1 111 facebook.com 12.01.2016 7
2 111 twitter.com 12.01.2016 12
3 222 vk.com 12.01.2016 8
4 222 twitter.com 12.01.2016 34
5 111 facebook.com 12.01.2016 5
Solutions 2 - cumsum by columns ID and url:
g = df[['ID','url']].ne(df[['ID','url']].shift()).cumsum()
print (g)
ID url
0 1 1
1 1 2
2 1 2
3 1 3
4 2 4
5 2 5
6 3 6
print (df.groupby([g.ID, df.date, g.url], sort=False)
.agg({'active_seconds':'sum', 'url':'first'})
.reset_index(level='url', drop=True)
.reset_index()
.reindex(columns=df.columns))
ID url date active_seconds
0 1 vk.com 12.01.2016 5
1 1 facebook.com 12.01.2016 7
2 1 twitter.com 12.01.2016 12
3 2 vk.com 12.01.2016 8
4 2 twitter.com 12.01.2016 34
5 3 facebook.com 12.01.2016 5
And solution where add column df.url, but is necessary rename columns in helper df:
g = df[['ID','url']].ne(df[['ID','url']].shift()).cumsum()
g.columns = g.columns + '1'
print (g)
ID1 url1
0 1 1
1 1 2
2 1 2
3 1 3
4 2 4
5 2 5
6 3 6
print (df.groupby([df.ID, df.url, df.date, g.ID1, g.url1], sort=False)['active_seconds']
.sum()
.reset_index(level=['ID1','url1'], drop=True)
.reset_index())
ID url date active_seconds
0 111 vk.com 12.01.2016 5
1 111 facebook.com 12.01.2016 7
2 111 twitter.com 12.01.2016 12
3 222 vk.com 12.01.2016 8
4 222 twitter.com 12.01.2016 34
5 111 facebook.com 12.01.2016 5
Timings:
Similar solutions, but pivot_table is slowier as groupby:
In [180]: %timeit (df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum').reset_index([1, 2, 3]).reset_index(drop=True))
100 loops, best of 3: 5.02 ms per loop
In [181]: %timeit (df.groupby([df.ID, df.url, df.date, (df.url != df.url.shift()).cumsum().rename('tmp')], sort=False)['active_seconds'].sum().reset_index(level='tmp', drop=True).reset_index())
100 loops, best of 3: 3.62 ms per loop

it looks like you want a cumsum():
In [195]: df.groupby(['ID', 'url'])['active_seconds'].cumsum()
Out[195]:
0 5
1 4
2 7
3 12
4 8
5 34
6 12
Name: active_seconds, dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: divide dataframe to some parts - python

Related

Selecting Items in dataframe

Pandas Dataframe iteration and selecting the rows based on condition - Change in Requirements

Pandas: groupby neighboring identical elements

Pandas: sum values in some column

Pandas: union duplicate strings

Categories

Resources