Pandas custom groupby

Pandas custom groupby - python

Is there any way to use a custom groupby function in Pandas? for example suppose I have the data below.
a|b|c
-----
1 2 3
1 2 4
1 3 7
1 4 3
1 4 5
2 1 0
2 3 5
2 4 6
2 3 6
3 1 0
4 1 0
4 2 3
Is it possible to group my data by a and b if a is not in [2,4] and by a otherwise?
In the example above I'd like to get the following groups:
123
124
137
143
145
210
235
246
236
310
410
423
The column b is an open set so I would ideally like a function that is independent of the values in b

you can mask the column b when a meets your condition with isin and replace by any value (like 1), then use this in the groupby.
for _, dfg in df.groupby(['a',
df['b'].mask(df['a'].isin([2,4]), # condition
1)]): # replacement value
print('new group')
print(dfg)
new group
a b c
0 1 2 3
1 1 2 4
new group
a b c
2 1 3 7
new group
a b c
3 1 4 3
4 1 4 5
new group
a b c
5 2 1 0
6 2 3 5
7 2 4 6
8 2 3 6
new group
a b c
9 3 1 0
new group
a b c
10 4 1 0
11 4 2 3

IIUC, you can also try:
Here, if the value of a is in [2,4] it'll ignore the value in column b and will group them together.
for _, k in df.groupby([df.a.values, np.where(df.a.isin([2, 4]), 0, df.b)]):
print(k)
OUTPUT:
a b c
0 1 2 3
1 1 2 4
a b c
2 1 3 7
a b c
3 1 4 3
4 1 4 5
a b c
5 2 1 0
6 2 3 5
7 2 4 6
8 2 3 6
a b c
9 3 1 0
a b c
10 4 1 0
11 4 2 3

You can create a temporary Series of tuples, containing either (a) or (a, b) and then group by that:
a = df[['a']].apply(tuple, axis=1)
ab = df[['a', 'b']].apply(tuple, axis=1)
df['group'] = np.where(df['a'].isin([2,4]), a, ab)
Output
> df.sort_values('group')
a b c group
1 2 3 (1, 2)
1 2 4 (1, 2)
1 3 7 (1, 3)
1 4 3 (1, 4)
1 4 5 (1, 4)
2 1 0 (2,)
2 3 5 (2,)
2 4 6 (2,)
2 3 6 (2,)
3 1 0 (3, 1)
4 1 0 (4,)
4 2 3 (4,)

You can do this indirectly. First define a function that defines groups:
def grouping(row):
if row.a in [2,4]:
return 0
else:
return f"{row.a}_{row.b}"
Then use apply to get grouping column:
df['grouping'] = df.apply(grouping)
Then group by grouping column:
df = df.groupby('grouping')

Related

How to create a new column that increments within a subgroup of a group in Python?

I have a problem where I need to group the data by two groups, and attach a column that sort of counts the subgroup.
Example dataframe looks like this:
colA colB
1 a
1 a
1 c
1 c
1 f
1 z
1 z
1 z
2 a
2 b
2 b
2 b
3 c
3 d
3 k
3 k
3 m
3 m
3 m
Expected output after attaching the new column is as follows:
colA colB colC
1 a 1
1 a 1
1 c 2
1 c 2
1 f 3
1 z 4
1 z 4
1 z 4
2 a 1
2 b 2
2 b 2
2 b 2
3 c 1
3 d 2
3 k 3
3 k 3
3 m 4
3 m 4
3 m 4
I tried the following but I cannot get this trivial looking problem solved:
Solution 1 I tried that does not give what I am looking for:
df['ONES']=1
df['colC']=df.groupby(['colA','colB'])['ONES'].cumcount()+1
df.drop(columns='ONES', inplace=True)
I also played with transform, and cumsum functions, and apply, but I cannot seem to solve this. Any help is appreciated.
Edit: minor error on dataframes.
Edit 2: For simplicity purposes, I showed similar values for column B, but the problem is within a larger group (indicated by colA), colB may be different and therefore, it needs to be grouped by both at the same time.
Edit 3: Updated dataframes to emphasize what I meant by my second edit. Hope this makes it more clear and reproduceable.

You could use groupby + ngroup:
df['colC'] = df.groupby('colA').apply(lambda x: x.groupby('colB').ngroup()+1).droplevel(0)
Output:
colA colB colC
0 1 a 1
1 1 a 1
2 1 c 2
3 1 c 2
4 1 f 3
5 1 z 4
6 1 z 4
7 1 z 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 c 1
13 3 d 2
14 3 k 3
15 3 k 3
16 3 m 4
17 3 m 4
18 3 m 4

Categorically, factorize
df['colC'] =df['colB'].astype('category').cat.codes+1
colA colB colC
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 c 3
5 1 d 4
6 1 d 4
7 1 d 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 a 1
13 3 b 2
14 3 c 3
15 3 c 3
16 3 d 4
17 3 d 4
18 3 d 4

Pandas, Validating data, check if all groups are with the same length

Having the following DF:
A B
0 1 2
1 1 2
2 4 3
3 4 3
4 5 6
5 5 6
6 5 6
After grouping with column A I get 3 groups
(1, A B
0 1 2
1 1 2)
(4, A B
2 4 3
3 4 3)
(5, A B
4 5 6
5 5 6
6 5 6)
I would like to count the groups different from a specific row count, for example 2 as an input will result 1 as an output because there is only 1 group with 3 rows, were 3 as input will output 2 for the other groups.
A B
0 1 2
1 1 2
2 4 3
3 4 3
4 5 6
5 5 6
6 5 6
What is the Pandas solution for such a task?

I think you need Series.value_counts with test not equal by Series.ne and then count number of Trues by sum:
N = 2
a = df['A'].value_counts().ne(N).sum()
print (a)
1

You can count values, then filter on B:
counts = df.groupby('A').count()
count_input = 2
print(len(counts[counts['B'] != count_input]))

How to grouby one column and do nothing to other columns in pandas?

I have a dataframe like this:
a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3
How to groupby 'a', and do nothing to column b c d, and split into several dataframes? Like this:
First groupby column 'a':
a b c d
0 1 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 2 1 1 1
5 2 2 2
6 3 3 3
And then split into different dataframes based on numbers in 'a':
dataframe 1:
a b c d
0 1 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
dataframe 2:
a b c d
0 2 1 1 1
1 2 2 2
2 3 3 3
:
:
:
dataframe n:
a b c d
0 n 1 1 1
1 2 2 2
2 3 3 3

Iterate over each group that df.groupby returns.
for _, g in df.groupby('a'):
print(g, '\n')
a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4
a b c d
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3
If you want a dict of dataframes, I'd recommend:
df_dict = {idx : g for idx, g in df.groupby('a')}
Here, idx is the unique a value.
A couple of nifty techniques courtesy Root:
df_dict = dict(list(df.groupby('a'))) # for a dictionary
And,
idxs, dfs = zip(*df.groupby('a')) # separate lists
idxs
(1, 2)
dfs
( a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4, a b c d
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3)

This is the way by using np.split
idx=df.a.diff().fillna(0).nonzero()[0]
dfs = np.split(df, idx, axis=0)
dfs
Out[210]:
[ a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4, a b c d
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3]
dfs[0]
Out[211]:
a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4

Pandas DataFrame drop tuple or list of columns

When using the drop method for a pandas.DataFrame it accepts lists of column names, but not tuples, despite the documentation saying that "list-like" arguments are acceptable. Am I reading the documentation incorrectly, as I would expect my MWE to work.
MWE
import pandas as pd
df = pd.DataFrame({k: range(5) for k in list('abcd')})
df.drop(['a', 'c'], axis=1) # Works
df.drop(('a', 'c'), axis=1) # Errors
Versions - Using Python 2.7.12, Pandas 0.20.3.

There is problem with tuples select Multiindex:
np.random.seed(345)
mux = pd.MultiIndex.from_arrays([list('abcde'), list('cdefg')])
df = pd.DataFrame(np.random.randint(10, size=(4,5)), columns=mux)
print (df)
a b c d e
c d e f g
0 8 0 3 9 8
1 4 3 4 1 7
2 4 0 9 6 3
3 8 0 3 1 5
df = df.drop(('a', 'c'), axis=1)
print (df)
b c d e
d e f g
0 0 3 9 8
1 3 4 1 7
2 0 9 6 3
3 0 3 1 5
Same as:
df = df[('a', 'c')]
print (df)
0 8
1 4
2 4
3 8
Name: (a, c), dtype: int32

Pandas treats tuples as multi-index values, so try this instead:
In [330]: df.drop(list(('a', 'c')), axis=1)
Out[330]:
b d
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
here is an example for deleting rows (axis=0 - default) in the multi-index DF:
In [342]: x = df.set_index(np.arange(len(df), 0, -1), append=True)
In [343]: x
Out[343]:
a b c d
0 5 0 0 0 0
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [344]: x.drop((0,5))
Out[344]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [345]: x.drop([(0,5), (4,1)])
Out[345]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
So when you specify tuple Pandas treats it as a multi-index label

I used this to delete column of tuples
del df3[('val1', 'val2')]
and it got deleted.

Add multiple columns to DataFrame and set them equal to an existing column

I want to add multiple columns to a pandas DataFrame and set them equal to an existing column. Is there a simple way of doing this? In R I would do:
df <- data.frame(a=1:5)
df[c('b','c')] <- df$a
df
a b c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
In pandas this results in KeyError: "['b' 'c'] not in index":
df = pd.DataFrame({'a': np.arange(1,6)})
df[['b','c']] = df.a

you can use .assign() method:
In [31]: df.assign(b=df['a'], c=df['a'])
Out[31]:
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
or a little bit more creative approach:
In [41]: cols = list('bcdefg')
In [42]: df.assign(**{col:df['a'] for col in cols})
Out[42]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
another solution:
In [60]: pd.DataFrame(np.repeat(df.values, len(cols)+1, axis=1), columns=['a']+cols)
Out[60]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
NOTE: as #Cpt_Jauchefuerst mentioned in the comment DataFrame.assign(z=1, a=1) will add columns in alphabetical order - i.e. first a will be added to existing columns and then z.

A pd.concat approach
df = pd.DataFrame(dict(a=range5))
pd.concat([df.a] * 5, axis=1, keys=list('abcde'))
a b c d e
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4

Turns out you can use a loop to do this:
for i in ['b','c']: df[i] = df.a

You can set them individually if you're only dealing with a few columns:
df['b'] = df['a']
df['c'] = df['a']
or you can use a loop as you discovered.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas custom groupby - python

You can do this indirectly. First define a function that defines groups: def grouping(row): if row.a in [2,4]: return 0 else: return f"{row.a}_{row.b}" Then use apply to get grouping column: df['grouping'] = df.apply(grouping) Then group by grouping column: df = df.groupby('grouping')

Related

How to create a new column that increments within a subgroup of a group in Python?

Pandas, Validating data, check if all groups are with the same length

How to grouby one column and do nothing to other columns in pandas?

Pandas DataFrame drop tuple or list of columns

Add multiple columns to DataFrame and set them equal to an existing column

Categories

Resources