Pandas custom groupby - python

Is there any way to use a custom groupby function in Pandas? for example suppose I have the data below.
a|b|c
-----
1 2 3
1 2 4
1 3 7
1 4 3
1 4 5
2 1 0
2 3 5
2 4 6
2 3 6
3 1 0
4 1 0
4 2 3
Is it possible to group my data by a and b if a is not in [2,4] and by a otherwise?
In the example above I'd like to get the following groups:
123
124
137
143
145
210
235
246
236
310
410
423
The column b is an open set so I would ideally like a function that is independent of the values in b

you can mask the column b when a meets your condition with isin and replace by any value (like 1), then use this in the groupby.
for _, dfg in df.groupby(['a',
df['b'].mask(df['a'].isin([2,4]), # condition
1)]): # replacement value
print('new group')
print(dfg)
new group
a b c
0 1 2 3
1 1 2 4
new group
a b c
2 1 3 7
new group
a b c
3 1 4 3
4 1 4 5
new group
a b c
5 2 1 0
6 2 3 5
7 2 4 6
8 2 3 6
new group
a b c
9 3 1 0
new group
a b c
10 4 1 0
11 4 2 3

IIUC, you can also try:
Here, if the value of a is in [2,4] it'll ignore the value in column b and will group them together.
for _, k in df.groupby([df.a.values, np.where(df.a.isin([2, 4]), 0, df.b)]):
print(k)
OUTPUT:
a b c
0 1 2 3
1 1 2 4
a b c
2 1 3 7
a b c
3 1 4 3
4 1 4 5
a b c
5 2 1 0
6 2 3 5
7 2 4 6
8 2 3 6
a b c
9 3 1 0
a b c
10 4 1 0
11 4 2 3

You can create a temporary Series of tuples, containing either (a) or (a, b) and then group by that:
a = df[['a']].apply(tuple, axis=1)
ab = df[['a', 'b']].apply(tuple, axis=1)
df['group'] = np.where(df['a'].isin([2,4]), a, ab)
Output
> df.sort_values('group')
a b c group
1 2 3 (1, 2)
1 2 4 (1, 2)
1 3 7 (1, 3)
1 4 3 (1, 4)
1 4 5 (1, 4)
2 1 0 (2,)
2 3 5 (2,)
2 4 6 (2,)
2 3 6 (2,)
3 1 0 (3, 1)
4 1 0 (4,)
4 2 3 (4,)

You can do this indirectly. First define a function that defines groups:
def grouping(row):
if row.a in [2,4]:
return 0
else:
return f"{row.a}_{row.b}"
Then use apply to get grouping column:
df['grouping'] = df.apply(grouping)
Then group by grouping column:
df = df.groupby('grouping')

Related

How to create a new column that increments within a subgroup of a group in Python?

I have a problem where I need to group the data by two groups, and attach a column that sort of counts the subgroup.
Example dataframe looks like this:
colA colB
1 a
1 a
1 c
1 c
1 f
1 z
1 z
1 z
2 a
2 b
2 b
2 b
3 c
3 d
3 k
3 k
3 m
3 m
3 m
Expected output after attaching the new column is as follows:
colA colB colC
1 a 1
1 a 1
1 c 2
1 c 2
1 f 3
1 z 4
1 z 4
1 z 4
2 a 1
2 b 2
2 b 2
2 b 2
3 c 1
3 d 2
3 k 3
3 k 3
3 m 4
3 m 4
3 m 4
I tried the following but I cannot get this trivial looking problem solved:
Solution 1 I tried that does not give what I am looking for:
df['ONES']=1
df['colC']=df.groupby(['colA','colB'])['ONES'].cumcount()+1
df.drop(columns='ONES', inplace=True)
I also played with transform, and cumsum functions, and apply, but I cannot seem to solve this. Any help is appreciated.
Edit: minor error on dataframes.
Edit 2: For simplicity purposes, I showed similar values for column B, but the problem is within a larger group (indicated by colA), colB may be different and therefore, it needs to be grouped by both at the same time.
Edit 3: Updated dataframes to emphasize what I meant by my second edit. Hope this makes it more clear and reproduceable.
You could use groupby + ngroup:
df['colC'] = df.groupby('colA').apply(lambda x: x.groupby('colB').ngroup()+1).droplevel(0)
Output:
colA colB colC
0 1 a 1
1 1 a 1
2 1 c 2
3 1 c 2
4 1 f 3
5 1 z 4
6 1 z 4
7 1 z 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 c 1
13 3 d 2
14 3 k 3
15 3 k 3
16 3 m 4
17 3 m 4
18 3 m 4
Categorically, factorize
df['colC'] =df['colB'].astype('category').cat.codes+1
colA colB colC
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 c 3
5 1 d 4
6 1 d 4
7 1 d 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 a 1
13 3 b 2
14 3 c 3
15 3 c 3
16 3 d 4
17 3 d 4
18 3 d 4

Pandas, Validating data, check if all groups are with the same length

Having the following DF:
A B
0 1 2
1 1 2
2 4 3
3 4 3
4 5 6
5 5 6
6 5 6
After grouping with column A I get 3 groups
(1, A B
0 1 2
1 1 2)
(4, A B
2 4 3
3 4 3)
(5, A B
4 5 6
5 5 6
6 5 6)
I would like to count the groups different from a specific row count, for example 2 as an input will result 1 as an output because there is only 1 group with 3 rows, were 3 as input will output 2 for the other groups.
A B
0 1 2
1 1 2
2 4 3
3 4 3
4 5 6
5 5 6
6 5 6
What is the Pandas solution for such a task?
I think you need Series.value_counts with test not equal by Series.ne and then count number of Trues by sum:
N = 2
a = df['A'].value_counts().ne(N).sum()
print (a)
1
You can count values, then filter on B:
counts = df.groupby('A').count()
count_input = 2
print(len(counts[counts['B'] != count_input]))

How to grouby one column and do nothing to other columns in pandas?

I have a dataframe like this:
a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3
How to groupby 'a', and do nothing to column b c d, and split into several dataframes? Like this:
First groupby column 'a':
a b c d
0 1 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 2 1 1 1
5 2 2 2
6 3 3 3
And then split into different dataframes based on numbers in 'a':
dataframe 1:
a b c d
0 1 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
dataframe 2:
a b c d
0 2 1 1 1
1 2 2 2
2 3 3 3
:
:
:
dataframe n:
a b c d
0 n 1 1 1
1 2 2 2
2 3 3 3
Iterate over each group that df.groupby returns.
for _, g in df.groupby('a'):
print(g, '\n')
a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4
a b c d
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3
If you want a dict of dataframes, I'd recommend:
df_dict = {idx : g for idx, g in df.groupby('a')}
Here, idx is the unique a value.
A couple of nifty techniques courtesy Root:
df_dict = dict(list(df.groupby('a'))) # for a dictionary
And,
idxs, dfs = zip(*df.groupby('a')) # separate lists
idxs
(1, 2)
dfs
( a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4, a b c d
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3)
This is the way by using np.split
idx=df.a.diff().fillna(0).nonzero()[0]
dfs = np.split(df, idx, axis=0)
dfs
Out[210]:
[ a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4, a b c d
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3]
dfs[0]
Out[211]:
a b c d
0 1 1 1 1
1 1 2 2 2
2 1 3 3 3
3 1 4 4 4

Pandas DataFrame drop tuple or list of columns

When using the drop method for a pandas.DataFrame it accepts lists of column names, but not tuples, despite the documentation saying that "list-like" arguments are acceptable. Am I reading the documentation incorrectly, as I would expect my MWE to work.
MWE
import pandas as pd
df = pd.DataFrame({k: range(5) for k in list('abcd')})
df.drop(['a', 'c'], axis=1) # Works
df.drop(('a', 'c'), axis=1) # Errors
Versions - Using Python 2.7.12, Pandas 0.20.3.
There is problem with tuples select Multiindex:
np.random.seed(345)
mux = pd.MultiIndex.from_arrays([list('abcde'), list('cdefg')])
df = pd.DataFrame(np.random.randint(10, size=(4,5)), columns=mux)
print (df)
a b c d e
c d e f g
0 8 0 3 9 8
1 4 3 4 1 7
2 4 0 9 6 3
3 8 0 3 1 5
df = df.drop(('a', 'c'), axis=1)
print (df)
b c d e
d e f g
0 0 3 9 8
1 3 4 1 7
2 0 9 6 3
3 0 3 1 5
Same as:
df = df[('a', 'c')]
print (df)
0 8
1 4
2 4
3 8
Name: (a, c), dtype: int32
Pandas treats tuples as multi-index values, so try this instead:
In [330]: df.drop(list(('a', 'c')), axis=1)
Out[330]:
b d
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
here is an example for deleting rows (axis=0 - default) in the multi-index DF:
In [342]: x = df.set_index(np.arange(len(df), 0, -1), append=True)
In [343]: x
Out[343]:
a b c d
0 5 0 0 0 0
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [344]: x.drop((0,5))
Out[344]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [345]: x.drop([(0,5), (4,1)])
Out[345]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
So when you specify tuple Pandas treats it as a multi-index label
I used this to delete column of tuples
del df3[('val1', 'val2')]
and it got deleted.

Add multiple columns to DataFrame and set them equal to an existing column

I want to add multiple columns to a pandas DataFrame and set them equal to an existing column. Is there a simple way of doing this? In R I would do:
df <- data.frame(a=1:5)
df[c('b','c')] <- df$a
df
a b c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
In pandas this results in KeyError: "['b' 'c'] not in index":
df = pd.DataFrame({'a': np.arange(1,6)})
df[['b','c']] = df.a
you can use .assign() method:
In [31]: df.assign(b=df['a'], c=df['a'])
Out[31]:
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
or a little bit more creative approach:
In [41]: cols = list('bcdefg')
In [42]: df.assign(**{col:df['a'] for col in cols})
Out[42]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
another solution:
In [60]: pd.DataFrame(np.repeat(df.values, len(cols)+1, axis=1), columns=['a']+cols)
Out[60]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
NOTE: as #Cpt_Jauchefuerst mentioned in the comment DataFrame.assign(z=1, a=1) will add columns in alphabetical order - i.e. first a will be added to existing columns and then z.
A pd.concat approach
df = pd.DataFrame(dict(a=range5))
pd.concat([df.a] * 5, axis=1, keys=list('abcde'))
a b c d e
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
Turns out you can use a loop to do this:
for i in ['b','c']: df[i] = df.a
You can set them individually if you're only dealing with a few columns:
df['b'] = df['a']
df['c'] = df['a']
or you can use a loop as you discovered.

Categories

Resources