Get group id back into pandas dataframe - python

For dataframe
In [2]: df = pd.DataFrame({'Name': ['foo', 'bar'] * 3,
...: 'Rank': np.random.randint(0,3,6),
...: 'Val': np.random.rand(6)})
...: df
Out[2]:
Name Rank Val
0 foo 0 0.299397
1 bar 0 0.909228
2 foo 0 0.517700
3 bar 0 0.929863
4 foo 1 0.209324
5 bar 2 0.381515
I'm interested in grouping by Name and Rank and possibly getting aggregate values
In [3]: group = df.groupby(['Name', 'Rank'])
In [4]: agg = group.agg(sum)
In [5]: agg
Out[5]:
Val
Name Rank
bar 0 1.839091
2 0.381515
foo 0 0.817097
1 0.209324
But I would like to get a field in the original df that contains the group number for that row, like
In [13]: df['Group_id'] = [2, 0, 2, 0, 3, 1]
In [14]: df
Out[14]:
Name Rank Val Group_id
0 foo 0 0.299397 2
1 bar 0 0.909228 0
2 foo 0 0.517700 2
3 bar 0 0.929863 0
4 foo 1 0.209324 3
5 bar 2 0.381515 1
Is there a good way to do this in pandas?
I can get it with python,
In [16]: from itertools import count
In [17]: c = count()
In [22]: group.transform(lambda x: c.next())
Out[22]:
Val
0 2
1 0
2 2
3 0
4 3
5 1
but it's pretty slow on a large dataframe, so I figured there may be a better built in pandas way to do this.

A lot of handy things are stored in the DataFrameGroupBy.grouper object. For example:
>>> df = pd.DataFrame({'Name': ['foo', 'bar'] * 3,
'Rank': np.random.randint(0,3,6),
'Val': np.random.rand(6)})
>>> grouped = df.groupby(["Name", "Rank"])
>>> grouped.grouper.
grouped.grouper.agg_series grouped.grouper.indices
grouped.grouper.aggregate grouped.grouper.labels
grouped.grouper.apply grouped.grouper.levels
grouped.grouper.axis grouped.grouper.names
grouped.grouper.compressed grouped.grouper.ngroups
grouped.grouper.get_group_levels grouped.grouper.nkeys
grouped.grouper.get_iterator grouped.grouper.result_index
grouped.grouper.group_info grouped.grouper.shape
grouped.grouper.group_keys grouped.grouper.size
grouped.grouper.groupings grouped.grouper.sort
grouped.grouper.groups
and so:
>>> df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.group_info[0]
>>> df
Name Rank Val GroupId
0 foo 0 0.302482 2
1 bar 0 0.375193 0
2 foo 2 0.965763 4
3 bar 2 0.166417 1
4 foo 1 0.495124 3
5 bar 2 0.728776 1
There may be a nicer alias for for grouper.group_info[0] lurking around somewhere, but this should work, anyway.

Use GroupBy.ngroup from pandas 0.20.2+:
df["GroupId"] = df.groupby(["Name", "Rank"]).ngroup()
print (df)
Name Rank Val GroupId
0 foo 2 0.451724 4
1 bar 0 0.944676 0
2 foo 0 0.822390 2
3 bar 2 0.063603 1
4 foo 1 0.938892 3
5 bar 2 0.332454 1

The correct solution is to use grouper.label_info:
df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.label_info
It automatically associates each row in the df dataframe to the corresponding group label.

Previous answers do not mention how the group id within a group is assigned and whether it is replicable across multiple calls or across systems. Hence the ranking of item is not controlled by the user.
To address this issue, I use the following function to assign a rank to individual elements within each group. 'sorter` enables me to control precisely how to assign a rank.
def group_rank_id(df, grouper, sorter):
# function to apply to each group
def group_fun(x): return x[sorter].reset_index(drop=True).reset_index().rename(columns={'index': 'rank'})
# apply and merge to itself
out = df.groupby(grouper).apply(group_fun).reset_index(drop=True)
return df.merge(out, on=sorter)
Example data:
df
action quantity ticker date price
0 buy 3.0 SXRV.DE 1.584662e+09 0.519707
1 buy 7.0 MSF.DE 1.599696e+09 0.998484
2 buy 1.0 ABEA.DE 1.600387e+09 0.538107
3 buy 1.0 AMZ.F 1.606349e+09 0.446594
4 buy 9.0 09KE.BE 1.610669e+09 0.383777
5 buy 11.0 09KF.BE 1.610669e+09 0.987921
6 buy 3.0 FB2A.MU 1.620173e+09 0.696381
7 buy 3.0 FB2A.MU 1.636070e+09 0.700757
will result in:
group_rank_id(df, 'ticker',['ticker','date'])
action quantity ticker date price rank
0 buy 3.0 SXRV.DE 1.584662e+09 0.519707 0
1 buy 7.0 MSF.DE 1.599696e+09 0.998484 0
2 buy 1.0 ABEA.DE 1.600387e+09 0.538107 0
3 buy 1.0 AMZ.F 1.606349e+09 0.446594 0
4 buy 9.0 09KE.BE 1.610669e+09 0.383777 0
5 buy 11.0 09KF.BE 1.610669e+09 0.987921 0
6 buy 3.0 FB2A.MU 1.620173e+09 0.696381 0
7 buy 3.0 FB2A.MU 1.636070e+09 0.700757 1

Related

How can I groupby a DataFrame at the same time I count the values and put in different columns?

I have a DataFrame that looks like the one below
Index Category Class
0 1 A
1 1 A
2 1 B
3 2 A
4 3 B
5 3 B
And I would like to get an output data frame that groups by category and have one column for each of the classes with the counting of the occurrences of that class in each category, such as the one below
Index Category A B
0 1 2 1
1 2 1 0
2 3 0 2
So far I've tried various combinations of the groupby and agg methods, but I still can't get what I want. I've also tried df.pivot_table(index='Category', columns='Class', aggfunc='count'), but that return a DataFrame without columns. Any ideas of what could work in this case?
You can use aggfunc="size" to achieve your desired result:
>>> df.pivot_table(index='Category', columns='Class', aggfunc='size', fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Alternatively, you can use .groupby(...).size() to get the counts, and then unstack to reshape your data as well:
>>> df.groupby(["Category", "Class"]).size().unstack(fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Assign a dummy value to count:
out = df.assign(val=1).pivot_table('val', 'Category', 'Class',
aggfunc='count', fill_value=0).reset_index()
print(out)
# Output
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2
import pandas as pd
df = pd.DataFrame({'Index':[0,1,2,3,4,5],
'Category': [1,1,1,2,3,3],
'Class':['A','A','B','A','B','B'],
})
df = df.groupby(['Category', 'Class']).count()
df = df.pivot_table(index='Category', columns='Class')
print(df)
output:
Index
Class A B
Category
1 2.0 1.0
2 1.0 NaN
3 NaN 2.0
Use crosstab:
pd.crosstab(df['Category'], df['Class']).reset_index()
output:
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2

Replacing null value in Python with next available value by group

df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'value': [None,None,'A',None,'B',None]
})
I would like to replace missing values by the first next non missing value by group. The desired result is:
df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'value': ['A','A','A','B','B',None]
})
You can try this:
df['value'] = df.groupby(by=['group'])['value'].backfill()
print(df)
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
The Easiest way as #Erfan mention using backfill method DataFrameGroupBy.bfill.
Solution 1)
>>> df['value'] = df.groupby('group')['value'].bfill()
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 2)
DataFrameGroupBy.bfill with limit parameter works perfectly as well here.
From the pandas Documentation which nicely briefs the Limit the amount of filling is worth to read. as per the doc If we only want consecutive gaps filled up to a certain number of data points, we can use the limit keyword.
>>> df['value'] = df.groupby(['group']).bfill(limit=2)
# >>> df['value'] = df.groupby('group').bfill(limit=2)
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 3)
With groupby() we can also combine fillna() with bfill() along with limit parameter.
>>> df.groupby('group').fillna(method='bfill',limit=2)
value
0 A
1 A
2 A
3 B
4 B
5 None
Solution 4)
Other way around using DataFrame.transform function to fill the value column after group by with DataFrameGroupBy.bfill.
>>> df['value'] = df.groupby('group')['value'].transform(lambda v: v.bfill())
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 None
Solution 5)
You can use DataFrame.set_index to add the group column to the index, making it unique, and do a simple bfill() via groupby(), then you can use reset index to its original state.
>>> df.set_index('group', append=True).groupby(level=1).bfill().reset_index(level=1)
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 6)
In case strictly not going for groupby() then below would be the easiest ..
>>> df['value'] = df['value'].bfill()
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 None

pandas dataframe apply using additional arguments

with below example:
df = pd.DataFrame({'signal':[1,0,0,1,0,0,0,0,1,0,0,1,0,0],'product':['A','A','A','A','A','A','A','B','B','B','B','B','B','B'],'price':[1,2,3,4,5,6,7,1,2,3,4,5,6,7],'price2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2]})
I have a function "fill_price" to create a new column 'Price_B' based on 'signal' and 'price'. For every 'product' subgroup, Price_B equals to Price if 'signal' is 1. Price_B equals previous row's Price_B if signal is 0. If the subgroup starts with a 0 'signal', then 'price_B' will be kept at 0 until 'signal' turns 1.
Currently I have:
def fill_price(df, signal,price_A):
p = df[price_A].where(df[signal] == 1)
return p.ffill().fillna(0).astype(df[price_A].dtype)
this is then applied using:
df['Price_B'] = fill_price(df,'signal','price')
However, I want to use df.groupby('product').apply() to apply this fill_price function to two subsets of 'product' columns separately, and also apply it to both'price' and 'price2' columns. Could someone help with that?
I basically want to do:
df.groupby('product',groupby_keys=False).apply(fill_price, 'signal','price2')
IIUC, you can use this syntax:
df['Price_B'] = df.groupby('product').apply(lambda x: fill_price(x,'signal','price2')).reset_index(level=0, drop=True)
Output:
price price2 product signal Price_B
0 1 1 A 1 1
1 2 2 A 0 1
2 3 1 A 0 1
3 4 2 A 1 2
4 5 1 A 0 2
5 6 2 A 0 2
6 7 1 A 0 2
7 1 2 B 0 0
8 2 1 B 1 1
9 3 2 B 0 1
10 4 1 B 0 1
11 5 2 B 1 2
12 6 1 B 0 2
13 7 2 B 0 2
You can write this much simplier without the extra function.
df['Price_B'] = (df.groupby('product',as_index=False)
.apply(lambda x: x['price2'].where(x.signal==1).ffill().fillna(0))
.reset_index(level=0, drop=True))

Pandas - Return first item in dataframe, grouped by user

I have a lot of user/item/timestamp data. I want to know which items were consumed first, second third, etc. by all users.
My questions are: if I have a dataframe that is already sorted by time (descending), will it stay sorted by default through the groupby process? and, how can I pull out the first two items consumed by any user, even if the user has not consumed two items?
import pandas as pd
df = pd.DataFrame({'item_id': ['b', 'b', 'a', 'c', 'a', 'b'], 'user_id': [1,2,1,1,3,1], 'time': range(6)})
print df
pd.get_dummies(df['item_id'])
gp = df.groupby('user_id').head()
print gp
# Return item_id of first one installed in each case ??
This gives:
item_id time user_id
0 b 0 1
1 b 1 2
2 a 2 1
3 c 3 1
4 a 4 3
5 b 5 1
item_id time user_id
user_id
1 0 b 0 1
2 a 2 1
3 c 3 1
5 b 5 1
2 1 b 1 2
3 4 a 4 3
Now, I need to pull out the top two item_id values, something like this (but retaining the user_id column is not essential):
user_id order item_id
1 0 b
1 1 a
2 0 b
3 0 a
Here's a hack:
In [75]: def nth_order(x, n):
....: xn = x[:n]
....: return xn.join(Series(arange(len(xn)), name='order', index=xn.index))
....:
In [76]: df.groupby('user_id').apply(lambda x: nth_order(x, 2))
Out[76]:
item_id time user_id order
user_id
1 0 b 0 1 0
2 a 2 1 1
2 1 b 1 2 0
3 4 a 4 3 0
Note that you can't just use n, because you may have a group where len(group) < 2, therefore
len(x[:n]) != n
in every case (as per your question).
This is a feature of this particular kind of slicing in pandas: if you slice pass the end here you'll just get every row (and there may not be n rows), whereas with iloc indexing, that isn't true. That is, an exception will be raised if you try to slice past the end of the array.
You can do this directly with head which gets the top n results):
In [11]: g = df.groupby('user_id')
In [12]: g.head(2)
Out[12]:
item_id time user_id
user_id
1 0 b 0 1
2 a 2 1
2 1 b 1 2
3 4 a 4 3
As of 0.13 IIRC, this is much faster than any apply-based solution head (calling head used to be a fallthrough to .apply(lambda x: x.head()).
The implementation uses cumcount so is similar in spirit to PhilipCloud's solution.

Pandas: Create new dataframe that averages duplicates from another dataframe

Say I have a dataframe my_df with column duplicates, e..g
foo bar foo hello
0 1 1 5
1 1 2 5
2 1 3 5
I would like to create another dataframe that averages the duplicates:
foo bar hello
0.5 1 5
1.5 1 5
2.5 1 5
How can I do this in Pandas?
So far I have managed to identify duplicates:
my_columns = my_df.columns
my_duplicates = print [x for x, y in collections.Counter(my_columns).items() if y > 1]
By I don't know how to ask Pandas to average them.
You can groupby the column index and take the mean:
In [11]: df.groupby(level=0, axis=1).mean()
Out[11]:
bar foo hello
0 1 0.5 5
1 1 1.5 5
2 1 2.5 5
A somewhat trickier example is if there is a non numeric column:
In [21]: df
Out[21]:
foo bar foo hello
0 0 1 1 a
1 1 1 2 a
2 2 1 3 a
The above will raise: DataError: No numeric types to aggregate. Definitely not going to win any prizes for efficiency, but here's generic method to do in this case:
In [22]: dupes = df.columns.get_duplicates()
In [23]: dupes
Out[23]: ['foo']
In [24]: pd.DataFrame({d: df[d] for d in df.columns if d not in dupes})
Out[24]:
bar hello
0 1 a
1 1 a
2 1 a
In [25]: pd.concat(df.xs(d, axis=1) for d in dupes).groupby(level=0, axis=1).mean()
Out[25]:
foo
0 0.5
1 1.5
2 2.5
In [26]: pd.concat([Out[24], Out[25]], axis=1)
Out[26]:
foo bar hello
0 0.5 1 a
1 1.5 1 a
2 2.5 1 a
I think the thing to take away is avoid column duplicates... or perhaps that I don't know what I'm doing.

Categories

Resources