Pandas: Create new dataframe that averages duplicates from another dataframe - python

Say I have a dataframe my_df with column duplicates, e..g
foo bar foo hello
0 1 1 5
1 1 2 5
2 1 3 5
I would like to create another dataframe that averages the duplicates:
foo bar hello
0.5 1 5
1.5 1 5
2.5 1 5
How can I do this in Pandas?
So far I have managed to identify duplicates:
my_columns = my_df.columns
my_duplicates = print [x for x, y in collections.Counter(my_columns).items() if y > 1]
By I don't know how to ask Pandas to average them.

You can groupby the column index and take the mean:
In [11]: df.groupby(level=0, axis=1).mean()
Out[11]:
bar foo hello
0 1 0.5 5
1 1 1.5 5
2 1 2.5 5
A somewhat trickier example is if there is a non numeric column:
In [21]: df
Out[21]:
foo bar foo hello
0 0 1 1 a
1 1 1 2 a
2 2 1 3 a
The above will raise: DataError: No numeric types to aggregate. Definitely not going to win any prizes for efficiency, but here's generic method to do in this case:
In [22]: dupes = df.columns.get_duplicates()
In [23]: dupes
Out[23]: ['foo']
In [24]: pd.DataFrame({d: df[d] for d in df.columns if d not in dupes})
Out[24]:
bar hello
0 1 a
1 1 a
2 1 a
In [25]: pd.concat(df.xs(d, axis=1) for d in dupes).groupby(level=0, axis=1).mean()
Out[25]:
foo
0 0.5
1 1.5
2 2.5
In [26]: pd.concat([Out[24], Out[25]], axis=1)
Out[26]:
foo bar hello
0 0.5 1 a
1 1.5 1 a
2 2.5 1 a
I think the thing to take away is avoid column duplicates... or perhaps that I don't know what I'm doing.

Related

How can I groupby a DataFrame at the same time I count the values and put in different columns?

I have a DataFrame that looks like the one below
Index Category Class
0 1 A
1 1 A
2 1 B
3 2 A
4 3 B
5 3 B
And I would like to get an output data frame that groups by category and have one column for each of the classes with the counting of the occurrences of that class in each category, such as the one below
Index Category A B
0 1 2 1
1 2 1 0
2 3 0 2
So far I've tried various combinations of the groupby and agg methods, but I still can't get what I want. I've also tried df.pivot_table(index='Category', columns='Class', aggfunc='count'), but that return a DataFrame without columns. Any ideas of what could work in this case?
You can use aggfunc="size" to achieve your desired result:
>>> df.pivot_table(index='Category', columns='Class', aggfunc='size', fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Alternatively, you can use .groupby(...).size() to get the counts, and then unstack to reshape your data as well:
>>> df.groupby(["Category", "Class"]).size().unstack(fill_value=0)
Class A B
Category
1 2 1
2 1 0
3 0 2
Assign a dummy value to count:
out = df.assign(val=1).pivot_table('val', 'Category', 'Class',
aggfunc='count', fill_value=0).reset_index()
print(out)
# Output
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2
import pandas as pd
df = pd.DataFrame({'Index':[0,1,2,3,4,5],
'Category': [1,1,1,2,3,3],
'Class':['A','A','B','A','B','B'],
})
df = df.groupby(['Category', 'Class']).count()
df = df.pivot_table(index='Category', columns='Class')
print(df)
output:
Index
Class A B
Category
1 2.0 1.0
2 1.0 NaN
3 NaN 2.0
Use crosstab:
pd.crosstab(df['Category'], df['Class']).reset_index()
output:
Class Category A B
0 1 2 1
1 2 1 0
2 3 0 2

Replacing null value in Python with next available value by group

df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'value': [None,None,'A',None,'B',None]
})
I would like to replace missing values by the first next non missing value by group. The desired result is:
df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'value': ['A','A','A','B','B',None]
})
You can try this:
df['value'] = df.groupby(by=['group'])['value'].backfill()
print(df)
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
The Easiest way as #Erfan mention using backfill method DataFrameGroupBy.bfill.
Solution 1)
>>> df['value'] = df.groupby('group')['value'].bfill()
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 2)
DataFrameGroupBy.bfill with limit parameter works perfectly as well here.
From the pandas Documentation which nicely briefs the Limit the amount of filling is worth to read. as per the doc If we only want consecutive gaps filled up to a certain number of data points, we can use the limit keyword.
>>> df['value'] = df.groupby(['group']).bfill(limit=2)
# >>> df['value'] = df.groupby('group').bfill(limit=2)
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 3)
With groupby() we can also combine fillna() with bfill() along with limit parameter.
>>> df.groupby('group').fillna(method='bfill',limit=2)
value
0 A
1 A
2 A
3 B
4 B
5 None
Solution 4)
Other way around using DataFrame.transform function to fill the value column after group by with DataFrameGroupBy.bfill.
>>> df['value'] = df.groupby('group')['value'].transform(lambda v: v.bfill())
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 None
Solution 5)
You can use DataFrame.set_index to add the group column to the index, making it unique, and do a simple bfill() via groupby(), then you can use reset index to its original state.
>>> df.set_index('group', append=True).groupby(level=1).bfill().reset_index(level=1)
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 6)
In case strictly not going for groupby() then below would be the easiest ..
>>> df['value'] = df['value'].bfill()
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 None

find column value in dataframe

Have 2 dataframes.
First has 1 column.
test1: 1,2,3,4,5
Second has 2 columns.
test2: 0 1 1 1 1. test3: 2 2 3 3 4
I need to create new column in First dataframe that with search row value exist in whole dataframe2 (simple ctrl+F).
As result I need to get
test1: 1,2,3,4,5
check: yes,yes,yes,yes,no
UPD
Below code I found, but it shows good result only for first row, don't know if that make sense
first['check'] = second.eq(first['test1'],0).any(1).astype(int)
You can check with isin with values flatten
test1['col2']=test1['col1'].isin(test2.values.ravel())
In [1]: import pandas as pd
...: df1 = pd.DataFrame({'test1': [1,2,3,4,5]})
...: df2 = pd.DataFrame({'test2': [0,1,1,1,1], 'test3': [2,2,3,3,4]})
In [2]: df1
Out[2]:
test1
0 1
1 2
2 3
3 4
4 5
In [3]: df2
Out[3]:
test2 test3
0 0 2
1 1 2
2 1 3
3 1 3
4 1 4
In [4]: df1['check'] = df1['test1'].isin(df2['test2']) \
...: | df1['test1'].isin(df2['test3'])
...: df1
Out[4]:
test1 check
0 1 True
1 2 True
2 3 True
3 4 True
4 5 False

Start counting at zero by group

Consider the following dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame({'group': list('aaabbabc')})
>>> df
group
0 a
1 a
2 a
3 b
4 b
5 a
6 b
7 c
I want to count the cumulative number of times each group has occurred. My desired output looks like this:
>>> df
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
My initial approach was to do something like this:
df['n'] = df.groupby('group').apply(lambda x: list(range(x.shape[0])))
Basically assigning a length n array, zero-indexed, to each group. But that has proven difficult to transpose and join.
You can use groupby + cumcount, and horizontally concat the new column:
>>> pd.concat([df, df.group.groupby(df.group).cumcount()], axis=1).rename(columns={0: 'n'})
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Simply use groupby on column name, in this case group and then apply cumcount and finally add a column in dataframe with the result.
df['n']=df.groupby('group').cumcount()
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
You can use apply method by passing a lambda expression as parameter.
The idea is that you need to find out the count for a group as number of appearances for that group from the previous rows.
df['n'] = df.apply(lambda x: list(df['group'])[:int(x.name)].count(x['group']), axis=1)
Output
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Note: cumcount method is build with the help of the apply function.
You can read this in pandas documentation.

Get group id back into pandas dataframe

For dataframe
In [2]: df = pd.DataFrame({'Name': ['foo', 'bar'] * 3,
...: 'Rank': np.random.randint(0,3,6),
...: 'Val': np.random.rand(6)})
...: df
Out[2]:
Name Rank Val
0 foo 0 0.299397
1 bar 0 0.909228
2 foo 0 0.517700
3 bar 0 0.929863
4 foo 1 0.209324
5 bar 2 0.381515
I'm interested in grouping by Name and Rank and possibly getting aggregate values
In [3]: group = df.groupby(['Name', 'Rank'])
In [4]: agg = group.agg(sum)
In [5]: agg
Out[5]:
Val
Name Rank
bar 0 1.839091
2 0.381515
foo 0 0.817097
1 0.209324
But I would like to get a field in the original df that contains the group number for that row, like
In [13]: df['Group_id'] = [2, 0, 2, 0, 3, 1]
In [14]: df
Out[14]:
Name Rank Val Group_id
0 foo 0 0.299397 2
1 bar 0 0.909228 0
2 foo 0 0.517700 2
3 bar 0 0.929863 0
4 foo 1 0.209324 3
5 bar 2 0.381515 1
Is there a good way to do this in pandas?
I can get it with python,
In [16]: from itertools import count
In [17]: c = count()
In [22]: group.transform(lambda x: c.next())
Out[22]:
Val
0 2
1 0
2 2
3 0
4 3
5 1
but it's pretty slow on a large dataframe, so I figured there may be a better built in pandas way to do this.
A lot of handy things are stored in the DataFrameGroupBy.grouper object. For example:
>>> df = pd.DataFrame({'Name': ['foo', 'bar'] * 3,
'Rank': np.random.randint(0,3,6),
'Val': np.random.rand(6)})
>>> grouped = df.groupby(["Name", "Rank"])
>>> grouped.grouper.
grouped.grouper.agg_series grouped.grouper.indices
grouped.grouper.aggregate grouped.grouper.labels
grouped.grouper.apply grouped.grouper.levels
grouped.grouper.axis grouped.grouper.names
grouped.grouper.compressed grouped.grouper.ngroups
grouped.grouper.get_group_levels grouped.grouper.nkeys
grouped.grouper.get_iterator grouped.grouper.result_index
grouped.grouper.group_info grouped.grouper.shape
grouped.grouper.group_keys grouped.grouper.size
grouped.grouper.groupings grouped.grouper.sort
grouped.grouper.groups
and so:
>>> df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.group_info[0]
>>> df
Name Rank Val GroupId
0 foo 0 0.302482 2
1 bar 0 0.375193 0
2 foo 2 0.965763 4
3 bar 2 0.166417 1
4 foo 1 0.495124 3
5 bar 2 0.728776 1
There may be a nicer alias for for grouper.group_info[0] lurking around somewhere, but this should work, anyway.
Use GroupBy.ngroup from pandas 0.20.2+:
df["GroupId"] = df.groupby(["Name", "Rank"]).ngroup()
print (df)
Name Rank Val GroupId
0 foo 2 0.451724 4
1 bar 0 0.944676 0
2 foo 0 0.822390 2
3 bar 2 0.063603 1
4 foo 1 0.938892 3
5 bar 2 0.332454 1
The correct solution is to use grouper.label_info:
df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.label_info
It automatically associates each row in the df dataframe to the corresponding group label.
Previous answers do not mention how the group id within a group is assigned and whether it is replicable across multiple calls or across systems. Hence the ranking of item is not controlled by the user.
To address this issue, I use the following function to assign a rank to individual elements within each group. 'sorter` enables me to control precisely how to assign a rank.
def group_rank_id(df, grouper, sorter):
# function to apply to each group
def group_fun(x): return x[sorter].reset_index(drop=True).reset_index().rename(columns={'index': 'rank'})
# apply and merge to itself
out = df.groupby(grouper).apply(group_fun).reset_index(drop=True)
return df.merge(out, on=sorter)
Example data:
df
action quantity ticker date price
0 buy 3.0 SXRV.DE 1.584662e+09 0.519707
1 buy 7.0 MSF.DE 1.599696e+09 0.998484
2 buy 1.0 ABEA.DE 1.600387e+09 0.538107
3 buy 1.0 AMZ.F 1.606349e+09 0.446594
4 buy 9.0 09KE.BE 1.610669e+09 0.383777
5 buy 11.0 09KF.BE 1.610669e+09 0.987921
6 buy 3.0 FB2A.MU 1.620173e+09 0.696381
7 buy 3.0 FB2A.MU 1.636070e+09 0.700757
will result in:
group_rank_id(df, 'ticker',['ticker','date'])
action quantity ticker date price rank
0 buy 3.0 SXRV.DE 1.584662e+09 0.519707 0
1 buy 7.0 MSF.DE 1.599696e+09 0.998484 0
2 buy 1.0 ABEA.DE 1.600387e+09 0.538107 0
3 buy 1.0 AMZ.F 1.606349e+09 0.446594 0
4 buy 9.0 09KE.BE 1.610669e+09 0.383777 0
5 buy 11.0 09KF.BE 1.610669e+09 0.987921 0
6 buy 3.0 FB2A.MU 1.620173e+09 0.696381 0
7 buy 3.0 FB2A.MU 1.636070e+09 0.700757 1

Categories

Resources