I have a DataFrame of the following form:
>>> sales = pd.DataFrame({'seller_id':list('AAAABBBB'),'buyer_id':list('CCDECDEF'),\
'amount':np.random.randint(10,20,size=(8,))})
>>> sales = sales[['seller_id','buyer_id','amount']]
>>> sales
seller_id buyer_id amount
0 A C 18
1 A C 15
2 A D 11
3 A E 12
4 B C 16
5 B D 18
6 B E 16
7 B F 19
Now what I would like to do is for each seller calculate the share of total sale amount taken up by its largest buyer. I have code that does this, but I have to keep resetting the index and grouping again, which is wasteful. There has to be a better way. I would like a solution where I can aggregate one column at a time and keep the others grouped.
Here's my current code:
>>> gr2 = sales.groupby(['buyer_id','seller_id'])
>>> seller_buyer_level = gr2['amount'].sum() # sum over different purchases
>>> seller_buyer_level_reset = seller_buyer_level.reset_index('buyer_id')
>>> gr3 = seller_buyer_level_reset.groupby(seller_buyer_level_reset.index)
>>> result = gr3['amount'].max() / gr3['amount'].sum()
>>> result
seller_id
A 0.589286
B 0.275362
I simplified a bit. In reality I also have a time period column, and so I want to do this at the seller and time period level, that's why in gr3 I'm grouping by the multi-index (in this example, it appears as a single index).
I thought there would be a solution where instead of reducing and regrouping I would be able to aggregate only one index out of the group, leaving the others grouped, but couldn't find it in the documentation or online. Any ideas?
Here's a one-liner, but it resets the index once, too:
sales.groupby(['seller_id','buyer_id']).sum().\
reset_index(level=1).groupby(level=0).\
apply(lambda x: x.amount.max()/x.amount.sum())
#seller_id
#A 0.509091
#B 0.316667
#dtype: float64
I would do this using pivot_table and then broadcasting (see What does the term "broadcasting" mean in Pandas documentation?).
First, pivot the data with seller_id in the index and buyer_id in the columns:
sales_pivot = sales.pivot_table(index='seller_id', columns='buyer_id', values='amount', aggfunc='sum')
Then, divide the values in each row by the sum of said row:
result = sales_pivot.div(sales_pivot.sum(axis=1), axis=0)
Lastly, you can call result.max(axis=1) to see the top share for each seller.
Related
I have the following Data Frame:
import pandas as pd
df = {'Country': ['A','A','B','B','B'],'MY_Product': ['NS_1', 'SY_1','BMX_3','NS_5','NK'],'Cost': [5, 35,34,45,9],'Competidor_Country_2': ['A', 'A' ,'B','B','B'],'Competidor_Product_2': ['BMX_2','TM_0','NS_6','SY_8','NA'],'Competidor_Cost_2': [35, 20,65,67,90]}
df_new = pd.DataFrame(df,columns=['Country', 'MY_Product', 'Cost','Competidor_Country_2','Competidor_Product_2','Competidor_Cost_2'])
print(df_new)
Information:
My products must start with "NS","SY", "NK" or "NA";
In the first three columns is represented informations of my products and in the last three the competitor's product
I did not put all examples to simplify the exercise
Problem:
As you can see in the third row, there is a product that is not mine ("BMX_3") and the competidor is one of mine...So I would like to replace not only the pruduct but the other competidor's columns too, thus leaving the first three columns with my product and the last 3 with the competitor's
Considerations:
if the two products in the line are my products (last row for exemple), I don't need to do anything (but if possible leave a "comment code" to delete this comparison will help me, just in case)
If I understand you right, you want to swap values of the 3 columns if the product in MY_Product isn't yours:
# create a mask
mask = ~df_new.MY_Product.str.contains(r"^(?:NS|SY|NK|NA)")
# swap the values of the three columns:
vals = df_new.loc[mask, ["Country", "MY_Product", "Cost"]].values
df_new.loc[mask, ["Country", "MY_Product", "Cost"]] = df_new.loc[
mask, ["Competidor_Country_2", "Competidor_Product_2", "Competidor_Cost_2"]
].values
df_new.loc[
mask, ["Competidor_Country_2", "Competidor_Product_2", "Competidor_Cost_2"]
] = vals
# print the dataframe
print(df_new)
Prints:
Country MY_Product Cost Competidor_Country_2 Competidor_Product_2 Competidor_Cost_2
0 A NS_1 5 A BMX_2 35
1 A SY_1 35 A TM_0 20
2 B NS_6 65 B BMX_3 34
3 B NS_5 45 B SY_8 67
4 B NK 9 B NA 90
I have a DataFrame which I want to groupby with a few columns. I know how to aggregate the data after that, or view each index tuple. However, I am unsure of the best way to just append the "group number" of each group in a column on the original dataframe:
For example, I have a dataframe, df, with two indices (a_id and b_id) which I want to use for grouping the df using groupby.
import pandas as pd
a = pd.DataFrame({'a_id':['q','q','q','q','q','r','r','r','r','r'],
'b_id':['m','m','j','j','j','g','g','f','f','f'],
'val': [1,2,3,4,5,6,7,8,9,8]})
# Output:
a_id b_id val
0 q m 1
1 q m 2
2 q j 3
3 q j 4
4 q j 5
5 r g 6
6 r g 7
7 r f 8
8 r f 9
9 r f 8
When I do the groupby, rather than aggregate everything, I just want to add a column group_id that has an integer representing the group. However, I am not sure if there is a simple way to do this. My current solution involves inverting the GroupBy.indices dictionary, turning that into a series, and appending it to the dataframe as follows:
gb = a.groupby(['a_id','b_id'])
dict_g = dict(enumerate(gb.indices.values()))
dict_g_reversed = {x:k for k,v in dict_g.items() for x in v}
group_ids = pd.Series(dict_g_reversed)
a['group_id'] = group_ids
This gives me sort of what I want, although the group_id indices are not in the right order. This seems like it should be a simple function, but I'm not sure why it seems not to be. I know in MATLAB, for example, they have a findgroups that does exactly what I would like. So far I haven't been able to find an equivalent in pandas. How can this be done with a pd DataFrame?
You can using ngroup this will provide the order as occurrence
a.groupby(['a_id','b_id']).ngroup()
Or using factorize
pd.factorize(list(map(tuple,a[['a_id','b_id']].values.tolist())))[0]+1
df['newid']=pd.factorize(list(map(tuple,a.values.tolist())))[0]+1
I'm just getting into pandas and I am trying to add a new column to an existing dataframe.
I have two dataframes where the index of one data frame links to a column in another dataframe. Where these values are equal I need to put the value of another column in the source dataframe in a new column of the destination column.
The code section below illustrates what I mean. The commented part is what I need as an output.
I guess I need the .loc[] function.
Another, minor, question: is it bad practice to have a non-unique indexes?
import pandas as pd
d = {'key':['a', 'b', 'c'],
'bar':[1, 2, 3]}
d2 = {'key':['a', 'a', 'b'],
'other_data':['10', '20', '30']}
df = pd.DataFrame(d)
df2 = pd.DataFrame(data = d2)
df2 = df2.set_index('key')
print df2
## other_data new_col
##key
##a 10 1
##a 20 1
##b 30 2
Use rename index by Series:
df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
Or map:
df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
If want better performance, the best is avoid duplicates in index. Also some function like reindex failed in duplicates index.
You can use join
df2.join(df.set_index('key'))
other_data bar
key
a 10 1
a 20 1
b 30 2
One way to rename the column in the process
df2.join(df.set_index('key').bar.rename('new'))
other_data new
key
a 10 1
a 20 1
b 30 2
Another, minor, question: is it bad practice to have a non-unique
indexes?
It is not great practice, but depends on your needs and can be okay in some circumstances.
Issue 1: join operations
A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join or .merge, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.
For example:
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
0 1 a b
a 0.73737 1.49073 0.73737 1.49073
a 0.73737 1.49073 -0.25562 -2.79859
a -0.25562 -2.79859 0.73737 1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583 1.17583 -0.93583 1.17583
b -0.93583 1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583 1.17583
Issue 2: performance
Unique-valued indices make certain operations efficient, as explained in this post.
When index is unique, pandas use a hashtable to map key to value O(1).
When index is non-unique and sorted, pandas use binary search O(logN),
when index is random ordered pandas need to check all the keys in the
index O(N).
A word on .loc
Using .loc will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
print(df.loc['a'])
0 1
a 0.73737 1.49073
a -0.25562 -2.79859
With the help of .loc
df2['new'] = df.set_index('key').loc[df2.index]
Output :
other_data new
key
a 10 1
a 20 1
b 30 2
Using combine_first
In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
bar other_data
key
a 1.0 10
a 1.0 20
b 2.0 30
Or, using map
In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
other_data bar
key
a 10 1
a 20 1
b 30 2
I am new to pandas. I'm trying to sort a column within each group. So far, I was able to group first and second column values together and calculate the mean value in third column. But I am still struggling to sort 3rd column.
This is my input dataframe
This is my dataframe after applying groupby and mean function
I used the following line of code to group input dataframe,
df_o=df.groupby(by=['Organization Group','Department']).agg({'Total Compensation':np.mean})
Please let me know how to sort the last column for each group in 1st column using pandas.
It seems you need sort_values:
#for return df add parameter as_index=False
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
Sample:
df = pd.DataFrame({'Organization Group':['a','b','a','a'],
'Department':['d','f','a','a'],
'Total Compensation':[1,8,9,1]})
print (df)
Department Organization Group Total Compensation
0 d a 1
1 f b 8
2 a a 9
3 a a 1
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
print (df_o)
Organization Group Department Total Compensation
0 a a 5
1 a d 1
2 b f 8
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
print (df_o)
Organization Group Department Total Compensation
1 a d 1
0 a a 5
2 b f 8
Is there a more efficient way to use pandas groupby or pandas.core.groupby.DataFrameGroupBy object to create a unique list, series or dataframe, where I want unique combinations of 2 of N columns. E.g., if I have columns: Date, Name, Item Purchased and I just want to know unique Name and Date combination this works fine:
y = x.groupby(['Date','Name']).count()
y = y.reset_index()[['Date', 'Name']]
but I feel like there should be a cleaner way using
y = x.groupby(['Date','Name'])
but y.index gives me an error, although y.keys works. This actually leads me to ask the general question as what are pandas.core.groupby.DataFrameGroupBy objects convenient for?
Thanks!
You don't need to use -- and in fact shouldn't use -- groupby here. You could use drop_duplicates to get unique rows instead:
x.drop_duplicates(['Date','Name'])
Demo:
In [156]: x = pd.DataFrame({'Date':[0,1,2]*2, 'Name':list('ABC')*2})
In [158]: x
Out[158]:
Date Name
0 0 A
1 1 B
2 2 C
3 0 A
4 1 B
5 2 C
In [160]: x.drop_duplicates(['Date','Name'])
Out[160]:
Date Name
0 0 A
1 1 B
2 2 C
You shouldn't use groupby because
x.groupby(['Date','Name']).count() performs a count of the
number of elements in each group, but the count is not used -- it's a wasted computation.
x.groupby(['Date','Name']).count() raises an AttributeError if
x has only Date and Name columns.
drop_duplicates is much much faster for this purpose.
Use groupby when you want to perform some operation on each group, such as counting the number of elements in each group, or computing some statistic (e.g. a sum or mean, etc.) per group.