Python: Group and count unique variables based on multiple grouping without recount - python

I have a Dataframe with 3 columns:
ID Round Investor
X 1 A
X 2 A
X 2 B
X 3 A
Y 1 A
Y 1 B
Y 1 C
Y 2 B
Y 2 D
And want to count the number of unique investors for each round for each ID. But I dont want it to recount the investor if it has been in the previous round. The code I am using is:
print(df.groupby(['ID', 'Round'])['Investor'].nunique())
Which results in:
ID Round Unique Investor
X 1 1
2 2
2 2
3 1
Y 1 3
1 3
1 3
2 2
2 2
But I dont what it to count when an investor have invested in a earlier round for the same ID:
ID Round Unique Investor
X 1 1
2 1
2 1
3 0
Y 1 3
1 3
1 3
2 1
2 1
Any help is greatly appreciated!

You can define a helper column Investor2 which is grouped under ID and dropped duplicates within the same ID with Series.drop_duplicates
Then, group by ID and Round as you did before on this Investor2 column with .transform() and nunique, as follows:
df['Unique Investor'] = (
df.assign(Investor2=df.groupby('ID')['Investor'].apply(pd.Series.drop_duplicates).droplevel(0))
.groupby(['ID', 'Round'])['Investor2'].transform('nunique')
)
Result:
print(df)
ID Round Investor Unique Investor
0 X 1 A 1
1 X 2 A 1
2 X 2 B 1
3 X 3 A 0
4 Y 1 A 3
5 Y 1 B 3
6 Y 1 C 3
7 Y 2 B 1
8 Y 2 D 1

One possible solution is to drop the duplicates, based on 'ID' and Investor, groupby ID and Round to get number of uniques, and merge the result back to the main dataframe:
dups = ['ID', 'Investor']
group = ['ID', 'Round']
mapping = (df.drop_duplicates(subset = dups)
.groupby(group)
.Investor
.nunique()
)
(df.filter(group)
.merge(mapping, left_on = group,
right_index = True, how = 'left')
.fillna(0, downcast='infer')
)
ID Round Investor
0 X 1 1
1 X 2 1
2 X 2 1
3 X 3 0
4 Y 1 3
5 Y 1 3
6 Y 1 3
7 Y 2 1
8 Y 2 1

Related

How do I calculate the first value in each group from every other value in the group to calculate change over time?

In R I can calculate the change over time for each group in a data set like this:
df %>%
group_by(z) %>%
mutate(diff = y - y[x == 0])
What is the equivalent in pandas?
I know that using pandas you can minus the first value of a column like this:
df['diff'] = df.y-df.y.iloc[0]
But how do you group by variable z?
Example data:
x y z
0 2 A
5 4 A
10 6 A
0 1 B
5 3 B
10 9 B
Expected output:
x y z diff
0 2 A 0
5 4 A 2
10 6 A 4
0 1 B 0
5 5 B 4
10 9 B 8
You can try this.
temp = df.groupby('z').\
apply(lambda g: g.y - g.y[0]).\
reset_index().\
rename(columns={'y': 'diff'}).\
drop('z', axis=1)
df.merge(temp, how='inner', left_index=True, right_on='level_1').\
drop('level_1', axis=1)
Return:
x y z diff
0 2 A 0
5 4 A 2
10 6 A 4
0 1 B 0
5 5 B 4
10 9 B 8

Groupby selected rows by a condition on a column value and then transform another column

This seems to be easy but couldn't find a working solution for it:
I have a dataframe with 3 columns:
df = pd.DataFrame({'A': [0,0,2,2,2],
'B': [1,1,2,2,3],
'C': [1,1,2,3,4]})
A B C
0 0 1 1
1 0 1 1
2 2 2 2
3 2 2 3
4 2 3 4
I want to select rows based on values of column A, then groupby based on values of column B, and finally transform values of column C into sum. something along the line of this (obviously not working) code:
df[df['A'].isin(['2']), 'C'] = df[df['A'].isin(['2']), 'C'].groupby('B').transform('sum')
desired output for above example is:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
I also know how to split dataframe and do it. I am looking more for a solution that does it without the need of split+concat/merge. Thank you.
Is it just
s = df['A'].isin([2])
pd.concat((df[s].groupby(['A','B'])['C'].sum().reset_index(),
df[~s])
)
Output:
A B C
0 2 2 5
1 2 3 4
0 0 1 1
Update: Without splitting, you can assign a new column indicating special values of A:
(df.sort_values('A')
.assign(D=(~df['A'].isin([2])).cumsum())
.groupby(['D','A','B'])['C'].sum()
.reset_index('D',drop=True)
.reset_index()
)
Output:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4

dataframe pivot with pandas

I am trying to pivot df111 into df222:
ID1 ID2 Type Value
0 1 a X 1
1 1 a Y 2
2 1 b X 3
3 1 b Y 4
4 2 a X 5
5 2 a Y 6
6 2 b X 7
7 2 b Y 8
ID1 ID2 X Value Y Value
0 1 a 1 2
1 1 b 3 4
2 2 a 5 6
3 2 b 7 8
I tried with df111.pivot() and df111.groupby() but no luck. Can someone throw me a one-liner? Thanks
you can do it by first set_index the three first columns and then unstack. To fit the exact output, rename the columns by keeping the second level and reset_index such as:
df222 = df111.set_index(['ID1', 'ID2','Type']).unstack()
df222.columns = [col[1] + ' Value' for col in df222.columns]
df222 = df222.reset_index()
print (df222)
ID1 ID2 X Value Y Value
0 1 a 1 2
1 1 b 3 4
2 2 a 5 6
3 2 b 7 8
and if you want to do it with chaining methods:
df222 = df111.set_index(['ID1', 'ID2','Type']).Value.unstack()\
.rename(columns = {'X': 'X Value', 'Y': 'Y Value'})\
.rename_axis(None, axis="columns")\
.reset_index()
If you have pivot_table function, why the hell you provide pivot? this is just confusing ...
df333 = pd.pivot_table(df111, index=['ID1','ID2'], columns=['Type'], values='Value')
df333.reset_index()
df222 = (df111.set_index(['ID1', 'ID2','Type']).unstack()
.add_suffix(' Value'))
df222.columns=[lev[1] for lev in df222.columns]
df222.reset_index(inplace=True)

pandas dataframe apply using additional arguments

with below example:
df = pd.DataFrame({'signal':[1,0,0,1,0,0,0,0,1,0,0,1,0,0],'product':['A','A','A','A','A','A','A','B','B','B','B','B','B','B'],'price':[1,2,3,4,5,6,7,1,2,3,4,5,6,7],'price2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2]})
I have a function "fill_price" to create a new column 'Price_B' based on 'signal' and 'price'. For every 'product' subgroup, Price_B equals to Price if 'signal' is 1. Price_B equals previous row's Price_B if signal is 0. If the subgroup starts with a 0 'signal', then 'price_B' will be kept at 0 until 'signal' turns 1.
Currently I have:
def fill_price(df, signal,price_A):
p = df[price_A].where(df[signal] == 1)
return p.ffill().fillna(0).astype(df[price_A].dtype)
this is then applied using:
df['Price_B'] = fill_price(df,'signal','price')
However, I want to use df.groupby('product').apply() to apply this fill_price function to two subsets of 'product' columns separately, and also apply it to both'price' and 'price2' columns. Could someone help with that?
I basically want to do:
df.groupby('product',groupby_keys=False).apply(fill_price, 'signal','price2')
IIUC, you can use this syntax:
df['Price_B'] = df.groupby('product').apply(lambda x: fill_price(x,'signal','price2')).reset_index(level=0, drop=True)
Output:
price price2 product signal Price_B
0 1 1 A 1 1
1 2 2 A 0 1
2 3 1 A 0 1
3 4 2 A 1 2
4 5 1 A 0 2
5 6 2 A 0 2
6 7 1 A 0 2
7 1 2 B 0 0
8 2 1 B 1 1
9 3 2 B 0 1
10 4 1 B 0 1
11 5 2 B 1 2
12 6 1 B 0 2
13 7 2 B 0 2
You can write this much simplier without the extra function.
df['Price_B'] = (df.groupby('product',as_index=False)
.apply(lambda x: x['price2'].where(x.signal==1).ffill().fillna(0))
.reset_index(level=0, drop=True))

Update in pandas on specific columns

I want to update values in one pandas data frame based on the values in another dataframe, but I want to specify which column to update by (i.e., which column should be the “key” for looking up matching rows). Right now it seems to do treat the first column as the key one. Is there a way to pass it a specific column name?
Example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame()
df_a['x'] = range(5)
df_a['y'] = range(4, -1, -1)
df_a['z'] = np.random.rand(5)
df_b = pd.DataFrame()
df_b['x'] = range(5)
df_b['y'] = range(5)
df_b['z'] = range(5)
print('df_b:')
print(df_b.head())
print('\nold df_a:')
print(df_a.head(10))
df_a.update(df_b)
print('\nnew df_a:')
print(df_a.head())
Out:
df_b:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
old df_a:
x y z
0 0 4 0.333648
1 1 3 0.683656
2 2 2 0.605688
3 3 1 0.816556
4 4 0 0.360798
new df_a:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
You see, what it did is replaced y and z in df_a with the respective columns in df_b based on matches of x between df_a and df_b.
What if I wanted to keep y the same? What if I want it to replace based on y and not x. Also, what if there are multiple columns on which I’d like to do the replacement (in the real problem, I have to update a dataset with a new dataset, where there is a match in two or three columns between the two on the values from a fourth column).
Basically, I want to do some sort of a merge-replace action, where I specify which columns I am merging/replacing on and which column should be replaced.
Hope this makes things clearer. If this cannot be accomplished with update in pandas, I am wondering if there is another way (short of writing a separate function with for loops for it).
This is my current solution, but it seems somewhat inelegant:
df_merge = df_a.merge(df_b, on='y', how='left', suffixes=('_a', '_b'))
print(df_merge.head())
df_merge['x'] = df_merge.x_b
df_merge['z'] = df_merge.z_b
df_update = df_a.copy()
df_update.update(df_merge)
print(df_update)
Out:
x_a y z_a x_b z_b
0 0 0 0.505949 0 0
1 1 1 0.231265 1 1
2 2 2 0.241109 2 2
3 3 3 0.579765 NaN NaN
4 4 4 0.172409 NaN NaN
x y z
0 0 0 0.000000
1 1 1 1.000000
2 2 2 2.000000
3 3 3 0.579765
4 4 4 0.172409
5 5 5 0.893562
6 6 6 0.638034
7 7 7 0.940911
8 8 8 0.998453
9 9 9 0.965866

Categories

Resources