What I'm looking to do is group my Dataframe on a Categorical column, compute quantiles using second column, and store the result in a 3rd column. For simplicity lets just do the P50. Example below:
Original DF:
Col1 Col2
A 2
B 4
C 2
A 6
B 12
C 10
Desired DF:
Col1 Col2 Col3_P50
A 2 4
B 4 8
C 2 6
A 6 4
B 12 8
C 10 6
One easy way would be to create a small dataframe of each Category (A,B,C) and compute quantile and merge back to existing DF, but my actual dataset has 100s of category so this isn't an option. Any suggestions would be much appreciated!
You can do transform with quantile
df['Col3_P50'] = df.groupby("Col1")['Col2'].transform('quantile',0.5)
print(df)
Col1 Col2 Col3_P50
0 A 2 4
1 B 4 8
2 C 2 6
3 A 6 4
4 B 12 8
5 C 10 6
If you have multiple values, one way is creating a dictionary and set the keys as column names and values inside the groupby:
d = {'P_50':0.5,'P_90':0.9}
for k,v in d.items():
df[k]=df.groupby("Col1")['Col2'].transform('quantile',v)
print(df)
Col1 Col2 P_50 P_90
0 A 2 4 5.6
1 B 4 8 11.2
2 C 2 6 9.2
3 A 6 4 5.6
4 B 12 8 11.2
5 C 10 6 9.2
Related
I have data frame where i need to convert all the column in a row with their unique values
A B C
1 2 2
1 2 3
5 2 9
Desired output
X1 V1
A 1
A 5
B 2
C 2
C 3
C 9
I can get unique values by unique() function but don't know how I get desired output in pandas
You can use melt and drop_duplicates:
df.melt(var_name='X1', value_name='V1').drop_duplicates()
Output:
X1 V1
0 A 1
2 A 5
3 B 2
6 C 2
7 C 3
8 C 9
P.S. And you can add .reset_index(drop=True) if you want to have sequential integers for index
I saw a primitive version of this question here
but i my dataframe has diffrent names and i want to calculate separately for them
A B C
0 a 3 5
1 a 6 9
2 b 3 8
3 b 11 19
i want to groupby A and then find diffence between alternate B and C.something like this
A B C dA
0 a 3 5 6
1 a 6 9 NaN
2 b 3 8 16
3 b 11 19 NaN
i tried doing
df['dA']=df.groupby('A')(['C']-['B'])
df['dA']=df.groupby('A')['C']-df.groupby('A')['B']
none of them helped
what mistake am i making?
IIUC, here is one way to perform the calculation:
# create the data frame
from io import StringIO
import pandas as pd
data = '''idx A B C
0 a 3 5
1 a 6 9
2 b 3 8
3 b 11 19
'''
df = pd.read_csv(StringIO(data), sep='\s+', engine='python').set_index('idx')
Now, compute dA. I look last value of C less first value of B, as grouped by A. (Is this right? Or is it max(C) less min(B)?). If you're guaranteed to have the A values in pairs, then #BenT's shift() would be more concise.
dA = (
(df.groupby('A')['C'].transform('last') -
df.groupby('A')['B'].transform('first'))
.drop_duplicates()
.rename('dA'))
print(pd.concat([df, dA], axis=1))
A B C dA
idx
0 a 3 5 6.0
1 a 6 9 NaN
2 b 3 8 16.0
3 b 11 19 NaN
I used groupby().transform() to preserve index values, to support the concat operation.
Since pandas can't work in multi-dimensions, I usually stack the data row-wise and use a dummy column to mark the data dimensions. Now, I need to divide one dimension by another.
For example, given this dataframe where key define the dimensions
index key value
0 a 10
1 b 12
2 a 20
3 b 15
4 a 8
5 b 9
I want to achieve this:
index key value ratio_a_b
0 a 10 0.833333
1 b 12 NaN
2 a 20 1.33333
3 b 15 NaN
4 a 8 0.888889
5 b 9 NaN
Is there a way to do it using groupby?
You don't really need (and should not use) groupby for this:
# interpolate the b values
s = df['value'].where(df['key'].eq('b')).bfill()
# mask the a values and divide
# change to df['key'].ne('b') if you have many values of a
df['ratio'] = df['value'].where(df['key'].eq('a')).div(s)
Output:
index key value ratio
0 0 a 10 0.833333
1 1 b 12 NaN
2 2 a 20 1.333333
3 3 b 15 NaN
4 4 a 8 0.888889
5 5 b 9 NaN
Using eq, cumsum and GroupBy.apply with shift.
We use .eq to get a boolean where the value is a then we use cumsum to make an unique identifier for each a, b pair.
Then we use groupby and divide each value by the value one row below with shift
s = df['key'].eq('a').cumsum()
df['ratio_a_b'] = df.groupby(s)['value'].apply(lambda x: x.div(x.shift(-1)))
Output
key value ratio_a_b
0 a 10 0.833333
1 b 12 NaN
2 a 20 1.333333
3 b 15 NaN
4 a 8 0.888889
5 b 9 NaN
This is what s returns, our unique identifier for each a,b pair:
print(s)
0 1
1 1
2 2
3 2
4 3
5 3
Name: key, dtype: int32
I have the following two dataframes. Please note that 'amt' is grouped by 'id' in both dataframes.
df1
id code amt
0 A 1 5
1 A 2 5
2 B 3 10
3 C 4 6
4 D 5 8
5 E 6 11
df2
id code amt
0 B 1 9
1 C 12 10
I want to add a row in df2 for every id of df1 not contained in df2. For example as Id's A, D and E are not contained in df2,I want to add a row for these Id's. The appended row should contain the id not contained in df2, null value for the attribute code and stored value in df1 for attribute amt
The result should be something like this:
id code name
0 B 1 9
1 C 12 10
2 A nan 5
3 D nan 8
4 E nan 11
I would highly appreciate if I can get some guidance on it.
By using pd.concat
df=df1.drop('code',1).drop_duplicates()
df[~df.id.isin(df2.id)]
pd.concat([df2,df[~df.id.isin(df2.id)]],axis=0).rename(columns={'amt':'name'}).reset_index(drop=True)
Out[481]:
name code id
0 9 1.0 B
1 10 12.0 C
2 5 NaN A
3 8 NaN D
4 11 NaN E
Drop dups from df1 then append df2 then drop more dups then append again.
df2.append(
df1.drop_duplicates('id').append(df2)
.drop_duplicates('id', keep=False).assign(code=np.nan),
ignore_index=True
)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11
Slight variation
m = ~np.in1d(df1.id.values, df2.id.values)
d = ~df1.duplicated('id').values
df2.append(df1[m & d].assign(code=np.nan), ignore_index=True)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11
I am dealing with a data frame with 6 columns, Below is the example df
a b c d e f
1 2 3 4 5 6
7 8 9 10 11 12
Following is the new data frame which I expect:
col1 col2 col3
1 2 3
4 5 6
7 8 9
10 11 12
Please note the order of the row elements, The first row from the original df, becomes the first two rows of the new df, the second row from the original df becomes the next two.
Please advice me to achieve the required new df.
Thanks in advance.
You can reshape the values (which is numpy array) to 3 columns, and construct a new data frame out of it:
pd.DataFrame(df.values.reshape(-1, 3), columns=["Col"+str(i) for i in range(1,4)])
#Col1 Col2 Col3
#0 1 2 3
#1 4 5 6
#2 7 8 9
#3 10 11 12