I'm trying to find a way to do nested ranking (row number) in python that is equivalent to the following in TSQL:
I have a table thank looks like this:
data = {
'col1':[11,11,11,22,22,33,33],
'col2':['a','b','c','a','b','a','b']
}
df = pd.DataFrame(data)
# col1 col2
# 11 a
# 11 b
# 11 c
# 22 a
# 22 b
# 33 a
# 33 b
Looking for Python equivalent to:
SELECT
col1
,col2
,row_number() over(Partition by col1 order by col2) as rnk
FROM df
group by col1, col2
The output to be:
# col1 col2 rnk
# 11 a 1
# 11 b 2
# 11 c 3
# 22 a 1
# 22 b 2
# 33 a 1
# 33 b 2
I've tried to use rank() and groupby() but I keep running into a problem of No numeric types to aggregate. Is there a way to rank non numeric columns and give them row numbers
Use cumcount()
df['rnk']=df.groupby('col1')['col2'].cumcount()+1
col1 col2 rnk
0 11 a 1
1 11 b 2
2 11 c 3
3 22 a 1
4 22 b 2
5 33 a 1
6 33 b 2
Related
What I'm looking to do is group my Dataframe on a Categorical column, compute quantiles using second column, and store the result in a 3rd column. For simplicity lets just do the P50. Example below:
Original DF:
Col1 Col2
A 2
B 4
C 2
A 6
B 12
C 10
Desired DF:
Col1 Col2 Col3_P50
A 2 4
B 4 8
C 2 6
A 6 4
B 12 8
C 10 6
One easy way would be to create a small dataframe of each Category (A,B,C) and compute quantile and merge back to existing DF, but my actual dataset has 100s of category so this isn't an option. Any suggestions would be much appreciated!
You can do transform with quantile
df['Col3_P50'] = df.groupby("Col1")['Col2'].transform('quantile',0.5)
print(df)
Col1 Col2 Col3_P50
0 A 2 4
1 B 4 8
2 C 2 6
3 A 6 4
4 B 12 8
5 C 10 6
If you have multiple values, one way is creating a dictionary and set the keys as column names and values inside the groupby:
d = {'P_50':0.5,'P_90':0.9}
for k,v in d.items():
df[k]=df.groupby("Col1")['Col2'].transform('quantile',v)
print(df)
Col1 Col2 P_50 P_90
0 A 2 4 5.6
1 B 4 8 11.2
2 C 2 6 9.2
3 A 6 4 5.6
4 B 12 8 11.2
5 C 10 6 9.2
I have a data frame like this,
df
col1 col2 col3
1 2 3
2 5 6
7 8 9
10 11 12
11 12 13
13 14 15
14 15 16
Now I want to create multiple data frames from above when the col1 difference of two consecutive rows are more than 1.
So the result data frames will look like,
df1
col1 col2 col3
1 2 3
2 5 6
df2
col1 col2 col3
7 8 9
df3
col1 col2 col3
10 11 12
11 12 13
df4
col1 col2 col3
13 14 15
14 15 16
I can do this using for loop and storing the indices but this will increase execution time, looking for some pandas shortcuts or pythonic way to do this most efficiently.
You could define a custom grouper by taking the diff, checking when it is greater than 1, and take the cumsum of the boolean series. Then group by the result and build a dictionary from the groupby object:
d = dict(tuple(df.groupby(df.col1.diff().gt(1).cumsum())))
print(d[0])
col1 col2 col3
0 1 2 3
1 2 5 6
print(d[1])
col1 col2 col3
2 7 8 9
A more detailed break-down:
df.assign(difference=(diff:=df.col1.diff()),
condition=(gt1:=diff.gt(1)),
grouper=gt1.cumsum())
col1 col2 col3 difference condition grouper
0 1 2 3 NaN False 0
1 2 5 6 1.0 False 0
2 7 8 9 5.0 True 1
3 10 11 12 3.0 True 2
4 11 12 13 1.0 False 2
5 13 14 15 2.0 True 3
6 14 15 16 1.0 False 3
You can also peel off the target column and work with it as a series, rather than the above answer. That keeps everything smaller. It runs faster on the example, but I don't know how they'll scale up, depending how many times you're splitting.
row_bool = df['col1'].diff()>1
split_inds, = np.where(row_bool)
split_inds = np.insert(arr=split_inds, obj=[0,len(split_inds)], values=[0,len(df)])
df_tup = ()
for n in range(0,len(split_inds)-1):
tempdf = df.iloc[split_inds[n]:split_inds[n+1],:]
df_tup.append(tempdf)
(Just throwing it in a tuple of dataframes afterward, but the dictionary approach might be better?)
I have a data frame like this,
col1 col2 col3
1 2 3
2 3 4
4 2 3
7 2 8
8 3 4
9 3 3
15 1 12
Now I want to group those rows where there difference between two consecutive col1 rows is less than 3. and sum other column values, create another column(col4) with the last value of the group,
So the final data frame will look like,
col1 col2 col3 col4
1 7 10 4
7 8 15 9
using for loop to do this is tedious, looking for some pandas shortcuts to do it most efficiently.
You can do a named aggregation on groupby:
(df.groupby(df.col1.diff().ge(3).cumsum(), as_index=False)
.agg(col1=('col1','first'),
col2=('col2','sum'),
col3=('col3','sum'),
col4=('col1','last'))
)
Output:
col1 col2 col3 col4
0 1 7 10 4
1 7 8 15 9
2 15 1 12 15
update without named aggregation you can do some thing like this:
groups = df.groupby(df.col1.diff().ge(3).cumsum())
new_df = groups.agg({'col1':'first', 'col2':'sum','col3':'sum'})
new_df['col4'] = groups['col1'].last()
I've been reading through pd.stack, pd.unstack and pd.pivot but I can't wrap my head around getting what I want done
Given a dataframe as follows
id1 id2 id3 vals vals1
0 1 a -1 10 20
1 1 a -2 11 21
2 1 a -3 12 22
3 1 a -4 13 23
4 1 b -1 14 24
5 1 b -2 15 25
6 1 b -3 16 26
7 1 b -4 17 27
I'd like to get the following result
id1 id2 -1_vals -2_vals ... -1_vals1 -2_vals1 -3_vals1 -4_vals1
0 1 a 10 11 ... 20 21 22 23
1 1 b 14 15 ... 24 25 26 27
It's kind of a groupby with a pivot, The column id3 is being spread into rows, where the new column names is the corresponding concatenation of the original column and the value of id3
EDIT: It is guaranteed that per id1 + id2 id3 will be unique, but some groups of id1 + id2 will have diffenet id3 - in this case it is ok to put NaNs there
Use DataFrame.set_index with DataFrame.unstack and DataFrame.sort_index for MultiIndex in columns and then flatten it by list comprehension with f-strings:
df1 = (df.set_index(['id1','id2','id3'])
.unstack()
.sort_index(level=[0,1], ascending=[True, False], axis=1))
#python 3.6+
df1.columns = [f'{b}_{a}' for a, b in df1.columns]
#python below
#df1.columns = ['{}_{}'.format(a, b) for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
id1 id2 -1_vals -2_vals -3_vals -4_vals -1_vals1 -2_vals1 -3_vals1 \
0 1 a 10 11 12 13 20 21 22
1 1 b 14 15 16 17 24 25 26
-4_vals1
0 23
1 27
I am having a dataframe which has 4 columns and 4 rows. I need to reshape it into 2 columns and 4 rows. The 2 new columns are result of addition of values of col1 + col3 and col2 +col4. I do not wish to create any other memory object for it.
I am trying
df['A','B'] = df['A']+df['C'],df['B']+df['D']
Can it be achieved by using drop function only? Is there any other simpler method for this?
The dynamic way of summing two columns at a time is to use groupby:
df.groupby(np.arange(len(df.columns)) % 2, axis=1).sum()
Out[11]:
0 1
0 2 4
1 10 12
2 18 20
3 26 28
You can use rename afterwards if you want to change column names but that would require a logic.
Consider the sample dataframe df
df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=list('ABCD'))
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
One line of code
pd.DataFrame(
df.values.reshape(4, 2, 2).transpose(0, 2, 1).sum(2),
columns=df.columns[:2]
)
A B
0 2 4
1 10 12
2 18 20
3 26 28
Another line of code
df.iloc[:, :2] + df.iloc[:, 2:4].values
A B
0 2 4
1 10 12
2 18 20
3 26 28
Yet another
df.assign(A=df.A + df.C, B=df.B + df.D).drop(['C', 'D'], 1)
A B
0 2 4
1 10 12
2 18 20
3 26 28
This works for me:
df['A'], df['B'] = df['A'] + df['C'], df['B'] + df['D']
df.drop(['C','D'], axis=1)