I am trying to rank a large dataset using python. I do not want duplicates and rather than using the 'first' method, I would instead like it to look at another column and rank it based on that value.
It should only look at the second column if the rank in the first column has duplicates.
Name CountA CountB
Alpha 15 3
Beta 20 52
Delta 20 31
Gamma 45 43
I would like the ranking to end up
Name CountA CountB Rank
Alpha 15 3 4
Beta 20 52 2
Delta 20 31 3
Gamma 45 43 1
Currently, I am using df.rank(ascending=False, method='first')
Maybe use sort and pull out the index:
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C','D'],'CountA':[15,20,20,45],'CountB':[3,52,31,43]})
df['rank'] = df.sort_values(['CountA','CountB'],ascending=False).index + 1
Name CountA CountB rank
0 A 15 3 4
1 B 20 52 2
2 C 20 31 3
3 D 45 43 1
You can take the counts of the values in CountA and then filter the DataFrame rows based on the count of CountA being greater than 1. Where the count is greater than 1, take CountB, otherwise CountA.
df = pd.DataFrame([[15,3],[20,52],[20,31],[45,43]],columns=['CountA','CountB'])
colAcount = df['CountA'].value_counts()
#then take the indices where colACount > 1 and use them in a where
df['final'] = df['CountA'].where(~df['CountA'].isin(colAcount[colAcount>1].index),df['CountB'])
df = df.sort_values(by='final', ascending=False).reset_index(drop=True)
# the rank is the index
CountA CountB final
0 20 52 52
1 45 43 45
2 20 31 31
3 15 3 15
See this for more details.
Related
Here's my data -
ID,Pay1,Pay2,Pay3,Low,High,expected_output
1,12,21,23,1,2,21
2,21,34,54,1,3,54
3,74,56,76,1,1,74
The goal is to calculate the max Pay of each row as per the Pay column index specified in Low and High columns.
For example, for row 1, calculate the max of Pay1 and Pay2 columns as Low and High are 1 and 2.
I have tried building a dynamic string and then using eval function which is not performing well.
Idea is filter only Pay columns and then using numpy broadcasting select columns by Low and High columns, pass to DataFrame.where and last get max:
df1 = df.filter(like='Pay')
m1 = np.arange(len(df1.columns)) >= df['Low'].to_numpy()[:, None] - 1
m2 = np.arange(len(df1.columns)) <= df['High'].to_numpy()[:, None] - 1
df['expected_output'] = df1.where(m1 & m2, 0).max(axis=1)
print (df)
ID Pay1 Pay2 Pay3 Low High expected_output
0 1 12 21 23 1 2 21
1 2 21 34 54 1 3 54
2 3 74 56 76 1 1 74
An alternative; I expect #jezrael's solution to be faster as it is within numpy and pd.wide_to_long is not particularly fast:
grouping = (
pd.wide_to_long(df.filter(regex="^Pay|Low|High"),
i=["Low", "High"],
stubnames="Pay",
j="num")
.query("Low==num or High==num")
.groupby(level=["Low", "High"])
.Pay.max()
)
grouping
Low High
1 1 74
2 21
3 54
Name: Pay, dtype: int64
df.join(grouping.rename("expected_output"), on=["Low", "High"])
ID Pay1 Pay2 Pay3 Low High expected_output
0 1 12 21 23 1 2 21
1 2 21 34 54 1 3 54
2 3 74 56 76 1 1 74
I have this dataframe :
id start end
1 1 2
1 13 27
1 30 35
1 36 40
2 2 5
2 8 10
2 25 30
I want to groupby over id and aggregate rows where difference of end of n-1 row and start of n row is less than 10 for example. I already find a way using a loop but it's far too long with over a million rows.
So the expected outcome would be :
id start end
1 1 2
1 13 40
2 2 10
2 25 30
First I can get the required difference by using df['diff']=df['start'].shift(-1)-df['end']. How can I gather ids based on the condition for each different id ?
Thanks !
I believe you can create groups by suntract shifted end by DataFrameGroupBy.shift with greater like 10 and cumulative sum and pass to GroupBy.agg:
g = df['start'].sub(df.groupby('id')['end'].shift()).gt(10).cumsum()
df = (df.groupby(['id',g])
.agg({'start':'first', 'end': 'last'})
.reset_index(level=1, drop=True)
.reset_index())
print (df)
id start end
0 1 1 2
1 1 13 40
2 2 2 10
3 2 25 30
I have the dataset below, I want to calculte the aggregated sum of "notes" of each school, except the school "B", where I want to be equal zero or missing
student school notes nbr_of_student_per_school
1 A 12 45
1 A 13 45
2 A 10 45
3 B 13 -
4 C 16 46
5 A 10 45
6 C 20 46
7 C 10 46
8 B 11 -
df.groupby(['Country'])['notes'].sum()
Try this:
df.query('school != "B"').groupby('school')['notes'].sum()
So you are only selecting the subset of the dataframe where the school is not B
EDIT:
Another approach re: comments:
# calculate mean
df['new_col'] = df.groupby('school')['notes'].transform('sum')
# now set B school sum to np.nan
df.loc[df['school'] == 'B', 'new_col'] = np.nan
I have a dataframe that uses MultiIndex for both index and columns.
For example:
df = pd.DataFrame(index=pd.MultiIndex.from_product([[1,2], [1,2,3], [4,5]], names=['i','j', 'k']), columns=pd.MultiIndex.from_product([[1,2], [1,2]], names=['x', 'y']))
for c in df.columns:
df[c] = np.random.randint(100, size=(12,1))
x 1 2
y 1 2 1 2
i j k
1 1 4 10 13 0 76
5 92 37 52 40
2 4 88 77 50 22
5 75 31 19 1
3 4 61 23 5 47
5 43 68 10 21
2 1 4 23 15 17 5
5 47 68 6 94
2 4 0 12 24 54
5 83 27 46 19
3 4 7 22 5 15
5 7 10 89 79
I want to group the values by a name in the index and by a name in the columns.
For each such group, we will have a 2D array of numbers (rather than a Series). I want to aggregate std() of all entries in that 2D array.
For example, let's say I groupby ['i', 'x'], one group would be with values of i=1 and x=1. I want to compute std for each of these 2D arrays and produce a DataFrame with i values as index and x values as columns.
What is the best way to achieve this?
If I do stack() to get x as an index, I will still be computing several std() instead of one as there will still be multiple columns.
You can use nested list comprehensions. For your example, with the given kind of DataFrame (not the same, as the values are random; you may want to fix a seed value so that results are comparable) and i and x as the indices of interest, it would work like this:
# get values of the top level row index
rows = set(df.index.get_level_values(0))
# get values of the top level column index
columns = set(df.columns.get_level_values(0))
# for every sub-dataframe (every combination of top-level indices)
# compute sampling standard deviation (1 degree of freedom) across all values
df_groupSD = pd.DataFrame([[df.loc[(row, )][(col, )].values.std(ddof=1)
for col in columns] for row in rows],
index = rows, columns = columns)
# show result
display(df_groupSD)
Output:
1 2
1 31.455115 25.433812
2 29.421699 33.748962
There may be better ways, of course.
You can use stack to put the 'y' level of column as index and then groupby only i to get:
print (df.stack(level='y').groupby(['i']).std())
x 1 2
i
1 32.966811 23.933462
2 28.668825 28.541835
Try the following code:
df.groupby(level=0).apply(lambda grp: grp.stack().std())
I have the following dataframe
df:
group people value value_50
1 5 100 1
2 2 90 1
1 10 80 1
2 20 40 0
1 7 10 0
2 23 30 0
And I am trying to apply sklearn minmax on one of the column, given a condition on dataset, and then want to join that back as per pandas index in my original data
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
After copying the above data
data = pd.read_clipboard()
minmax = MinMaxScaler(feature_range=(0,10))
''' Applying a filter on "group" and then apply minmax only on those values '''
val = pd.DataFrame(minmax.fit_transform(data[data['group'] == 1][['value']])
,columns = ['val_minmax'] )
But it looks like we lose the index after the minmax
val
val_minmax
0 10.000000
1 7.777778
2 0.000000
where index in my original dataset on this filter is
data[data['group'] == 1]['value']
output:
0 100
2 80
4 10
Desired dataset:
df_out:
group people value value_50 val_minmax
1 5 100 1 10
2 2 90 1 na
1 10 80 1 7.88
2 20 40 0 na
1 7 10 0 0
2 23 30 0 na
Now, how to join back my data at rows in the original data, so that I can get the above output?
You just need to assign it back
df.loc[df.group==1,'val_minmax']=minmax.fit_transform(df[df['group'] == 1][['value']])