Caculating rank for a particular column in pandas - python

I have this dataframe that looks like this
data = {'col1': ['a', 'b', 'c'],
'col2': [10, 5, 20]}
df_sample = pd.DataFrame(data=data)
I want to calculate the rank of col2. I wrote this function
def rank_by(df):
if df.shape[0] >= 10:
df.sort_values(by=['col2'])
l = []
for val in df['col2']:
rank = (val/df['col2'].max())*10
l.append(rank)
df['rank'] = l
return df
Please assume col2 has more than 10 values. I want to know if there is a more pythonic way of applying the function defined above.

It looks like you just want the ratio to the max value multiplied by 10:
df_sample['rank'] = df_sample['col2'].div(df_sample['col2'].max()).mul(10)
print(df_sample.sort_values(by='col2'))
Output:
col1 col2 rank
4 e 2 0.8
8 i 2 0.8
3 d 4 1.6
1 b 5 2.0
6 g 6 2.4
9 j 9 3.6
0 a 10 4.0
7 h 12 4.8
2 c 20 8.0
5 f 25 10.0
Used input:
data = {'col1': list('abcdefghij'),
'col2': [10, 5, 20,4,2,25,6,12,2,9]}
df_sample = pd.DataFrame(data=data)

Use pd.Series.rank:
df_sample['rank'] = df_sample['col2'].rank()
Output:
col1 col2 rank
0 a 10 2.0
1 b 5 1.0
2 c 20 3.0
Note, there are different methods to handle ties.

Related

Replace specific values in a dataframe by column mean in pandas

I'm a python beginner and I'm trying to do some operations with dataframes that I usually do with R language.
I Have a large dataframe with 2592 rows and 205 columns and I want to replace the 0.0 values by half the minimum value of its column.
An example with a random dataframe would be:
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(1)
>>> df = pd.DataFrame(np.random.randint(0,10, size=(3,5)), columns = ['A', 'B', 'C', 'D', 'E'])
>>> print(df)
A B C D E
0 5 8 9 5 0
1 0 1 7 6 9
2 2 4 5 2 4
And the result I'm looking for is:
A B C D E
0 5 8 9 5 2
1 1 1 7 6 9
2 2 4 5 2 4
Intuitively I would do it like this:
>>> for column in df:
for element in column:
if element == 0:
element = df[column].min()/2
But it doesn't work... any help?
Thank you!
Use DataFrame.mask with replace minimum values without 0 divide by 2:
df1 = df.mask(df.eq(0), df.replace(0, np.nan).min().div(2), axis=1)
print(df1)
A B C D E
0 5 8 9 5 2
1 1 1 7 6 9
2 2 4 5 2 4
For more efficient solution is possible use (thanks #mozway):
m = df.eq(0)
df1 = df.mask(m, df[~m].min().div(2), axis=1)
To work on your "intuitive" way, this is how to do it.
Use a function to perform the fancy logics you need.
Pandas has .apply function is optimised, so it should be sufficiently fast anyway.
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,10, size=(3,5)), columns = ['A', 'B', 'C', 'D', 'E'])
def make_half_minimum(value, dataseries):
if value == 0:
dataseries_ = dataseries[dataseries!=0]
return dataseries_.min()/2
else:
return value
for column_name in df.columns:
df[column_name] = df[column_name].apply(lambda x: make_half_minimum(x,df[column_name]))
print(df)
A B C D E
0 5.0 8 9 5 2.0
1 1.0 1 7 6 9.0
2 2.0 4 5 2 4.0
[Finished in 521ms]

How to create df1 from df2 with different row and column indices?

I want to fill df1 using df2 values, I could achieve it using nested loop but is very much time taking.
Is there any smart way to do this ?
P.S. The size of df is around 8000 rows , 8000 columns.
df1 initially is like this
A B C D
A 0 0 0 0
B 0 0 0 0
C 0 0 0 0
D 0 0 0 0
df2 is like this
P Q R S T
P 1 5 7 5 3
Q 5 6 2 8 5
R 3 5 4 9 3
S 9 4 5 0 8
T 2 9 4 2 1
Now there is correspondence list between indices of df1 and df2
df1 df2
A P
B Q
C R
D S
B T
df1 should be filled like this
A B C D
A 1 8 7 5
B 7 21 6 10
C 3 8 4 9
D 9 12 5 0
Here as 'B' is occurring twice in the list, so it will add values of 'Q' and 'T' together.
Thank you in advance.
You could try changing the row and col names in df1 (based on the correspondence with df2) and for the cases of multiple correspondence (like B) you could first name them B1, B2, etc... and then sum them together:
> di
{'Q': 'B1', 'P': 'A', 'S': 'D', 'R': 'C', 'T': 'B2'}
> df1 = df2.copy()
> df1.columns = [di[c] for c in df2.columns]
> df1.index = [di[c] for c in df2.index]
> ## sum B1,B2 column wise
> df1['B'] = df1.B1 + df1.B2
> ## sum B1,B2 row wise
> df1.at["B", :] = df1.loc["B1"] + df1.loc["B2"]
> ## subset with original index and column names
> df1[["A", "B", "C", "D"]].loc[["A", "B", "C", "D"]]
##output
A B C D
A 1.0 8.0 7.0 5.0
B 7.0 21.0 6.0 10.0
C 3.0 8.0 4.0 9.0
D 9.0 12.0 5.0 0.0
you can also stack df2 to a series, as the columns become an inner index(level_1) of a Series.
Then replace the indices with {'P': 'A', 'Q': 'B', 'R': 'C', 'S': 'D', 'T': 'B'}.
use groupby with sum to add values with the same indices, then unstack to turn the inner level index to column.
amap = {'P': 'A', 'Q': 'B', 'R': 'C', 'S': 'D', 'T': 'B'}
obj2 = df2.stack().reset_index()
for col in ['level_0','level_1']:
obj2[col] = obj2[col].map(amap)
df1 = obj2.groupby(['level_0', 'level_1'])[0].sum().unstack()

Subset df on value and subsequent row - pandas

I know this is in S0 somewhere but I can't seem to find it. I want to subset a df on a specific value and include the following unique rows. Using below, I can return values equal to A, but I'm hoping to return the next unique values, which is B.
Note: The subsequent unique value may not be B or may have varying rows, so I need a function that finds the returns all subsequent unique values.
import pandas as pd
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,2,2,2,2,2,2],
'ID' : ['A','A','B','B','C','C','A','A','B','B','C','C'],
'Val' : [2.0,5.0,2.5,2.0,2.0,1.0,1.0,6.0,4.0,2.0,5.0,1.0],
})
df = df[df['ID'] == 'A']
intended output:
Time ID Val
0 1 A 2.0
1 1 A 5.0
2 1 B 2.5
3 1 B 2.0
4 2 A 1.0
5 2 A 6.0
6 2 B 4.0
7 2 B 2.0
Ok OP let me do this again, you want to find all the rows which are "A" (base condition) and all the rows which are following a "A" row at some point, right ?
Then,
is_A = df["ID"] == "A"
not_A_follows_from_A = (df["ID"] != "A") &( df["ID"].shift() == "A")
candidates = df["ID"].loc[is_A | not_A_follows_from_A].unique()
df.loc[df["ID"].isin(candidates)]
Should work as intented.
Edit : example
df = pd.DataFrame({
'Time': [1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1],
'ID': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'A', 'E', 'E', 'E', 'A', 'F'],
'Val': [7, 2, 7, 5, 1, 6, 7, 3, 2, 4, 7, 8, 2]})
is_A = df["ID"] == "A"
not_A_follows_from_A = (df["ID"] != "A") &( df["ID"].shift() == "A")
candidates = df["ID"].loc[is_A | not_A_follows_from_A].unique()
df.loc[df["ID"].isin(candidates)]
outputs this :
Time ID Val
0 1 A 7
1 1 A 2
2 1 B 7
3 0 B 5
7 1 A 3
8 0 E 2
9 0 E 4
10 1 E 7
11 1 A 8
12 1 F 2
Let us try drop_duplicates, then groupby select the number of unique ID we would like to keep by head, and merge
out = df.merge(df[['Time','ID']].drop_duplicates().groupby('Time').head(2))
Time ID Val
0 1 A 2.0
1 1 A 5.0
2 1 B 2.5
3 1 B 2.0
4 2 A 1.0
5 2 A 6.0
6 2 B 4.0
7 2 B 2.0

How to apply function to ALL columns of dataframe GROUPWISELY ? (In python pandas)

How to apply a function to each column of dataframe "groupwisely" ?
I.e. group by values of one column and calculate e.g. means for each group+ other columns. The expected output is dataframe with index - names of different groups, and values - means for each group+column
E.g. consider:
df = pd.DataFrame(np.arange(16).reshape(4,4), columns=['A', 'B', 'C', 'D'])
df['group'] = ['a', 'a', 'b','b']
A B C D group
0 0 1 2 3 a
1 4 5 6 7 a
2 8 9 10 11 b
3 12 13 14 15 b
I want to calculate e.g. np.mean for each column, but "groupwisely",
in that particular example it can be done by:
t = df.groupby('group').agg({'A': np.mean, 'B': np.mean, 'C': np.mean, 'D': np.mean })
A B C D
group
a 2 3 4 5
b 10 11 12 13
However, it requires explicit use of column names 'A': np.mean, 'B': np.mean, 'C': np.mean, 'D': np.mean
which is unacceptable for my task, since they can be changed.
As MaxU commented simplier is groupby + GroupBy.mean:
df1 = df.groupby('group').mean()
print (df1)
A B C D
group
a 2 3 4 5
b 10 11 12 13
If need column from index:
df1 = df.groupby('group', as_index=False).mean()
print (df1)
group A B C D
0 a 2 3 4 5
1 b 10 11 12 13
You don't need to explicitly name the columns.
df.groupby('group').agg('mean')
Will produce the mean for each group for each column as requested:
A B C D
group
a 2 3 4 5
b 10 11 12 13
The below does the job:
df.groupby('group').apply(np.mean, axis=0)
giving back
A B C D
group
a 2.0 3.0 4.0 5.0
b 10.0 11.0 12.0 13.0
apply takes axis = {0,1} as additional argument, which in turn specifies whether you want to apply the function row-wise or column-wise.

Python pandas - select rows based on groupby

I have a sample table like this:
Dataframe: df
Col1 Col2 Col3 Col4
A 1 10 i
A 1 11 k
A 1 12 a
A 2 10 w
A 2 11 e
B 1 15 s
B 1 16 d
B 2 21 w
B 2 25 e
B 2 36 q
C 1 23 a
C 1 24 b
I'm trying to get all records/rows of the groups (Col1, Col2) that has the smaller number of records AND skipping over those groups that have only 1 record (in this example Col1 = 'C'). So, the output would be as follows:
A 2 10 w
A 2 11 e
B 1 15 s
B 1 16 d
since group (A,2) has 2 records compared to group (A,1) which has 3 records.
I tried to approach this issue from different angles but just can't seem to get the result that I need. I am able to find the groups that I need using a combination of groupby, filter and agg but how do I now use this as a select filter on df? After spending a lot of time on this, I wasn't even sure that the approach was correct as it looked overly complicated. I am sure that there is an elegant solution but I just can't see it.
Any advise on how to approach this would be greatly appreciated.
I had this to get the groups for which I wanted the rows displayed:
groups = df.groupby(["Col1, Col2"])["Col2"].agg({'no':'count'})
filteredGroups = groups.groupby(level=0).filter(lambda group: group.size > 1)
print filteredGroups.groupby(level=0).agg('idxmin')
The second line was to account for groups that may have only one record as those I don't want to consider. Honestly, I tried so many variations and approaches that eventually did not give me the result that I wanted. I see that all answers are not one-liners so that at least I don't feel like I was over thinking the problem.
df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")
df['rnk'] = df.groupby('Col1')['sz'].rank(method='min')
df['rnk_rev'] = df.groupby('Col1')['sz'].rank(method='min',ascending=False)
df.loc[ (df['rnk'] == 1.0) & (df['rnk_rev'] != 1.0) ]
Col1 Col2 Col3 Col4 sz rnk rnk_rev
3 A 2 10 w 2 1.0 4.0
4 A 2 11 e 2 1.0 4.0
5 B 1 15 s 2 1.0 4.0
6 B 1 16 d 2 1.0 4.0
Edit: changed "count" to "size" (as in #Marco Spinaci's answer) which doesn't matter in this example but might if there were missing values.
And for clarity, here's what the df looks like before dropping the selected rows.
Col1 Col2 Col3 Col4 sz rnk rnk_rev
0 A 1 10 i 3 3.0 1.0
1 A 1 11 k 3 3.0 1.0
2 A 1 12 a 3 3.0 1.0
3 A 2 10 w 2 1.0 4.0
4 A 2 11 e 2 1.0 4.0
5 B 1 15 s 2 1.0 4.0
6 B 1 16 d 2 1.0 4.0
7 B 2 21 w 3 3.0 1.0
8 B 2 25 e 3 3.0 1.0
9 B 2 36 q 3 3.0 1.0
10 C 1 23 a 2 1.0 1.0
11 C 1 24 b 2 1.0 1.0
Definitely not a nice answer, but it should work:
tmp = df[['col1','col2']].groupby(['col1','col2'], as_index=False).size()
df['occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]][df.col2[i]])
df['min_occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]].min())
df[df.occurrencies == df.min_occurrencies]
But there must be a more clever way to use groupby than creating an auxiliary data frame...
The following is a solution that is based on the groupby.apply methodology. Other simpler methods are available by creating data Series as in JohnE's method which is superior I would say.
The solution works by grouping the dataframe at the Col1 level and then passing a function to apply that further groups the data by Col2. Each sub_group is then assessed to yield the smallest group. Note that ties in size will be determined by whichever is evaluated first. This may not be desirable.
#create data
import pandas as pd
df = pd.DataFrame({
"Col1" : ["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"],
"Col2" : [1, 1, 1, 2, 2, 1, 1, 2, 2, 2],
"Col3" : [10, 11, 12, 10, 11, 15, 16, 21, 25, 36],
"Col4" : ["i", "k", "a", "w", "e", "s", "d", "w", "e", "q"]
})
Grouped = df.groupby("Col1")
def transFunc(x):
smallest = [None, None]
sub_groups = x.groupby("Col2")
for group, data in sub_groups:
if not smallest[1] or len(data) < smallest[1]:
smallest[0] = group
smallest[1] = len(data)
return sub_groups.get_group(smallest[0])
Grouped.apply(transFunc).reset_index(drop = True)
Edit to assign the result
result = Grouped.apply(transFunc).reset_index(drop = True)
print(result)
I would like to add a shorter yet readable version of JohnE's solution
df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")
df.groupby('Col1').filter(lambda x: x['sz'].rank(method='min') == 1 and x['sz'].rank(method='min', ascending=False) != 1)

Categories

Resources