Let's say I have the following dataframe:
Priority Color Risk
1 1 Green 8
2 9 Red 10
3 5 Orange 4
I would like to add a column 'Score' which calculates a score for each row based on multiple conditions related to the other columns. For example, the conditions and scoring could be:
If 'Priority' column > 5, add 1 point, otherwise 0 points
If 'Color' column == 'Red', add 1 point, otherwise 0 points
If 'Risk' column > 7, add 1 point, otherwise 0 points
In this case, row 1 would get 1 point, row 2 would get 3 points and row 3 would get 0 points.
Does anyone know how I could achieve this?
You can sum boolean conditions converted to ints with .astype:
df['score'] = ( (df['Priority'] > 5).astype(int)
+ (df['Color'] == 'Red').astype(int)
+ (df['Risk'] > 7).astype(int) )
Priority Color Risk score
1 1 Green 8 1
2 9 Red 10 3
3 5 Orange 4 0
Related
I am dealing with a dataframe that has some empty cells. For example:
df = pd.DataFrame(data={'id': [1, 2, 3, 1, 2, 3], 'id2': [1,1,1,2,2,2], 'color': ["red", "", "green", "yellow", "", "blue"], 'rate':["good","","good","average","","good"]})
id id2 color rate
0 1 1 red good
1 2 1
2 3 1 green good
3 1 2 yellow average
4 2 2
5 3 2 blue good
For both the columns "color" and "rate", I would like to replace the empty rows with values from another row where id is 1. Therefore, I would like the dataframe to look like this in the end:
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
I prefer not to replace the empty cells with values from the previous row. I would like to indicate the id and replace the empty cells with rows that have the specific id.
IIUC you can groupby and transform with first, and finally assign to empty values:
df.loc[df["color"].eq(""), ["color", "rate"]] = df.groupby(df["id"].eq(1).cumsum())[["color","rate"]].transform("first")
print (df)
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
Its only works if the number of rows with empty values in [color,rate] is equal to the number of rows with id==1. Expand further on the question of whether this is not the intention.
If I understood you correctly:
empty_rows = df.loc[df['color']=='',['color','rate']].index
df.loc[empty_rows, ['color','rate']] = df.loc[df['id']==1,['color','rate']].values
Result df:
Assume the below dummy dataframe:
a b
0 red 0
1 red 1
2 red 3
3 blue 0
4 blue 1
5 blue 3
6 black 4
7 black 2
I want to sample on column a, with a sample size of 1. But the only condition is that in the sample created the values of column b should be unique, such that it gives me a result something like:
a b
2 red 0
3 blue 1
5 black 4
The below result is not acceptable:
a b
2 red 1
3 blue 1
5 black 4
Right now I am using below code with pandas sample function, but it is returning duplicate values in column b
df.groupby('a').sample(n=1)
sample(n=1) selects values randomly from each element in column 'a'.
You can use df.groupby('a').first() which chooses the first value of each element, and would return the output you are looking for.
ID group categories
1 0 red
2 1 blue
3 1 green
4 1 green
1 0 blue
1 0 blue
2 1 red
3 0 red
4 0 red
4 1 red
Hi, I am new to python I am trying to get the count of duplicates of ID columns based on multiple conditions on the other 2 columns. So I am filtering out red and 0 and then I wanted ID's that repeated more than once.
df1 = df[(df['categories']=='red')& (df['group'] == 0)]
df1['ID'].value_counts()[df1['ID'].value_counts()>1]
There are almost 10 categories in the categories column so I was thinking if there is any way to write a function or for loop instead of repeating the same steps. The final goal is to see how many duplicate ID's in each group given category is 'red'/'blue'/'green'. Thanks in advance
P.S : the group values doesn't change it is a binomial variable
output
ID count
1 3
2 2
3 2
4 3
I think you can use groupby with SeriesGroupBy.value_counts:
s = df.groupby(['ID','group'])['categories'].value_counts()
print (s)
ID group categories
1 0 blue 2
red 1
2 1 blue 1
red 1
3 0 red 1
1 green 1
4 0 red 1
1 green 1
red 1
Name: categories, dtype: int64
out = s[s > 1].reset_index(name='count')
print (out)
ID group categories count
0 1 0 blue 2
Another solution is get duplicates first by filtering with duplicated and then count:
df = df[df.duplicated(['ID','group','categories'], keep=False)]
print (df)
ID group categories
4 1 0 blue
5 1 0 blue
df1 = df.groupby(['ID','group'])['categories'].value_counts().reset_index(name='count')
print (df1)
ID group categories count
0 1 0 blue 2
EDIT: For count categories (all rows) per ID use GroupBy.size:
df1 = df.groupby('ID').size().reset_index(name='count')
print (df1)
ID count
0 1 3
1 2 2
2 3 2
3 4 3
In a pandas dataframe, how can I drop a random subset of rows that obey a condition?
In other words, if I have a Pandas dataframe with a Label column, I'd like to drop 50% (or some other percentage) of rows where Label == 1, but keep all of the rest:
Label A -> Label A
0 1 0 1
0 2 0 2
0 3 0 3
1 10 1 11
1 11 1 12
1 12
1 13
I'd love to know the simplest and most pythonic/panda-ish way of doing this!
Edit: This question provides part of an answer, but it only talks about dropping rows by index, disregarding the row values. I'd still like to know how to drop only from rows that are labeled a certain way.
Use the frac argument
df.sample(frac=.5)
If you define the amount you want to drop in a variable n
n = .5
df.sample(frac=1 - n)
To include the condition, use drop
df.drop(df.query('Label == 1').sample(frac=.5).index)
Label A
0 0 1
1 0 2
2 0 3
4 1 11
6 1 13
Using drop with sample
df.drop(df[df.Label.eq(1)].sample(2).index)
Label A
0 0 1
1 0 2
2 0 3
3 1 10
5 1 12
I want to group a DataFrame by some criteria, and then find the integer index in the group (not the DataFrame) of the first row satisfying some predicate. If there is no such row, I want to get NaN.
For example, I group by column a divided by 5 and then in each group, find the index of the first row where column b is "red":
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': xrange(12), 'b': ['red', 'green', 'blue'] * 4})
a b
0 0 red
1 1 green
2 2 blue
3 3 red
4 4 green
5 5 blue
6 6 red
7 7 green
8 8 blue
9 9 red
10 10 green
11 11 blue
df.groupby(df.a // 5).apply(lambda g: next((idx for idx, row in g.reset_index(drop=True).iterrows() if row.b == "red"), None))
a
0 0
1 1
2 NaN
dtype: float64
(I guess I'm assuming rows stay in the same order as the in original DataFrame, but I can sort the group if needed.) Is there a more concise, efficient way to do this?
This is a bit longer, but IMHO is more understandable / customizable
In [126]: df2 = df.copy()
This is your group metric
In [127]: g = df.a//5
A reference to the create groups
In [128]: grp = df.groupby(g)
Create a columns of the generated group and the cumulative count within the group
In [129]: df2['group'] = g
In [130]: df2['count'] = grp.cumcount()
In [131]: df2
Out[131]:
a b group count
0 0 red 0 0
1 1 green 0 1
2 2 blue 0 2
3 3 red 0 3
4 4 green 0 4
5 5 blue 1 0
6 6 red 1 1
7 7 green 1 2
8 8 blue 1 3
9 9 red 1 4
10 10 green 2 0
11 11 blue 2 1
Filtering and grouping gives you back the first elemnt that you want. The count is the within group count
In [132]: df2[df2.b=='red'].groupby('group').first()
Out[132]:
a b count
group
0 0 red 0
1 6 red 1
You can generate all of the group keys (e.g. nothing came back from your filter); this way.
In [133]: df2[df2.b=='red'].groupby('group').first().reindex(grp.groups.keys())
Out[133]:
a b count
0 0 red 0
1 6 red 1
2 NaN NaN NaN
Best I could do:
import itertools as it
df.groupby(df.a // 5).apply(lambda group: next(it.chain(np.where(group.get_values() == "red")[0], [None])))
The only real difference is using np.where on the values (so I'd expect this to be faster usually), but you may even want to just write your own first_where function and use that.