I am dealing with a dataframe that has some empty cells. For example:
df = pd.DataFrame(data={'id': [1, 2, 3, 1, 2, 3], 'id2': [1,1,1,2,2,2], 'color': ["red", "", "green", "yellow", "", "blue"], 'rate':["good","","good","average","","good"]})
id id2 color rate
0 1 1 red good
1 2 1
2 3 1 green good
3 1 2 yellow average
4 2 2
5 3 2 blue good
For both the columns "color" and "rate", I would like to replace the empty rows with values from another row where id is 1. Therefore, I would like the dataframe to look like this in the end:
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
I prefer not to replace the empty cells with values from the previous row. I would like to indicate the id and replace the empty cells with rows that have the specific id.
IIUC you can groupby and transform with first, and finally assign to empty values:
df.loc[df["color"].eq(""), ["color", "rate"]] = df.groupby(df["id"].eq(1).cumsum())[["color","rate"]].transform("first")
print (df)
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
Its only works if the number of rows with empty values in [color,rate] is equal to the number of rows with id==1. Expand further on the question of whether this is not the intention.
If I understood you correctly:
empty_rows = df.loc[df['color']=='',['color','rate']].index
df.loc[empty_rows, ['color','rate']] = df.loc[df['id']==1,['color','rate']].values
Result df:
Related
Assume the below dummy dataframe:
a b
0 red 0
1 red 1
2 red 3
3 blue 0
4 blue 1
5 blue 3
6 black 4
7 black 2
I want to sample on column a, with a sample size of 1. But the only condition is that in the sample created the values of column b should be unique, such that it gives me a result something like:
a b
2 red 0
3 blue 1
5 black 4
The below result is not acceptable:
a b
2 red 1
3 blue 1
5 black 4
Right now I am using below code with pandas sample function, but it is returning duplicate values in column b
df.groupby('a').sample(n=1)
sample(n=1) selects values randomly from each element in column 'a'.
You can use df.groupby('a').first() which chooses the first value of each element, and would return the output you are looking for.
I have two data frames that i want to concatenate in python. However, I want to add another column type in order to distinguish among the columns.
Here is my sample data:
import pandas as pd
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']},
columns=['numbers', 'colors'])
df1 = pd.DataFrame({'numbers': [7, 9, 9], 'colors': ['yellow', 'brown', 'blue']},
columns=['numbers', 'colors'])
pd.concat([df,df1])
This code will give me the following result:
numbers colors
0 1 red
1 2 white
2 3 blue
0 7 yellow
1 9 brown
2 9 blue
but what I would like to get is as follows:
numbers colors type
0 1 red first
1 2 white first
2 3 blue first
0 7 yellow second
1 9 brown second
2 9 blue second
type column is going to help me to differentiate between the values of the two data frames.
Can anyone help me with this please?
Use DataFrame.assign for new columns:
df = pd.concat([df.assign(typ='first'),df1.assign(typ='second')])
print (df)
numbers colors typ
0 1 red first
1 2 white first
2 3 blue first
0 7 yellow second
1 9 brown second
2 9 blue second
Using a list-comprehension
df = pd.concat([d.assign(typ=f'id{i}') for i, d in enumerate([df, df1])], ignore_index=True)
numbers colors typ
0 1 red id0
1 2 white id0
2 3 blue id0
3 7 yellow id1
4 9 brown id1
5 9 blue id1
Let's say I have the following dataframe:
Priority Color Risk
1 1 Green 8
2 9 Red 10
3 5 Orange 4
I would like to add a column 'Score' which calculates a score for each row based on multiple conditions related to the other columns. For example, the conditions and scoring could be:
If 'Priority' column > 5, add 1 point, otherwise 0 points
If 'Color' column == 'Red', add 1 point, otherwise 0 points
If 'Risk' column > 7, add 1 point, otherwise 0 points
In this case, row 1 would get 1 point, row 2 would get 3 points and row 3 would get 0 points.
Does anyone know how I could achieve this?
You can sum boolean conditions converted to ints with .astype:
df['score'] = ( (df['Priority'] > 5).astype(int)
+ (df['Color'] == 'Red').astype(int)
+ (df['Risk'] > 7).astype(int) )
Priority Color Risk score
1 1 Green 8 1
2 9 Red 10 3
3 5 Orange 4 0
ID group categories
1 0 red
2 1 blue
3 1 green
4 1 green
1 0 blue
1 0 blue
2 1 red
3 0 red
4 0 red
4 1 red
Hi, I am new to python I am trying to get the count of duplicates of ID columns based on multiple conditions on the other 2 columns. So I am filtering out red and 0 and then I wanted ID's that repeated more than once.
df1 = df[(df['categories']=='red')& (df['group'] == 0)]
df1['ID'].value_counts()[df1['ID'].value_counts()>1]
There are almost 10 categories in the categories column so I was thinking if there is any way to write a function or for loop instead of repeating the same steps. The final goal is to see how many duplicate ID's in each group given category is 'red'/'blue'/'green'. Thanks in advance
P.S : the group values doesn't change it is a binomial variable
output
ID count
1 3
2 2
3 2
4 3
I think you can use groupby with SeriesGroupBy.value_counts:
s = df.groupby(['ID','group'])['categories'].value_counts()
print (s)
ID group categories
1 0 blue 2
red 1
2 1 blue 1
red 1
3 0 red 1
1 green 1
4 0 red 1
1 green 1
red 1
Name: categories, dtype: int64
out = s[s > 1].reset_index(name='count')
print (out)
ID group categories count
0 1 0 blue 2
Another solution is get duplicates first by filtering with duplicated and then count:
df = df[df.duplicated(['ID','group','categories'], keep=False)]
print (df)
ID group categories
4 1 0 blue
5 1 0 blue
df1 = df.groupby(['ID','group'])['categories'].value_counts().reset_index(name='count')
print (df1)
ID group categories count
0 1 0 blue 2
EDIT: For count categories (all rows) per ID use GroupBy.size:
df1 = df.groupby('ID').size().reset_index(name='count')
print (df1)
ID count
0 1 3
1 2 2
2 3 2
3 4 3
I want to group a DataFrame by some criteria, and then find the integer index in the group (not the DataFrame) of the first row satisfying some predicate. If there is no such row, I want to get NaN.
For example, I group by column a divided by 5 and then in each group, find the index of the first row where column b is "red":
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': xrange(12), 'b': ['red', 'green', 'blue'] * 4})
a b
0 0 red
1 1 green
2 2 blue
3 3 red
4 4 green
5 5 blue
6 6 red
7 7 green
8 8 blue
9 9 red
10 10 green
11 11 blue
df.groupby(df.a // 5).apply(lambda g: next((idx for idx, row in g.reset_index(drop=True).iterrows() if row.b == "red"), None))
a
0 0
1 1
2 NaN
dtype: float64
(I guess I'm assuming rows stay in the same order as the in original DataFrame, but I can sort the group if needed.) Is there a more concise, efficient way to do this?
This is a bit longer, but IMHO is more understandable / customizable
In [126]: df2 = df.copy()
This is your group metric
In [127]: g = df.a//5
A reference to the create groups
In [128]: grp = df.groupby(g)
Create a columns of the generated group and the cumulative count within the group
In [129]: df2['group'] = g
In [130]: df2['count'] = grp.cumcount()
In [131]: df2
Out[131]:
a b group count
0 0 red 0 0
1 1 green 0 1
2 2 blue 0 2
3 3 red 0 3
4 4 green 0 4
5 5 blue 1 0
6 6 red 1 1
7 7 green 1 2
8 8 blue 1 3
9 9 red 1 4
10 10 green 2 0
11 11 blue 2 1
Filtering and grouping gives you back the first elemnt that you want. The count is the within group count
In [132]: df2[df2.b=='red'].groupby('group').first()
Out[132]:
a b count
group
0 0 red 0
1 6 red 1
You can generate all of the group keys (e.g. nothing came back from your filter); this way.
In [133]: df2[df2.b=='red'].groupby('group').first().reindex(grp.groups.keys())
Out[133]:
a b count
0 0 red 0
1 6 red 1
2 NaN NaN NaN
Best I could do:
import itertools as it
df.groupby(df.a // 5).apply(lambda group: next(it.chain(np.where(group.get_values() == "red")[0], [None])))
The only real difference is using np.where on the values (so I'd expect this to be faster usually), but you may even want to just write your own first_where function and use that.