Create column in pandas based on two other columns and table

Create column in pandas based on two other columns and table - python

table = pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]],
columns=['High','Middle','Low'],
index=['Blue','Green','Red'])
df = pd.DataFrame(data=[['High','Blue'],
['High','Green'],
['Low','Red'],
['Middle','Blue'],
['Low','Blue'],
['Low','Red']],
columns=['A','B'])
>>> df
A B
0 High Blue
1 High Green
2 Low Red
3 Middle Blue
4 Low Blue
5 Low Red
>>> table
High Middle Low
Blue 1 2 3
Green 4 5 6
Red 7 8 9
I'm trying to add a third column 'C' which is based on the values in the table. So the first row would get a value of 1, the second of 4 etc.
If this would be be a one-dimensional lookup I would convert the table to a dictionary and would use df['C'] = df['A'].map(table). However since this is two-dimensional I can't figure out how to use map or apply.
Ideally I would convert the table to dictionary format so I save it together with other dictionaries in a json, however this is not essential.

pandas lookup
table.lookup(df.B,df.A)
Out[248]: array([1, 4, 9, 2, 3, 9], dtype=int64)
#table['c']=table.lookup(df.B,df.A)
Or df.apply(lambda x : table.loc[x['B'],x['A']],1) personally do not like apply

You can use a merge for this:
df2 = (df.merge(table.stack().reset_index(),
left_on=['A','B'], right_on=['level_1', 'level_0'])
.drop(['level_0', 'level_1'], 1)
.rename(columns={0:'C'}))
>>> df2
A B C
0 High Blue 1
1 High Green 4
2 Low Red 9
3 Low Red 9
4 Middle Blue 2
5 Low Blue 3

Related

How to replace empty value with value from another row

I am dealing with a dataframe that has some empty cells. For example:
df = pd.DataFrame(data={'id': [1, 2, 3, 1, 2, 3], 'id2': [1,1,1,2,2,2], 'color': ["red", "", "green", "yellow", "", "blue"], 'rate':["good","","good","average","","good"]})
id id2 color rate
0 1 1 red good
1 2 1
2 3 1 green good
3 1 2 yellow average
4 2 2
5 3 2 blue good
For both the columns "color" and "rate", I would like to replace the empty rows with values from another row where id is 1. Therefore, I would like the dataframe to look like this in the end:
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
I prefer not to replace the empty cells with values from the previous row. I would like to indicate the id and replace the empty cells with rows that have the specific id.

IIUC you can groupby and transform with first, and finally assign to empty values:
df.loc[df["color"].eq(""), ["color", "rate"]] = df.groupby(df["id"].eq(1).cumsum())[["color","rate"]].transform("first")
print (df)
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good

Its only works if the number of rows with empty values in [color,rate] is equal to the number of rows with id==1. Expand further on the question of whether this is not the intention.
If I understood you correctly:
empty_rows = df.loc[df['color']=='',['color','rate']].index
df.loc[empty_rows, ['color','rate']] = df.loc[df['id']==1,['color','rate']].values
Result df:

Sampling of a pandas dataframe with groupby on one column such that the sample has no duplicates in another column

Assume the below dummy dataframe:
a b
0 red 0
1 red 1
2 red 3
3 blue 0
4 blue 1
5 blue 3
6 black 4
7 black 2
I want to sample on column a, with a sample size of 1. But the only condition is that in the sample created the values of column b should be unique, such that it gives me a result something like:
a b
2 red 0
3 blue 1
5 black 4
The below result is not acceptable:
a b
2 red 1
3 blue 1
5 black 4
Right now I am using below code with pandas sample function, but it is returning duplicate values in column b
df.groupby('a').sample(n=1)

sample(n=1) selects values randomly from each element in column 'a'.
You can use df.groupby('a').first() which chooses the first value of each element, and would return the output you are looking for.

How to assign a new descriptive column while concatenating dataframes

I have two data frames that i want to concatenate in python. However, I want to add another column type in order to distinguish among the columns.
Here is my sample data:
import pandas as pd
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']},
columns=['numbers', 'colors'])
df1 = pd.DataFrame({'numbers': [7, 9, 9], 'colors': ['yellow', 'brown', 'blue']},
columns=['numbers', 'colors'])
pd.concat([df,df1])
This code will give me the following result:
numbers colors
0 1 red
1 2 white
2 3 blue
0 7 yellow
1 9 brown
2 9 blue
but what I would like to get is as follows:
numbers colors type
0 1 red first
1 2 white first
2 3 blue first
0 7 yellow second
1 9 brown second
2 9 blue second
type column is going to help me to differentiate between the values of the two data frames.
Can anyone help me with this please?

Use DataFrame.assign for new columns:
df = pd.concat([df.assign(typ='first'),df1.assign(typ='second')])
print (df)
numbers colors typ
0 1 red first
1 2 white first
2 3 blue first
0 7 yellow second
1 9 brown second
2 9 blue second
Using a list-comprehension
df = pd.concat([d.assign(typ=f'id{i}') for i, d in enumerate([df, df1])], ignore_index=True)
numbers colors typ
0 1 red id0
1 2 white id0
2 3 blue id0
3 7 yellow id1
4 9 brown id1
5 9 blue id1

pandas getting highest frequency value for each group in another column

I have a Pandas dataframe like this:
id color size test
0 0 blue medium 1
1 1 blue small 2
2 5 blue small 4
3 2 blue big 3
4 3 red small 4
5 4 red small 5
My desired output is this:
color size
blue small
red small
I've tried:
df = df[['id', 'color', 'size']]
df = df.groupby(['color'])['size'].value_counts()
and get this:
color size
blue small 2
big 1
medium 1
red small 2
Name: size, dtype: int64
but it turns into a series and the columns seem all messed up.
Basically, for each of the groups of 'color', I want the 'size' with the highest frequency. I'm really having a lot of trouble with this. Any suggestions? Thanks so much.

We can do sort_values the groupby with tail
s=df.groupby(['color','size']).size().sort_values().groupby(level=0).tail(1).reset_index()
color size 0
0 blue small 2
1 red small 2

Pandas: find group index of first row matching a predicate in a group, if any

I want to group a DataFrame by some criteria, and then find the integer index in the group (not the DataFrame) of the first row satisfying some predicate. If there is no such row, I want to get NaN.
For example, I group by column a divided by 5 and then in each group, find the index of the first row where column b is "red":
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': xrange(12), 'b': ['red', 'green', 'blue'] * 4})
a b
0 0 red
1 1 green
2 2 blue
3 3 red
4 4 green
5 5 blue
6 6 red
7 7 green
8 8 blue
9 9 red
10 10 green
11 11 blue
df.groupby(df.a // 5).apply(lambda g: next((idx for idx, row in g.reset_index(drop=True).iterrows() if row.b == "red"), None))
a
0 0
1 1
2 NaN
dtype: float64
(I guess I'm assuming rows stay in the same order as the in original DataFrame, but I can sort the group if needed.) Is there a more concise, efficient way to do this?

This is a bit longer, but IMHO is more understandable / customizable
In [126]: df2 = df.copy()
This is your group metric
In [127]: g = df.a//5
A reference to the create groups
In [128]: grp = df.groupby(g)
Create a columns of the generated group and the cumulative count within the group
In [129]: df2['group'] = g
In [130]: df2['count'] = grp.cumcount()
In [131]: df2
Out[131]:
a b group count
0 0 red 0 0
1 1 green 0 1
2 2 blue 0 2
3 3 red 0 3
4 4 green 0 4
5 5 blue 1 0
6 6 red 1 1
7 7 green 1 2
8 8 blue 1 3
9 9 red 1 4
10 10 green 2 0
11 11 blue 2 1
Filtering and grouping gives you back the first elemnt that you want. The count is the within group count
In [132]: df2[df2.b=='red'].groupby('group').first()
Out[132]:
a b count
group
0 0 red 0
1 6 red 1
You can generate all of the group keys (e.g. nothing came back from your filter); this way.
In [133]: df2[df2.b=='red'].groupby('group').first().reindex(grp.groups.keys())
Out[133]:
a b count
0 0 red 0
1 6 red 1
2 NaN NaN NaN

Best I could do:
import itertools as it
df.groupby(df.a // 5).apply(lambda group: next(it.chain(np.where(group.get_values() == "red")[0], [None])))
The only real difference is using np.where on the values (so I'd expect this to be faster usually), but you may even want to just write your own first_where function and use that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create column in pandas based on two other columns and table - python

pandas lookup table.lookup(df.B,df.A) Out[248]: array([1, 4, 9, 2, 3, 9], dtype=int64) #table['c']=table.lookup(df.B,df.A) Or df.apply(lambda x : table.loc[x['B'],x['A']],1) personally do not like apply

You can use a merge for this: df2 = (df.merge(table.stack().reset_index(), left_on=['A','B'], right_on=['level_1', 'level_0']) .drop(['level_0', 'level_1'], 1) .rename(columns={0:'C'})) >>> df2 A B C 0 High Blue 1 1 High Green 4 2 Low Red 9 3 Low Red 9 4 Middle Blue 2 5 Low Blue 3

Related

How to replace empty value with value from another row

Sampling of a pandas dataframe with groupby on one column such that the sample has no duplicates in another column

How to assign a new descriptive column while concatenating dataframes

pandas getting highest frequency value for each group in another column

Pandas: find group index of first row matching a predicate in a group, if any

Categories

Resources