Filling na values with merge from another dataframe - python

I have a column with na values that I want to fill according to values from another data frame according to a key. I was wondering if there is any simple way to do so.
Example:
I have a data frame of objects and their colors like this:
object color
0 chair black
1 ball yellow
2 door brown
3 ball **NaN**
4 chair white
5 chair **NaN**
6 ball grey
I want to fill na values in the color column with default color from the following data frame:
object default_color
0 chair brown
1 ball blue
2 door grey
So the result will be this:
object color
0 chair black
1 ball yellow
2 door brown
3 ball **blue**
4 chair white
5 chair **brown**
6 ball grey
Is there any "easy" way to do this?
Thanks :)

Use np.where and mapping by setting a column as index i.e
df['color']= np.where(df['color'].isnull(),df['object'].map(df2.set_index('object')['default_color']),df['color'])
or df.where
df['color'] = df['color'].where(df['color'].notnull(), df['object'].map(df2.set_index('object')['default_color']))
object color
0 chair black
1 ball yellow
2 door brown
3 ball blue
4 chair white
5 chair brown
6 ball grey

First create Series and then replace NaNs:
s = df1['object'].map(df2.set_index('object')['default_color'])
print (s)
0 brown
1 blue
2 grey
3 blue
4 brown
5 brown
6 blue
Name: object, dtype: object
df1['color']= df1['color'].mask(df1['color'].isnull(), s)
Or:
df1.loc[df1['color'].isnull(), 'color'] = s
Or:
df1['color'] = df1['color'].combine_first(s)
Or:
df1['color'] = df1['color'].fillna(s)
print (df1)
object color
0 chair black
1 ball yellow
2 door brown
3 ball blue
4 chair white
5 chair brown
6 ball grey
If unique values in object:
df = df1.set_index('object')['color']
.combine_first(df2.set_index('object')['default_color'])
.reset_index()
Or:
df = df1.set_index('object')['color']
.fillna(df2.set_index('object')['default_color'])
.reset_index()

Using loc + map:
m = df.color.isnull()
df.loc[m, 'color'] = df.loc[m, 'object'].map(df2.set_index('object').default_color)
df
object color
0 chair black
1 ball yellow
2 door brown
3 ball blue
4 chair white
5 chair brown
6 ball grey
If you're going to be doing a lot of these replacements, you should call set_index on df2 just once and save its result.

Related

Finding combinations of values across multiple rows and columns in pandas dataframe

I am trying to create all possible sets of values from a pandas dataframe, taking 3 values from each row.
For example, if I consider the dataframe below:
0 1 2 3 4 5 6
Table Black Brown Red Blue Green Amber
Chair Magenta Turquoise White Orange Violet Pink
Window Indigo Yellow Cerulean Grey Peach Aqua
I want to generate all possible solution sets, by taking 3 values from the first, second, third and fourth rows each.
This is what I tried:
from itertools import product
uniques = [df[i].unique().tolist() for i in df.columns ]
pd.DataFrame(product(*uniques), columns = df.columns)
But this generates all combinations with all 6 columns like this:
0 1 2 3 4 5 6
Table Black Brown Red Blue Green Aqua
Table Black Brown Red Blue Green Violet
Here, all values of Row 1 remain the same except for the last value, and all combinations are generated like this.
But what I need is this:
0 1 2 3 4 5 6 7 8 9
Table Black Red Blue Magenta White Orange Yellow Peach Aqua
Here, the first three values are from Row 1, the second 3 values are from Row 2, and the last 3 values are from Row 3.
Similarly, I want to display all such sets, and store them in a new dataframe.
Any help will be appreciated.
df
###
0 1 2 3 4 5 6
0 Table Black Brown Red Blue Green Amber
1 Chair Magenta Turquoise White Orange Violet Pink
2 Window Indigo Yellow Cerulean Grey Peach Aqua
from itertools import product
import random
uniques = [df[i].unique().tolist() for i in df.columns]
rl = list(product(*uniques))
pd.DataFrame(random.choices(rl))
product() generates all the combinations from sets, but you want a random result from the combinations list.
0 1 2 3 4 5 6
0 Table Black Brown White Orange Peach Aqua
Supplement
Combination
with 3 sets.
Select one element from each set, how many combinations would be?
2 * 2 * 3 = 12
Let's check whether the total number of all combinations is 12 or not.
list_of_lists = [['Yellow','Blue'],['Cat','Dog'],['Swim','Run','Sleep']]
combination = product(*list_of_lists)
combination_list = list(combination)
pd.DataFrame(combination_list)
###
0 1 2
0 Yellow Cat Swim
1 Yellow Cat Run
2 Yellow Cat Sleep
3 Yellow Dog Swim
4 Yellow Dog Run
5 Yellow Dog Sleep
6 Blue Cat Swim
7 Blue Cat Run
8 Blue Cat Sleep
9 Blue Dog Swim
10 Blue Dog Run
11 Blue Dog Sleep
And choose one row from above randomly, would be the solution to generate a set from combinations.

Pandas count size of groupby groups idiomatically

I often want a dataframe of counts for how many members are in each group after a groupby operation in pandas. I have a verbose way of doing it with size and reset index and rename, but I'm sure there is a better way.
Here's an example of what I'd like to do:
import pandas as pd
import numpy as np
np.random.seed(0)
colors = ['red','green','blue']
cdf = pd.DataFrame({
'color1':np.random.choice(colors,10),
'color2':np.random.choice(colors,10),
})
print(cdf)
#better way to do next line? (somehow use agg?)
gb_count = cdf.groupby(['color1','color2']).size().reset_index().rename(columns={0:'num'})
print(gb_count)
#cdf.groupby(['color1','color2']).count() #<-- this doesn't work
Final output:
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
To avoid getting your MultiIndex, use as_index=False:
cdf.groupby(['color1','color2'], as_index=False).size()
color1 color2 size
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
If you explicitly want to name your new column num. You can use reset_index with name=.. since groupby will return a series:
cdf.groupby(['color1','color2']).size().reset_index(name='num')
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
Another way is to reset the grouper_index after sending it to_frame(with preferred column name) in an agg operation.
gb_count = cdf.groupby(['color1','color2']).agg('size').to_frame('num').reset_index()
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1

Split (Explode) String into Multiple columns and rows - Python

Good afternoon, i am trying to split text in a column to a specfic format
here is my table below
UserId Application
1 Grey Blue::Black Orange;White:Green
2 Yellow Purple::Orange Grey;Blue Pink::Red
I would like it to read the following:
UserId Application Role
1 Grey Blue Black Orange
1 White Green
2 Yellow Purple Orange Grey
2 Blue Pink Red
So far my code is
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
df['Application']=df.Roles.str.split(';|::|:').map(lambda x : x[0::2])
unnesting(df.drop('Roles',1),['Application'])
The following output code reads
UserId Application
1 Grey Blue
1 White
2 Yellow Purple
2 Blue Pink
i do not know how to add the second column (role) in the code for the second split after ::
Given this dataframe:
UserId Application
0 1 Grey Blue::Black Orange;White::Green
1 2 Yellow Purple::Orange Grey;Blue Pink::Red
you could at least achieve the last two columns directly via
df.Application.str.split(';', expand=True).stack().str.split('::', expand=True).reset_index().drop(columns=['level_0', 'level_1'])
which results in
0 1
0 Grey Blue Black Orange
1 White Green
2 Yellow Purple Orange Grey
3 Blue Pink Red
However, defining UserId as index before would also provide the proper UserId column:
result = df.set_index('UserId').Application.str.split(';', expand=True).stack().str.split('::', expand=True).reset_index().drop(columns=['level_1'])
result.columns = ['UserId', 'Application', 'Role']
UserId Application Role
0 1 Grey Blue Black Orange
1 1 White Green
2 2 Yellow Purple Orange Grey
3 2 Blue Pink Red

Convert text table to pandas dataframe

Many times when I'm trying to answer questions on Stackoverflow, the question contains a table, which I have to convert to a pandas dataframe in order to process. for example, in this question:
http://stackoverflow.com/questions/43172116/pandas-count-some-value-in-all-columns
My question is, is there a faster way to convert it to a dataframe rather than writing:
df=pd.DataFrame({'graph':[1,2,3,4,5,6],
0:['blue','blue','red','red','blue','blue'],
1:['blue','blue','red','blue','red','blue'],
2:['blue','blue','blue','red','blue','blue'],
3:['blue','blue','blue','red','red','blue'],
4:['blue','blue','red','blue','red','blue']})
given that I can copy the text:
graph 0 1 2 3 4
1 blue blue blue blue blue
2 blue blue blue blue blue
3 blue red blue blue red
4 red blue red red blue
5 red red blue red red
6 blue blue blue blue blue
Make sure the desired data set is in clipboard and use pd.read_clipboard() method.
Step by step:
mark desired data set
press Ctrl+C (for MS Windows)
execute: df = pd.read_clipboard()
In [40]: df = pd.read_clipboard()
In [41]: df
Out[41]:
graph 0 1 2 3 4
0 1 blue blue blue blue blue
1 2 blue blue blue blue blue
2 3 blue red blue blue red
3 4 red blue red red blue
4 5 red red blue red red
5 6 blue blue blue blue blue

Pandas: How do I check for value match between columns in same dataframe?

I am a complete coding novice and have been experimenting with Pandas. This is my first post. Thank you in advance for your help!
I would like to remove any rows where cat1 does not match either dog1 or dog2. It does not have to match both, just one or the other.
cat1 dog1 dog2
0 red red blue
1 red green blue
2 blue red blue
3 blue blue green
4 red green blue
I would like the end result to be as follows:
cat1 dog1 dog2
0 red red blue
2 blue red blue
3 blue blue green
How do I do this?
This is really simple:
df.query('cat1 == dog1 or cat1 == dog2')

Categories

Resources