Fill dataframe nan values from a join

Fill dataframe nan values from a join - python

I am trying to map owners to an IP address through the use of two tables, df1 & df2. df1 contains the IP list to be mapped and df2 contains an IP, an alias, and the owner. After running a join on the IP column, it gives me a half joined dataframe. Most of the remaining data can be joined by replacing the NaN values with a join on the Alias column, but I can’t figure out how to do it.
My initial thoughts were to try nesting pd.merge inside fillna(), but it won't accept a dataframe. Any help would be greatly appreciated.
df1 = pd.DataFrame({'IP' : ['192.18.0.100', '192.18.0.101', '192.18.0.102', '192.18.0.103', '192.18.0.104']})
df2 = pd.DataFrame({'IP' : ['192.18.0.100', '192.18.0.101', '192.18.1.206', '192.18.1.218', '192.18.1.118'],
'Alias' : ['192.18.1.214', '192.18.1.243', '192.18.0.102', '192.18.0.103', '192.18.1.180'],
'Owner' : ['Smith, Jim', 'Bates, Andrew', 'Kline, Jenny', 'Hale, Fred', 'Harris, Robert']})
new_df = pd.DataFrame(pd.merge(df1, df2[['IP', 'Owner']], on='IP', how= 'left'))
Expected output is:
IP Owner
192.18.0.100 Smith, Jim
192.18.0.101 Bates, Andrew
192.18.0.102 Kline, Jenny
192.18.0.103 Hale, Fred
192.18.0.104 nan

No need to merge, Just pull data where condition satisfies. This is way faster than merge and less complicated.
condition = (df1['IP'] == df2['IP']) | (df1['IP'] == df2['Alias'])
df1['Owner'] = np.where(condition, df2['Owner'], np.nan)
print(df1)
IP Owner
0 192.18.0.100 Smith, Jim
1 192.18.0.101 Bates, Andrew
2 192.18.0.102 Kline, Jenny
3 192.18.0.103 Hale, Fred
4 192.18.0.104 NaN

Try this one:
new_df = pd.DataFrame(pd.merge(df1, pd.concat([df2[['IP', 'Owner']], df2[['Alias', 'Owner']].rename(columns={"Alias": "IP"})]).drop_duplicates(), on='IP', how= 'left'))
The result:
>>> new_df
IP Owner
0 192.18.0.100 Smith, Jim
1 192.18.0.101 Bates, Andrew
2 192.18.0.102 Kline, Jenny
3 192.18.0.103 Hale, Fred
4 192.18.0.104 NaN

Let's melt then use map:
df1['IP'].map(df2.melt('Owner').set_index('value')['Owner'])
Output:
0 Smith, Jim
1 Bates, Andrew
2 Kline, Jenny
3 Hale, Fred
4 NaN
Name: IP, dtype: object

Related

Check if pandas column contains text in another dataframe and replace values

I have two df's, one for user names and another for real names. I'd like to know how I can check if I have a real name in my first df using the data of the other, and then replace it.
For example:
import pandas as pd
df1 = pd.DataFrame({'userName':['peterKing', 'john', 'joe545', 'mary']})
df2 = pd.DataFrame({'realName':['alice','peter', 'john', 'francis', 'joe', 'carol']})
df1
userName
0 peterKing
1 john
2 joe545
3 mary
df2
realName
0 alice
1 peter
2 john
3 francis
4 joe
5 carol
My code should replace 'peterKing' and 'joe545' since these names appear in my df2. I tried using pd.contains, but I can only verify if a name appears or not.
The output should be like this:
userName
0 peter
1 john
2 joe
3 mary
Can someone help me with that? Thanks in advance!

You can use loc[row, colum], here you can see the documentation about loc method. And Series.str.contain method to select the usernames you need to replace with the real names. In my opinion, this solution is clear in terms of readability.
for real_name in df2['realName'].to_list():
df1.loc[ df1['userName'].str.contains(real_name), 'userName' ] = real_name
Output:
userName
0 peter
1 john
2 joe
3 mary

Python: Sum values in DataFrame if other values match between DataFrames

I have two dataframes of different length like those:
DataFrame A:
FirstName LastName
Adam Smith
John Johnson
DataFrame B:
First Last Value
Adam Smith 1.2
Adam Smith 1.5
Adam Smith 3.0
John Johnson 2.5
Imagine that what I want to do is to create a new column in "DataFrame A" summing all the values with matching last names, so the output in "A" would be:
FirstName LastName Sums
Adam Smith 5.7
John Johnson 2.5
If I were in Excel, I'd use
=SUMIF(dfB!B:B, B2, dfB!C:C)
In Python I've been trying multiple solutions but using both np.where, df.sum(), dropping indexes etc., but I'm lost. Below code is returning "ValueError: Can only compare identically-labeled Series objects", but I don't think it's written correctly anyways.
df_a['Sums'] = df_a[df_a['LastName'] == df_b['Last']].sum()['Value']
Huge thanks in advance for any help.

Use boolean indexing with Series.isin for filtering and then aggregate sum:
df = (df_b[df_b['Last'].isin(df_a['LastName'])]
.groupby(['First','Last'], as_index=False)['Value']
.sum())
If want match both, first and last name:
df = (df_b.merge(df_a, left_on=['First','Last'], right_on=['FirstName','LastName'])
.groupby(['First','Last'], as_index=False)['Value']
.sum())

df_b_a = (pd.merge(df_b, df_a, left_on=['FirstName', 'LastName'], right_on=['First', 'Last'], how='left')
.groupby(by=['First', 'Last'], as_index=False)['Value'].sum())
print(df_b_a)
First Last Value
0 Adam Smith 5.7
1 John Johnson 2.5

Use DataFrame.merge + DataFrame.groupby:
new_df=( dfa.merge(dfb.groupby(['First','Last'],as_index=False).Value.sum() ,
left_on='LastName',right_on='Last',how='left')
.drop('Last',axis=1) )
print(new_df)
to join for both columns:
new_df=( dfa.merge(dfb.groupby(['First','Last'],as_index=False).Value.sum() ,
left_on=['FirstName','LastName'],right_on=['First','Last'],how='left')
.drop(['First','Last'],axis=1) )
print(new_df)
Output:
FirstName LastName Value
0 Adam Smith 5.7
1 John Johnson 2.5

pandas: Group by splitting string value in all rows (a column) and aggregation function

If i have dataset like this:
id person_name salary
0 [alexander, william, smith] 45000
1 [smith, robert, gates] 65000
2 [bob, alexander] 56000
3 [robert, william] 80000
4 [alexander, gates] 70000
If we sum that salary column then we will get 316000
I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).
output:
group sum_salary
alexander 171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william 125000 #sum from id 0 + 3
smith 110000 #sum from id 0 + 1
robert 145000 #sum from id 1 + 3
gates 135000 #sum from id 1 + 4
bob 56000 #sum from id 2
as we see the sum of sum_salary columns is not the same as the initial dataset. all because the function requires double counting.
I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function. I've tried creating a new list of distinct value in person_name columns, then stuck comes.
Any help is appreciated, Thank you very much

Solutions working with lists in column person_name:
#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')
print (type(df.loc[0, 'person_name']))
<class 'list'>
First idea is use defaultdict for store sumed values in loop:
from collections import defaultdict
d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
for x in p:
d[x] += int(s)
print (d)
defaultdict(<class 'int'>, {'alexander': 171000,
'william': 125000,
'smith': 110000,
'robert': 145000,
'gates': 135000,
'bob': 56000})
And then:
df1 = pd.DataFrame({'group':list(d.keys()),
'sum_salary':list(d.values())})
print (df1)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another solution with repeating values by length of lists and aggregate sum:
from itertools import chain
df1 = pd.DataFrame({
'group' : list(chain.from_iterable(df['person_name'].tolist())),
'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})
df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000

Another sol:
df_new=(pd.DataFrame({'person_name':np.concatenate(df.person_name.values),
'salary':df.salary.repeat(df.person_name.str.len())}))
print(df_new.groupby('person_name')['salary'].sum().reset_index())
person_name salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000

Can be done concisely with dummies though performance will suffer due to all of the .str methods:
df.person_name.str.join('*').str.get_dummies('*').multiply(df.salary, 0).sum()
#alexander 171000
#bob 56000
#gates 135000
#robert 145000
#smith 110000
#william 125000
#dtype: int64

I parsed this as strings of lists, by copying OP's data and using pandas.read_clipboard(). In case this was indeed the case (a series of strings of lists), this solution would work:
df = df.merge(df.person_name.str.split(',', expand=True), left_index=True, right_index=True)
df = df[[0, 1, 2, 'salary']].melt(id_vars = 'salary').drop(columns='variable')
# Some cleaning up, then a simple groupby
df.value = df.value.str.replace('[', '')
df.value = df.value.str.replace(']', '')
df.value = df.value.str.replace(' ', '')
df.groupby('value')['salary'].sum()
Output:
value
alexander 171000
bob 56000
gates 135000
robert 145000
smith 110000
william 125000

Another way you can do this is with iterrows(). This will not be as fast jezraels solution. But it works:
ids = []
names = []
salarys = []
# Iterate over the rows and extract the names from the lists in person_name column
for ix, row in df.iterrows():
for name in row['person_name']:
ids.append(row['id'])
names.append(name)
salarys.append(row['salary'])
# Create a new 'unnested' dataframe
df_new = pd.DataFrame({'id':ids,
'names':names,
'salary':salarys})
# Groupby on person_name and get the sum
print(df_new.groupby('names').salary.sum().reset_index())
Output
names salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000

Conditionally align two dataframes in order to derive a column passed in as a condition in numpy where

I come from a SQL background and new to python. I have been trying to figure out how to solve this particular problem for awhile now and am unable to come up with anything.
Here are my dataframes
from pandas import DataFrame
import numpy as np
Names1 = {'First_name': ['Jon','Bill','Billing','Maria','Martha','Emma']}
df = DataFrame(Names1,columns=['First_name'])
print(df)
names2 = {'name': ['Jo', 'Bi', 'Ma']}
df_2 = DataFrame(names2,columns=['name'])
print(df_2)
Results to this:
First_name
0 Jon
1 Bill
2 Billing
3 Maria
4 Martha
5 Emma
name
0 Jo
1 Bi
2 Ma
This code helps me identify in df which First_name starts with a tuple from df_2
df['like_flg'] = np.where(df['First_name'].str.startswith(tuple(list(df_2['name']))), 'true', df['First_name'])
results to this:
First_name like_flg
0 Jon true
1 Bill true
2 Billing true
3 Maria true
4 Martha true
5 Emma Emma
I would like the final output of the dataframe to set the like_flg to the value of the tuple in which the First_name field is being conditionally compared against. See below for final desired output:
First_name like_flg
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
Here's what I've tried so far
df['like_flg'] = np.where(df['First_name'].str.startswith(tuple(list(df_2['name']))), tuple(list(df_2['name'])), df['First_name'])
results to this error:
`ValueError: operands could not be broadcast together with shapes (6,) (3,) (6,)`
I've also tried aligning both dataframes, however, that won't work for the use case that I'm trying to achieve.
Is there a way to conditionally align dataframes to fill in the columns that start with the tuple?
I believe the issue I'm facing is that the tuple or dataframe that I'm using as a comparison is not the same size as the dataframe that I want to append the tuple to. Please see above for the desired output.
Thank you all advance!

If your starting strings differ in length, you can use .str.extract
df['like_flag'] = df['First_name'].str.extract('^('+'|'.join(df_2.name)+')')
df['like_flag'] = df['like_flag'].fillna(df.First_name) # Fill non matches.
I modified df_2 to be
name
0 Jo
1 Bi
2 Mar
which leads to:
First_name like_flag
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Mar
4 Martha Mar
5 Emma Emma

You can use np.where,
df['like_flg'] = np.where(df.First_name.str[:2].isin(df_2.name), df.First_name.str[:2], df.First_name)
First_name like_flg
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma

Do with numpy find
v=df.First_name.values.astype(str)
s=df_2.name.values.astype(str)
df_2.name.dot((np.char.find(v,s[:,None])==0))
array(['Jo', 'Bi', 'Bi', 'Ma', 'Ma', ''], dtype=object)
Then we just assign it back
df['New']=df_2.name.dot((np.char.find(v,s[:,None])==0))
df.loc[df['New']=='','New']=df.First_name
df
First_name New
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma

Adding Dates (Series) column from one DataFrame to the other Pandas, Python

I am trying to 'broadcast' a date column from df1 to df2.
In df1 I have the names of all the users and their basic information.
In df2 I have a list of purchases made by the users.
df1 and df2 code
Assuming I have a much bigger dataset (the above created for sample) how can I add just(!) the df1['DoB'] column to df2?
I have tried both concat() and merge() but none of them seem to work:
code and error
The only way it seems to work is only if I merge both df1 and df2 together and then just delete the columns I don't need. But if I have tens of unwanted columns, it is going to be very problematic.
The full code (including the lines that throw an error):
import pandas as pd
df1 = pd.DataFrame(columns=['Name','Age','DoB','HomeTown'])
df1['Name'] = ['John', 'Jack', 'Wendy','Paul']
df1['Age'] = [25,23,30,31]
df1['DoB'] = pd.to_datetime(['04-01-2012', '03-02-1991', '04-10-1986', '06-03-1985'], dayfirst=True)
df1['HomeTown'] = ['London', 'Brighton', 'Manchester', 'Jersey']
df2 = pd.DataFrame(columns=['Name','Purchase'])
df2['Name'] = ['John','Wendy','John','Jack','Wendy','Jack','John','John']
df2['Purchase'] = ['fridge','coffee','washingmachine','tickets','iPhone','stove','notebook','laptop']
df2 = df2.concat(df1) # error
df2 = df2.merge(df1['DoB'], on='Name', how='left') #error
df2 = df2.merge(df1, on='Name', how='left')
del df2['Age'], df2['HomeTown']
df2 #that's how i want it to look like
Any help would be much appreciated. Thank you :)

I think you need merge with subset [['Name','DoB']] - need Name column for matching:
print (df1[['Name','DoB']])
Name DoB
0 John 2012-01-04
1 Jack 1991-02-03
2 Wendy 1986-10-04
3 Paul 1985-03-06
df2 = df2.merge(df1[['Name','DoB']], on='Name', how='left')
print (df2)
Name Purchase DoB
0 John fridge 2012-01-04
1 Wendy coffee 1986-10-04
2 John washingmachine 2012-01-04
3 Jack tickets 1991-02-03
4 Wendy iPhone 1986-10-04
5 Jack stove 1991-02-03
6 John notebook 2012-01-04
7 John laptop 2012-01-04
Another solution with map by Series s:
s = df1.set_index('Name')['DoB']
print (s)
Name
John 2012-01-04
Jack 1991-02-03
Wendy 1986-10-04
Paul 1985-03-06
Name: DoB, dtype: datetime64[ns]
df2['DoB'] = df2.Name.map(s)
print (df2)
Name Purchase DoB
0 John fridge 2012-01-04
1 Wendy coffee 1986-10-04
2 John washingmachine 2012-01-04
3 Jack tickets 1991-02-03
4 Wendy iPhone 1986-10-04
5 Jack stove 1991-02-03
6 John notebook 2012-01-04
7 John laptop 2012-01-04

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fill dataframe nan values from a join - python

Let's melt then use map: df1['IP'].map(df2.melt('Owner').set_index('value')['Owner']) Output: 0 Smith, Jim 1 Bates, Andrew 2 Kline, Jenny 3 Hale, Fred 4 NaN Name: IP, dtype: object

Related

Check if pandas column contains text in another dataframe and replace values

Python: Sum values in DataFrame if other values match between DataFrames

pandas: Group by splitting string value in all rows (a column) and aggregation function

Conditionally align two dataframes in order to derive a column passed in as a condition in numpy where

Adding Dates (Series) column from one DataFrame to the other Pandas, Python

Categories

Resources