Joining two dfs based on different col names? [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two dataframes such as:
dfa:
Name | ID | Amount
Bob V434 50.00
Jill B333 22.11
Hank B442 11.11
dfb:
Name | ID_First | ID_Second | ID_Third
Bob V434 E333 B442
Karen V434 E333 B442
Jill V434 E333 B442
Hank V434 E333 B442
I want to join dfa to dfb, but the ID in dfa only corresponds to one of the IDS in dfb.
Is there a way I can join dfa to dfb for ID in dfa so basically if it matches any of the ids in dfb then I can match amount from dfa?
Required output would just be:
Name | ID_First | ID_Second | ID_Third | Amount
Bob V434 E333 B442 50.00
Jill V434 E333 B442 22.11
Hank V434 E333 B442 11.11
Basically join on Name that exists in both tables, but the ID that exists in dfa exists in dfb under only one of the ID_First, second or third columns so the amount that matches for the same name and same ID value but that ID value is only in one of the IDS for dfb.
Thanks

You could attempt a merge on all three, though not sure how efficient that would be. This wouldn't account work for when you have multiple matches across IDs, if such a thing is possible. The following might work;
new_df = pd.DataFrame()
for col in ['ID_First', 'ID_Second', 'ID_Third']:
df = pd.merge(dfa, dfb, left_on='ID', right_on=col, how='left')
new_df = df if new_df.empty else new_df.append(df)
I don't think you can have an 'OR' condition in pd.merge.
This is another possibility;
Python Pandas: How to merge based on an "OR" condition?

You can do 3 inner joins with each of your id columns and concatenate them
df1 = pd.DataFrame([['Bob','V434',50.00],['Jill','E333',22.11],['Hank','B442',11.11]],
columns=['Name','ID','Amount'])
df2 = pd.DataFrame([['Bob','V434','E333','B442'],
['Karen','V434','E333','B442'],
['Jill','V434','E333','B442'],
['Hank','V434','E333','B442']],
columns=['Name','ID_First','ID_Second','ID_Third'])
print(pd.concat([df1.merge(df2, left_on=['ID','Name'], right_on=['ID_First','Name']),
df1.merge(df2, left_on=['ID', 'Name'], right_on=['ID_Second', 'Name']),
df1.merge(df2, left_on=['ID', 'Name'], right_on=['ID_Third', 'Name'])])[['Name','ID','Amount']])
Output:
Name ID Amount
0 Bob V434 50.00
0 Jill E333 22.11
0 Hank B442 11.11
Improvising on #Ian's answer to get the desired output:
new_df = pd.DataFrame()
for col in ['ID_First', 'ID_Second', 'ID_Third']:
df = pd.merge(df1, df2, left_on=['ID','Name'], right_on=[col,'Name'], how='inner')
new_df = df if new_df.empty else new_df.append(df)

Solution
You can do this with a simple merge statement as follows.
pd.merge(dfa[['Name', 'Amount']], dfb, how='inner', on='Name')
Note: While merging dfa and dfb, the columns, dfa.ID and dfb.ID do not act like primary keys, neither are their values unique. The only thing that matters here is to inner join dfa and dfb using the "Name" column.
Output:
For Reproducibility
You may load the data and test the solution given above, using the following code-block
import numpy as np
import pandas as pd
from io import StringIO
# Example Data
dfa = """
Name | ID | Amount
Bob | V434 | 50.00
Jill | B333 | 22.11
Hank | B442 | 11.11
"""
dfb = """
Name | ID_First | ID_Second | ID_Third
Bob | V434 | E333 | B442
Karen | V434 | E333 | B442
Jill | V434 | E333 | B442
Hank | V434 | E333 | B442
"""
# Load Data and Clean up empty spaces
# in headers and columns
dfa = pd.read_csv(StringIO(dfa), sep='|')
dfb = pd.read_csv(StringIO(dfb), sep='|')
dfa.columns = dfa.columns.str.strip()
dfb.columns = dfb.columns.str.strip()
for col in dfa.columns:
if col=='Amount':
dfa[col] = dfa[col].astype(str).str.strip().astype(float)
else:
dfa[col] = dfa[col].str.strip()
for col in dfb.columns:
dfb[col] = dfb[col].str.strip()
# merge dfa and dfb: Note that dfa.ID and dfb.ID do not act
# like primary keys, neither are their values unique.
# The only thing that matters here is to inner join dfa
# and dfb using the "Name" column.
pd.merge(dfa[['Name', 'Amount']], dfb, how='inner', on='Name')

Related

Convert multiple rows into one row with multiple columns in pyspark?

I have something like this (I've simplified the number of columns for brevity, there's about 10 other attributes):
id name foods foods_eaten color continent
1 john apples 2 red Europe
1 john oranges 3 red Europe
2 jack apples 1 blue North America
I want to convert it to:
id name apples oranges color continent
1 john 2 3 red Europe
2 jack 1 0 blue North America
Edit:
(1) I updated the data to show a few more of the columns.
(3) I've done
df_piv = df.groupBy(['id', 'name', 'color', 'continent', ...]).pivot('foods').avg('foods_eaten')
Is there a simpler way to do this sort of thing? As far as I can tell, I'll need to groupby almost every attribute to get my result.
Extending from what you have done so far and leveraging here
>>>from pyspark.sql import functions as F
>>>from pyspark.sql.types import *
>>>from pyspark.sql.functions import collect_list
>>>data=[{'id':1,'name':'john','foods':"apples"},{'id':1,'name':'john','foods':"oranges"},{'id':2,'name':'jack','foods':"banana"}]
>>>dataframe=spark.createDataFrame(data)
>>>dataframe.show()
+-------+---+----+
| foods| id|name|
+-------+---+----+
| apples| 1|john|
|oranges| 1|john|
| banana| 2|jack|
+-------+---+----+
>>>grouping_cols = ["id","name"]
>>>other_cols = [c for c in dataframe.columns if c not in grouping_cols]
>>> df=dataframe.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols])
>>>df.show()
+---+----+-----------------+
| id|name| foods|
+---+----+-----------------+
| 1|john|[apples, oranges]|
| 2|jack| [banana]|
+---+----+-----------------+
>>>df_sizes = df.select(*[F.size(col).alias(col) for col in other_cols])
>>>df_max = df_sizes.agg(*[F.max(col).alias(col) for col in other_cols])
>>> max_dict = df_max.collect()[0].asDict()
>>>df_result = df.select('id','name', *[df[col][i] for col in other_cols for i in range(max_dict[col])])
>>>df_result.show()
+---+----+--------+--------+
| id|name|foods[0]|foods[1]|
+---+----+--------+--------+
| 1|john| apples| oranges|
| 2|jack| banana| null|
+---+----+--------+--------+

How to fill data from similar columns into a particular column(pandas)?

I have a script that converts files.
import pandas as pd
df = pd.read_csv("sample1.csv")
final_df = df.reindex(['id','name','email'],axis=1)
final_df.to_csv("output.csv", index = False)
sample1.csv
|name|email|id|
|--| -- | -- |
output.csv
|id|name|email|
|--| -- | -- |
Now, if the other sample files are in the format like below, how to arrange them in the format same as output.csv
sample2.csv
|id|first name |email address|
|--| -- | -- |
|1 | Sta |sta#example.com|
|2 |Danny|dany#example.com|
|3 |Elle |elle#example.com|
sample3.csv
|id|initial name |email id|
|--| -- | -- |
|1 | Ricky|ricky#example.com|
|2 |Sham|sham#example.com|
|3 |Mark|#example.com|
sample4.csv
| id |alias|contact|
|-- | -- | -- |
| 1 | Ricky|ricky#example.com|
|2 |Sham|sham#example.com|
|3 |Mark|#example.com|
I want to convert these files and place them in the columns of output file. For example, first name, initial name, alias refers to name(all means same), and email address, email id, and contact should refer to email. The order of columns can be random in the sample files.
The basic illustration for this case is :
switch(headerfields[i])
{
case "firstname":
case "initial name":
case "alias":
name = i;
}
Any ideas to do this in Pandas?
Select the target columns, then append to the target DataFrame.
dfn = pd.DataFrame(columns=['id', 'name', 'email'])
for df in [df1, df2, df3]:
# select columns
cond_list = [
df.columns =='id',
df.columns.str.contains('name|alias', na=False),
df.columns.str.contains('email|contact', na=False)
]
cols = [df.columns[cond][0] for cond in cond_list]
print(cols)
dfn = dfn.append(pd.DataFrame(df[cols].values, columns=dfn.columns))
output:
['id', 'first name', 'email address']
['id', 'initial name', 'email id']
['id', 'alias', 'contact']
dfn:
id name email
0 1 Sta sta#example.com
1 2 Danny dany#example.com
2 3 Elle elle#example.com
0 1 Ricky ricky#example.com
1 2 Sham sham#example.com
2 3 Mark #example.com
0 1 Ricky ricky#example.com
1 2 Sham sham#example.com
2 3 Mark #example.com
Testing data:
df_str = '''
id "first name" "email address"
1 Sta sta#example.com
2 Danny dany#example.com
3 Elle elle#example.com
'''
df1 = pd.read_csv(io.StringIO(df_str.strip()), sep='\s+', index_col=False)
df_str = '''
id "initial name" "email id"
1 Ricky ricky#example.com
2 Sham sham#example.com
3 Mark #example.com
'''
df2 = pd.read_csv(io.StringIO(df_str.strip()), sep='\s+', index_col=False)
df_str = '''
id alias contact
1 Ricky ricky#example.com
2 Sham sham#example.com
3 Mark #example.com
'''
df3 = pd.read_csv(io.StringIO(df_str.strip()), sep='\s+', index_col=False)
df1['1'] = 1
df2['2'] = 2
df3['3'] = 3
df1.sort_index(axis=1, inplace=True)
df2.sort_index(axis=1, inplace=True)
df3.sort_index(axis=1, inplace=True)
Not the cleanest solution, but you could test the first 5 rows of each dataframe for certain strings/numbers and assume that's your target column.
import numpy as np
import pandas as pd
def rename_and_merge_dfs(dfs : list) -> pd.DataFrame:
new_dfs = []
for frame in dfs:
id_col = frame.head(5).select_dtypes(np.number).columns[0]
email = frame.columns[frame.head(5).replace('[^#]','',regex=True).eq('#').all()][0]
name = list(set(frame.columns) - set([id_col, email]))[0]
frame = frame.rename(columns={id_col : 'id', email : 'email', name : 'name'})
new_dfs.append(frame)
return pd.concat(new_dfs)
final = rename_and_merge_dfs([df3,df2,df1])
print(final)
id name email
1 1 Ricky ricky#example.com
2 2 Sham sham#example.com
3 3 Mark #example.com
0 1 Ricky ricky#example.com
1 2 Sham sham#example.com
2 3 Mark #example.com
0 1 Sta sta#example.com
1 2 Danny dany#example.com
2 3 Elle elle#example.com
This solved my problem.
import pandas as pd
sample1 = pd.read_csv('sample1.csv')
def mapping(df):
for column_name, column in df.transpose().iterrows():
df.rename(columns ={'first name' : 'FNAME', 'email address': 'EMAIL'}, inplace = True)
df.rename(columns ={'alias' : 'FNAME', 'contact': 'EMAIL'}, inplace = True)
df.rename(columns ={'initial name' : 'FNAME', 'emailid': 'EMAIL'}, inplace = True)
mapping(sample1)

How to enrich dataframe by adding columns in specific condition

I have a two different datasets:
users:
+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
+-------+---------+--------+
movies:
+--------+---------+--------------------------+
|movie_id| title | genre |
+--------+---------+--------------------------+
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
+--------+---------+--------------------------+
How to get dataset in the following format? So I can get user's taste profile, so I can compare different users by their similarity score?
+-------+---------+--------+---------+---------+-----+
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+--------+---------+---------+-----+
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 1 | 0 | 1 | 0 |
+-------+---------+---------+---------+--------+-----+
Where df is the movies dataframe and dfu is the users dataframe
The 'genre' column needs to be split into a list with pandas.Series.str.split, and then using pandas.DataFrame.explode, transform each element of the list into a row, replicating index values.
pandas.merge the two dataframes on 'movie_id'
Use pandas.DataFrame.groupby on 'user_id' and 'genre' and aggregate by count.
Shape final
.unstack converts the groupby dataframe from long to wide format
.fillna replace NaN with 0
.astype changes the numeric values from float to int
Tested in python 3.10, pandas 1.4.3
import pandas as pd
# data
movies = {'movie_id': [1000, 1001, 1002],
'title': ['Toy Story', 'Jumanji', 'Iron Man'],
'genre': ['Adventure|Animation|Children', 'Adventure|Children|Fantasy', 'Action|Adventure|Sci-Fi']}
users = {'user_id': [100, 101, 101],
'movie_id': [1000, 1001, 1002],
'timestep': [20200728, 20200727, 20200726]}
# set up dataframes
df = pd.DataFrame(movies)
dfu = pd.DataFrame(users)
# split the genre column strings at '|' to make lists
df.genre = df.genre.str.split('|')
# explode the lists in genre
df = df.explode('genre', ignore_index=True)
# merge df with dfu
dfm = pd.merge(dfu, df, on='movie_id')
# groupby, count and unstack
final = dfm.groupby(['user_id', 'genre'])['genre'].count().unstack(level=1).fillna(0).astype(int)
# display(final)
genre Action Adventure Animation Children Fantasy Sci-Fi
user_id
100 0 1 1 1 0 0
101 1 2 0 1 1 1

Make Pandas Dataframe column equal to value in another Dataframe based on index

I have 3 dataframes as below
df1
id first_name surname state
1
88
190
2509
....
df2
id given_name surname state street_num
17 John Doe NY 5
88 Tom Murphy CA 423
190 Dave Casey KY 250
....
df3
id first_name family_name state car
1 John Woods NY ford
74 Tom Kite FL vw
2509 Mike Johnson KY toyota
Some id's from df1 are in df2 and others are in df3. There are also id's in df2 and df3 that are not in df1.
EDIT: there are also some id's in df1 that re not in either df2 or df3.
I want to fill the columns in df1 with the values in the dataframe containing the id. However, I do not want all columns (so i think merge is not suitable). I have tried to use the isin function but that way I could not update records individually and got an error. This was my attempt using isin:
df1.loc[df1.index.isin(df2.index), 'first_name'] = df2.given_name
Is there an easy way to do this without iterating through the dataframes checking if index matches?
I think you first need to rename your columns to align the DataFrames in concat and then reindex to filter by df1.index and df1.columns:
df21 = df2.rename(columns={'given_name':'first_name'})
df31 = df3.rename(columns={'family_name':'surname'})
df = pd.concat([df21, df31]).reindex(index=df1.index, columns=df1.columns)
print (df)
first_name surname state
d
1 John Woods NY
88 Tom Murphy CA
190 Dave Casey KY
2509 Mike Johnson KY
EDIT: If need intersection of indices only:
df4 = pd.concat([df21, df31])
df = df4.reindex(index=df1.index.intersection(df4.index), columns=df1.columns)

Add UUID's to pandas DF

Say I have a pandas DataFrame like so:
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df:
Name
0 John Doe
1 Jane Smith
2 John Doe
3 Jane Smith
4 Jack Dawson
5 John Doe
And I want to add a column with uuids that are the same if the name is the same. For example, the DataFrame above should become:
df:
Name UUID
0 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
1 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
2 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
3 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
4 Jack Dawson 6a495c95-dd68-4a7c-8109-43c2e32d5d42
5 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
The uuid's should be generated from the uuid.uuid4() function.
My current idea is to use a groupby("Name").cumcount() to identify which rows have the same name and which are different. Then I'd create a dictionary with a key of the cumcount and a value of the uuid and use that to add the uuids to the DF.
While that would work, I'm wondering if there's a more efficient way to do this?
Grouping the data frame and applying uuid.uuid4 will be more efficient than looping through the groups. Since you want to keep the original shape of your data frame you should use pandas function transform.
Using your sample data frame, we'll add a column in order to have a series to apply transform to. Since uuid.uuid4 doesn't take any argument it really doesn't matter what the column is.
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df.loc[:, "UUID"] = 1
Now to use transform:
import uuid
df.loc[:, "UUID"] = df.groupby("Name").UUID.transform(lambda g: uuid.uuid4())
+----+--------------+--------------------------------------+
| | Name | UUID |
+----+--------------+--------------------------------------+
| 0 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 1 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 2 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 3 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 4 | Jack Dawson | 6b843d0f-ba3a-4880-8a84-d98c4af09cc3 |
| 5 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
+----+--------------+--------------------------------------+
uuid.uuid4 will be called as many times as there are distinct groups
How about this
names = df['Name'].unique()
for name in names:
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()
could shorten it to
for name in df['Name'].unique():
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()

Categories

Resources