How to link data from one Dataframe to another? [duplicate] - python

How to get merged data frame from two data frames having common column value such that only those rows make merged data frame having common value in a particular column.
I have 5000 rows of df1 as format : -
director_name actor_1_name actor_2_name actor_3_name movie_title
0 James Cameron CCH Pounder Joel David Moore Wes Studi Avatar
1 Gore Verbinski Johnny Depp Orlando Bloom Jack Davenport Pirates
of the Caribbean: At World's End
2 Sam Mendes Christoph Waltz Rory Kinnear Stephanie Sigman Spectre
and 10000 rows of df2 as
movieId genres movie_title
1 Adventure|Animation|Children|Comedy|Fantasy Toy Story
2 Adventure|Children|Fantasy Jumanji
3 Comedy|Romance Grumpier Old Men
4 Comedy|Drama|Romance Waiting to Exhale
A common column 'movie_title' have common values and based on them, I want to get all rows where 'movie_title' is same. Other rows to be deleted.
Any help/suggestion would be appreciated.
Note: I already tried
pd.merge(dfinal, df1, on='movie_title')
and output comes like one row
director_name actor_1_name actor_2_name actor_3_name movie_title movieId title genres
and on how ="outer"/"left", "right", I tried all and didn't get any row after dropping NaN although many common coloumn do exist.

You can use pd.merge:
import pandas as pd
pd.merge(df1, df2, on="movie_title")
Only rows are kept for which common keys are found in both data frames. In case you want to keep all rows from the left data frame and only add values from df2 where a matching key is available, you can use how="left":
pd.merge(df1, df2, on="movie_title", how="left")

We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title' as 'movie_name'.
dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')
If you want to be even more specific, you may read the documentation of pandas merge operation.

If you want to merge two DataFrames and you want a merged DataFrame in which only common values from both data frames will appear then do inner merge.
import pandas as pd
merged_Frame = pd.merge(df1, df2, on = id, how='inner')

Related

Append a pandas row with data from another dataframe if a certain column matches

I have two dataframes with the same columns (they represent different years of a sporting season). If a player played in both seasons, I'd like to append certain information from the following season to that season's dataframe.
DF1
Name
PPG
Michael Jordan
31.7
DF2
Name
PPG
Michael Jordan
28.4
What I'd like to do is combine them (either into DF1 or a new DF3) and have three rows
Name
PPG
PPG Next Season
Michael Jordan
31.7
28.4
Not all players played in both seasons. How can I check all the players in DF1, see if they played in DF2, and if so add a new column to DF1 tracking those players DF2 PPG?
import pandas as pd
df1 = pd.DataFrame({'Name': ['Michael Jordan'], 'PPG': [31.7]})
df2 = pd.DataFrame({'Name': ['Michael Jordan'], 'PPG': [28.4]})
df3 = df1.merge(df2, on='Name', suffixes=('', ' Next Season'))
print(df3)
The suffixes parameter is used to add a suffix to the columns in df2 to avoid duplicate column names in the merged dataframe.
Name
PPG
PPG Next Season
Michael Jordan
31.7
28.4

How to drop rows in one DataFrame based on one similar column in another Dataframe that has a different number of rows

I have two DataFrames that are completely dissimilar except for certain values in one particular column:
df
First Last Email Age
0 Adam Smith email1#email.com 30
1 John Brown email2#email.com 35
2 Joe Max email3#email.com 40
3 Will Bill email4#email.com 25
4 Johnny Jacks email5#email.com 50
df2
ID Location Contact
0 5435 Austin email5#email.com
1 4234 Atlanta email1#email.com
2 7896 Toronto email3#email.com
How would I go about finding the matching values in the Email column of df and the Contact column of df2, and then dropping the whole row in df based on that match?
Output I'm looking for (index numbering doesn't matter):
df1
First Last Email Age
1 John Brown email2#email.com 35
3 Will Bill email4#email.com 25
I've been able to identify matches using a few different methods like:
Changing the column names to be identical
common = df.merge(df2,on=['Email'])
df3 = df[(~df['Email'].isin(common['Email']))]
But df3 still shows all the rows from df.
I've also tried:
common = df['Email'].isin(df2['Contact'])
df.drop(df[common].index, inplace = True)
And again, identifies the matches but df still contains all original rows.
So the main thing I'm having difficulty with is updating df with the matches dropped or creating a new DataFrame that contains only the rows with dissimilar values when comparing the Email column from df and the Contact column in df2. Appreciate any suggestions.
As mentioned in the comments(#Arkadiusz), it is enough to filter your data using the following
df3 = df[(~df['Email'].isin(df2.Contact))].copy()
print(df3)

Efficient pandas operation for columnwise functions on two dataframes

I have two numerical dataframes (df1 and df2), each with a common index but with different column headers. I want to apply a function that takes: for the ith column of df1, and the jth column of df2, apply the Pearson correlation function (or cosine similarity, or similar user defined function), and return the number.
I want to return the number into a dataframe, df3, where the columns of df1 are the index of df3, the columns of df2 are the columns of df3, and the cells represent the value of the correlation between the two vectors (columns) from df1 and df2.
*not all of the values are populated. Where there's a difference, match only on the inner join of the two vectors (this can be done in the user defined function). Assume df1 and df2 have a different length/number of columns to each other.
Example: I have a dataframe (df1) of male dating profiles, where the columns are the names of the men, and the row index is their interest in a certain topic, between 0 and 100.
I have a second dataframe (df2) of female dating profiles in the same way.
I want to return a matrix of Males along the side, Females across the top, and the number corresponds to the similarity coefficient between the two profiles, for each man/woman pair.
eg:
df1
bob joe carlos
movies 50 45 90
sports 10 NaN 10
walking 20 NaN 50
skiing NaN 80 40
df2
mary anne sally
movies 40 70 NaN
sports 50 0 30
walking 80 10 50
skiing 30 NaN 40
Desired output, df3:
mary anne sally
bob 4.53 19.3 77.4
joe 81.8 75.7 91.0
carlos 45.8 12.2 18.8
I tried this with the classic double for loop, but even I know this is the work of satan in Pandas world. The tables are relatively large, so reasonable efficiency is important (which the below obviously isn't). Thanks in advance.
df3 = pd.DataFrame(index=df1.columns, columns=df2.columns)
for usera in df1:
for userb in df2:
df3.loc[userb, usera] = myfunc(df1[usera], df2[userb])
I've experimented with a few alternatives of your code and this one is the fastest as of now:
df3 = pd.DataFrame(([myfunc_np(col_a, col_b) for col_b in df2.values.T] for col_a in df1.values.T),
index=df1.columns, columns=df2.columns)
Here myfunc_np is a numpy version of myfunc that acts on numpy arrays directly rather than pandas series.
Further performance improvement would likely require to vectorize myfunc_np, i.e. having a myfunc_np_vec that takes one column u1 in df1 and the entire df2, and returns a vector of similarity values of u1 with all columns in df2 at the same time.

Transforming dataframe by making column using unique row values python pandas

I have a following dataframe
Name Activities
Eric Soccer,Baseball,Swimming
Natasha Soccer
Mike Basketball,Baseball
I need to transform it into following dataframe
Activities Name
Soccer Eric,Natasha,Mike
Swimming Eric
Baseball Eric,Mike
Basketball Mike
how should I do it?
Using pd.get_dummies
First, use get_dummies:
tmp = df.set_index('Name').Activities.str.get_dummies(sep=',')
Now using stack and agg:
tmp.mask(tmp.eq(0)).stack().reset_index('Name').groupby(level=0).agg(', '.join)
Name
Baseball Eric, Mike
Basketball Mike
Soccer Eric, Natasha
Swimming Eric
Using str.split and melt
(df.set_index('Name').Activities.str.split(',', expand=True)
.reset_index().melt(id_vars='Name').groupby('value').Name.agg(', '.join))
You can separate the Activities by performing a split and then converting the resulting list to a Series.
Then melt from wide to long format, and groupby the resulting value column (which is Activities).
In your grouped data frame, join the Name fields associated with each Activity.
Like this:
(df.Activities.str.split(",")
.apply(pd.Series)
.merge(df, right_index=True, left_index=True)
.melt(id_vars="Name", value_vars=[0,1,2])
.groupby("value")
.agg({'Name': lambda x: ','.join(x)})
.reset_index()
.rename(columns={"value":"Activities"})
)
Output:
Activities Name
0 Baseball Eric,Mike
1 Basketball Mike
2 Soccer Eric,Natasha
3 Swimming Eric
Note: The reset_index() and rename() methods at the end of the chain are just cosmetic; the main operations are complete after the groupby aggregation.

Performing the appropriate join operation between two pandas DataFrame

nocity.head()
user_id business_id stars
0 cjpdDjZyprfyDG3RlkVG3w uYHaNptLzDLoV_JZ_MuzUA 5
1 bjTcT8Ty4cJZhEOEo01FGA uYHaNptLzDLoV_JZ_MuzUA 3
2 AXgRULmWcME7J6Ix3I--ww uYHaNptLzDLoV_JZ_MuzUA 3
3 oU2SSOmsp_A8JYI7Z2JJ5w uYHaNptLzDLoV_JZ_MuzUA 4
4 0xtbPEna2Kei11vsU-U2Mw uYHaNptLzDLoV_JZ_MuzUA 5
withcity.head()
business_id city
0 YDf95gJZaq05wvo7hTQbbQ Richmond Heights
1 mLwM-h2YhXl2NCgdS84_Bw Charlotte
2 v2WhjAB3PIBA8J8VxG3wEg Toronto
3 CVtCbSB1zUcUWg-9TNGTuQ Scottsdale
4 duHFBe87uNSXImQmvBh87Q Phoenix
nocity dataframe has business_id, (they may be repeating since it also has the rating each user_id gave for each business_id)
The withcity dataframe has the city associated with each business_id
The result I want is:
This is going to be super hard to word:
I want to look up the city associated with each business_id from the withcity dataframe and create a new column in nocity called cityname, which now has the city name associated with that business_id
Why I gave up trying and came here
I know this can be performed with some sort of join operation.. But I don't understand which one exactly.. I looked them up online and got a little confused as to what would happen if some business_id wasn't available in the two dataframes when performing that join operation?
For example:
withcity has some business_id with some city value; and when performing whichever appropriate join with the nocity, it does not find that particular business_id
So I came here for help.
What other alternative did I try?
area_dict = dict(zip(withcity.business_id, withcity.city))
emptylist = []
for rows in nocity['business_id']:
for key, value in area_dict.items():
if(key == rows):
emptylist.append(value)
I created a dictionary which held the business_id and the city from the withcity dataframe, and performed some sort of matching comparison with the nocity dataframe.
But my method, will probably take a lot of time since there are 4.7 million records to be exact.
IIUC merge
nocity.merge(withcity,on='business_id',how='left')
Out[855]:
user_id business_id stars city
0 cjpdDjZyprfyDG3RlkVG3w uYHaNptLzDLoV_JZ_MuzUA 5 NaN
1 bjTcT8Ty4cJZhEOEo01FGA uYHaNptLzDLoV_JZ_MuzUA 3 NaN
2 AXgRULmWcME7J6Ix3I--ww uYHaNptLzDLoV_JZ_MuzUA 3 NaN
3 oU2SSOmsp_A8JYI7Z2JJ5w uYHaNptLzDLoV_JZ_MuzUA 4 NaN
4 0xtbPEna2Kei11vsU-U2Mw uYHaNptLzDLoV_JZ_MuzUA 5 NaN
In general, whenever you have a situation like this you want to consider avoiding loops and iterations and instead perform a merge. Then afterwards, you massage the data to fit your needs. For example, Wen's solution is the most apt way to do this.
However there were a few things I would add. Say those are my two dfs below:
Let's call the first and second dfs, nocity and withcity respectively.
You want to do:
nocity.merge(withcity, on='business_id', how='left')
However, if you end up getting nan values as Wen got above. Check the datatypes of your keys
Meaning, if you business_id field in nocity was int (for some reason) while the business_id field in withcity was str then Pandas will have issues merging the dataframes and you get NaN values instead of the desired City Name.
To check you would do
#for all datatypes in the nocity df
print(nocity.dtypes)
#or just for the field's dtypes
print(nocity.business_id.dtypes)
Then you would convert to a common datatype like str if they were different...
#example conversion of pandas column (series) to different datatype
nocity.business_id = nocity.business_id.astype(str)
withcity.business_id = withcity.business_id.astype(str)
#then perform merge as usual
nocity = nocity.merge(withcity, on='business_id', how='left')
Hope this helps. Also don't forget to change your name from 'city' to 'cityname' if that is what you prefer
nocity.rename(columns = {'city': 'city name'})

Categories

Resources