I have two numerical dataframes (df1 and df2), each with a common index but with different column headers. I want to apply a function that takes: for the ith column of df1, and the jth column of df2, apply the Pearson correlation function (or cosine similarity, or similar user defined function), and return the number.
I want to return the number into a dataframe, df3, where the columns of df1 are the index of df3, the columns of df2 are the columns of df3, and the cells represent the value of the correlation between the two vectors (columns) from df1 and df2.
*not all of the values are populated. Where there's a difference, match only on the inner join of the two vectors (this can be done in the user defined function). Assume df1 and df2 have a different length/number of columns to each other.
Example: I have a dataframe (df1) of male dating profiles, where the columns are the names of the men, and the row index is their interest in a certain topic, between 0 and 100.
I have a second dataframe (df2) of female dating profiles in the same way.
I want to return a matrix of Males along the side, Females across the top, and the number corresponds to the similarity coefficient between the two profiles, for each man/woman pair.
eg:
df1
bob joe carlos
movies 50 45 90
sports 10 NaN 10
walking 20 NaN 50
skiing NaN 80 40
df2
mary anne sally
movies 40 70 NaN
sports 50 0 30
walking 80 10 50
skiing 30 NaN 40
Desired output, df3:
mary anne sally
bob 4.53 19.3 77.4
joe 81.8 75.7 91.0
carlos 45.8 12.2 18.8
I tried this with the classic double for loop, but even I know this is the work of satan in Pandas world. The tables are relatively large, so reasonable efficiency is important (which the below obviously isn't). Thanks in advance.
df3 = pd.DataFrame(index=df1.columns, columns=df2.columns)
for usera in df1:
for userb in df2:
df3.loc[userb, usera] = myfunc(df1[usera], df2[userb])
I've experimented with a few alternatives of your code and this one is the fastest as of now:
df3 = pd.DataFrame(([myfunc_np(col_a, col_b) for col_b in df2.values.T] for col_a in df1.values.T),
index=df1.columns, columns=df2.columns)
Here myfunc_np is a numpy version of myfunc that acts on numpy arrays directly rather than pandas series.
Further performance improvement would likely require to vectorize myfunc_np, i.e. having a myfunc_np_vec that takes one column u1 in df1 and the entire df2, and returns a vector of similarity values of u1 with all columns in df2 at the same time.
Related
I am trying to collapse all the rows of a dataframe into one single row across all columns.
My data frame looks like the following:
name
job
value
bob
business
100
NAN
dentist
Nan
jack
Nan
Nan
I am trying to get the following output:
name
job
value
bob jack
business dentist
100
I am trying to group across all columns, I do not care if the value column is converted to dtype object (string).
I'm just trying to collapse all the rows across all columns.
I've tried groupby(index=0) but did not get good results.
You could apply join:
out = df.apply(lambda x: ' '.join(x.dropna().astype(str))).to_frame().T
Output:
name job value
0 bob jack business dentist 100.0
Try this:
new_df = df.agg(lambda x: x.dropna().astype(str).tolist()).str.join(' ').to_frame().T
Output:
>>> new_df
name job value
0 bob jack business dentist 100.0
I have two dataframes:
df1
Name Age State Postcode AveAge_State_PC
John 40 PA 1000 35
Janet 40 LV 1050 30
Jake 30 PA 1000 35
Jess 20 LV 1050 30
df2
State Postcode AveAge_State_PC
PA 1000 ???
LV 1050 ???
How do I get the values into the 2nd table? They should all be the same so happy to take the first value that appears.
I have tried:
df2 = df2.merge(df1[['State', 'Postcode', 'AveAge_State_PC']], how = 'left',
left_on = ['State', 'Postcode'], right_on = ['State', 'Postcode']).drop(columns= ['State', 'Postcode'])
but getting an
ValueError: You are trying to merge on int64 and object columns.
Edit
Im also getting duplicate rows when merging rather than just keeping the same number of rows in df2
I assume because there are multiple same values in df1? any help would be much appreciated! thanks!
I have two DataFrames that are completely dissimilar except for certain values in one particular column:
df
First Last Email Age
0 Adam Smith email1#email.com 30
1 John Brown email2#email.com 35
2 Joe Max email3#email.com 40
3 Will Bill email4#email.com 25
4 Johnny Jacks email5#email.com 50
df2
ID Location Contact
0 5435 Austin email5#email.com
1 4234 Atlanta email1#email.com
2 7896 Toronto email3#email.com
How would I go about finding the matching values in the Email column of df and the Contact column of df2, and then dropping the whole row in df based on that match?
Output I'm looking for (index numbering doesn't matter):
df1
First Last Email Age
1 John Brown email2#email.com 35
3 Will Bill email4#email.com 25
I've been able to identify matches using a few different methods like:
Changing the column names to be identical
common = df.merge(df2,on=['Email'])
df3 = df[(~df['Email'].isin(common['Email']))]
But df3 still shows all the rows from df.
I've also tried:
common = df['Email'].isin(df2['Contact'])
df.drop(df[common].index, inplace = True)
And again, identifies the matches but df still contains all original rows.
So the main thing I'm having difficulty with is updating df with the matches dropped or creating a new DataFrame that contains only the rows with dissimilar values when comparing the Email column from df and the Contact column in df2. Appreciate any suggestions.
As mentioned in the comments(#Arkadiusz), it is enough to filter your data using the following
df3 = df[(~df['Email'].isin(df2.Contact))].copy()
print(df3)
How to get merged data frame from two data frames having common column value such that only those rows make merged data frame having common value in a particular column.
I have 5000 rows of df1 as format : -
director_name actor_1_name actor_2_name actor_3_name movie_title
0 James Cameron CCH Pounder Joel David Moore Wes Studi Avatar
1 Gore Verbinski Johnny Depp Orlando Bloom Jack Davenport Pirates
of the Caribbean: At World's End
2 Sam Mendes Christoph Waltz Rory Kinnear Stephanie Sigman Spectre
and 10000 rows of df2 as
movieId genres movie_title
1 Adventure|Animation|Children|Comedy|Fantasy Toy Story
2 Adventure|Children|Fantasy Jumanji
3 Comedy|Romance Grumpier Old Men
4 Comedy|Drama|Romance Waiting to Exhale
A common column 'movie_title' have common values and based on them, I want to get all rows where 'movie_title' is same. Other rows to be deleted.
Any help/suggestion would be appreciated.
Note: I already tried
pd.merge(dfinal, df1, on='movie_title')
and output comes like one row
director_name actor_1_name actor_2_name actor_3_name movie_title movieId title genres
and on how ="outer"/"left", "right", I tried all and didn't get any row after dropping NaN although many common coloumn do exist.
You can use pd.merge:
import pandas as pd
pd.merge(df1, df2, on="movie_title")
Only rows are kept for which common keys are found in both data frames. In case you want to keep all rows from the left data frame and only add values from df2 where a matching key is available, you can use how="left":
pd.merge(df1, df2, on="movie_title", how="left")
We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title' as 'movie_name'.
dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')
If you want to be even more specific, you may read the documentation of pandas merge operation.
If you want to merge two DataFrames and you want a merged DataFrame in which only common values from both data frames will appear then do inner merge.
import pandas as pd
merged_Frame = pd.merge(df1, df2, on = id, how='inner')
nocity.head()
user_id business_id stars
0 cjpdDjZyprfyDG3RlkVG3w uYHaNptLzDLoV_JZ_MuzUA 5
1 bjTcT8Ty4cJZhEOEo01FGA uYHaNptLzDLoV_JZ_MuzUA 3
2 AXgRULmWcME7J6Ix3I--ww uYHaNptLzDLoV_JZ_MuzUA 3
3 oU2SSOmsp_A8JYI7Z2JJ5w uYHaNptLzDLoV_JZ_MuzUA 4
4 0xtbPEna2Kei11vsU-U2Mw uYHaNptLzDLoV_JZ_MuzUA 5
withcity.head()
business_id city
0 YDf95gJZaq05wvo7hTQbbQ Richmond Heights
1 mLwM-h2YhXl2NCgdS84_Bw Charlotte
2 v2WhjAB3PIBA8J8VxG3wEg Toronto
3 CVtCbSB1zUcUWg-9TNGTuQ Scottsdale
4 duHFBe87uNSXImQmvBh87Q Phoenix
nocity dataframe has business_id, (they may be repeating since it also has the rating each user_id gave for each business_id)
The withcity dataframe has the city associated with each business_id
The result I want is:
This is going to be super hard to word:
I want to look up the city associated with each business_id from the withcity dataframe and create a new column in nocity called cityname, which now has the city name associated with that business_id
Why I gave up trying and came here
I know this can be performed with some sort of join operation.. But I don't understand which one exactly.. I looked them up online and got a little confused as to what would happen if some business_id wasn't available in the two dataframes when performing that join operation?
For example:
withcity has some business_id with some city value; and when performing whichever appropriate join with the nocity, it does not find that particular business_id
So I came here for help.
What other alternative did I try?
area_dict = dict(zip(withcity.business_id, withcity.city))
emptylist = []
for rows in nocity['business_id']:
for key, value in area_dict.items():
if(key == rows):
emptylist.append(value)
I created a dictionary which held the business_id and the city from the withcity dataframe, and performed some sort of matching comparison with the nocity dataframe.
But my method, will probably take a lot of time since there are 4.7 million records to be exact.
IIUC merge
nocity.merge(withcity,on='business_id',how='left')
Out[855]:
user_id business_id stars city
0 cjpdDjZyprfyDG3RlkVG3w uYHaNptLzDLoV_JZ_MuzUA 5 NaN
1 bjTcT8Ty4cJZhEOEo01FGA uYHaNptLzDLoV_JZ_MuzUA 3 NaN
2 AXgRULmWcME7J6Ix3I--ww uYHaNptLzDLoV_JZ_MuzUA 3 NaN
3 oU2SSOmsp_A8JYI7Z2JJ5w uYHaNptLzDLoV_JZ_MuzUA 4 NaN
4 0xtbPEna2Kei11vsU-U2Mw uYHaNptLzDLoV_JZ_MuzUA 5 NaN
In general, whenever you have a situation like this you want to consider avoiding loops and iterations and instead perform a merge. Then afterwards, you massage the data to fit your needs. For example, Wen's solution is the most apt way to do this.
However there were a few things I would add. Say those are my two dfs below:
Let's call the first and second dfs, nocity and withcity respectively.
You want to do:
nocity.merge(withcity, on='business_id', how='left')
However, if you end up getting nan values as Wen got above. Check the datatypes of your keys
Meaning, if you business_id field in nocity was int (for some reason) while the business_id field in withcity was str then Pandas will have issues merging the dataframes and you get NaN values instead of the desired City Name.
To check you would do
#for all datatypes in the nocity df
print(nocity.dtypes)
#or just for the field's dtypes
print(nocity.business_id.dtypes)
Then you would convert to a common datatype like str if they were different...
#example conversion of pandas column (series) to different datatype
nocity.business_id = nocity.business_id.astype(str)
withcity.business_id = withcity.business_id.astype(str)
#then perform merge as usual
nocity = nocity.merge(withcity, on='business_id', how='left')
Hope this helps. Also don't forget to change your name from 'city' to 'cityname' if that is what you prefer
nocity.rename(columns = {'city': 'city name'})