Lookup matched value from one dataframe and send to a different dataframe - python

I have two dataframes:
df1
Name Age State Postcode AveAge_State_PC
John 40 PA 1000 35
Janet 40 LV 1050 30
Jake 30 PA 1000 35
Jess 20 LV 1050 30
df2
State Postcode AveAge_State_PC
PA 1000 ???
LV 1050 ???
How do I get the values into the 2nd table? They should all be the same so happy to take the first value that appears.
I have tried:
df2 = df2.merge(df1[['State', 'Postcode', 'AveAge_State_PC']], how = 'left',
left_on = ['State', 'Postcode'], right_on = ['State', 'Postcode']).drop(columns= ['State', 'Postcode'])
but getting an
ValueError: You are trying to merge on int64 and object columns.
Edit
Im also getting duplicate rows when merging rather than just keeping the same number of rows in df2
I assume because there are multiple same values in df1? any help would be much appreciated! thanks!

Related

Update an existing column in one dataframe based on the value of a column in another dataframe

I have two csv files as my raw data to read into different dataframes. One is called 'employee' and another is called 'origin'. However, I cannot upload the files here so I hardcoded the data into the dataframes below. The task I'm trying to solve is to update the 'Eligible' column in employee_details with 'Yes' or 'No' based on the value of the 'Country' column in origin_details. If Country = UK, then put 'Yes' in the Eligible column for that Personal_ID. Else, put 'No'.
import pandas as pd
import numpy as np
employee = {
'Personal_ID': ['1000123', '1100258', '1104682', '1020943'],
'Name': ['Tom', 'Joseph', 'Krish', 'John'],
'Age': ['40', '35', '43', '51'],
'Eligible': ' '}
origin = {
'Personal_ID': ['1000123', '1100258', '1104682', '1020943', '1573482', '1739526'],
'Country': ['UK', 'USA', 'FRA', 'SWE', 'UK', 'AU']}
employee_details = pd.DataFrame(employee)
origin_details = pd.DataFrame(origin)
employee_details['Eligible'] = np.where((origin_details['Country']) == 'UK', 'Yes', 'No')
print(employee_details)
print(origin_details)
The output of above code shows the below error message:
ValueError: Length of values (6) does not match length of index (4)
However, I am expecting to see the below as my output.
Personal_ID Name Age Eligible
0 1000123 Tom 40 Yes
1 1100258 Joseph 35 No
2 1104682 Krish 43 No
3 1020943 John 51 No
I also don't want to delete anything in my dataframes to match the size specified in the ValueError message because I may need the extra Personal_IDs in the origin_details later. Alternatively, I can keep all the existing Personal_ID's in the raw data (employee_details, origin_details) and create a new dataframe from those to extract the records which have the same Personal_ID's and determine the np.where() condition from there.
Please advise! Any helps are appreciated, thank you!!
You can merge the 2 dataframes on Personal ID and then use np.where
Merge with how='outer' to keep all personal IDs
df_merge = pd.merge(employee_details, origin_details, on='Personal_ID', how='outer')
df_merge['Eligible'] = np.where(df_merge['Country']=='UK', 'Yes', 'No')
Personal_ID Name Age Eligible Country
0 1000123 Tom 40 Yes UK
1 1100258 Joseph 35 No USA
2 1104682 Krish 43 No FRA
3 1020943 John 51 No SWE
4 1573482 NaN NaN Yes UK
5 1739526 NaN NaN No AU
If you dont want to keep all personal IDs then you can merge with how='inner' and you won't see the NANs
df_merge = pd.merge(employee_details, origin_details, on='Personal_ID', how='inner')
df_merge['Eligible'] = np.where(df_merge['Country']=='UK', 'Yes', 'No')
Personal_ID Name Age Eligible Country
0 1000123 Tom 40 Yes UK
1 1100258 Joseph 35 No USA
2 1104682 Krish 43 No FRA
3 1020943 John 51 No SWE
You are using a Pandas Series object inside a Numpy method, np.where((origin_details['Country'])). I believe this is the problem.
try:
employee_details['Eligible'] = origin_details['Country'].apply(lambda x:"Yes" if x=='UK' else "No")
It is always much easier and faster to use the pandas library to analyze dataframes instead of converting them back to numpy arrays
Well, the first thing I want to answer about is the exception and how lucky you are that it didn't if your tables were the same length your code was going to work.
but there is an assumption in the code that I don't think you thought about and that is that the ids may not be in the same order or like in the example there are more ids in some table than the other if you had the same length of tables but not the same order you would have got incorrect eligible values for each row. the current way to do this is as follow
first join the table to one using personal_id but use left join as you don't want to lose data if there is no origin info for that personal id.
combine_df = pd.merge(employee_details, origin_details, on='Personal_ID', how='left')
use the apply function to fill the new column
combine_df['Eligible'] = combine_df['Country'].apply(lambda x:'Yes' if x=='UK' else 'No')

How to drop rows in one DataFrame based on one similar column in another Dataframe that has a different number of rows

I have two DataFrames that are completely dissimilar except for certain values in one particular column:
df
First Last Email Age
0 Adam Smith email1#email.com 30
1 John Brown email2#email.com 35
2 Joe Max email3#email.com 40
3 Will Bill email4#email.com 25
4 Johnny Jacks email5#email.com 50
df2
ID Location Contact
0 5435 Austin email5#email.com
1 4234 Atlanta email1#email.com
2 7896 Toronto email3#email.com
How would I go about finding the matching values in the Email column of df and the Contact column of df2, and then dropping the whole row in df based on that match?
Output I'm looking for (index numbering doesn't matter):
df1
First Last Email Age
1 John Brown email2#email.com 35
3 Will Bill email4#email.com 25
I've been able to identify matches using a few different methods like:
Changing the column names to be identical
common = df.merge(df2,on=['Email'])
df3 = df[(~df['Email'].isin(common['Email']))]
But df3 still shows all the rows from df.
I've also tried:
common = df['Email'].isin(df2['Contact'])
df.drop(df[common].index, inplace = True)
And again, identifies the matches but df still contains all original rows.
So the main thing I'm having difficulty with is updating df with the matches dropped or creating a new DataFrame that contains only the rows with dissimilar values when comparing the Email column from df and the Contact column in df2. Appreciate any suggestions.
As mentioned in the comments(#Arkadiusz), it is enough to filter your data using the following
df3 = df[(~df['Email'].isin(df2.Contact))].copy()
print(df3)

Efficient pandas operation for columnwise functions on two dataframes

I have two numerical dataframes (df1 and df2), each with a common index but with different column headers. I want to apply a function that takes: for the ith column of df1, and the jth column of df2, apply the Pearson correlation function (or cosine similarity, or similar user defined function), and return the number.
I want to return the number into a dataframe, df3, where the columns of df1 are the index of df3, the columns of df2 are the columns of df3, and the cells represent the value of the correlation between the two vectors (columns) from df1 and df2.
*not all of the values are populated. Where there's a difference, match only on the inner join of the two vectors (this can be done in the user defined function). Assume df1 and df2 have a different length/number of columns to each other.
Example: I have a dataframe (df1) of male dating profiles, where the columns are the names of the men, and the row index is their interest in a certain topic, between 0 and 100.
I have a second dataframe (df2) of female dating profiles in the same way.
I want to return a matrix of Males along the side, Females across the top, and the number corresponds to the similarity coefficient between the two profiles, for each man/woman pair.
eg:
df1
bob joe carlos
movies 50 45 90
sports 10 NaN 10
walking 20 NaN 50
skiing NaN 80 40
df2
mary anne sally
movies 40 70 NaN
sports 50 0 30
walking 80 10 50
skiing 30 NaN 40
Desired output, df3:
mary anne sally
bob 4.53 19.3 77.4
joe 81.8 75.7 91.0
carlos 45.8 12.2 18.8
I tried this with the classic double for loop, but even I know this is the work of satan in Pandas world. The tables are relatively large, so reasonable efficiency is important (which the below obviously isn't). Thanks in advance.
df3 = pd.DataFrame(index=df1.columns, columns=df2.columns)
for usera in df1:
for userb in df2:
df3.loc[userb, usera] = myfunc(df1[usera], df2[userb])
I've experimented with a few alternatives of your code and this one is the fastest as of now:
df3 = pd.DataFrame(([myfunc_np(col_a, col_b) for col_b in df2.values.T] for col_a in df1.values.T),
index=df1.columns, columns=df2.columns)
Here myfunc_np is a numpy version of myfunc that acts on numpy arrays directly rather than pandas series.
Further performance improvement would likely require to vectorize myfunc_np, i.e. having a myfunc_np_vec that takes one column u1 in df1 and the entire df2, and returns a vector of similarity values of u1 with all columns in df2 at the same time.

Make Pandas Dataframe column equal to value in another Dataframe based on index

I have 3 dataframes as below
df1
id first_name surname state
1
88
190
2509
....
df2
id given_name surname state street_num
17 John Doe NY 5
88 Tom Murphy CA 423
190 Dave Casey KY 250
....
df3
id first_name family_name state car
1 John Woods NY ford
74 Tom Kite FL vw
2509 Mike Johnson KY toyota
Some id's from df1 are in df2 and others are in df3. There are also id's in df2 and df3 that are not in df1.
EDIT: there are also some id's in df1 that re not in either df2 or df3.
I want to fill the columns in df1 with the values in the dataframe containing the id. However, I do not want all columns (so i think merge is not suitable). I have tried to use the isin function but that way I could not update records individually and got an error. This was my attempt using isin:
df1.loc[df1.index.isin(df2.index), 'first_name'] = df2.given_name
Is there an easy way to do this without iterating through the dataframes checking if index matches?
I think you first need to rename your columns to align the DataFrames in concat and then reindex to filter by df1.index and df1.columns:
df21 = df2.rename(columns={'given_name':'first_name'})
df31 = df3.rename(columns={'family_name':'surname'})
df = pd.concat([df21, df31]).reindex(index=df1.index, columns=df1.columns)
print (df)
first_name surname state
d
1 John Woods NY
88 Tom Murphy CA
190 Dave Casey KY
2509 Mike Johnson KY
EDIT: If need intersection of indices only:
df4 = pd.concat([df21, df31])
df = df4.reindex(index=df1.index.intersection(df4.index), columns=df1.columns)

Pandas sum by groupby, but exclude certain columns

What is the best way to do a groupby on a Pandas dataframe, but exclude some columns from that groupby? e.g. I have the following dataframe:
Code Country Item_Code Item Ele_Code Unit Y1961 Y1962 Y1963
2 Afghanistan 15 Wheat 5312 Ha 10 20 30
2 Afghanistan 25 Maize 5312 Ha 10 20 30
4 Angola 15 Wheat 7312 Ha 30 40 50
4 Angola 25 Maize 7312 Ha 30 40 50
I want to groupby the column Country and Item_Code and only compute the sum of the rows falling under the columns Y1961, Y1962 and Y1963. The resulting dataframe should look like this:
Code Country Item_Code Item Ele_Code Unit Y1961 Y1962 Y1963
2 Afghanistan 15 C3 5312 Ha 20 40 60
4 Angola 25 C4 7312 Ha 60 80 100
Right now I am doing this:
df.groupby('Country').sum()
However this adds up the values in the Item_Code column as well. Is there any way I can specify which columns to include in the sum() operation and which ones to exclude?
You can select the columns of a groupby:
In [11]: df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum()
Out[11]:
Y1961 Y1962 Y1963
Country Item_Code
Afghanistan 15 10 20 30
25 10 20 30
Angola 15 30 40 50
25 30 40 50
Note that the list passed must be a subset of the columns otherwise you'll see a KeyError.
The agg function will do this for you. Pass the columns and function as a dict with column, output:
df.groupby(['Country', 'Item_Code']).agg({'Y1961': np.sum, 'Y1962': [np.sum, np.mean]}) # Added example for two output columns from a single input column
This will display only the group by columns, and the specified aggregate columns. In this example I included two agg functions applied to 'Y1962'.
To get exactly what you hoped to see, included the other columns in the group by, and apply sums to the Y variables in the frame:
df.groupby(['Code', 'Country', 'Item_Code', 'Item', 'Ele_Code', 'Unit']).agg({'Y1961': np.sum, 'Y1962': np.sum, 'Y1963': np.sum})
If you are looking for a more generalized way to apply to many columns, what you can do is to build a list of column names and pass it as the index of the grouped dataframe. In your case, for example:
columns = ['Y'+str(i) for year in range(1967, 2011)]
df.groupby('Country')[columns].agg('sum')
If you want to add a suffix/prefix to the aggregated column names, use add_suffix() / add_prefix().
df.groupby(["Code", "Country"])[["Y1961", "Y1962", "Y1963"]].sum().add_suffix("_total")
If you want to retain Code and Country as columns after aggregation, set as_index=False in groupby() or use reset_index().
df.groupby(["Code", "Country"], as_index=False)[["Y1961", "Y1962", "Y1963"]].sum()
# df.groupby(["Code", "Country"])[["Y1961", "Y1962", "Y1963"]].sum().reset_index()

Categories

Resources