Joining dataframes based on values, pandas - python

I have two data frames, let's say A and B. A has the columns ['Name', 'Age', 'Mobile_number'] and B has the columns ['Cell_number', 'Blood_Group', 'Location'], with 'Mobile_number' and 'Cell_number' having common values. I want to join the 'Location' column only onto A based off the common values in 'Mobile_number' and 'Cell_number', so the final DataFrame would have A={'Name':,'Age':,'Mobile_number':,'Location':]
a = {'Name': ['Jake', 'Paul', 'Logan', 'King'], 'Age': [33,43,22,45], 'Mobile_number':[332,554,234, 832]}
A = pd.DataFrame(a)
b = {'Cell_number': [832,554,123,333], 'Blood_group': ['O', 'A', 'AB', 'AB'], 'Location': ['TX', 'AZ', 'MO', 'MN']}
B = pd.DataFrame(b)
Please suggest. A colleague suggest to use pd.Join but I don't understand how.
Thank you for your time.

the way i see it, you want to merge a dataframe with a part of another dataframe, based on some common column.
first you have to make sure the common column share the same name:
B['Mobile_number'] = B['Cell_number']
then create a dataframe that contains only the relevant columns (the indexing column and the relevant data column):
B1 = B[['Mobile_number', 'Location']]
and at last you can merge them:
merged_df = pd.merge(A, B1, on='Mobile_number')
note that this usage of pd.merge will take only rows with Mobile_number value that exists in both dataframes.
you can look at the documentation of pd.merge to change how exactly the merge is done, what to include etc..

Related

Create new column based on conditions in other dataframes

I have two input dataframes I'm attempting to use to create a column in another dataframe (target column = 'Country' in df1). The issue is that one of the input dataframes (i.e., df2) has fewer rows than my target dataframe (i.e., df1). I want to iterate through all possible countries in df2 but when that is exhausted (i.e., after 'KOR'), I want the column to revert to df3 and return the countries starting from the beginning, hence the target result in df1['Country'].
I had believed I found a way of doing this using numpy.where, but I ended up with 'ESP' instead of 'US' when reverting to df3 and that's not my objective.
Also, these are just examples, the actual dataframes in question are much larger.
Hopefully someone can help me figure this out. Appreciate any help. Thanks.
import pandas as pd
data1 = {'Name': ['tom', 'nick', 'krish', 'jack'],
'Age': [20, 21, 19, 18],
'Country': ['IND', 'KOR', 'US', 'UK']}
data2 = {'Country2': ['IND', 'KOR']}
data3 = {'Country': ['US', 'UK', 'ESP', 'MEX']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
EDIT: The easiest way I can think of to more clearly explain my goal is to utilize excel, where df1 begins in A1, df2 begins in E1, and df3 begins in H1. In that case, the formula in C1 (the beginning of the country column in df1) would be something like =IF(LEN(E1)>0,E1,H1). If i pull that formula down the cells, i'd basically have something similar to "if df2 has a country available, provide the name of that country, else, revert to the beginning of df3 and provide the country (beginning at the start of the list)"

Flatten a Dataframe that is pivoted

I have the following code that is taking a single column and pivoting it into multiple columns. There are blanks in my result that I am trying to remove but I am running into issues with the wrong values being applied to rows.
task_df = task_df.pivot(index=pivot_cols, columns='Field')['Value'].reset_index()
task_df[['Color','Class']] = task_df[['Color','Class']].bfill()
task_df[['Color','Class']] = task_df[['Color','Class']].ffill()
task_df = task_df.drop_duplicates()
Start
Current
Desired
This is basically merging all rows having the same name or id together. You can do it with this:
mergers = {'ID': 'first', 'Color': 'sum', 'Class': 'sum'}
task_df = task_df.groupby('Name', as_index=False).aggregate(mergers).reindex(columns=task_df.columns).sort_values(by=['ID'])

Python Pandas - How to find rows where a column value is different from two data frame

I am trying to get rows where values in a column are different from two data frames.
For example, let say we have these two dataframe below:
import pandas as pd
data1 = {'date' : [20210701, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0]}
data2 = {'date' : [20210701, 20210702, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0,0]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
As you can see Dave has different values in column 'a' on 20210704, and Sue has different values in column 'a' on 020210705. Therefore, my desire output should be something like:
import pandas as pd
output = {'date' : [20210704, 20210705],
'name': ['Dave', 'Sue'],
'a_from_old' : [0,1]}
df_output = pd.DataFrame(output)
If I am not mistaken, what I am asking for is pretty much the same thing as minus statement in SQL unless I am missing some edge cases.
How can I find those rows with the same date and name but different values in a column?
Update
I found an edge case that some data is not even in another data frame, I want to find the ones that are in both data frame but the value in column 'a' is different.
I edited the sample data set to take the edge case into account.
(Notice that Dave on 20210702 is not appear on the final output because the data is not in the first data frame).
Another option but with an inner merge and keep only rows where the a from df1 does not equal the a from df2:
df3 = (
df1.merge(df2, on=['date', 'name'], suffixes=('_from_old', '_df2'))
.query('a_from_old != a_df2') # df1 `a` != df2 `a`
.drop('a_df2', axis=1) # Remove column with df2 `a` values
)
df3:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
try left merge() with indicator=True then filterout results with query() then drop extra column by drop() and rename 'a' to 'a_from_old' by using rename():
out=(df1.merge(df2,on=['date','name','a'],how='left',indicator=True)
.query("_merge=='left_only'").drop('_merge',1)
.rename(columns={'a':'a_from_old'}))
output of out:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
Note: If there are many more columns that you want to rename then pass:
suffixes=('_from_old', '') in the merge() method as a parameter

What is the most efficient way to compare to Dataframe columns and match the similar rows based on a function?

Assuming we have two Dataframes each containing a column of somewhat similar string-based values. What is the most efficient and/or effective way to match the rows with similar columns, based on a comparison function — like textdistance's implementation of Jaro-Winkler?
Example DataFrames:
first_df = pd.DataFrame( ['Cars and cats', 'Spaceship', 'Captain Marvel', 'Dune','Bucks in 6'], columns=['Title'])
second_df = pd.DataFrame( ['Captain Harlock', 'Cats and dogs', 'Buccuneers', 'Dune buggy','Milwaukee Bucks'], columns=['Title'])
What I'm thinking is:
Creating a Cartesian Product based on each DataFrame's column of interest
Apply the comparison function and store the result in a new column. Let's call it similarity_score
Sort the new DataFrame by best value to worst (depending on the algorithm)
Drop the duplicates of the column we are mostly interested in
Implementation:
comparison_df = first_df.merge(second_df, how='cross')
comparison_df['similarity_score'] = comparison_df.apply(lambda row: textdistance.jaro_winkler.normalized_similarity(row['First DataFrame Titles'], row['Second DataFrame Titles']), axis=1)
display(comparison_df)
comparison_df = comparison_df.sort_values('similarity_score', ascending=False).drop_duplicates(subset=['First DataFrame Titles'], keep='first')
Any suggestions are welcome. Thank you in advance.

Use results from two queries with common key to create a dataframe without having to use merge

data set:
df = pd.DataFrame(np.random.randn(5, 4), columns=['A', 'B', 'C', 'D'],index=['abcd','efgh','abcd','abc123','efgh']).reset_index()
s = pd.Series(data=[True,True,False],index=['abcd','efgh','abc123'], name='availability').reset_index()
(feel free to remove the reset_index bits above, they are simply there to easily provide a different approach to the problem. however, the resulting datasets from the queries i'm running resemble the above most accurately)
I have two separate queries that return data similar to the above. One query queries one field from a DB that has one column of information that does not exist in the other. The 'index' column is the common key across both tables.
My result set needs to have the 2nd query's result series injected into the first query's resulting dataframe at a specific column index.
I know that I can simply run:
df = df.merge(s, how='left', on='index')
Then to enforce column order:
df = df[['index', 'A', 'B', 'availability', 'C', 'D']
I saw that you can do df.inject, but that requires that the series be the same length as the df.
I'm wondering if there is a way to do this without having to run merge and then enforce column order. With my actual dataset, the number of columns is significantly longer. I'd imagine the best solution likely relies on list manipulation, but I'd much rather do something clever with how the dataframe is created in the first place.
df.set_index(['index','id']).index.map(s['availability'])
is returning:
TypeError: 'Series' object is not callable
S is a dataframe with a multi-index and one column which is a boolean.
df is a dataframe with columns in it which make up S's multi-index
IIUC:
In [260]: df.insert(3, 'availability',
df['index'].map(s.set_index('index')['availability']))
In [261]: df
Out[261]:
index A B availability C D
0 abcd 1.867270 0.517894 True 0.584115 -0.162361
1 efgh -0.036696 1.155110 True -1.112075 2.005678
2 abcd 0.693795 -0.843335 True -1.003202 1.001791
3 abc123 -1.466148 -0.848055 False -0.373293 0.360091
4 efgh -0.436618 -0.625454 True -0.285795 -0.220717

Categories

Resources