I'm very new to python and I have two dataframes... I'm trying to match the "Names" aka the columns of dataframe 1 with the rows of dataframe 2 and collect the value for the year 2022 with the hopeful output looking like Dataframe 3... I've tried looking through other queries but not found anything to help, any help would be greatly appreciated!
Dataframe 1 - Money Dataframe 2 Dataframe 3
Date Alex Rob Kev Ben Name Name Amount
2022 29 45 65 12 James James
2021 11 32 11 19 Alex Alex 29
2019 45 12 22 76 Carl Carl
Rob Rob 45
Kev Kev 65
There are many different ways to achieve this.
One option is using map:
s = df1.set_index('Date').loc[2022]
df2['Amount'] = df2['Name'].map(s)
output:
Name Amount
0 James NaN
1 Alex 29.0
2 Carl NaN
3 Rob 45.0
4 Kev 65.0
Another option is using merge:
s = df1.set_index('Date').loc[2022]
df3 = df2.merge(s.rename('Amount'), left_on='Name', right_index=True, how='left')
Related
I have two Dataframes
df1 df2
fname lname age fname lname Position
0 Jack Lee 45 0 Jack Ray 25
1 Joy Kay 34 1 Chris Kay 34
2 Jeff Kim 54 2 Kim Xi 34
3 Josh Chris 29 3 Josh McC 24
4 David Lee 56
5 Aron Dev 41
6 Jack Lee 45
7 Shane Gab 43
8 Joy Kay 34
9 Jack Lee 45
want to compare fname and lname from two dfs and append to a list, Since there is a possibility of multiple repetition of entries from df1 in df2.
(Ex. data of row 1 in df1 is present in row 6 and 9 of df2.)
not very clear on how to fetch one row from df1 and compare with all the rows of df2.(One to many Comparison)
please assist me on the same.
Using pd.merge() with indicator=True will return a clear comparison between the two dataframes based on the columns 'fname' and 'lname':
df = pd.merge(df2,
df1[['fname','lname']],
on=['fname','lname'],
how='left',
indicator=True)
prints:
print(df)
fname lname Position _merge
0 Jack Ray 25 left_only
1 Chris Kay 34 left_only
2 Kim Xi 34 left_only
3 Josh McC 24 left_only
4 David Lee 56 left_only
5 Aron Dev 41 left_only
6 Jack Lee 45 both
7 Shane Gab 43 left_only
8 Joy Kay 34 both
9 Jack Lee 45 both
I have a df with columns:
Student_id subject marks
1 English 70
1 math 90
1 science 60
1 social 80
2 English 90
2 math 50
2 science 70
2 social 40
I have another df1 with columns
Student_id Year_of_join column_with_info
1 2020 some_info1
1 2020 some_info2
1 2020 some_info3
2 2019 some_info4
2 2019 some_info5
I want to combine two of the above data frames(.csv files) something like below res_df:
Student_id subject marks year_of_join column_with_info
1 English 70 2020 some_info1
1 math 90 2020 some_info2
1 science 60 2020 some_info3
1 social 80 NaN NaN
2 English 90 2019 some_info4
2 math 50 2019 some_info5
2 science 70 NaN NaN
2 social 40 NaN NaN
Note:
I want to join the datasets based on Student_ids. Both have the same unique Student_id's but the shape of the data is different for both the datasets.
P.S: The resulting df res_df is just an example of how the data might look after combining two data-frames, It can also be like this:
Student_id subject marks year_of_join column_with_info
1 English 70 NaN NaN
1 math 90 2020 some_info1
1 science 60 2020 some_info2
1 social 80 2020 some_info3
2 English 90 NaN NaN
2 math 50 NaN NaN
2 science 70 2019 some_info4
2 social 40 2019 some_info5
Thanks in advance for the help! Please help me to solve this..
Use GroupBy.cumcount for helper column used for merge with left join:
df['g'] = df.groupby('Student_id').cumcount()
df1['g'] = df1.groupby('Student_id').cumcount()
df = df.merge(df1, on=['Student_id','g'], how='left').drop('g', axis=1)
I have two dataframes
df1
Name 2010 2011
0 Jack 25 35
1 Jill 15 20
df2
Name 2010 2011
0 Berry 45 25
1 Jack 5 10
I want to create a third dataframe by adding the values in these dataframes
Desired Output
df3
Name 2010 2011
0 Jack 30 45 #add the values from df1 and df2
1 Jill 15 20
2 Berry 45 25
I have used this code
df1.add(df2)
concat both dfs and do a groupby and sum:
print (pd.concat([df, df2]).groupby("Name", as_index=False).sum())
Name 2010 2011
0 Berry 45 25
1 Jack 30 45
2 Jill 15 20
I have multiple dataframes that I need to merge into a single dataset based on a unique identifier (uid), and on the timedelta between dates in each dataframe.
Here's a simplified example of the dataframes:
df1
uid tx_date last_name first_name meas_1
0 60 2004-01-11 John Smith 1.3
1 60 2016-12-24 John Smith 2.4
2 61 1994-05-05 Betty Jones 1.2
3 63 2006-07-19 James Wood NaN
4 63 2008-01-03 James Wood 2.9
5 65 1998-10-08 Tom Plant 4.2
6 66 2000-02-01 Helen Kerr 1.1
df2
uid rx_date last_name first_name meas_2
0 60 2004-01-14 John Smith A
1 60 2017-01-05 John Smith AB
2 60 2017-03-31 John Smith NaN
3 63 2006-07-21 James Wood A
4 64 2002-04-18 Bill Jackson B
5 65 1998-10-08 Tom Plant AA
6 65 2005-12-01 Tom Plant B
7 66 2013-12-14 Helen Kerr C
Basically I am trying to merge records for the same person from two separate sources, where there link between records for unique individuals is the 'uid', and the link between rows (where it exists) for each individiual is a fuzzy relationship between 'tx_date' and 'rx_date' that can (usually) be accomodated by a specific time delta. There won't always be an exact or fuzzy match between dates, data could be missing from any column except 'uid', and each dataframe will contain a different but intersecting subset of 'uid's.
I need to be able to concatenate rows where the 'uid' columns match, and where the absolute time delta between 'tx_date' and 'rx_date' is within a given range (e.g. max delta of 14 days). Where the time delta is outside that range, or one of either 'tx_date' or 'rx_date' is missing, or where the 'uid' exists in only one of the dataframes, I still need to retain the data in that row. The end result should be something like:
uid tx_date rx_date first_name last_name meas_1 meas_2
0 60 2004-01-11 2004-01-14 John Smith 1.3 A
1 60 2016-12-24 2017-01-05 John Smith 2.4 AB
2 60 NaT 2017-03-31 John Smith NaN NaN
3 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood NaN NaN
6 64 2002-04-18 NaT Bill Jackson NaN B
7 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
8 65 NaT 2005-12-01 Tom Plant NaN B
9 66 2000-02-01 NaT Helen Kerr 1.1 NaN
10 66 NaT 2013-12-14 Helen Kerr NaN C
Seems like pandas.merge_asof should be useful here, but I've not been able to get it to do quite what I need.
Trying merge_asof on two of the real dataframes I have gave an error ValueError: left keys must be sorted
As per this question the problem there was actually due to there being NaT values in the 'date' column for some rows. I dropped the rows with NaT values, and sorted the 'date' columns in each dataframe, but the result still isn't quite what I need.
The code below shows the steps taken.
import pandas as pd
df1['date'] = df1['tx_date']
df1['date'] = pd.to_datetime(df1['date'])
df1['date'] = df1['date'].dropna()
df1 = df1.sort_values('date')
df2['date'] = df2['rx_date']
df2['date'] = pd.to_datetime(df2['date'])
df2['date'] = df2['date'].dropna()
df2 = df2.sort_values('date')
df_merged = (pd.merge_asof(df1, df2, on='date', by='uid', tolerance=pd.Timedelta('14 days'))).sort_values('uid')
Result:
uid tx_date rx_date last_name_x first_name_x meas_1 meas_2
3 60 2004-01-11 2004-01-14 John Smith 1.3 A
6 60 2016-12-24 2017-01-05 John Smith 2.4 AB
0 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood 2.9 NaN
1 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
2 66 2000-02-01 NaT Helen Kerr 1.1 NaN
It looks like a left join rather than a full outer join, so anywhere there's a row in df2 without a match on 'uid' and 'date' in df1 is lost (and it's not really clear from this simplified example, but I also need to add the rows back in where the date was NaT).
Is there some way to achieve a lossless merge, either by somehow doing an outer join with merge_asof, or using some other approach?
Let's say I have the following dataframe:
df = pd.DataFrame({'name':['john','mary','peter','jeff','bill'], 'matched_name':['mary','john','jeff','lisa','jose'], 'ratio':[78, 78, 22, 19, 45]})
print(df)
name matched_name ratio
0 john mary 78
1 mary john 78
2 peter jeff 22
3 jeff lisa 19
4 bill jose 45
I want to remove duplicated rows based on condition: if columns name and matched after exchange their cell place are same values and ratio also same then those rows are considered as duplicated rows.
Under above rules, row 0 and row 1 are duplicates, so I will keep only row 0. How could I do it use Pandas? Thanks.
This is expected result:
name matched ratio
0 john mary 78
1 peter jeff 22
2 jeff lisa 19
3 bill jose 45
Use np.sort for sorting values per rows, add column ratio and test duplicates by DataFrame.duplicated, last filter by inverse mask by ~ by boolean indexing:
m = (pd.DataFrame(np.sort(df[['name', 'matched_name']], axis=1), index=df.index)
.assign(ratio=df['ratio'])
.duplicated())
df = df[~m]
print (df)
name matched_name ratio
0 john mary 78
2 peter jeff 22
3 jeff lisa 19
4 bill jose 45
Try the below:
m=pd.DataFrame(np.sort(df.astype(str).values,axis=1)).drop_duplicates().index
df=df.loc[df.index.isin(m)].reset_index()
print(df)
index name matched_name ratio
0 0 john mary 78
1 2 peter jeff 22
2 3 jeff lisa 19
3 4 bill jose 45