How to take data from another dataframe based on 2 columns - python

Here is my code
A['period_id'] = A['period_number','Session'].map(B.set_index(['period_number','Session'])['period_id'])
So I want to take data from column period_id of B to give to A, based on criteria that 2 columns (period_number and Session) are matched. However it gave me error. What can I do?

You can use pd.merge:
A_columns = A.columns
A_columns.append("period_id")
# merge based on period_number and Session
merged_df = pd.merge(A, B, how='left', left_on=['period_number','Session'], right_on = ['period_number','Session'])
final_df = merged_df[A_columns] # filter for only columns in A + `period_id` from B
Note that if A's column names are different for period_number and Session, you'll have to adjust your left_on, and vice versa for B. To be explicit A is the left dataframe here, and B is the right dataframe.

Related

How to get Python function to update original df [duplicate]

I have a large data frame df and a small data frame df_right with 2 columns a and b. I want to do a simple left join / lookup on a without copying df.
I come up with this code but I am not sure how robust it is:
dtmp = pd.merge(df[['a']], df_right, on = 'a', how = "left") #one col left join
df['b'] = dtmp['b'].values
I know it certainly fails when there are duplicated keys: pandas left join - why more results?
Is there better way to do this?
Related:
Outer merging two data frames in place in pandas
What are the exact downsides of copy=False in DataFrame.merge()?
You are almost there.
There are 4 cases to consider:
Both df and df_right do not have duplicated keys
Only df has duplicated keys
Only df_right has duplicated keys
Both df and df_right have duplicated keys
Your code fails in case 3 & 4 since the merging extends the number of row count in df. In order to make it work, you need to choose what information to drop in df_right prior to merging. The purpose of this is to enforce any merging scheme to be either case 1 or 2.
For example, if you wish to keep "first" values for each duplicated key in df_right, the following code works for all 4 cases above.
dtmp = pd.merge(df[['a']], df_right.drop_duplicates('a', keep='first'), on='a', how='left')
df['b'] = dtmp['b'].values
Alternatively, if column 'b' of df_right consists of numeric values and you wish to have summary statistic:
dtmp = pd.merge(df[['a']], df_right.groupby('a').mean().reset_index(drop=False), on='a', how='left')
df['b'] = dtmp['b'].values

Joining two dataframes on subvalue of the key column

I am currently trying to join / merge two df on the column Key, where in df1 the key is a standalone value such as 5, but in df2, the key can consist of multiple values such as [5,6,13].
For example like this:
df1 = pd.DataFrame({'key': [["5","6","13"],["10","7"],["6","8"]]})
df2 = pd.DataFrame({'sub_key': ["5","10","6"]})
However, my df are a lot bigger and consist of many columns, so an efficient solution would be great.
As a result I would like to have a table like this:
Key1
Key2
5
5,6,13
10
10,7
and so on ....
I already tried to apply this approach to my code, but it didn't work:
df1['join'] = 1
df2['join'] = 1
merged= df1.merge(df2, on='join').drop('join', axis=1)
df2.drop('join', axis=1, inplace=True)
merged['match'] = merged.apply(lambda x: x.key(x.sub_key), axis=1).ge(0)
I also tried to split and explode the column and to join on single values, however there the problem was, that not all column values were split correctly and I would need to combine everything back into one cell once joined.
Help would be much appreciated!
If you only want to match the first key:
df1['sub_key'] = df1.key.str[0]
df1.merge(df2)
If you want to match ANY key:
df3 = df1.explode('key').rename(columns={'key':'sub_key'})
df3 = df3.join(df1)
df3.merge(df2)
Edit: First version had a small bug, fixed it.

Join in Pandas Dataframe using conditional join statement

I am trying to join two dataframes with the following data:
df1
df2
I want to join these two dataframes on the condition that if 'col2' of df2 is blank/NULL then the join should occur only on 'column1' of df1 and 'col1' of df2 but if it is not NULL/blank then the join should occur on two conditions, i.e. 'column1', 'column2' of df1 with 'col1', 'col2' of df2 respectively.
For reference the final dataframe that I wish to obtain is:
My current approach is that I'm trying to slice these 2 dataframes into 4 and then joining them seperately based on the condition. Is there any way to do this without slicing them or maybe a better way that I'm missing out??
Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)

What is the difference between 'pd.concat([df1, df2], join='outer')', 'df1.combine_first(df2)', 'pd.merge(df1, df2)' and 'df1.join(df2, how='outer')'? [duplicate]

This question already has answers here:
Difference(s) between merge() and concat() in pandas
(7 answers)
Closed 2 years ago.
Say I have the following 2 pandas dataframes:
import pandas as pd
A = [174,-155,-931,301]
B = [943,847,510,16]
C = [325,914,501,884]
D = [-956,318,319,-83]
E = [767,814,43,-116]
F = [110,-784,-726,37]
G = [-41,964,-67,-207]
H = [-555,787,764,-788]
df1 = pd.DataFrame({"A": A, "B": B, "C": C, "D": D})
df2 = pd.DataFrame({"E": E, "B": F, "C": G, "D": H})
If I do concat with join=outer, I get the following resulting dataframe:
pd.concat([data1,data2], join='outer')
If I do df1.combine_first(df2), I get the following:
df1.set_index('B').combine_first(df2.set_index('B')).reset_index()
If I do pd.merge(df1, df2), I get the following which is identical to the result produced by concat:
pd.merge(data1, data2, on=['B','C','D'], how='outer')
And finally, if I do df1.join(df2, how='outer'), I get the following:
df1.join(df2, how='outer', on='B', lsuffix='_left', rsuffix='_right')
I don't fully understand how and why each produces different results.
concat: append one dataframe to another along the given axis (default axix=0 meaning concat along index, i.e. put other dataframe below given dataframe). Data are aligned on the other axis (i.e. for default setting align columns). This is why we get NaNs in the non-matching columns 'A' and 'E'.
combine_first: replace NaNs in dataframe by existing values in other dataframe, where rows and columns are pooled (union of rows and cols from both dataframes). In your example, there are no missing values from the beginning but they emerge due to the union operation as your indices have no common entries. The order of the rows results from the sorted combined index (df1.B and df2.B).
So if there are no missing values in your dataframe you wouldn't normally use combine_first.
merge is a database-style combination of two dataframes that offers more options on how to merge (left, right, specific columns) than concat. In your example, the data of the result are identical, but there's a difference in the index between concat and merge: when merging on columns, the dataframe indices will be ignored and a new index will be created.
join merges df1 and df2 on the index of df1 and the given column (in the example 'B') of df2. In your example this is the same as pd.merge(df1, df2, left_on=df1.index, right_on='B', how='outer', suffixes=('_left', '_right')). As there's no match between the index of df1 and column 'B' of df2 there will be a lot of NaNs due to the outer join.

How can make pandas columns compare check cell?

I have a two file.
a.txt has the below data.
Zone,Aliase1,Aliase2
VNX7600SPB3_8B3_H1,VNX7600SPB3,8B3_H1
VNX7600SPBA_8B4_H1,VNX7600SPA3,8B4_H1
CX480SPA1_11B3_H1,CX480SPA1,11B3_H1
CX480SPB1_11B4_H1,CX480SPB1,11B4_H1
b.txt has the below data.
Zone,Aliase1,Aliase2
VNX7600SPB3_8B3_H1,VNX7600SPB3,8B3_H1
CX480SPA1_11B3_H1,CX480SPA1,11B3_H1
I want made result about compare two files zone columns like below.
Zone,Aliase1,Aliase2,Status
VNX7600SPB3_8B3_H1,VNX7600SPB3,8B3_H1,Active
VNX7600SPBA_8B4_H1,VNX7600SPA3,8B4_H1,Not used
CX480SPA1_11B3_H1,CX480SPA1,11B3_H1,Active
CX480SPB1_11B4_H1,CX480SPB1,11B4_H1,Not used
How can I make result.
I tried using pandas. But I can't make result.
please help me.
I think you need merge with outer join and parameter indicator=True and then rename column name and map 3 possible values (both, left_only and right_only):
#if no 'on' parameter, merge all columns
df = pd.merge(df1, df2, how='outer', indicator=True)
df = df.rename(columns={'_merge':'status'})
d = {'left_only':'Not used', 'both':'Active', 'right_only':'b_file_only'}
df['status'] = df['status'].map(d)
print (df)
Zone Aliase1 Aliase2 status
0 VNX7600SPB3_8B3_H1 VNX7600SPB3 8B3_H1 Active
1 VNX7600SPBA_8B4_H1 VNX7600SPA3 8B4_H1 Not used
2 CX480SPA1_11B3_H1 CX480SPA1 11B3_H1 Active
3 CX480SPB1_11B4_H1 CX480SPB1 11B4_H1 Not used
If you want compare only by Zone column add parameter on and filter in df2 column by subset ([[]]):
df = pd.merge(df1, df2[['Zone']], how='outer', indicator=True, on='Zone')
df = df.rename(columns={'_merge':'status'})
d = {'left_only':'Not used', 'both':'Active', 'right_only':'b_file_only'}
df['status'] = df['status'].map(d)
print (df)
Zone Aliase1 Aliase2 status
0 VNX7600SPB3_8B3_H1 VNX7600SPB3 8B3_H1 Active
1 VNX7600SPBA_8B4_H1 VNX7600SPA3 8B4_H1 Not used
2 CX480SPA1_11B3_H1 CX480SPA1 11B3_H1 Active
3 CX480SPB1_11B4_H1 CX480SPB1 11B4_H1 Not used

Categories

Resources