How can make pandas columns compare check cell? - python

I have a two file.
a.txt has the below data.
Zone,Aliase1,Aliase2
VNX7600SPB3_8B3_H1,VNX7600SPB3,8B3_H1
VNX7600SPBA_8B4_H1,VNX7600SPA3,8B4_H1
CX480SPA1_11B3_H1,CX480SPA1,11B3_H1
CX480SPB1_11B4_H1,CX480SPB1,11B4_H1
b.txt has the below data.
Zone,Aliase1,Aliase2
VNX7600SPB3_8B3_H1,VNX7600SPB3,8B3_H1
CX480SPA1_11B3_H1,CX480SPA1,11B3_H1
I want made result about compare two files zone columns like below.
Zone,Aliase1,Aliase2,Status
VNX7600SPB3_8B3_H1,VNX7600SPB3,8B3_H1,Active
VNX7600SPBA_8B4_H1,VNX7600SPA3,8B4_H1,Not used
CX480SPA1_11B3_H1,CX480SPA1,11B3_H1,Active
CX480SPB1_11B4_H1,CX480SPB1,11B4_H1,Not used
How can I make result.
I tried using pandas. But I can't make result.
please help me.

I think you need merge with outer join and parameter indicator=True and then rename column name and map 3 possible values (both, left_only and right_only):
#if no 'on' parameter, merge all columns
df = pd.merge(df1, df2, how='outer', indicator=True)
df = df.rename(columns={'_merge':'status'})
d = {'left_only':'Not used', 'both':'Active', 'right_only':'b_file_only'}
df['status'] = df['status'].map(d)
print (df)
Zone Aliase1 Aliase2 status
0 VNX7600SPB3_8B3_H1 VNX7600SPB3 8B3_H1 Active
1 VNX7600SPBA_8B4_H1 VNX7600SPA3 8B4_H1 Not used
2 CX480SPA1_11B3_H1 CX480SPA1 11B3_H1 Active
3 CX480SPB1_11B4_H1 CX480SPB1 11B4_H1 Not used
If you want compare only by Zone column add parameter on and filter in df2 column by subset ([[]]):
df = pd.merge(df1, df2[['Zone']], how='outer', indicator=True, on='Zone')
df = df.rename(columns={'_merge':'status'})
d = {'left_only':'Not used', 'both':'Active', 'right_only':'b_file_only'}
df['status'] = df['status'].map(d)
print (df)
Zone Aliase1 Aliase2 status
0 VNX7600SPB3_8B3_H1 VNX7600SPB3 8B3_H1 Active
1 VNX7600SPBA_8B4_H1 VNX7600SPA3 8B4_H1 Not used
2 CX480SPA1_11B3_H1 CX480SPA1 11B3_H1 Active
3 CX480SPB1_11B4_H1 CX480SPB1 11B4_H1 Not used

Related

How to get Python function to update original df [duplicate]

I have a large data frame df and a small data frame df_right with 2 columns a and b. I want to do a simple left join / lookup on a without copying df.
I come up with this code but I am not sure how robust it is:
dtmp = pd.merge(df[['a']], df_right, on = 'a', how = "left") #one col left join
df['b'] = dtmp['b'].values
I know it certainly fails when there are duplicated keys: pandas left join - why more results?
Is there better way to do this?
Related:
Outer merging two data frames in place in pandas
What are the exact downsides of copy=False in DataFrame.merge()?
You are almost there.
There are 4 cases to consider:
Both df and df_right do not have duplicated keys
Only df has duplicated keys
Only df_right has duplicated keys
Both df and df_right have duplicated keys
Your code fails in case 3 & 4 since the merging extends the number of row count in df. In order to make it work, you need to choose what information to drop in df_right prior to merging. The purpose of this is to enforce any merging scheme to be either case 1 or 2.
For example, if you wish to keep "first" values for each duplicated key in df_right, the following code works for all 4 cases above.
dtmp = pd.merge(df[['a']], df_right.drop_duplicates('a', keep='first'), on='a', how='left')
df['b'] = dtmp['b'].values
Alternatively, if column 'b' of df_right consists of numeric values and you wish to have summary statistic:
dtmp = pd.merge(df[['a']], df_right.groupby('a').mean().reset_index(drop=False), on='a', how='left')
df['b'] = dtmp['b'].values

Python Pandas - How to find rows where a column value is different from two data frame

I am trying to get rows where values in a column are different from two data frames.
For example, let say we have these two dataframe below:
import pandas as pd
data1 = {'date' : [20210701, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0]}
data2 = {'date' : [20210701, 20210702, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0,0]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
As you can see Dave has different values in column 'a' on 20210704, and Sue has different values in column 'a' on 020210705. Therefore, my desire output should be something like:
import pandas as pd
output = {'date' : [20210704, 20210705],
'name': ['Dave', 'Sue'],
'a_from_old' : [0,1]}
df_output = pd.DataFrame(output)
If I am not mistaken, what I am asking for is pretty much the same thing as minus statement in SQL unless I am missing some edge cases.
How can I find those rows with the same date and name but different values in a column?
Update
I found an edge case that some data is not even in another data frame, I want to find the ones that are in both data frame but the value in column 'a' is different.
I edited the sample data set to take the edge case into account.
(Notice that Dave on 20210702 is not appear on the final output because the data is not in the first data frame).
Another option but with an inner merge and keep only rows where the a from df1 does not equal the a from df2:
df3 = (
df1.merge(df2, on=['date', 'name'], suffixes=('_from_old', '_df2'))
.query('a_from_old != a_df2') # df1 `a` != df2 `a`
.drop('a_df2', axis=1) # Remove column with df2 `a` values
)
df3:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
try left merge() with indicator=True then filterout results with query() then drop extra column by drop() and rename 'a' to 'a_from_old' by using rename():
out=(df1.merge(df2,on=['date','name','a'],how='left',indicator=True)
.query("_merge=='left_only'").drop('_merge',1)
.rename(columns={'a':'a_from_old'}))
output of out:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
Note: If there are many more columns that you want to rename then pass:
suffixes=('_from_old', '') in the merge() method as a parameter

How to add a column from df1 to df2 if it not present in df2, else do nothing

I have 2 data frame from a basic web scrape using Pandas (below). The second table has less columns than the first, and I need to concat the dataframes. I have been manually inserting columns for a while but seeing as they change frequently I would like to have a function that can assess the columns in df2, check whether they are all in df2, and if not, add the column, with the data from df2.
import pandas as pd
link = 'https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_French_presidential_election'
df = pd.read_html(link,header=0)
df1 = df[1]
df1 = df1.drop([0])
df1 = df1.drop('Abs.',axis=1)
df2 = df[2]
df2 = df2.drop([0])
df2 = df2.drop(['Abs.'],axis=1)
Many thanks,
#divingTobi's answer:
pd.concat([df1, df2]) does the trick.

Join in Pandas Dataframe using conditional join statement

I am trying to join two dataframes with the following data:
df1
df2
I want to join these two dataframes on the condition that if 'col2' of df2 is blank/NULL then the join should occur only on 'column1' of df1 and 'col1' of df2 but if it is not NULL/blank then the join should occur on two conditions, i.e. 'column1', 'column2' of df1 with 'col1', 'col2' of df2 respectively.
For reference the final dataframe that I wish to obtain is:
My current approach is that I'm trying to slice these 2 dataframes into 4 and then joining them seperately based on the condition. Is there any way to do this without slicing them or maybe a better way that I'm missing out??
Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)

How to take data from another dataframe based on 2 columns

Here is my code
A['period_id'] = A['period_number','Session'].map(B.set_index(['period_number','Session'])['period_id'])
So I want to take data from column period_id of B to give to A, based on criteria that 2 columns (period_number and Session) are matched. However it gave me error. What can I do?
You can use pd.merge:
A_columns = A.columns
A_columns.append("period_id")
# merge based on period_number and Session
merged_df = pd.merge(A, B, how='left', left_on=['period_number','Session'], right_on = ['period_number','Session'])
final_df = merged_df[A_columns] # filter for only columns in A + `period_id` from B
Note that if A's column names are different for period_number and Session, you'll have to adjust your left_on, and vice versa for B. To be explicit A is the left dataframe here, and B is the right dataframe.

Categories

Resources