Drop columns that aren't common between two dataframes? - python

I have two dataframes that have many columns in column but a few that do not exist in both. I would like to create a dataframe that only has the columns that are in common between both dataframes. So for example:
list(df1)
['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Captain']
list(df2)
['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Countess']
And I would like to go to:
['Survived', 'Age', 'Title_Mr', 'Title_Mrs']
Since Title_Mr and Title_Mrs are in both df1 and df2. I've figured out how to do it by manually entering in the columns names like so:
df1 = df1.drop(['Title_Captain'], axis=1)
But I'd like to find a more robust solution where I don't have to manually enter the column names. Suggestions?

Using the comments of #linuxfan and #PadraicCunningham we can get a list of common columns:
common_cols = list(set(df1.columns).intersection(df2.columns))
Edit: #AdamHughes' answer made me consider preserving the column order. If that is important you could do this instead:
common_cols = [col for col in set(df1.columns).intersection(df2.columns)]
To get another DataFrame with just those columns you use that list to select only those columns from df1:
df3 = df1[common_cols]
According to http://pandas.pydata.org/pandas-docs/stable/indexing.html:
You can pass a list of columns to [] to select columns in that order.
If a column is not contained in the DataFrame, an exception will be
raised.

df1 = df1.drop([col for col in df1.columns if col in df1.columns and col in df2.columns], axis=1)

You don't necessarily need to drop the columns, just select the columns of interest:
In [204]:
df1 = pd.DataFrame(columns=['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Captain'])
df2 = pd.DataFrame(columns=['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Countess'])
# create a list of the common columns using set and intersection
common_cols=list(set.intersection(set(df1), set(df2)))
# use this list to perform column selection
df1[common_cols]
['Title_Mr', 'Age', 'Survived', 'Title_Mrs']
Out[204]:
Empty DataFrame
Columns: [Title_Mr, Age, Survived, Title_Mrs]
Index: []

Related

Populating a column based off of values in another column

Hi I am working with pandas to manipulate some lab data. I currently have a data frame with 5 columns.
The first three columns(Analyte,CAS NO(1), and Value) are in the correct order.
The last two columns(CAS NO 2 and Value 2) are not.
Is there a way to align CAS No(2) and Value(2) with the first three columns based off of matching CAS Numbers(aka CAS NO(2)=CAS(NO1).
I am new to python and pandas. Thank you for your help
you can reorder the columns by reassigning the df variable as a slice of itself indexed on a list whose entries are the column names in question.
colidx = ['Analyte', 'CAS NO(1)', 'CAS NO(2)']
df = df[colidx]
Better provide input data in text format so we can copy-paste it. I understand you question like this: You need to sort two last columns together, so that CAS NO(2) matches CAS NO(1).
Since CAS NO(2)=CAS(NO1) you then do not need duplicated CAS NO(2) column, right?
Split off two last columns and make a Series from it, then convert that series to dict, and use that dict to map new values.
# Split 2 last columns and assign index.
df_tmp = df[['CAS NO(2)', 'Value(2)']]
df_tmp = df_tmp.set_index('CAS NO(2)')
# Keep only 3 first columns of original dataframe
df = df[['Analyte',' CASNo(1)', 'Value(1)']]
# Now copy the CasNO(1) to CAS NO(2)
df['CAS NO(2)'] = df['CasNO(1)']
# Now create Value(2) column on original dataframe
df['Value(2)'] = df['CASNo(1)'].map(df_tmp.to_dict()['Value(2)'])
Try the following:
import pandas as pd
import numpy as np
#create an example of your table
list_CASNo1 = ['71-43-2', '100-41-4', np.nan, '1634-04-4']
list_Val1 = [np.nan]*len(list_CASNo1)
list_CASNo2 = [np.nan, np.nan, np.nan, '100-41-4']
list_Val2 = [np.nan, np.nan, np.nan, '18']
df = pd.DataFrame(zip(list_CASNo1, list_Val1, list_CASNo2, list_Val2), columns =['CASNo(1)','Value(1)','CAS NO(2)','Value(2)'], index = ['Benzene','Ethylbenzene','Gasonline Range Organics','Methyl-tert-butyl ether'])
#split the data to two dataframes
df1 = df[['CASNo(1)','Value(1)']]
df2 = df[['CAS NO(2)','Value(2)']]
#merge df2 to df1 based on the specified columns
#reset_index and set_index will take care
#that df_adjusted will have the same index names as df1
df_adjusted = df1.reset_index().merge(df2.dropna(),
how = 'left',
left_on = 'CASNo(1)',
right_on = 'CAS NO(2)').set_index('index')
but be careful with duplicates in your columns, those will cause the merge to fail..

Join in Pandas Dataframe using conditional join statement

I am trying to join two dataframes with the following data:
df1
df2
I want to join these two dataframes on the condition that if 'col2' of df2 is blank/NULL then the join should occur only on 'column1' of df1 and 'col1' of df2 but if it is not NULL/blank then the join should occur on two conditions, i.e. 'column1', 'column2' of df1 with 'col1', 'col2' of df2 respectively.
For reference the final dataframe that I wish to obtain is:
My current approach is that I'm trying to slice these 2 dataframes into 4 and then joining them seperately based on the condition. Is there any way to do this without slicing them or maybe a better way that I'm missing out??
Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)

How to rank (in percent) each column in a dataframe in place?

The df is as shown below...
The below code can only rank one column in place. I would like to rank all columns and post the rank values in a separate df
df['rank_2020-06-23'] = df['2020-06-23'].rank(pct=True)
print(df)
Something like that should work:
df_ranks=pd.concat([pd.DataFrame(df[col].rank(pct=True)) for col in df.columns], axis=1)
It's simply using your function in a list comprehension, storing the results in dataframes to get a list of dataframes:
list_df_ranks=[pd.DataFrame(df[col].rank(pct=True)) for col in df.columns]
Then merging into one:
df_ranks=pd.concat(list_df_ranks, axis=1)

Select columns based on != condition

I have a dataframe and I have a list of some column names that correspond to the dataframe. How do I filter the dataframe so that it != the list of column names, i.e. I want the dataframe columns that are outside the specified list.
I tried the following:
quant_vair = X != true_binary_cols
but get the output error of: Unable to coerce to Series, length must be 545: given 155
Been battling for hours, any help will be appreciated.
It will help:
df.drop(columns = ["col1", "col2"])
You can either drop the columns from the dataframe, or create a list that does not contain all these columns:
df_filtered = df.drop(columns=true_binary_cols)
Or:
filtered_col = [col for col in df if col not in true_binary_cols]
df_filtered = df[filtered_col]

add selected columns from two pandas dfs

I have two pandas dataframes a_df and b_df. a_df has columns ID, atext, and var1-var25, while b_df has columns ID, atext, and var1-var 25.
I want to add ONLY the corresponding vars from a_df and b_df and leave ID, and atext alone.
The code below adds ALL the corresponding columns. Is there a way to get it to add just the columns of interest?
absum_df=a_df.add(b_df)
What could I do to achieve this?
Use filter:
absum_df = a_df.filter(like='var').add(b_df.filter(like='var'))
If you want to keep additional columns as-is, use concat after summing:
absum_df = pd.concat([a_df[['ID', 'atext']], absum_df], axis=1)
Alternatively, instead of subselecting columns from a_df, you could instead just drop the columns in absum_df, if you want to add all columns from a_df not in absum_df:
absum_df = pd.concat([a_df.drop(absum_df.columns axis=1), absum_df], axis=1)
You can subset a dataframe to particular columns:
var_columns = ['var-{}'.format(i) for i in range(1,26)]
absum_df=a_df[var_columns].add(b_df[var_columns])
Note that this will result in a dataframe with only the var columns. If you want a dataframe with the non-var columns from a_df, and the var columns being the sum of a_df and b_df, you can do
absum_df = a_df.copy()
absum_df[var_columns] = a_df[var_columns].add(b_df[var_columns])

Categories

Resources