add selected columns from two pandas dfs - python

I have two pandas dataframes a_df and b_df. a_df has columns ID, atext, and var1-var25, while b_df has columns ID, atext, and var1-var 25.
I want to add ONLY the corresponding vars from a_df and b_df and leave ID, and atext alone.
The code below adds ALL the corresponding columns. Is there a way to get it to add just the columns of interest?
absum_df=a_df.add(b_df)
What could I do to achieve this?

Use filter:
absum_df = a_df.filter(like='var').add(b_df.filter(like='var'))
If you want to keep additional columns as-is, use concat after summing:
absum_df = pd.concat([a_df[['ID', 'atext']], absum_df], axis=1)
Alternatively, instead of subselecting columns from a_df, you could instead just drop the columns in absum_df, if you want to add all columns from a_df not in absum_df:
absum_df = pd.concat([a_df.drop(absum_df.columns axis=1), absum_df], axis=1)

You can subset a dataframe to particular columns:
var_columns = ['var-{}'.format(i) for i in range(1,26)]
absum_df=a_df[var_columns].add(b_df[var_columns])
Note that this will result in a dataframe with only the var columns. If you want a dataframe with the non-var columns from a_df, and the var columns being the sum of a_df and b_df, you can do
absum_df = a_df.copy()
absum_df[var_columns] = a_df[var_columns].add(b_df[var_columns])

Related

how i can solve Pandas groupby function?

I have the dataframe 'df1' contain 1226 rows × 13 columns I want to group it by 'Region' columns but it is not working
Try this out for grouping based on column
blockedGroup = df1.groupby('Region')
blocking_df = {}
for x in blockedGroup.groups:
temp_df = blockedGroup.get_group(x)
blocking_df.update({temp_df['customerProfile'].iloc[0]: temp_df})
This will return the groups into a dict structure, where keys will be the unique items in the region and the values will data frames.
e.g : {"USA": DataFrame}
df.groupby does not change dataframe. Instead try doing it like this
df2 = df1.groupby('Region')
df2

Distinct aggregation over multiple columns in Pandas based on column names

I'm having a large Pandas data frame and I want to aggregate the columns differently. I have 24 columns (hours of the day), which I would like to sum, and for all others just take the maximum.
I know that I can write manually the required conditions like this:
df_agg = df.groupby('user_id').agg({'hour_0':'sum',
'hour_1':'sum',
.
.
'hour_24':'sum',
'all other columns': 'max'}
)
but I was wondering whether an elegant solution exists on the lines:
df_agg = df.groupby('user_id').agg({'hour_*':'sum',
'all other columns != hour_*': 'max'}
You can generate dictionary by all columns with hour, add all another columns to another dictionary, merge them and last pass to agg:
c1 = df.columns[df.columns.str.startswith('hour')].tolist()
#also excluded user_id column for avoid `max` aggregation
c2 = df.columns.difference(c1 + ['user_id'])
#https://stackoverflow.com/a/26853961
d = {**dict.fromkeys(c1, 'sum'), **dict.fromkeys(c2, 'max')}
df_agg = df.groupby('user_id').agg(d)
Or you can use 2 times groupby with concat:
df_agg = pd.concat([df.groupby('user_id')[c1].sum(),
df.groupby('user_id')[c2].max()], axis=1)

Join in Pandas Dataframe using conditional join statement

I am trying to join two dataframes with the following data:
df1
df2
I want to join these two dataframes on the condition that if 'col2' of df2 is blank/NULL then the join should occur only on 'column1' of df1 and 'col1' of df2 but if it is not NULL/blank then the join should occur on two conditions, i.e. 'column1', 'column2' of df1 with 'col1', 'col2' of df2 respectively.
For reference the final dataframe that I wish to obtain is:
My current approach is that I'm trying to slice these 2 dataframes into 4 and then joining them seperately based on the condition. Is there any way to do this without slicing them or maybe a better way that I'm missing out??
Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)

How to rank (in percent) each column in a dataframe in place?

The df is as shown below...
The below code can only rank one column in place. I would like to rank all columns and post the rank values in a separate df
df['rank_2020-06-23'] = df['2020-06-23'].rank(pct=True)
print(df)
Something like that should work:
df_ranks=pd.concat([pd.DataFrame(df[col].rank(pct=True)) for col in df.columns], axis=1)
It's simply using your function in a list comprehension, storing the results in dataframes to get a list of dataframes:
list_df_ranks=[pd.DataFrame(df[col].rank(pct=True)) for col in df.columns]
Then merging into one:
df_ranks=pd.concat(list_df_ranks, axis=1)

Drop columns that aren't common between two dataframes?

I have two dataframes that have many columns in column but a few that do not exist in both. I would like to create a dataframe that only has the columns that are in common between both dataframes. So for example:
list(df1)
['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Captain']
list(df2)
['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Countess']
And I would like to go to:
['Survived', 'Age', 'Title_Mr', 'Title_Mrs']
Since Title_Mr and Title_Mrs are in both df1 and df2. I've figured out how to do it by manually entering in the columns names like so:
df1 = df1.drop(['Title_Captain'], axis=1)
But I'd like to find a more robust solution where I don't have to manually enter the column names. Suggestions?
Using the comments of #linuxfan and #PadraicCunningham we can get a list of common columns:
common_cols = list(set(df1.columns).intersection(df2.columns))
Edit: #AdamHughes' answer made me consider preserving the column order. If that is important you could do this instead:
common_cols = [col for col in set(df1.columns).intersection(df2.columns)]
To get another DataFrame with just those columns you use that list to select only those columns from df1:
df3 = df1[common_cols]
According to http://pandas.pydata.org/pandas-docs/stable/indexing.html:
You can pass a list of columns to [] to select columns in that order.
If a column is not contained in the DataFrame, an exception will be
raised.
df1 = df1.drop([col for col in df1.columns if col in df1.columns and col in df2.columns], axis=1)
You don't necessarily need to drop the columns, just select the columns of interest:
In [204]:
df1 = pd.DataFrame(columns=['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Captain'])
df2 = pd.DataFrame(columns=['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Countess'])
# create a list of the common columns using set and intersection
common_cols=list(set.intersection(set(df1), set(df2)))
# use this list to perform column selection
df1[common_cols]
['Title_Mr', 'Age', 'Survived', 'Title_Mrs']
Out[204]:
Empty DataFrame
Columns: [Title_Mr, Age, Survived, Title_Mrs]
Index: []

Categories

Resources