How to replace with values with adjacent column using pandas - python

I have dataframe, df1,
After outer join the df is below
df1 have 4 columns ['A','B','C','D']
ID,A,B,C,D
1,Nan,Nan,c,d
1,a,b,c,d
I need to replace the Nan values in df['A'] is with df['C']
I need to replace the Nan values in df['B'] is with df['D']
expected out is below
ID,A,B,C,D
1,c,d,c,d
1,a,b,c,d
in the first row df['A'] replaced with df['C'], if df['A'] then it has to retrieve df['A'] only
in the first row df['B'] replaced with df['D'], if df['B'] then it has to retrieve df['D'] only

You need to fill the column with the second-after column, one way is to fillna specifying the value parameter:
df.A.fillna(value=df.C, inplace=True)
df.B.fillna(value=df.D, inplace=True)
If for some reason you have a lot of columns and wants to keep filling NaN using values on the second-after column then use a for loop on the first n-2 columns
columns = ['A', 'B', 'C', 'D']
for i in range(len(columns)-2):
df[columns[i]].fillna(df[columns[i+2]], inplace=True)

Related

Assign specific value from a column to specific number of rows

I would like to assign agent_code to specific number of rows in df2.
df1
df2
Thank you.
df3 (Output)
First make sure in both DataFrames is default index by DataFrame.reset_index with drop=True, then repeat agent_code, convert to default index and last use concat:
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
s = df1['agent_code'].repeat(df1['number']).reset_index(drop=True)
df3 = pd.concat([df2, s], axis=1)

Merge two dataframes on multiple columns but only merge on columns if both not NaN

I'm looking to merge two dataframes across multiple columns but with some additional conditions.
import pandas as pd
df1 = pd.DataFrame({
'col1': ['a','b','c', 'd'],
'optional_col2': ['X',None,'Z','V'],
'optional_col3': [None,'def', 'ghi','jkl']
})
df2 = pd.DataFrame({
'col1': ['a','b','c', 'd'],
'optional_col2': ['X','Y','Z','W'],
'optional_col3': ['abc', 'def', 'ghi','mno']
})
I would like to always join on col1 but then try to also join on optional_col2 and optional_col3. In df1, the value can be NaN for both columns but it is always populated in df2. I would like the join to be valid when the col1 + one of optional_col2 or optional_col3 match.
This would result in ['a', 'b', 'c'] joining due to exact col2, col3, and exact match, respectively.
In SQL I suppose you could write the join as this, if it helps explain further:
select
*
from
df1
inner join
df2
on df1.col1 = df2.col2
AND (df1.optional_col2 = df2.optional_col2 OR df1.optional_col3 = df2.optional_col3)
I've messed around with pd.merge but can't figure how to do a complex operation like this. I think I can do a merge on ['col1', 'optional_col2'] then a second merge on ['col1', 'optional_col_3'] then union and drop duplicates?
Expected DataFrame would be something like:
merged_df = pd.DataFrame({
'col1': ['a', 'b', 'c'],
'optional_col_2': ['X', 'Y', 'Z'],
'optional_col_3': ['abc', 'def', 'ghi']
})
This solution works by creating an extra column called "temp" in both dataframes. In df11 it will be a column of true values. In df2 the values will be true if there is a match between either of the optional columns. I'm not clear whether you consider a NaN value to be matchable or not, if so then you need to fill in the NaNs of columns in df1 with values from df2 before comparing to fulfill your criteria around missing values (this is what is below). If this is not required then drop the fillna calls in the example below.
df1["temp"] = True
optional_col2_match = df1["optional_col2"].fillna(df2["optional_col2"]).eq(df2["optional_col2"])
optional_col3_match = df1["optional_col3"].fillna(df2["optional_col3"]).eq(df2["optional_col3"])
df2["temp"] = optional_col2_match | optional_col3_match
Then use the "temp" column in the merge, and then drop it - it has served its purpose
pd.merge(df1, df2, on=["col1", "temp"]).drop(columns="temp")
This gives the following result
col1 optional_col2_x optional_col3_x optional_col2_y optional_col3_y
0 a X abc X abc
1 b Y def Y def
2 c Z ghi Z ghi
You will need to decide what to do here. In the example you gave there are no rows which match on just one of optional_col2 and optional_col2, which is why a 3 column solution looks reasonable. This won't generally be the case.

How to drop column from the target data frame, but the column(s) are required for the join in merge

I have two dataframe df1, df2
df1.columns
['id','a','b']
df2.columns
['id','ab','cd','ab_test','mn_test']
Expected out column is ['id','a','b','ab_test','mn_test']
How to get the all the columns from df1, and columns which contain test in the column name
pseudocode > pd.merge(df1,df2,how='id')
You can merge and use filter one the second dataframe to keep the columns of interest:
df1.merge(df2.filter(regex=r'^id$|test'), on='id')
Or similarly through bitwise operations:
df1.merge(df2.loc[:,(df2.columns=='id')|df2.columns.str.contains('test')], on='id')
df1 = pd.DataFrame(columns=['id','a','b'])
df2 = pd.DataFrame(columns=['id','ab','cd','ab_test','mn_test'])
df1.merge(df2.filter(regex=r'^id$|test'), on='id').columns
# Index(['a', 'b', 'id', 'ab_test', 'mn_test'], dtype='object')

Replacing nulls by fillna still returns nulls in pandas

I'm trying to replace nulls in few columns of pandas dataframe by fillna.
df["A"].fillna("Global", inplace=True)
df["B"].fillna("Global", inplace=True)
df["C"].fillna("Global", inplace=True)
df["D"].fillna("Global", inplace = True)
The nulls don't seem to be replaced completely as df.isnull().sum()
still returns non-zero values for columns A,B,C and D.
I have also tried the following, it doesn't seem to make a difference.
df["A"] = df["A"].fillna("Global", inplace=True)
df["B"] = df["B"].fillna("Global", inplace=True)
df["C"] = df["C"].fillna("Global", inplace=True)
df["D"] = df["D"].fillna("Global", inplace=True)
Following is my Sample data which contains NAN
id A B D
630940 NaN NaN ... NaN
630941 NaN NaN ... NaN
Inplace fillna does not work on pd.Series columns because they return a copy, and the copy is modified, leaving the original untouched.
Why not just do -
df.loc[:, 'A':'D'] = df.loc[:, 'A':'D'].fillna('Global')
Or,
df.loc[:, ['A', 'B', 'C', 'D']] = df.loc[:, ['A', 'B', 'C', 'D']].fillna('Global')

Assign a Series to several Rows of a Pandas DataFrame

I have a pandas DataFrame prepared with an Index and columns, all values are NaN.
Now I computed a result, which can be used for more than one row of a DataFrame, and I would like to assign them all at once. This can be done by a loop, but I am pretty sure that this assignment can be done at once.
Here is a scenario:
import pandas as pd
df = pd.DataFrame(index=['A', 'B', 'C'], columns=['C1', 'C2']) # original df
s = pd.Series({'C1': 1, 'C2': 'ham'}) # a computed result
index = pd.Index(['A', 'C']) # result is valid for rows 'A' and 'C'
The naive approach is
df.loc[index, :] = s
But this does not change the DataFrame at all. It remains as
C1 C2
A NaN NaN
B NaN NaN
C NaN NaN
How can this assignment be done?
It seems we can use the underlying array data to assign -
df.loc[index, :] = s.values
Now, this assumes that the order of index in s is same as in the columns of df. If that's not the case, as suggested by #Nras, we could use s[df.columns].values for right side assignment.

Categories

Resources