I'm trying to get data from df1 if it doesn't exist in df2 and col1 in df1 should be aligned with col3 in df2 ( same for col2 and col4)
Df1:
col1 col2
2 2
1 Nan
Nan 1
Df2:
col3 col4
Nan 1
1 Nan
Nan 1
Final_Df:
col1 col2
2 1
1 Nan
Nan 1
Just use pandas.DataFrame.update(other). The overwrite parameter explanation.
overwrite bool, default True
How to handle non-NA values for overlapping keys:
True: overwrite original DataFrame’s values with values from other.
False: only update values that are NA in the original DataFrame.
Note that df.update(other) modifies in place using non-NA values from another DataFrame on matching column label.
df2.update(df1.set_axis(df2.columns, axis=1))
print(df2)
col3 col4
0 2 2
1 1 Nan
2 Nan 1
Make the column same / replace Nan with np.NAN / update the dataframe
df1.columns = df2.columns
df2 = df2.replace('Nan', np.NAN)
df2.update(df1, overwrite=False) # will only update the NAN values
Related
This question already has answers here:
Python Pandas replace NaN in one column with value from corresponding row of second column
(7 answers)
replace nan in one column with the value from another column in pandas: what's wrong with my code
(2 answers)
Closed 2 months ago.
I'm trying to create a new column by merging non nans from two other columns.
I'm sure something similar has been asked and I've looked at many questions but most of them seems to check the value and return a hard coded values.
Here is my sample code:
test_df = pd.DataFrame({
'col1':['a','b','c',np.nan,np.nan],
'col2':[np.nan,'b','c','d',np.nan]
})
print(test_df)
col1 col2
0 a NaN
1 b b
2 c c
3 NaN d
4 NaN NaN
What I need to add col3 based on checking:
if col1 is not nan then col1
if col1 is nan and col2 not nana then col2
if col1 is nan and col2 is nan then nan
col1 col2 col3
0 a NaN a
1 b b b
2 c c c
3 NaN d d
4 NaN NaN NaN
test_df['col3'] = [x1 if pd.notna(x1) else x2 if pd.notna(x2) else np.nan for x1, x2 in zip(test_df['col1'], test_df['col2'])]
I am trying to update Col1 with values from Col2,Col3... if values are found in any of them. A row would have only one value, but it can have "-" but that should be treated as NaN
df = pd.DataFrame(
[
['A',np.nan,np.nan,np.nan,np.nan,np.nan],
[np.nan,np.nan,np.nan,'C',np.nan,np.nan],
[np.nan,np.nan,"-",np.nan,'B',np.nan],
[np.nan,np.nan,"-",np.nan,np.nan,np.nan]
],
columns = ['Col1','Col2','Col3','Col4','Col5','Col6']
)
print(df)
Col1 Col2 Col3 Col4 Col5 Col6
0 A NaN NaN NaN NaN NaN
1 NaN NaN NaN C NaN NaN
2 NaN NaN NaN NaN B NaN
3 NaN NaN NaN NaN NaN NaN
I want the output to be:
Col1
0 A
1 C
2 B
3 NaN
I tried to use the update function:
for col in df.columns[1:]:
df[Col1].update(col)
It works on this small DataFrame but when I run it on a larger DataFrame with a lot more rows and columns, I am losing a lot of values in between. Is there any better function to do this preferably without a loop. Please help I tried with many other methods, including using .loc but no joy.
Here is one way to go about it
# convert the values in the row to series, and sort, NaN moves to the end
df2=df.apply(lambda x: pd.Series(x).sort_values(ignore_index=True), axis=1)
# rename df2 column as df columns
df2.columns=df.columns
# drop where all values in the column as null
df2.dropna(axis=1, how='all', inplace=True)
print(df2)
Col1
0 A
1 C
2 B
3 NaN
You can use combine_first:
from functools import reduce
reduce(
lambda x, y: x.combine_first(df[y]),
df.columns[1:],
df[df.columns[0]]
).to_frame()
The following DataFrame is the result of the previous code:
Col1
0 A
1 C
2 B
3 NaN
Python has a one-liner generator for this type of use case:
# next((x for x in list if condition), None)
df["Col1"] = df.apply(lambda row: next((x for x in row if not pd.isnull(x) and x != "-"), None), axis=1)
[Out]:
0 A
1 C
2 B
3 None
Let's take this dataframe :
pd.DataFrame(dict(Col1=["a","c"],Col2=["b","d"],Col3=[1,3],Col4=[2,4]))
Col1 Col2 Col3 Col4
0 a b 1 2
1 c d 3 4
I would like to have one row per value in column Col1 and column Col2 (n=2 and r=2 so the expected dataframe have 2*2 = 4 rows).
Expected result :
Ind Value Col3 Col4
0 Col1 a 1 2
1 Col1 c 3 4
2 Col2 b 1 2
3 Col2 d 3 4
How please could I do ?
Pandas melt does the job here; the rest just has to do with repositioning and renaming the columns appropriately.
Use pandas melt to transform the dataframe, using Col3 and 4 as the index variables. melt typically converts from wide to long.
Next step - reindex the columns, with variable and value as lead columns.
Finally, rename the columns appropriately.
(df.melt(id_vars=['Col3','Col4'])
.reindex(['variable','value','Col3','Col4'],axis=1)
.rename({'variable':'Ind','value':'Value'},axis=1)
)
Ind Value Col3 Col4
0 Col1 a 1 2
1 Col1 c 3 4
2 Col2 b 1 2
3 Col2 d 3 4
I have a pandas dataframe:
Col1 Col2 Col3
0 1 2 3
1 2 3 4
And I want to add a new row summing over two columns [Col1,Col2] like:
Col1 Col2 Col3
0 1 2 3
1 2 3 4
Total 3 5 NaN
Ignoring Col3. What should I do? Thanks in advance.
You can use the pandas.DataFrame.append and pandas.DataFrame.sum methods:
df2 = df.append(df.sum(), ignore_index=True)
df2.iloc[-1, df2.columns.get_loc('Col3')] = np.nan
You can use pd.DataFrame.loc. Note the final column will be converted to float since NaN is considered float:
import numpy as np
df.loc['Total'] = [df['Col1'].sum(), df['Col2'].sum(), np.nan]
df[['Col1', 'Col2']] = df[['Col1', 'Col2']].astype(int)
print(df)
Col1 Col2 Col3
0 1 2 3.0
1 2 3 4.0
Total 3 5 NaN
I have a pandas dataframe with many columns, most of them are null, but for each row there always is one and only one column with value a string.
I am creating a new column in the dataframe that selects the only non-null value:
data[label] = data.iloc[:,0]
for col in range(1,100) :
data[label] = data[label].fillna(data.iloc[:,col])
This works fine, however, I would also keep track of which one of these columns was the non-null, for each entry, so that the column label has that information as well. How do I know which column was non-empty?
Ex.
col0 col1 col2
"red"
"blue"
"yellow"
new column label is:
label
"red"/col1
"blue"/col0
"yellow"/col2
You can first convert df to Trues where are values by notnull and get columns names by idxmax and lookup for values:
cols = df.notnull().idxmax(axis=1)
df['a'] = df.lookup(df.index, cols) + '/' + cols
print (df)
col0 col1 col2 a
0 NaN red NaN red/col1
1 blue NaN NaN blue/col0
2 NaN NaN yellow yellow/col2
Another solution with fillna and sum:
cols = df.notnull().idxmax(axis=1)
df['a'] = df.fillna('').sum(axis=1) + '/' + cols
print (df)
col0 col1 col2 a
0 NaN red NaN red/col1
1 blue NaN NaN blue/col0
2 NaN NaN yellow yellow/col2
Another solution, thanks Jon Clements - use first_valid_index:
cols = df.apply(pd.Series.first_valid_index, axis=1)
df['a'] = df.lookup(cols.index, cols) + '/' + cols
print (df)
col0 col1 col2 a
0 NaN red NaN red/col1
1 blue NaN NaN blue/col0
2 NaN NaN yellow yellow/col2