Pandas - create new column from values in other columns on condition [duplicate] - python

This question already has answers here:
Python Pandas replace NaN in one column with value from corresponding row of second column
(7 answers)
replace nan in one column with the value from another column in pandas: what's wrong with my code
(2 answers)
Closed 2 months ago.
I'm trying to create a new column by merging non nans from two other columns.
I'm sure something similar has been asked and I've looked at many questions but most of them seems to check the value and return a hard coded values.
Here is my sample code:
test_df = pd.DataFrame({
'col1':['a','b','c',np.nan,np.nan],
'col2':[np.nan,'b','c','d',np.nan]
})
print(test_df)
col1 col2
0 a NaN
1 b b
2 c c
3 NaN d
4 NaN NaN
What I need to add col3 based on checking:
if col1 is not nan then col1
if col1 is nan and col2 not nana then col2
if col1 is nan and col2 is nan then nan
col1 col2 col3
0 a NaN a
1 b b b
2 c c c
3 NaN d d
4 NaN NaN NaN

test_df['col3'] = [x1 if pd.notna(x1) else x2 if pd.notna(x2) else np.nan for x1, x2 in zip(test_df['col1'], test_df['col2'])]

Related

Updating values of a column from multiple columns if the values are present in those columns

I am trying to update Col1 with values from Col2,Col3... if values are found in any of them. A row would have only one value, but it can have "-" but that should be treated as NaN
df = pd.DataFrame(
[
['A',np.nan,np.nan,np.nan,np.nan,np.nan],
[np.nan,np.nan,np.nan,'C',np.nan,np.nan],
[np.nan,np.nan,"-",np.nan,'B',np.nan],
[np.nan,np.nan,"-",np.nan,np.nan,np.nan]
],
columns = ['Col1','Col2','Col3','Col4','Col5','Col6']
)
print(df)
Col1 Col2 Col3 Col4 Col5 Col6
0 A NaN NaN NaN NaN NaN
1 NaN NaN NaN C NaN NaN
2 NaN NaN NaN NaN B NaN
3 NaN NaN NaN NaN NaN NaN
I want the output to be:
Col1
0 A
1 C
2 B
3 NaN
I tried to use the update function:
for col in df.columns[1:]:
df[Col1].update(col)
It works on this small DataFrame but when I run it on a larger DataFrame with a lot more rows and columns, I am losing a lot of values in between. Is there any better function to do this preferably without a loop. Please help I tried with many other methods, including using .loc but no joy.
Here is one way to go about it
# convert the values in the row to series, and sort, NaN moves to the end
df2=df.apply(lambda x: pd.Series(x).sort_values(ignore_index=True), axis=1)
# rename df2 column as df columns
df2.columns=df.columns
# drop where all values in the column as null
df2.dropna(axis=1, how='all', inplace=True)
print(df2)
Col1
0 A
1 C
2 B
3 NaN
You can use combine_first:
from functools import reduce
reduce(
lambda x, y: x.combine_first(df[y]),
df.columns[1:],
df[df.columns[0]]
).to_frame()
The following DataFrame is the result of the previous code:
Col1
0 A
1 C
2 B
3 NaN
Python has a one-liner generator for this type of use case:
# next((x for x in list if condition), None)
df["Col1"] = df.apply(lambda row: next((x for x in row if not pd.isnull(x) and x != "-"), None), axis=1)
[Out]:
0 A
1 C
2 B
3 None

getting data from df1 if it doesn't exist in df2

I'm trying to get data from df1 if it doesn't exist in df2 and col1 in df1 should be aligned with col3 in df2 ( same for col2 and col4)
Df1:
col1 col2
2 2
1 Nan
Nan 1
Df2:
col3 col4
Nan 1
1 Nan
Nan 1
Final_Df:
col1 col2
2 1
1 Nan
Nan 1
Just use pandas.DataFrame.update(other). The overwrite parameter explanation.
overwrite bool, default True
How to handle non-NA values for overlapping keys:
True: overwrite original DataFrame’s values with values from other.
False: only update values that are NA in the original DataFrame.
Note that df.update(other) modifies in place using non-NA values from another DataFrame on matching column label.
df2.update(df1.set_axis(df2.columns, axis=1))
print(df2)
col3 col4
0 2 2
1 1 Nan
2 Nan 1
Make the column same / replace Nan with np.NAN / update the dataframe
df1.columns = df2.columns
df2 = df2.replace('Nan', np.NAN)
df2.update(df1, overwrite=False) # will only update the NAN values

How to divide rows in pandas dataframe parwise

I have a pandas dataframe, test, looking like the following:
Col1 Col2 Col 3
A 4 6
A 8 36
B 1 4
B 6 8
Now, I want to pairwise divide the rows of the dataframe resulting in:
Col1 Col2 Col 3
A 2 6
B 6 2
Hence I want to divide the second of the pair by the first of the pair. I amtrying to use groupby but without success.
Anyone a solution?
If you always have a pair of rows, you can just try iloc:
(df.iloc[1::2, 1:]
.div(df.iloc[::2,1:].to_numpy())
.assign(Col1=df.iloc[1::2,1])
)
If the Col1 pair doesn't repeat.
def divide(group):
# You could also use head(1)/tail(1) and first()/last().
return group.iloc[-1] / group.iloc[0]
df_ = df.groupby('Col1').apply(divide).reset_index()
# print(df)
Col1 Col2 Col3
0 A 2.0 6.0
1 B 6.0 2.0
Another option using groupby on the first column and using nth to divide
g = df.groupby("Col1")
out = g.nth(1).div(g.nth(0)).reset_index()
print(out)
Col1 Col2 Col3
0 A 2.0 6.0
1 B 6.0 2.0

How to make a sum row for two columns python dataframe

I have a pandas dataframe:
Col1 Col2 Col3
0 1 2 3
1 2 3 4
And I want to add a new row summing over two columns [Col1,Col2] like:
Col1 Col2 Col3
0 1 2 3
1 2 3 4
Total 3 5 NaN
Ignoring Col3. What should I do? Thanks in advance.
You can use the pandas.DataFrame.append and pandas.DataFrame.sum methods:
df2 = df.append(df.sum(), ignore_index=True)
df2.iloc[-1, df2.columns.get_loc('Col3')] = np.nan
You can use pd.DataFrame.loc. Note the final column will be converted to float since NaN is considered float:
import numpy as np
df.loc['Total'] = [df['Col1'].sum(), df['Col2'].sum(), np.nan]
df[['Col1', 'Col2']] = df[['Col1', 'Col2']].astype(int)
print(df)
Col1 Col2 Col3
0 1 2 3.0
1 2 3 4.0
Total 3 5 NaN

Pandas: move row (index and values) from last to first [duplicate]

This question already has answers here:
add a row at top in pandas dataframe [duplicate]
(6 answers)
Closed 4 years ago.
I would like to move an entire row (index and values) from the last row to the first row of a DataFrame. Every other example I can find either uses an ordered row index (to be specific - my row index is not a numerical sequence - so I cannot simply add at -1 and then reindex with +1) or moves the values while maintaining the original index. My DF has descriptions as the index and the values are discrete to the index description.
I'm adding a row and then would like to move that into row 1. Here is the setup:
df = pd.DataFrame({
'col1' : ['A', 'A', 'B', 'F', 'D', 'C'],
'col2' : [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
}).set_index('col1')
#output
In [7]: df
Out[7]:
col2 col3
col1
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
I then add a new row as follows:
df.loc["Deferred Description"] = pd.Series([''])
In [9]: df
Out[9]:
col2 col3
col1
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
Deferred Description NaN NaN
I would like the resulting output to be:
In [9]: df
Out[9]:
col2 col3
col1
Defenses Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
I've tried using df.shift() but only the values shift. I've also tried df.sort_index() but that requires the index to be ordered (there are several SO examples using df.loc[-1] = ... then then reindexing with df.index = df.index + 1). In my case I need the Defenses Description to be the first row.
Your problem is not one of cyclic shifting, but a simpler one—one of insertion (which is why I've chosen to mark this question as duplicate).
Construct an empty DataFrame and then concatenate the two using pd.concat.
pd.concat([pd.DataFrame(columns=df.columns, index=['Deferred Description']), df])
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
If this were columns, it'd have been easier. Funnily enough, pandas has a DataFrame.insert function that works for columns, but not rows.
Generalized Cyclic Shifting
If you were curious to know how you'd cyclically shift a dataFrame, you can use np.roll.
# apply this fix to your existing DataFrame
pd.DataFrame(np.roll(df.values, 1, axis=0),
index=np.roll(df.index, 1), columns=df.columns
)
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
This, thankfully, also works when you have duplicate index values. If the index or columns aren't important, then pd.DataFrame(np.roll(df.values, 1, axis=0)) works well enough.
You can using append
pd.DataFrame({'col2':[np.nan],'col3':[np.nan]},index=["Deferred Description"]).append(df)
Out[294]:
col2 col3
Deferred Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0

Categories

Resources