Create new column by manipulating existing columns - python

I have many columns in dataframe , I want to fill one column by manipulating other two column in same datframe
col1 | col2 | col3 | col4
nan 1 2 4
2 2 2 3
3 nan 1 2
I want fill value of col1 ,col2 and col3 if nan exist on the basis of col1 ,col2 and col3 value.
I have code as follows:
indices_of_nan_cell = [(index,col1,col2,col3) for index,(col1,col2,col3) in enumerate(zip(col1,col2,col3)) if str(col1)=='nan' or str(col2)=='nan' or str(col3)=='nan']
for nan_values in indices:
if np.isnan(nan_values[1]) or nan_values[1] == 'nan':
read4['col1'][nan_values[0]]=float(nan_values[2])*float(nan_values[3])
if np.isnan(nan_values[2]) or nan_values[2] == 'nan':
read4['col2'][nan_values[0]]=float(nan_values[1])/float(nan_values[3])
if np.isnan(nan_values[3]) or nan_values[3] == 'nan':
read4['col3'][nan_values[0]]=float(nan_values[1])*float(nan_values[2])
It's working fine for me , but taking to much time as I have rows in thousands in my dataframe, Is there any efficient way,we can do this?

I believe need fillna for replace NaNs only with mul, div and parameter fill_value for replace NaNs in division and multiplication:
df['col1'] = df['col1'].fillna(df['col2'].mul(df['col3'], fill_value=1))
df['col2'] = df['col2'].fillna(df['col1'].div(df['col3'], fill_value=1))
df['col3'] = df['col3'].fillna(df['col1'].mul(df['col2'], fill_value=1))
print (df)
col1 col2 col3 col4
0 2.0 1.0 2 4
1 2.0 2.0 2 3
2 3.0 3.0 1 2
Another approach is working only with NaNs rows:
m1 = df['col1'].isna()
m2 = df['col2'].isna()
m3 = df['col3'].isna()
#oldier versions of pandas
#m1 = df['col1'].isnull()
#m2 = df['col2'].isnull()
#m3 = df['col3'].isnull()
df.loc[m1, 'col1'] = df.loc[m1, 'col2'].mul(df.loc[m1, 'col3'], fill_value=1)
df.loc[m2, 'col2'] = df.loc[m2, 'col1'].div(df.loc[m2, 'col3'], fill_value=1)
df.loc[m3, 'col3'] = df.loc[m3, 'col1'].mul(df.loc[m3, 'col2'], fill_value=1)
Explanation:
Filter each column with isna for 3 separate boolean masks.
For each mask first filter rows like df.loc[m1, 'col2'] and multiple or divide
Last assign back - replace NaNs only because filtering again by df.loc[m1, 'col1']

Related

How to drop duplicates columns from a pandas dataframe, based on columns' values (columns don't have the same name)?

I want to drop columns if the values inside of them are the same as other columns. From DF, it should yields DF_new:
DF = pd.DataFrame(index=[1,2,3,4], columns = ['col1', 'col2','col3','col4','col5'])
x = np.random.uniform(size=4)
DF['col1'] = x
DF['col2'] = x+2
DF['col3'] = x
DF ['col4'] = x+2
DF['col5'] = [5,6,7,8]
display(DF)
DF_new = DF[['col1', 'col2', 'col5']]
display(DF_new)
Simple example of what I can't manage to do:
Note that the column names are not the same, so I can't use:
DF_new = DF.loc[:,~DF.columns.duplicated()].copy()
, which drop columns based on their names.
You can use:
df = df.T.drop_duplicates().T
Step by step:
df2 = df.T # T = transpose (convert rows to columns)
1 2 3 4
col1 0.67075 0.707864 0.206923 0.168023
col2 2.67075 2.707864 2.206923 2.168023
col3 0.67075 0.707864 0.206923 0.168023
col4 2.67075 2.707864 2.206923 2.168023
col5 5.00000 6.000000 7.000000 8.000000
#now we can use drop duplicates
df2=df2.drop_duplicates()
'''
1 2 3 4
col1 0.67075 0.707864 0.206923 0.168023
col2 2.67075 2.707864 2.206923 2.168023
col5 5.00000 6.000000 7.000000 8.000000
'''
#then use transpose again.
df2=df2.T
'''
col1 col2 col5
1 0.670750 2.670750 5.0
2 0.707864 2.707864 6.0
3 0.206923 2.206923 7.0
4 0.168023 2.168023 8.0
'''
this should do what you need
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
as you can see from this link

Pandas all cells which length <2

I want to acquire all rows in Dataframe where if the length of any cloumu shorter than 2.
For example:
df = pd.DataFrame({"col1":["a","ab",""],"col2":["bc","abc", "a"]})
col1 col2
0 a bc
1 ab abc
2 a
How to get this output:
col1 col2
0 a bc
2 a
Let's try stack to reshape then using str.len compute the length and create boolean mask with lt + any:
df[df.stack().str.len().lt(2).any(level=0)]
col1 col2
0 a bc
2 a
You can use the len() method of the pandas Series :
for col in df.columns:
df[col] = df[col][df[col].str.len() < 3]
df = df.dropna()
A list comprehension could help here :
df.loc[[not any(len(word) > 2 for word in entry)
for entry in df.to_numpy()]
]
col1 col2
0 a bc
2 a

pandas copy value from one column to another if condition is met

I have a dataframe:
df =
col1 col2 col3
1 2 3
1 4 6
3 7 2
I want to edit df, such that when the value of col1 is smaller than 2 , take the value from col3.
So I will get:
new_df =
col1 col2 col3
3 2 3
6 4 6
3 7 2
I tried to use assign and df.loc but it didn't work.
What is the best way to do so?
df['col1'] = df.apply(lambda x: x['col3'] if x['col1'] < x['col2'] else x['col1'], axis=1)
The most eficient way is by using the loc operator:
mask = df["col1"] < df["col2"]
df.loc[mask, "col1"] = df.loc[mask, "col3"]
df.loc[df["col1"] < 2, "col1"] = df["col3"]
As mentioned by #anky_91 use np.where to update the 'col1' values:
df['col1'] = np.where(df['col1'] < df['col2'], df['col3'], df['col1'])
You could look at using the apply function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
df['col1'] = df.apply(lambda c: c['col3'] if c['col1'] < 2 else c['col1'], axis=1)
Edit: Sorry, I see from your mock outcome you are referring to col2 rather than an int of 2. Eric Yang's answer will solve your problem.

Replace value of one column based on condition

I have two data frames named df and df_reference, which contain following information:
df df_reference
col1 col2 col1 col2
A 10 A 15
B 25 B 33
C 30 C 20
A 12
I want to compare both data frame based on col1.
I want to replace the value of df.col2 with df_reference.col2 if the value in df_reference is greater than value of df.col2.
The expected output is:
df
col1 col2
A 15
B 33
C 30
A 15
I have tried:
dict1 = {'a':'15'}
df.loc[df['col1'].isin(dict1.keys()), 'col2'] = sams['col1'].map(dict1)
Use Series.map by Series created by DataFrame.set_index and NaNs if some values are not matched are replace by Series.fillna:
s = df['col1'].map(df_reference.set_index('col1')['col2']).fillna(df['col2'])
df.loc[s > df['col2'], 'col2'] = s
print (df)
col1 col2
0 A 15
1 B 33
2 C 30
3 A 15
I can suggest you to first do a merge based on 'col1' and then apply a function that generates a new column with the greater value of the two 'col2'. Then just drop the useless column !
def greaterValue(row) :
if (row['col2_x']>row['col2_y']) :
return row['col2_x']
else :
return row['col2_y']
df = df.merge(df_reference, left_on='col1', right_on='col1')
df['col2'] = df.apply(greaterValue, axis=1)
df = df.loc[:,['col1','col2']]

Assign column with conditional values based on strings contained in other columns

I am trying to assign a column based on strings that may be contained in other columns. For example
var1 = 67
columns = {'col1': ['string1', 'thang2', 'code3', 'string2'],
'col2': [1, 2, np.nan, 3], 'col3': ['I', 'cant', 'think', 'what']}
df = pd.DataFrame(data = columns)
How do I then make a fourth column col4 that is col3 + var1 + col1 most of the time, but is np.nan whenever col2 is nan (in the same row) and has a -W appended to its value whenever there is an 'in' in any the string in col1 (again, in the same row)?
I know all about assign, but I don't know how to do all that conditional stuff in the assign, or if there is a way to do it after creating the column, I'm not sure either.
You can try this using np.where:
df['col4'] = np.where(df['col2'].notnull(),
df['col3'] + str(var1) + np.where(df['col1'].str.contains('in'),
df['col1'] + '-w',
df['col1']),
np.nan)
Output:
col1 col2 col3 col4
0 string1 1.0 I I67string1-w
1 thang2 2.0 cant cant67thang2
2 code3 NaN think NaN
3 string2 3.0 what what67string2-w
Or if you want to do it with assign:
df.assign(col5 = np.where(df['col2'].notnull(),
df['col3'] + str(var1) + np.where(df['col1'].str.contains('in'),
df['col1'] + '-w',
df['col1']),
np.nan))
Output:
col1 col2 col3 col4 col5
0 string1 1.0 I I67string1-w I67string1-w
1 thang2 2.0 cant cant67thang2 cant67thang2
2 code3 NaN think NaN NaN
3 string2 3.0 what what67string2-w what67string2-w
Update: Since you mentioned speed. I think I'd remove the .str accessor and use list comprehension too.
df['col4'] = np.where(df['col2'].notnull(),
df['col3'] + str(var1) + np.where(['in' in i for i in df['col1']],
df['col1'] + '-w',
df['col1']),
np.nan)

Categories

Resources