I have many columns in dataframe , I want to fill one column by manipulating other two column in same datframe
col1 | col2 | col3 | col4
nan 1 2 4
2 2 2 3
3 nan 1 2
I want fill value of col1 ,col2 and col3 if nan exist on the basis of col1 ,col2 and col3 value.
I have code as follows:
indices_of_nan_cell = [(index,col1,col2,col3) for index,(col1,col2,col3) in enumerate(zip(col1,col2,col3)) if str(col1)=='nan' or str(col2)=='nan' or str(col3)=='nan']
for nan_values in indices:
if np.isnan(nan_values[1]) or nan_values[1] == 'nan':
read4['col1'][nan_values[0]]=float(nan_values[2])*float(nan_values[3])
if np.isnan(nan_values[2]) or nan_values[2] == 'nan':
read4['col2'][nan_values[0]]=float(nan_values[1])/float(nan_values[3])
if np.isnan(nan_values[3]) or nan_values[3] == 'nan':
read4['col3'][nan_values[0]]=float(nan_values[1])*float(nan_values[2])
It's working fine for me , but taking to much time as I have rows in thousands in my dataframe, Is there any efficient way,we can do this?
I believe need fillna for replace NaNs only with mul, div and parameter fill_value for replace NaNs in division and multiplication:
df['col1'] = df['col1'].fillna(df['col2'].mul(df['col3'], fill_value=1))
df['col2'] = df['col2'].fillna(df['col1'].div(df['col3'], fill_value=1))
df['col3'] = df['col3'].fillna(df['col1'].mul(df['col2'], fill_value=1))
print (df)
col1 col2 col3 col4
0 2.0 1.0 2 4
1 2.0 2.0 2 3
2 3.0 3.0 1 2
Another approach is working only with NaNs rows:
m1 = df['col1'].isna()
m2 = df['col2'].isna()
m3 = df['col3'].isna()
#oldier versions of pandas
#m1 = df['col1'].isnull()
#m2 = df['col2'].isnull()
#m3 = df['col3'].isnull()
df.loc[m1, 'col1'] = df.loc[m1, 'col2'].mul(df.loc[m1, 'col3'], fill_value=1)
df.loc[m2, 'col2'] = df.loc[m2, 'col1'].div(df.loc[m2, 'col3'], fill_value=1)
df.loc[m3, 'col3'] = df.loc[m3, 'col1'].mul(df.loc[m3, 'col2'], fill_value=1)
Explanation:
Filter each column with isna for 3 separate boolean masks.
For each mask first filter rows like df.loc[m1, 'col2'] and multiple or divide
Last assign back - replace NaNs only because filtering again by df.loc[m1, 'col1']
Related
I want to drop columns if the values inside of them are the same as other columns. From DF, it should yields DF_new:
DF = pd.DataFrame(index=[1,2,3,4], columns = ['col1', 'col2','col3','col4','col5'])
x = np.random.uniform(size=4)
DF['col1'] = x
DF['col2'] = x+2
DF['col3'] = x
DF ['col4'] = x+2
DF['col5'] = [5,6,7,8]
display(DF)
DF_new = DF[['col1', 'col2', 'col5']]
display(DF_new)
Simple example of what I can't manage to do:
Note that the column names are not the same, so I can't use:
DF_new = DF.loc[:,~DF.columns.duplicated()].copy()
, which drop columns based on their names.
You can use:
df = df.T.drop_duplicates().T
Step by step:
df2 = df.T # T = transpose (convert rows to columns)
1 2 3 4
col1 0.67075 0.707864 0.206923 0.168023
col2 2.67075 2.707864 2.206923 2.168023
col3 0.67075 0.707864 0.206923 0.168023
col4 2.67075 2.707864 2.206923 2.168023
col5 5.00000 6.000000 7.000000 8.000000
#now we can use drop duplicates
df2=df2.drop_duplicates()
'''
1 2 3 4
col1 0.67075 0.707864 0.206923 0.168023
col2 2.67075 2.707864 2.206923 2.168023
col5 5.00000 6.000000 7.000000 8.000000
'''
#then use transpose again.
df2=df2.T
'''
col1 col2 col5
1 0.670750 2.670750 5.0
2 0.707864 2.707864 6.0
3 0.206923 2.206923 7.0
4 0.168023 2.168023 8.0
'''
this should do what you need
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
as you can see from this link
I want to acquire all rows in Dataframe where if the length of any cloumu shorter than 2.
For example:
df = pd.DataFrame({"col1":["a","ab",""],"col2":["bc","abc", "a"]})
col1 col2
0 a bc
1 ab abc
2 a
How to get this output:
col1 col2
0 a bc
2 a
Let's try stack to reshape then using str.len compute the length and create boolean mask with lt + any:
df[df.stack().str.len().lt(2).any(level=0)]
col1 col2
0 a bc
2 a
You can use the len() method of the pandas Series :
for col in df.columns:
df[col] = df[col][df[col].str.len() < 3]
df = df.dropna()
A list comprehension could help here :
df.loc[[not any(len(word) > 2 for word in entry)
for entry in df.to_numpy()]
]
col1 col2
0 a bc
2 a
I have a dataframe:
df =
col1 col2 col3
1 2 3
1 4 6
3 7 2
I want to edit df, such that when the value of col1 is smaller than 2 , take the value from col3.
So I will get:
new_df =
col1 col2 col3
3 2 3
6 4 6
3 7 2
I tried to use assign and df.loc but it didn't work.
What is the best way to do so?
df['col1'] = df.apply(lambda x: x['col3'] if x['col1'] < x['col2'] else x['col1'], axis=1)
The most eficient way is by using the loc operator:
mask = df["col1"] < df["col2"]
df.loc[mask, "col1"] = df.loc[mask, "col3"]
df.loc[df["col1"] < 2, "col1"] = df["col3"]
As mentioned by #anky_91 use np.where to update the 'col1' values:
df['col1'] = np.where(df['col1'] < df['col2'], df['col3'], df['col1'])
You could look at using the apply function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
df['col1'] = df.apply(lambda c: c['col3'] if c['col1'] < 2 else c['col1'], axis=1)
Edit: Sorry, I see from your mock outcome you are referring to col2 rather than an int of 2. Eric Yang's answer will solve your problem.
I have two data frames named df and df_reference, which contain following information:
df df_reference
col1 col2 col1 col2
A 10 A 15
B 25 B 33
C 30 C 20
A 12
I want to compare both data frame based on col1.
I want to replace the value of df.col2 with df_reference.col2 if the value in df_reference is greater than value of df.col2.
The expected output is:
df
col1 col2
A 15
B 33
C 30
A 15
I have tried:
dict1 = {'a':'15'}
df.loc[df['col1'].isin(dict1.keys()), 'col2'] = sams['col1'].map(dict1)
Use Series.map by Series created by DataFrame.set_index and NaNs if some values are not matched are replace by Series.fillna:
s = df['col1'].map(df_reference.set_index('col1')['col2']).fillna(df['col2'])
df.loc[s > df['col2'], 'col2'] = s
print (df)
col1 col2
0 A 15
1 B 33
2 C 30
3 A 15
I can suggest you to first do a merge based on 'col1' and then apply a function that generates a new column with the greater value of the two 'col2'. Then just drop the useless column !
def greaterValue(row) :
if (row['col2_x']>row['col2_y']) :
return row['col2_x']
else :
return row['col2_y']
df = df.merge(df_reference, left_on='col1', right_on='col1')
df['col2'] = df.apply(greaterValue, axis=1)
df = df.loc[:,['col1','col2']]
I am trying to assign a column based on strings that may be contained in other columns. For example
var1 = 67
columns = {'col1': ['string1', 'thang2', 'code3', 'string2'],
'col2': [1, 2, np.nan, 3], 'col3': ['I', 'cant', 'think', 'what']}
df = pd.DataFrame(data = columns)
How do I then make a fourth column col4 that is col3 + var1 + col1 most of the time, but is np.nan whenever col2 is nan (in the same row) and has a -W appended to its value whenever there is an 'in' in any the string in col1 (again, in the same row)?
I know all about assign, but I don't know how to do all that conditional stuff in the assign, or if there is a way to do it after creating the column, I'm not sure either.
You can try this using np.where:
df['col4'] = np.where(df['col2'].notnull(),
df['col3'] + str(var1) + np.where(df['col1'].str.contains('in'),
df['col1'] + '-w',
df['col1']),
np.nan)
Output:
col1 col2 col3 col4
0 string1 1.0 I I67string1-w
1 thang2 2.0 cant cant67thang2
2 code3 NaN think NaN
3 string2 3.0 what what67string2-w
Or if you want to do it with assign:
df.assign(col5 = np.where(df['col2'].notnull(),
df['col3'] + str(var1) + np.where(df['col1'].str.contains('in'),
df['col1'] + '-w',
df['col1']),
np.nan))
Output:
col1 col2 col3 col4 col5
0 string1 1.0 I I67string1-w I67string1-w
1 thang2 2.0 cant cant67thang2 cant67thang2
2 code3 NaN think NaN NaN
3 string2 3.0 what what67string2-w what67string2-w
Update: Since you mentioned speed. I think I'd remove the .str accessor and use list comprehension too.
df['col4'] = np.where(df['col2'].notnull(),
df['col3'] + str(var1) + np.where(['in' in i for i in df['col1']],
df['col1'] + '-w',
df['col1']),
np.nan)