pandas copy value from one column to another if condition is met - python

I have a dataframe:
df =
col1 col2 col3
1 2 3
1 4 6
3 7 2
I want to edit df, such that when the value of col1 is smaller than 2 , take the value from col3.
So I will get:
new_df =
col1 col2 col3
3 2 3
6 4 6
3 7 2
I tried to use assign and df.loc but it didn't work.
What is the best way to do so?

df['col1'] = df.apply(lambda x: x['col3'] if x['col1'] < x['col2'] else x['col1'], axis=1)

The most eficient way is by using the loc operator:
mask = df["col1"] < df["col2"]
df.loc[mask, "col1"] = df.loc[mask, "col3"]

df.loc[df["col1"] < 2, "col1"] = df["col3"]

As mentioned by #anky_91 use np.where to update the 'col1' values:
df['col1'] = np.where(df['col1'] < df['col2'], df['col3'], df['col1'])

You could look at using the apply function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
df['col1'] = df.apply(lambda c: c['col3'] if c['col1'] < 2 else c['col1'], axis=1)
Edit: Sorry, I see from your mock outcome you are referring to col2 rather than an int of 2. Eric Yang's answer will solve your problem.

Related

How to join column in pandas ignoring the value of Zero with mixed datatypes

My data looks like this: (I have 28 columns)
col1 col2 col3 col4 col5
AA 0 0 B 0
0 CC 0 D 0
0 0 E F G
I am trying to merge these columns to get an output like this:
col1 col2 col3 col4 col5 col6
AA 0 0 B 0 AA;B
0 C 0 DD 0 C;DD
0 0 E F G E;F;G
I want to merge only the non-numeric characters into the new column.
I tried like this:
cols=['col1','col2', 'col3', 'col4', 'col5']
df2["col6"] = df2[cols].apply(lambda x: ';'.join(x.dropna()), axis=1)
But it doesn't take out the zeros. I am aware it is a small change but couldn't figure it out.
Thanks
try via where() method and apply() method:
df2["col6"]=df2.where((df2!='0')&(df2!=0)).apply(lambda x: ';'.join(x.dropna()), axis=1)
If there are numbers other than 0(including 0) then use:
df2["col6"]=(df2.where(df2.apply(lambda x:x.str.isalpha(),1))
.apply(lambda x: ';'.join(x.dropna()), axis=1))
With your shown samples please try following. Trying to fix OP's attempts here. Simple explanation would be, major change is to use condition x[x!=0] to make boolean mask in OP's attempted code(join function).
df2['col6'] = df2[cols].apply(lambda x: ';'.join(x[x!=0]), axis=1)

Pandas all cells which length <2

I want to acquire all rows in Dataframe where if the length of any cloumu shorter than 2.
For example:
df = pd.DataFrame({"col1":["a","ab",""],"col2":["bc","abc", "a"]})
col1 col2
0 a bc
1 ab abc
2 a
How to get this output:
col1 col2
0 a bc
2 a
Let's try stack to reshape then using str.len compute the length and create boolean mask with lt + any:
df[df.stack().str.len().lt(2).any(level=0)]
col1 col2
0 a bc
2 a
You can use the len() method of the pandas Series :
for col in df.columns:
df[col] = df[col][df[col].str.len() < 3]
df = df.dropna()
A list comprehension could help here :
df.loc[[not any(len(word) > 2 for word in entry)
for entry in df.to_numpy()]
]
col1 col2
0 a bc
2 a

Pandas DataFrame.assign() doesn't work properly for multiple columns

I am trying to reassign multiple columns in DataFrame with modifications.
The below is a simplified example.
import pandas as pd
d = {'col1':[1,2], 'col2':[3,4]}
df = pd.DataFrame(d)
print(df)
col1 col2
0 1 3
1 2 4
I use assign() method to add 1 to both 'col1' and 'col2'.
However, the result is to add 1 only to 'col2' and copy the result to 'col1'.
df2 = df.assign(**{c: lambda x: x[c] + 1 for c in ['col1','col2']})
print(df2)
col1 col2
0 4 4
1 5 5
Can someone explain why this is happening, and also suggest a correct way to apply assign() to multiple columns?
I think the lambda here can not be used within the for loop dict
df.assign(**{c: df[c] + 1 for c in ['col1','col2']})

Create new column by manipulating existing columns

I have many columns in dataframe , I want to fill one column by manipulating other two column in same datframe
col1 | col2 | col3 | col4
nan 1 2 4
2 2 2 3
3 nan 1 2
I want fill value of col1 ,col2 and col3 if nan exist on the basis of col1 ,col2 and col3 value.
I have code as follows:
indices_of_nan_cell = [(index,col1,col2,col3) for index,(col1,col2,col3) in enumerate(zip(col1,col2,col3)) if str(col1)=='nan' or str(col2)=='nan' or str(col3)=='nan']
for nan_values in indices:
if np.isnan(nan_values[1]) or nan_values[1] == 'nan':
read4['col1'][nan_values[0]]=float(nan_values[2])*float(nan_values[3])
if np.isnan(nan_values[2]) or nan_values[2] == 'nan':
read4['col2'][nan_values[0]]=float(nan_values[1])/float(nan_values[3])
if np.isnan(nan_values[3]) or nan_values[3] == 'nan':
read4['col3'][nan_values[0]]=float(nan_values[1])*float(nan_values[2])
It's working fine for me , but taking to much time as I have rows in thousands in my dataframe, Is there any efficient way,we can do this?
I believe need fillna for replace NaNs only with mul, div and parameter fill_value for replace NaNs in division and multiplication:
df['col1'] = df['col1'].fillna(df['col2'].mul(df['col3'], fill_value=1))
df['col2'] = df['col2'].fillna(df['col1'].div(df['col3'], fill_value=1))
df['col3'] = df['col3'].fillna(df['col1'].mul(df['col2'], fill_value=1))
print (df)
col1 col2 col3 col4
0 2.0 1.0 2 4
1 2.0 2.0 2 3
2 3.0 3.0 1 2
Another approach is working only with NaNs rows:
m1 = df['col1'].isna()
m2 = df['col2'].isna()
m3 = df['col3'].isna()
#oldier versions of pandas
#m1 = df['col1'].isnull()
#m2 = df['col2'].isnull()
#m3 = df['col3'].isnull()
df.loc[m1, 'col1'] = df.loc[m1, 'col2'].mul(df.loc[m1, 'col3'], fill_value=1)
df.loc[m2, 'col2'] = df.loc[m2, 'col1'].div(df.loc[m2, 'col3'], fill_value=1)
df.loc[m3, 'col3'] = df.loc[m3, 'col1'].mul(df.loc[m3, 'col2'], fill_value=1)
Explanation:
Filter each column with isna for 3 separate boolean masks.
For each mask first filter rows like df.loc[m1, 'col2'] and multiple or divide
Last assign back - replace NaNs only because filtering again by df.loc[m1, 'col1']

Count unique symbols per column in Pandas

I was wondering how to calculate the number of unique symbols that occur in a single column in a dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'bbb', 'cc', ''], 'col2': ['ddd', 'eeeee', 'ff', 'ggggggg']})
df col1 col2
0 a ddd
1 bbb eeeee
2 cc ff
3 gggggg
It should calculate that col1 contains 3 unique symbols, and col2 contains 4 unique symbols.
My code so far (but this might be wrong):
unique_symbols = [0]*203
i = 0
for col in df.columns:
observed_symbols = []
df_temp = df[[col]]
df_temp = df_temp.astype('str')
#This part is where I am not so sure
for index, row in df_temp.iterrows():
pass
if symbol not in observed_symbols:
observed_symbols.append(symbol)
unique_symbols[i] = len(observed_symbols)
i += 1
Thanks in advance
Option 1
str.join + set inside a dict comprehension
For problems like this, I'd prefer falling back to python, because it's so much faster.
{c : len(set(''.join(df[c]))) for c in df.columns}
{'col1': 3, 'col2': 4}
Option 2
agg
If you want to stay in pandas space.
df.agg(lambda x: set(''.join(x)), axis=0).str.len()
Or,
df.agg(lambda x: len(set(''.join(x))), axis=0)
col1 3
col2 4
dtype: int64
Here is one way:
df.apply(lambda x: len(set(''.join(x.astype(str)))))
col1 3
col2 4
Maybe
df.sum().apply(set).str.len()
Out[673]:
col1 3
col2 4
dtype: int64
One more option:
In [38]: df.applymap(lambda x: len(set(x))).sum()
Out[38]:
col1 3
col2 4
dtype: int64

Categories

Resources