I have a dataframe:
df =
col1 col2 col3
1 2 3
1 4 6
3 7 2
I want to edit df, such that when the value of col1 is smaller than 2 , take the value from col3.
So I will get:
new_df =
col1 col2 col3
3 2 3
6 4 6
3 7 2
I tried to use assign and df.loc but it didn't work.
What is the best way to do so?
df['col1'] = df.apply(lambda x: x['col3'] if x['col1'] < x['col2'] else x['col1'], axis=1)
The most eficient way is by using the loc operator:
mask = df["col1"] < df["col2"]
df.loc[mask, "col1"] = df.loc[mask, "col3"]
df.loc[df["col1"] < 2, "col1"] = df["col3"]
As mentioned by #anky_91 use np.where to update the 'col1' values:
df['col1'] = np.where(df['col1'] < df['col2'], df['col3'], df['col1'])
You could look at using the apply function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
df['col1'] = df.apply(lambda c: c['col3'] if c['col1'] < 2 else c['col1'], axis=1)
Edit: Sorry, I see from your mock outcome you are referring to col2 rather than an int of 2. Eric Yang's answer will solve your problem.
Related
My data looks like this: (I have 28 columns)
col1 col2 col3 col4 col5
AA 0 0 B 0
0 CC 0 D 0
0 0 E F G
I am trying to merge these columns to get an output like this:
col1 col2 col3 col4 col5 col6
AA 0 0 B 0 AA;B
0 C 0 DD 0 C;DD
0 0 E F G E;F;G
I want to merge only the non-numeric characters into the new column.
I tried like this:
cols=['col1','col2', 'col3', 'col4', 'col5']
df2["col6"] = df2[cols].apply(lambda x: ';'.join(x.dropna()), axis=1)
But it doesn't take out the zeros. I am aware it is a small change but couldn't figure it out.
Thanks
try via where() method and apply() method:
df2["col6"]=df2.where((df2!='0')&(df2!=0)).apply(lambda x: ';'.join(x.dropna()), axis=1)
If there are numbers other than 0(including 0) then use:
df2["col6"]=(df2.where(df2.apply(lambda x:x.str.isalpha(),1))
.apply(lambda x: ';'.join(x.dropna()), axis=1))
With your shown samples please try following. Trying to fix OP's attempts here. Simple explanation would be, major change is to use condition x[x!=0] to make boolean mask in OP's attempted code(join function).
df2['col6'] = df2[cols].apply(lambda x: ';'.join(x[x!=0]), axis=1)
I want to acquire all rows in Dataframe where if the length of any cloumu shorter than 2.
For example:
df = pd.DataFrame({"col1":["a","ab",""],"col2":["bc","abc", "a"]})
col1 col2
0 a bc
1 ab abc
2 a
How to get this output:
col1 col2
0 a bc
2 a
Let's try stack to reshape then using str.len compute the length and create boolean mask with lt + any:
df[df.stack().str.len().lt(2).any(level=0)]
col1 col2
0 a bc
2 a
You can use the len() method of the pandas Series :
for col in df.columns:
df[col] = df[col][df[col].str.len() < 3]
df = df.dropna()
A list comprehension could help here :
df.loc[[not any(len(word) > 2 for word in entry)
for entry in df.to_numpy()]
]
col1 col2
0 a bc
2 a
I am trying to reassign multiple columns in DataFrame with modifications.
The below is a simplified example.
import pandas as pd
d = {'col1':[1,2], 'col2':[3,4]}
df = pd.DataFrame(d)
print(df)
col1 col2
0 1 3
1 2 4
I use assign() method to add 1 to both 'col1' and 'col2'.
However, the result is to add 1 only to 'col2' and copy the result to 'col1'.
df2 = df.assign(**{c: lambda x: x[c] + 1 for c in ['col1','col2']})
print(df2)
col1 col2
0 4 4
1 5 5
Can someone explain why this is happening, and also suggest a correct way to apply assign() to multiple columns?
I think the lambda here can not be used within the for loop dict
df.assign(**{c: df[c] + 1 for c in ['col1','col2']})
I have many columns in dataframe , I want to fill one column by manipulating other two column in same datframe
col1 | col2 | col3 | col4
nan 1 2 4
2 2 2 3
3 nan 1 2
I want fill value of col1 ,col2 and col3 if nan exist on the basis of col1 ,col2 and col3 value.
I have code as follows:
indices_of_nan_cell = [(index,col1,col2,col3) for index,(col1,col2,col3) in enumerate(zip(col1,col2,col3)) if str(col1)=='nan' or str(col2)=='nan' or str(col3)=='nan']
for nan_values in indices:
if np.isnan(nan_values[1]) or nan_values[1] == 'nan':
read4['col1'][nan_values[0]]=float(nan_values[2])*float(nan_values[3])
if np.isnan(nan_values[2]) or nan_values[2] == 'nan':
read4['col2'][nan_values[0]]=float(nan_values[1])/float(nan_values[3])
if np.isnan(nan_values[3]) or nan_values[3] == 'nan':
read4['col3'][nan_values[0]]=float(nan_values[1])*float(nan_values[2])
It's working fine for me , but taking to much time as I have rows in thousands in my dataframe, Is there any efficient way,we can do this?
I believe need fillna for replace NaNs only with mul, div and parameter fill_value for replace NaNs in division and multiplication:
df['col1'] = df['col1'].fillna(df['col2'].mul(df['col3'], fill_value=1))
df['col2'] = df['col2'].fillna(df['col1'].div(df['col3'], fill_value=1))
df['col3'] = df['col3'].fillna(df['col1'].mul(df['col2'], fill_value=1))
print (df)
col1 col2 col3 col4
0 2.0 1.0 2 4
1 2.0 2.0 2 3
2 3.0 3.0 1 2
Another approach is working only with NaNs rows:
m1 = df['col1'].isna()
m2 = df['col2'].isna()
m3 = df['col3'].isna()
#oldier versions of pandas
#m1 = df['col1'].isnull()
#m2 = df['col2'].isnull()
#m3 = df['col3'].isnull()
df.loc[m1, 'col1'] = df.loc[m1, 'col2'].mul(df.loc[m1, 'col3'], fill_value=1)
df.loc[m2, 'col2'] = df.loc[m2, 'col1'].div(df.loc[m2, 'col3'], fill_value=1)
df.loc[m3, 'col3'] = df.loc[m3, 'col1'].mul(df.loc[m3, 'col2'], fill_value=1)
Explanation:
Filter each column with isna for 3 separate boolean masks.
For each mask first filter rows like df.loc[m1, 'col2'] and multiple or divide
Last assign back - replace NaNs only because filtering again by df.loc[m1, 'col1']
I was wondering how to calculate the number of unique symbols that occur in a single column in a dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'bbb', 'cc', ''], 'col2': ['ddd', 'eeeee', 'ff', 'ggggggg']})
df col1 col2
0 a ddd
1 bbb eeeee
2 cc ff
3 gggggg
It should calculate that col1 contains 3 unique symbols, and col2 contains 4 unique symbols.
My code so far (but this might be wrong):
unique_symbols = [0]*203
i = 0
for col in df.columns:
observed_symbols = []
df_temp = df[[col]]
df_temp = df_temp.astype('str')
#This part is where I am not so sure
for index, row in df_temp.iterrows():
pass
if symbol not in observed_symbols:
observed_symbols.append(symbol)
unique_symbols[i] = len(observed_symbols)
i += 1
Thanks in advance
Option 1
str.join + set inside a dict comprehension
For problems like this, I'd prefer falling back to python, because it's so much faster.
{c : len(set(''.join(df[c]))) for c in df.columns}
{'col1': 3, 'col2': 4}
Option 2
agg
If you want to stay in pandas space.
df.agg(lambda x: set(''.join(x)), axis=0).str.len()
Or,
df.agg(lambda x: len(set(''.join(x))), axis=0)
col1 3
col2 4
dtype: int64
Here is one way:
df.apply(lambda x: len(set(''.join(x.astype(str)))))
col1 3
col2 4
Maybe
df.sum().apply(set).str.len()
Out[673]:
col1 3
col2 4
dtype: int64
One more option:
In [38]: df.applymap(lambda x: len(set(x))).sum()
Out[38]:
col1 3
col2 4
dtype: int64