I have a df as below
I want to make this df binary as follows
I tried
df[:]=np.where(df>0, 1, 0)
but with this I am losing my df index.
I can try this on all columns one by one or use loop but I think there would be some easy & quick way to do this.
You can convert boolean mask by DataFrame.gt to integers:
df1 = df.gt(0).astype(int)
Or use DataFrame.clip if integers and no negative values:
df1 = df.clip(upper=1)
Your solution should working with loc:
df.loc[:] = np.where(df>0, 1, 0)
of course it is possible by func, it can be done with just operator
(df > 0) * 1
Without using numpy:
df[df>0]=0
Related
I am performing math operations on a dataframe. This is the operation that works for me.
df['delta'] = np.where (df ['peak'] == 1, df['value'] - df['value'].shift(1), np.NaN)
Now I want to do the EXACTLY same thing with index. Indexes are integers, but they have gaps.
My DataFrame looks like this:
, value, peak
40, 5878, 1
90, 8091, 1
98, 9091, 1
101,10987,1
So, how should I write my line? I mean something like this:
df['i'] = np.where (df ['peak'] == 1, df.index - df.index.shift(1), np.NaN)
So, I want to get column with values 50, 8, 3 and so on...
Since df.index.shift() is only implemented for time-related indexes, use NumPy's diff as a replacement: np.diff(df.index, prepend=np.nan):
df['i'] = np.where(df.peak == 1, np.diff(df.index, prepend=np.nan), np.nan)
I wish to create a column just like this:
İn Excel this can be done easily where the remaining rows can be auto-filled. How to achieve the same in python using pandas?
Use f-strings with zero fill values:
df['new'] = [f'abc_{n+1:02}' for n in range(len(df))]
If default index is possible add 1, convert to strings and add str.zfill:
df['new'] = 'abc_' + (df.index + 1).astype(str).str.zfill(2)
I'm trying to create a new column based on other columns existing in my df.
My new column, col, should be 1 if there is at least one 1 in columns A ~ E.
If all values in columns A ~ E is 0, then value of col should be 0.
I've attached image for a better understanding.
What is the most efficient way to do this with python, not using loop? Thanks.
enter image description here
If need test all columns use DataFrame.max or DataFrame.any with cast to integers for True/False to 1/0 mapping:
df['col'] = df.max(axis=1)
df['col'] = df.any(axis=1).astype(int)
Or if need test columns between A:E add DataFrame.loc:
df['col'] = df.loc[:, 'A':'E'].max(axis=1)
df['col'] = df.loc[:, 'A':'E'].any(axis=1).astype(int)
If need specify columns by list use subset:
cols = ['A','B','C','D','E']
df['col'] = df[cols].max(axis=1)
df['col'] = df[cols].any(axis=1).astype(int)
if i want to apply lambda with multiple condition how should I do it?
df.train.age.apply(lambda x:0 (if x>=0 and x<500))
or is there much better methods?
create a mask and select from your array with the mask ... only apply to the result
mask = (df.train.age >=0) & (df.train.age < 500)
df.train.age[mask].apply(something)
if you just need to set the ones that dont match to zero thats even easier
df.train.age[~mask] = 0
Your syntax needs to have an else:
df.train.age.apply(lambda x:0 if x>=0 and x<500 else 0)
This is a good way to do it
The same can be obtained without using apply, but using np.where as below.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'age' : [-10,10,20,30,40,100,110]
})
df['age'] = np.where((df['age'] >= 100) | (df['age'] < 0), 0, df['age'])
df
If you have any confusion in using the above code, Please post your sample dataframe. I'll update my answer.
This has been killing me!
Any idea how to convert this to a list comprehension?
for x in dataframe:
if dataframe[x].value_counts().sum()<=1:
dataframe.drop(x, axis=1, inplace=True)
[dataframe.drop(x, axis=1, inplace=True) for x in dataframe if dataframe[x].value_counts().sum() <= 1]
I have not used pandas yet, but the documentation on dataframe.drop says it returns a new object, so I assume it will work.
I would probably suggest going the other way and filtering it, I don't know your dataframe but something like this should work:
counts_valid = df.T.apply(pd.value_counts()).sum() > 1
df = df[counts_valid]
Or, if I see what you are doing, you may be better with
counts_valid = df.T.nunique() > 1
df = df[counts_valid]
That will just keep rows that have more than one unique value.