Pandas: remove rows with value lower than a threshold but keep NANs - python

I am working with Pandas (python) on a dataset containing some fMRI results.
I am trying to drop rows when the value of a specific column is lower than a threshold I set. The thing is, I would also like to keep NAN values.
df = df[(dfr['in_mm3'] > 270) or (np.isnan(df['in_mm3']) == True)]
Obviously this doesn't work, but it is for you to better understand what I'm trying to achieve.
Any help would be appreciated.
Thanks.

You are almost there. This should work:
import numpy as np
df = df[ np.logical_or(dfr['in_mm3'] > 270, np.isnan(df['in_mm3'])) ]

df = df[(dfr['in_mm3'] > 270) | (pd.isnan(df['in_mm3']))]
Here we are printing the values where pd.isnan(df['in_mm3']) is true and (dfr['in_mm3'] > 270) is satisfied

Related

Multi-columns filter VAEX dataframe, apply expression and save result

I want to use VAEX for lazy work wih my dataframe. After quick start with export big csv and some simple filters and extract() I have initial df for my work with 3 main columns: cid1, cid2, cval1. Each combitations of cid1 and cid2 is a workset with some rows where is cval1 is different. My df contents only valid cid1 and cid2. I want to save in df only rows with minimun cval1 and drop other. cval1 is float, cid1 and cid2 is int.
I try one filter:
df = df.filter(df.cid1 == 36 & df.cid2 == 182 & df.cval1 == df.min(df.cval1))
I must to receive in result df with only one row.
But it not work properly, it's result:
enter image description here
It's a first problem. But next I must to find minimum cval1 for each valid combination of cid1 and cid2.
I have list of tuples with each values cid1 and cid2:
cart_prod=[(2, 5), (3, 9), ...]
I think I try:
df_finally = vaex.DataFrame()
for x in cart_prod:
df2 = df.filter(df.cid1 == x[0] & df.cid2 == x[1] & df.cval1 == df.min(df.cval1))
df_finally = vaex.concat([df_finally, df2])
But the filter not valid, and VAEX can not concat with error that DataFrame have not attribute concat.. But I try really vaex.concat(list_of_dataframes).
I think may be I can use:
df.select(df.cid1 == x[0] & df.cid2 == x[1] & df.cval1 == df.min(df.cval1), name = "selection")
But I can't to make that df this selection take and use..
df = df.filter((df.cid1, df.cid2) in cart_prod)
This code have not result too..
Hmmm.. Help me please!
How to choose minimum df.cval1 for each combinations of df.cid1 and df.cid2 and save result to dataframe in VAEX?
Maybe goupby? But I don't understand how it works..
I've not used VAEX but the documentation says its groupby syntax is virtually same as pandas. So, here is what I would do in Pandas:
import pandas as pd
import numpy as np
df["min_cid3"] = df.groupby(['cid1', 'cid2'])['cid3'].transform(np.min)
Then filter your df wherever cid3==min_cid3.
EDIT:
As per OP's comment, above pandas solution is working but fails for VAEX. So, based on VAEX docs, I believe this would work there:
df.groupby(by=['cid1', 'cid2']).agg({'min_cid3': 'min'})
PS: I haven't installed VAEX, so if this doesn't work and you figure out the change needed, feel free to suggest edit.

Calculate the mean in pandas while a column has a string

I am currently learning pandas and I am using an imdb movies database, which one of the columns is the duration of the movies. However, one of the values is "None", so I can´t calculate the mean because there is this string in the middle. I thought of changing the "None" to = 0, however that would skew the results. Like can be seen with the code below.
dur_temp = duration.replace("None", 0)
dur_temp = dur_temp.astype(float)
descricao_duration = dur_temp.mean()
Any ideas on what I should do in order to not skew the data? I also graphed it and it becomes more clear how it skews it.
You can replace "None" with numpy.nan, instead that using 0.
Something like this should do the trick:
import numpy as np
dur_temp = duration.replace("None", np.nan)
descricao_duration = dur_temp.mean()
if you want it working for any string in your pandas serie, you could use pd.to_numeric:
pd.to_numeric(dur_temp, errors='coerce').mean()
in this way all the values ​​that cannot be converted to float will be replaced by NaN regardless of which is
Just filter by condition like this
df[df['a']!='None'] #assuming your mean values are in column a
Make them np.NAN values
I am writing it as answer because i can't comment df = df.replace('None ', np.NaN) or df.replace('None', np.NaN, inplace=True)
You can use fillna(value=np.nan) as shown below:
descricao_duration = dur_temp.fillna(value=np.nan).mean()
Demo:
import pandas as pd
import numpy as np
dur_temp = pd.DataFrame({'duration': [10, 20, None, 15, None]})
descricao_duration = dur_temp.fillna(value=np.nan).mean()
print(descricao_duration)
Output:
duration 15.0
dtype: float64

Python Panda dataframe, giving the amount where 2 columns return true

So I have a really big dataframe with the following information:
There are 2 columns "eethuis" and "caternaar" which return True or False whether they have it or not. Now I need to find the number of places where they have both eethuis and caternaar. So this means that I need to count the rows where eethuis and caternaar return True. But I can't really find a way? Even after searching for sometime.
This is what I have. I merged the 2 rows that I need togheter but now I still need to select and count the columns that are both True:
In the picture You will not see a column where both are true, but there are some. Its a really long table with over 800 columns.
Would be nice if someone could help me!
If I understand your question correctly, you can use '&', here is an example on random data:
import pandas as pd
import random
# create random data
df = pd.DataFrame()
df['col1'] = [random.randint(0,1) for x in range(10000)]
df['col2'] = [random.randint(0,1) for x in range(10000)]
df = df.astype(bool)
# filter it:
df1 = df[(df['col1']==True) & (df['col2']==True)]
# check sise:
df1.shape
Thanks to Ezer K I found the solution! Here is the code:
totaal = df_resto[(df_resto['eethuis']==True) & (df_resto['cateraar']==True)]
This is the output:
`
So u see it works!
And the count is 41!

How do I delete every row that has a value for a specific column smaller than certain limit?

I have this dataset and would like to .drop() every column that corresponds to a cell of a "Cell_Area" smaller than 100.
This is the dataset;
Thank you very much in advance.
First select all indexes where your requirement is met, then use the drop method to remove them in-place
df.drop(df.loc[df.CellArea < 100].index, inplace=True)
Use boolean indexing to select rows
df = df[~(df['Cell_Area'] < 100)]
# Or
df = df[df['Cell_Area'] >= 100]
This can work
df.drop(df[df['Cell_Area'] < 100].index, inplace = True)

pandas apply function with multiple condition?

if i want to apply lambda with multiple condition how should I do it?
df.train.age.apply(lambda x:0 (if x>=0 and x<500))
or is there much better methods?
create a mask and select from your array with the mask ... only apply to the result
mask = (df.train.age >=0) & (df.train.age < 500)
df.train.age[mask].apply(something)
if you just need to set the ones that dont match to zero thats even easier
df.train.age[~mask] = 0
Your syntax needs to have an else:
df.train.age.apply(lambda x:0 if x>=0 and x<500 else 0)
This is a good way to do it
The same can be obtained without using apply, but using np.where as below.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'age' : [-10,10,20,30,40,100,110]
})
df['age'] = np.where((df['age'] >= 100) | (df['age'] < 0), 0, df['age'])
df
If you have any confusion in using the above code, Please post your sample dataframe. I'll update my answer.

Categories

Resources