Filtering DataFrame with a mean treshhold - python

I have a DataFrame, and I want to keep only columns, when their mean is over a certain treshhold.
My code looks like this:
import pandas as pd
df = pd.DataFrame(np.random.random((20,20)))
mean_keep= (df.mean() > 0.5)
mean_keep= mean_keep[mean_keep == True]
df_new = df[mean_keep.index]
and it is working. However I wonder if there is a function like "TAKE_ONLY_COLUMNS" that can reduce this to one line like
df_new = df[TAKE_ONLY_COLUMNS(df.mean() > 0.5)]

Use df.loc[] here:
df_new=df.loc[:,df.mean() > 0.5]
print(df_new)
This will automatically keep the columns where the condition is True.

Related

Is there a way to use a method/function as an expression for .loc() in pandas?

I've been going crazy trying to figure this out. I'm trying to avoid using df.iterrows() to iterate through the rows of a dataframe, as it's quite time consuming and .loc() is better from what I've seen.
I know this works:
df = df.loc[df.number == 3, :]
And that'll basically set df to be each row where the "number" column is equal to 3.
But, I get an error when I try something like this:
df = df.loc[someFunction(df.number), :]
What I want is to get every row where someFunction() returns True whenever the "number" value of said row is set as the parameter.
For some reason, it's passing the entire column (the dataframe's entire "number" column, in this example), instead of the value of a row as it iterates through the row, like the previous example.
Again, I know I can just use a for loop and .iterrows(), but I'm working with around 280,000 rows and it just takes longer than I'd like. Also have tried using a lambda function among other things.
Apply is slow - if you can, try to just put the complex vectorization logic in the function by taking in series as arguments:
import pandas as pd
df = pd.DataFrame()
df['a'] = [7, 6, 5, 4, 3, 2]
df['b'] = [1, 2, 3, 4, 5, 6]
def my_func(series1, series2):
return (series2 > 3) | (series1 == series2)
df.loc[my_func(df.b, df.a), 'new_column_name'] = True
I think this is what you need:
import pandas as pd
df = pd.DataFrame({"number": [x for x in range(10)]})
def someFunction(row):
if row > 5:
return True
else:
return False
df = df.loc[df.number.apply(someFunction)]
print(df)
Output:
number
6 6
7 7
8 8
9 9
You can use an anonymous function with .loc
x refers to the dataframe you are indexing
df.loc[lambda x: x.number > 5, :]
Two options I can think of:
Create a new column using the pandas apply() method and a lambda function that returns either true or false depending on someFunction(). Then, use loc to filter on the new column you just created.
Use a for loop and df.itertuples() as it is way faster than iterrows. Make sure to look up the documentation as the syntax is slightly different for itertuples
Just use something like this will work
df = pd.DataFrame()
df['number'] = np.arange(10)
display(df[df['number']>5])
display(df[df['number']>2][df['number']<7])

pandas apply function with multiple condition?

if i want to apply lambda with multiple condition how should I do it?
df.train.age.apply(lambda x:0 (if x>=0 and x<500))
or is there much better methods?
create a mask and select from your array with the mask ... only apply to the result
mask = (df.train.age >=0) & (df.train.age < 500)
df.train.age[mask].apply(something)
if you just need to set the ones that dont match to zero thats even easier
df.train.age[~mask] = 0
Your syntax needs to have an else:
df.train.age.apply(lambda x:0 if x>=0 and x<500 else 0)
This is a good way to do it
The same can be obtained without using apply, but using np.where as below.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'age' : [-10,10,20,30,40,100,110]
})
df['age'] = np.where((df['age'] >= 100) | (df['age'] < 0), 0, df['age'])
df
If you have any confusion in using the above code, Please post your sample dataframe. I'll update my answer.

Is there an equivalent Python function similar to complete.cases in R

I am removing a number of records in a pandas data frame which contains diverse combinations of NaN in the 4-columns frame. I have created a function called complete_cases to provide indexes of rows which met the following condition: all columns in the row are NaN.
I have tried this function below:
def complete_cases(dataframe):
indx = []
indx = [x for x in list(dataframe.index) \
if dataframe.loc[x, :].isna().sum() ==
len(dataframe.columns)]
return indx
I am wondering should this is optimal enough or there is a better way to do this.
Absolutely. All you need to do is
df.dropna(axis = 0, how = 'any', inplace = True)
This will remove all rows that have at least one missing value, and updates the data frame "inplace".
I'd recommend to use loc, isna, and any with 'columns' axis, like this:
df.loc[df.isna().any(axis='columns')]
This way you'll filter only the results like the complete.cases in R.
A possible solution:
Count the number of columns with "NA" creating a column to save it
Based on this new column, filter the rows of the data frame as you wish
Remove the (now) unnecessary column
It is possible to do it with a lambda function. For example, if you want to remove rows that have 10 "NA" values:
df['count'] = df.apply(lambda x: 0 if x.isna().sum() == 10 else 1, axis=1)
df = df[df.count != 0]
del df['count']

Get all previous values for every row

I'm about to write a backtesting tool and so for every row I'd like to have access to all the dataframe till the given row. In the following example I'm doing it from a fixed index using a loop. I'm wondering if there is any better solution.
import numpy as np
import pandas as pd
N
df = pd.DataFrame({"a":np.arange(N)})
for i in range(3,N):
print(df["a"][:i].values)
UPDATE (toy example)
I need to apply a custom function to all the previous values. Here as a toy example I will use the sum of the square of all previous values.
def toyFun(v):
return np.sum(v**2)
res = np.empty(N)
res[:] = np.nan
for i in range(3, N):
res[i] = toyFun(df["a"][:i].values)
df["res"] = res
If you are indexing rows for a particular column say 'a', you can use .iloc indexer (i stands for index, loc means location) to index on the columns.
df = pd.DataFrame({'a': [1,2,3,4]})
print(df.a.iloc[:2]) # get first two values
So, you can do:
for i in range(3, 10):
print(df.a.iloc[:i])
The best way is to use a temporary column with the direct results, that way you are not re-calculating everything.
df["a"].apply(lambda x: x**2).cumsum()
Then re-index as you which:
res[3:] = df["a"].apply(lambda x: x**2).cumsum()[2:N-1].values
or directly to the dataframe.

Pandas: return dataframe where one column's values are greater than another

Apologies if this is a duplicate but I can't seem to find a working example in the pandas docs, SO or google.
How do you return a dataframe where the values of one column are greater than the values of another?
Should be something like this: df['A'].where(df['A']>df['B'])
But this returns only a vector. I need the full filtered dataframe.
Try using query
df.query('A > B')
consider df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(10, 2), columns=list('AB'))
df
option 1
df.query('A > B')
option 2
df[df.A.gt(df.B)]
To do df['A'].where(df['A']>df['B']) in pandas syntax is essentially a mask. Instead of where you are taking a subset of the dataframe:
df[df['A'] > df['B']]

Categories

Resources