Drop entire row subject to value in column - python

I am facing a problem with a rather simple command. I hava a DataFrame and want to delete the respective row if the value in column1 (in this row) exceeds e.g. 5.
First step, the if-condition:
if df['column1]>5:
Using this command, I always get the following Value Error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
Do you have an idea what this could be about?
Second step (drop row):
How do I specify that Python shall delete the entire row? Do I have to work with a loop or is there a simple solution such as df.drop(df.index[?]).
I am still rather unexperienced with Python and would appreciate any support and suggestions!

The reason you're getting the error, is because df['column1'] > 5 returns a series of booleans, equal in length to column1, and a Series can't be true or false, i.e. "The truth value of a Series is ambiguous".
That said, if you just need to select out rows fulfilling a specific condition, then you can use the returned series as a boolean index, for example
>>> from numpy.random import randn
>>> from pandas import DataFrame
#Create a data frame of 10 rows by 5 cols
>>> D = DataFrame(randn(10,5))
>>> D
0 1 2 3 4
0 0.686901 1.714871 0.809863 -1.162436 1.757198
1 -0.071436 -0.898714 0.062620 1.443304 -0.784341
2 0.597807 -0.705585 -0.019233 -0.552494 -1.881875
3 1.313344 -1.146257 1.189182 0.169836 -0.186611
4 0.081255 -0.168989 1.181580 0.366820 2.999468
5 -0.221144 1.222413 1.199573 0.988437 0.378026
6 1.481952 -2.143201 -0.747700 -0.597314 0.428769
7 0.006805 0.876228 0.884723 -0.899379 -0.270513
8 -0.222297 1.695049 0.638627 -1.500652 -1.088818
9 -0.646145 -0.188199 -1.363282 -1.386130 1.065585
#Making a comparison test against a whole column yields a boolean series
>>> D[2] >= 0
0 True
1 True
2 False
3 True
4 True
5 True
6 False
7 True
8 True
9 False
Name: 2, dtype: bool
#Which can be used directly to select rows, like so
>>> D[D[2] >=0]
#note rows 2, 6 and 9 re now missing.
0 1 2 3 4
0 0.686901 1.714871 0.809863 -1.162436 1.757198
1 -0.071436 -0.898714 0.062620 1.443304 -0.784341
3 1.313344 -1.146257 1.189182 0.169836 -0.186611
4 0.081255 -0.168989 1.181580 0.366820 2.999468
5 -0.221144 1.222413 1.199573 0.988437 0.378026
7 0.006805 0.876228 0.884723 -0.899379 -0.270513
8 -0.222297 1.695049 0.638627 -1.500652 -1.088818
#if you want, you can make a new data frame out of the result
>>> N = D[D[2] >= 0]
>>> N
0 1 2 3 4
0 0.686901 1.714871 0.809863 -1.162436 1.757198
1 -0.071436 -0.898714 0.062620 1.443304 -0.784341
3 1.313344 -1.146257 1.189182 0.169836 -0.186611
4 0.081255 -0.168989 1.181580 0.366820 2.999468
5 -0.221144 1.222413 1.199573 0.988437 0.378026
7 0.006805 0.876228 0.884723 -0.899379 -0.270513
8 -0.222297 1.695049 0.638627 -1.500652 -1.088818
For more, see the Pandas docs on boolean indexing; Note the dot syntax for column selection used in the docs only works for non-numeric column names, so in the example above D[D.2 >= 0] wouldn't work.
If you actually need to remove rows, then you would need to look into creating a deep copy dataframe of only the specific rows. I'd have to dive into the docs quite deep to figure that out, because pandas tries it's level best to do most things by reference, to avoid copying huge chunks of memory around.

Related

Python: using rolling + apply with a function that requires 2 columns as arguments in pandas

I have a dataframe (df) with 2 columns:
Out[2]:
0 1
0 1 2
1 4 5
2 3 6
3 10 12
4 1 2
5 4 5
6 3 6
7 10 12
I would like to use calculate for all the elements of df[0] a function of itself and df[1] column:
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
I get the following error: TypeError:
("'numpy.float64' object is not callable", u'occurred at index 0')
Here is the full code:
from __future__ import division
import pandas as pd
import sys
from scipy import stats
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
df['perc']=df.rolling(3).apply(custom_fct_2(df[0],df[1]))
Can someone help me on that? ( I am new in Python)
Out[2]:
0 1
...
5 4 5
6 3 6
7 10 12
I want the percentile ranking of [10] in [12,6,5]
I want the percentile ranking of [3] in [6,5,2]
I want the percentile ranking of [4] in [5,2,12]
...
The problem here is that rolling().apply() function cannot give you a segment of 3 rows across all the columns. Instead, it gives you series for the column 0 first, then the column 1.
Maybe there are better solutions, but I would show my one which at least works.
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
def custom_fct_2(s):
score = df[0][s.index.values[1]] # you may use .values[-1] if you want the last element
a = s.values
return stats.percentileofscore(a, score)
I'm using the same data you provided. But I modified your custom_fct_2() function. Here we get the s which is a series of 3 rolling values from the column 1. Fortunately, we have indexes in this series, so we can get the score from the column 0 via the "middle" index of the series. BTW, in Python [-1] means the last element of a collection, but from your explanation, I believe you actually want the middle one.
Then, apply the function.
# remove the shift() function if you want the value align to the last value of the rolling scores
df['prec'] = df[1].rolling(3).apply(custom_fct_2).shift(periods=-1)
The shift function is optional. It depends on your requirements whether your prec need to be aligned with column 0 (the middle score is using) or the rolling scores of column 1. I would assume you need it.

Check if dataframe value +/- 1 exists anywhere else in a given column

Let's say I have a dataframe df that looks like this:
irrelevant location
0 1 0
1 2 0
2 3 1
3 4 3
How do I create a new true/false column "neighbor" to indicate if the value in "location" +/- 1 (plus or minus 1) exists anywhere else in the "location" column. Such that:
irrelevant location neighbor
0 1 0 True
1 2 0 True
2 3 1 True
3 4 3 False
The last row would be false, because neither 2 nor 4 appear anywhere in the df.location column.
I've tried these:
>>> df['neighbor']=np.where((df.location+1 in df.location.unique())|(df.location-1 in df.x.unique()), True, False)
ValueError: Lengths must match to compare
>>> df['tmp']=np.where((df.x+1 in df.x.tolist())|(df.x-1 in df.x.tolist()), 'true', 'false')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Alternatively, thanks in advance for help directing me to earlier instances of this question being asked (I don't seem to have the right vocabulary to find them).
To find a neighbor anywhere in the column, create a list of all neighbor values then check isin.
import numpy as np
vals = np.unique([df.location+1, df.location-1])
#array([-1, 0, 1, 2, 4], dtype=int64)
df['neighbor'] = df['location'].isin(vals)
# irrelevant location neighbor
#0 1 0 True
#1 2 0 True
#2 3 1 True
#3 4 3 False
Just because, this is also possible with pd.merge_asof setting a tolerance to find the neighbors. We assing a value of True, which is brought in the merge if a neighbor exists. Otherwise it's left NaN which we fill with False after the merge.
(pd.merge_asof(df,
df[['location']].assign(neighbor=True),
on='location',
allow_exact_matches=False, # Don't match with same value
direction='nearest', # Either direction
tolerance=1) # Within 1, inclusive
.fillna(False))
You just need a little fix:
df['neighbor']=np.where(((df['location']+1).isin(df['location'].unique()))|((df['location']-1).isin(df['location'].unique())), True, False)

Why is .loc slicing in pandas inclusive of stop, contrary to typical python slicing?

I am slicing a pandas dataframe and I seem to be getting unexpected slices using .loc, at least as compared to numpy and ordinary python slicing. See the example below.
>>> import pandas as pd
>>> a = pd.DataFrame([[0,1,2],[3,4,5],[4,5,6],[9,10,11],[34,2,1]])
>>> a
0 1 2
0 0 1 2
1 3 4 5
2 4 5 6
3 9 10 11
4 34 2 1
>>> a.loc[1:3, :]
0 1 2
1 3 4 5
2 4 5 6
3 9 10 11
>>> a.values[1:3, :]
array([[3, 4, 5],
[4, 5, 6]])
Interestingly, this only happens with .loc, not .iloc.
>>> a.iloc[1:3, :]
0 1 2
1 3 4 5
2 4 5 6
Thus, .loc appears to be inclusive of the terminating index, but numpy and .iloc are not.
By the comments, it seems this is not a bug and we are well warned. But why is it the case?
Remember .loc is primarily label based indexing. The decision to include the stop endpoint becomes far more obvious when working with a non-RangeIndex:
df = pd.DataFrame([1,2,3,4], index=list('achz'))
# 0
#a 1
#c 2
#h 3
#z 4
If I want to select all rows between 'a' and 'h' (inclusive) I only know about 'a' and 'h'. In order to be consistent with other python slicing, you'd need to also know what index follows 'h', which in this case is 'z' but could have been anything.
There's also a section of the documents hidden away that explains this design choice Endpoints are Inclusive
Additionally to the point in the docs, pandas slice indexing using .loc is not cell index based. It is in fact value based indexing (in the pandas docs it is called "label based", but for numerical data I prefer the term "value based"), whereas with .iloc it is traditional numpy-style cell indexing.
Furthermore, value based indexing is right-inclusive, whereas cell indexing is not. Just try the following:
a = pd.DataFrame([[0,1,2],[3,4,5],[4,5,6],[9,10,11],[34,2,1]])
a.index = [0, 1, 2, 3.1, 4] # add a float index
# value based slicing: the following will output all value up to the slice value
a.loc[1:3.1]
# Out:
# 0 1 2
# 1.0 3 4 5
# 2.0 4 5 6
# 3.1 9 10 11
# index based slicing: will raise an error, since only integers are allowed
a.iloc[1:3.1]
# Out: TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Float64Index'> with these indexers [3.2] of <class 'float'>
To give an explicit answer to your question why it is right-inclusive:
When using values/labels as indices, it is, at least in my opinion, intuitive, that the last index is included. This is as far as I know a design decision of how the implemented function is meant to work.

Pandas Vectorization with Function on Parts of Column

So I have a dataframe that looks something like this:
df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]])
0 1 2
0 1 2 3
1 5 7 8
2 2 5 4
I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this:
df1['3'] = add5(df1[2])
But my goal is to do something like this:
df1['3'] = add5(df1[2]) if df1[2] > 3
Hoping someone can point me in the right direction on this. Thanks!
With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax.
In this case, you can use numpy.where:
df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2])
Alternatively, you can use pd.DataFrame.loc in a couple of steps:
df1[3] = df1[2]
df1.loc[df1[2] > 3, 3] = df1[2] + 5
In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series.
Result:
print(df1)
0 1 2 3
0 1 2 3 3
1 5 7 8 13
2 2 5 4 9

Python dataframe check if a value in a column dataframe is within a range of values reported in another dataframe

Apology if the problemis trivial but as a python newby I wasn't able to find the right solution.
I have two dataframes and I need to add a column to the first dataframe that is true if a certain value of the first dataframe is between two values of the second dataframe otherwise false.
for example:
first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})
first_df
code1 code2
0 1 10
1 1 22
2 2 15
3 2 15
4 3 7
5 1 130
6 1 2
second_df
code1 code2_end code2_start
0 1 15 5
1 1 25 20
2 2 20 11
3 2 20 11
4 3 10 5
5 1 120 110
6 1 230 220
For each row in the first dataframe I should check if the value reported in the code2 columne is between one of the possible range identified by the row of the second dataframe second_df for example:
in row 1 of first_df code1=1 and code2=22
checking second_df I have 4 rows with code1=1, rows 0,1,5 and 6, the value code2=22 is in the interval identified by code2_start=20 and code2_end=25 so the function should return True.
Considering an example where the function should return False,
in row 5 of first_df code1=1 and code2=130
but there is no interval containing 130 where code1=1
I have tried to use this function
def check(first_df,second_df):
for i in range(len(first_df):
return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()
and to vectorize it
first_df['output'] = np.vectorize(check)(first_df, second_df)
but obviously with no success.
I would be happy for any input you could provide.
thx.
A.
As a practical example:
first_df.code1[0] = 1
therefore I need to search on second_df all the istances where
second_df.code1 == first_df.code1[0]
0 True
1 True
2 False
3 False
4 False
5 True
6 True
for the instances 0,1,5,6 where the status is True I need to check if the value
first_df.code2[0]
10
is between one of the range identified by
second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
code2_start code2_end
0 5 15
1 20 25
5 110 120
6 220 230
since the value of first_df.code2[0] is 10 it is between 5 and 15 so the range identified by row 0 therefore my function should return True. In case of first_df.code1[6] the value vould still be 1 therefore the range table would be still the same above but first_df.code2[6] is 2 in this case and there is no interval containing 2 therefore the resut should be False.
first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)
This works because when you do something like: second_df.code2_start <= first_df.code2
You get a boolean Series. If you then perform a logical AND on two of these boolean series, you get a Series which has value True where both Series were True and False otherwise.
Here's an example:
>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
1 2 3 output
0 2 4 6 True
1 3 6 9 True
2 4 8 10 True
EDIT:
So based on your updated question and my new interpretation of your problem, I would do something like this:
import pandas as pd
# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])
# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
return check.any()
# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)
What I do here is define a function called checkRange which takes as input x, a single row of df_1 and code_range, the entire df_2 DataFrame. It first finds the rows of code_range which have the same c1 value as the given row, x.c1. Then the non matching rows are discarded. This is done in the first 2 lines:
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
Next, we get a boolean Series which tells us if x.c2 falls within any of the ranges given in the reduced code_range DataFrame:
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
Finally, since we only care that the x.c2 falls within one of the ranges, we return the value of check.any(). When we call any() on a boolean Series, it will return True if any of the values in the Series are True.
To call the checkRange function on each row of df_1, we can use apply(). I define a lambda expression in order to send the checkRange function the row as well as df_2. axis=1 means that the function will be called on each row (instead of each column) for the DataFrame.

Categories

Resources