check if string is contained within cell value

check if string is contained within cell value - python

I have a pandas dataframe:
df = pd.DataFrame({"Code": [77000581079,77000458432,77000458433,77000458434,77000691973], "Description": ['16/06/2009ø01/08/2009', 'ø16/06/2009:Date Breakpoint','16/06/2009ø01/08/2009:Date Breakpoint','01/08/2009ø:Date Breakpoint','01/08/2009ø:Date Breakpoint']})
I want to check if Description contains a str 16/06/2009ø01/08/2009:Date Breakpoint
If this returns True then I want to append -A to the code
Expected output :
Code Description
0 77000581079-A 16/06/2009ø01/08/2009:Date Breakpoint
1 77000458432 ø16/06/2009:Date Breakpoint
2 77000458433-A 16/06/2009ø01/08/2009:Date Breakpoint
3 77000458434 01/08/2009ø:Date Breakpoint
4 77000691973 01/08/2009ø:Date Breakpoint
Using :
for row in df['Description']:
if df['Description'].str.contains('16/06/2009ø01/08/2009:Date Breakpoint'):
print(row)
else:
pass
I get ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
Ive tried:
for row in df['Description']:
if df['Description'].str.contains('16/06/2009ø01/08/2009:Date Breakpoint').all():
print(row)
else:
pass
But still no Joy, ive read some docs on this error but Im abit confused about its meaning..
Is there a better way to achieve my desired outcome?

Let us try str.contains
df.Code = df.Code.astype(str)
df.loc[df.Description.str.contains('16/06/2009ø01/08/2009:Date Breakpoint'),'Code'] += '-A'

You need to write your condition like this:
if df['Description'].str.contains('16/06/2009ø01/08/2009:Date Breakpoint').sum() > 0:
print(row)
Your condition without using sum will return an indicator vector so you can't directly evaluate its boolean value. It will be an array of Falses and Trues, so you sum it and get a positive (>0) value even if a single True is present in the array. This is what you get without sum:
>>> df['Description'].str.contains('16/06/2009ø01/08/2009:Date Breakpoint')
0 False
1 False
2 True
3 False
4 False
Name: Description, dtype: bool
>>> df['Description'].str.contains('16/06/2009ø01/08/2009:Date Breakpoint').sum()
1
or just use any:
if df['Description'].str.contains('16/06/2009ø01/08/2009:Date Breakpoint').any():
print(row)

Related

Python, comparing dataframe rows, adding new column - Truth Value Error?

I am quite new to Python, and somewhat stuck here.
I just want to compare floats with a previous or forward row in a dataframe, and mark a new column accordingly. FWIW, I have 6000 rows to compare. I need to output a string or int as the result.
My Python code:
for row in df_an:
x = df_an['mid_h'].shift(1)
y = df_an['mid_h']
if y > x:
df_an['bar_type'] = 1
I get the error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The x and y variables are generated, but apparently things go wrong with the if y > x: statement.
Any advice much appreciated.
A different approach...
I managed to implement the suggested .gt operator.
df_an.loc[df_an.mid_h.gt(df_an.mid_h.shift()) &\
df_an.mid_l.gt(df_an.mid_l.shift()), "bar_type"] = UP

Instead of row; you basically shifting whole row then comparing; try this once;
df_an = pd.DataFrame({"mid_h":[1,3,5,7,7,5,3]})
df_an['bar_type'] = df_an.mid_h.gt(df_an.mid_h.shift())
# Output
mid_h bar_type
0 1 False
1 3 True
2 5 True
3 7 True
4 7 False
5 5 False
6 3 False

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). - rolling sum, using & does not resolve it

i have problem with line
if (value[1]['Longs']==1.0) & (self.df['Long_Market'].rolling(20).sum()==0):
self.long_market=1
I want it to forbid code from opening to many long positions, but i get the error
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I would love to know is it a function that checks previous 20 cells for specified value (1) and if there is one, it goes to another
Heres whole function
def generate_signals(self):
self.df['Z-Score']=(self.df['Spread']-self.df['Spread'].rolling(window=self.ma).mean())/self.df['Spread'].rolling(window=self.ma).std()
self.df['Prior Z-Score']=self.df['Z-Score'].shift(1)
self.df['Longs']=(self.df['Z-Score']<=self.floor)*1.0
self.df['Shorts']=(self.df['Z-Score']>=self.ceiling)*1.0
self.df['Exit']=(np.abs(self.df['Z-Score'])<=self.exit_zscore)*1.0
self.df['Long_Market']=0.0
self.df['Short_Market']=0.0
self.long_market=0
self.short_market=0
for i,value in enumerate(self.df.iterrows()):
if (value[1]['Longs']==1.0) & (self.df['Long_Market'].rolling(20).sum()==0):
self.long_market=1

Problem
Your problem seems to be your if-statement. The if-statement requires some expression that can be evaluated as either True or False. Instead, what you are supplying is a series of True or False values:
The expression you are using consists of two parts:
part1 = (value[1]['Longs']==1.0): here everything seems fine, as it produces a single True or False
part2 = (self.df['Long_Market'].rolling(20).sum()==0): Here is the problem.
Explanation
To see why part2 is the problem, note that self.df['Long_Market'].rolling(20).sum() produces a series. With the following dummy data, it looks like this:
df = pd.DataFrame({'a': [1,2,3,4,5,6,7], 'b': [4,1,5,6,2,5,56]})
test = df['a'].rolling(3).sum()
>>> test
# 0 NaN
# 1 NaN
# 2 6.0
# 3 9.0
# 4 12.0
# 5 15.0
# 6 18.0
If you compare this to a single value, you get this:
>>> test == 12
# 0 False
# 1 False
# 2 False
# 3 False
# 4 True
# 5 False
# 6 False
If part1 is a single value, and part2 is a series of values, then part1 & part2 will be a series of values again:
>>> part1 & (test == 12)
# 0 False
# 1 False
# 2 False
# 3 False
# 4 True
# 5 False
# 6 False
And now we get to actual problem that I briefly introduced at the beginning of this post: You are supplying a pandas Series as the if-clause, but the if-clause requires a single value! What the error message is telling you, is that 'python' is trying to evaluate your series as a single value (as required by the if-clause), but doesn't know how to do that without further instructions.
Solution
If you are asking how to check if there are any zeroes in your rolling sum, then I recommend you replace part2 with:
(self.df['Long_Market'].rolling(20).sum() == 0).any()

Python-pandas: the truth value of a series is ambiguous

I am currently trying to compare values from a json file(on which I can already work on) to values from a csv file(which might be the issue). My current code looks like this:
for data in trades['timestamp']:
data = pd.to_datetime(data)
print(data)
if data == ask_minute['lastUpdated']:
#....'do something'
Which gives:
":The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
My current print(data) looks like this:
2018-10-03 18:03:38.067000
2018-10-03 18:03:38.109000
2018-10-03 18:04:28
2018-10-03 18:04:28.685000
However, I am still unable to compare these timestamps from my CSV file to those of my Json file. Does someone have an idea?

Let's reduce it to a simpler example. By doing for instance the following comparison:
3 == pd.Series([3,2,4,1])
0 True
1 False
2 False
3 False
dtype: bool
The result you get is a Series of booleans, equal in size to the pd.Series in the right hand side of the expression. So really what's happening here is that the integer is being broadcast across the series, and then they are compared. So when you do:
if 3 == pd.Series([3,2,4,1]):
pass
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You get an error. The problem here is that you are comparing a pd.Series with a value, so you'll have multiple True and multiple False values, as in the case above. This of course is ambiguous, since the condition is neither True or False.
So you need to further aggregate the result so that a single boolean value results from the operation. For that you'll have to use either any or all depending on whether you want at least one (any) or all values to satisfy the condition.
(3 == pd.Series([3,2,4,1])).all()
# False
or
(3 == pd.Series([3,2,4,1])).any()
# True

The problem I see is that even if you are evaluating one row in a dataframe, the code knows that a dataframe has the ability to have many rows. The code doesn't just assume you want the only row that exists. You have to tell it explicitly. The way I solved it was like this:
if data.iloc[0] == ask_minute['lastUpdated']:
then the code knows you are selecting the one row that exists.

How to change values in a column based on a function applied on two other columns

So i have a data frame DF that looks like this:
DF:
match_id team teamA_Win Outcome
1 A True None
2 B True None
3 A False None
The outcome column in this df is filled with the string 'None'
What i want is to be able to change the value of the string in outcome to either 'Win' or 'Loss' based on the values in team and teamA_win.
As an example, if the Team == A and TeamA_win=True is True, then the outcome should be Win. However if Team==A and TeamA_Win=False then the outcome is Loss. Similarly if Team==B and TeamA_Win=True then the outcome should be Loss.
I created the following function:
def win(x):
if (x['team']=='A')& (x['teamA_win']==True):
x['outcome']='Win'
elif ((x['team']=='A')& (x['teamA_win']==False)):
x['outcome']='Loss'
elif ((x['team']=='B')& (x['teamA_win']==True)):
x['outcome']='Loss'
elif ((x['team']=='B')& (x['teamA_win']==False)):
x['outcome']='Win'
Now however when i invoke win(DF), i get the error:
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any idea on how to fix this? or if there is a simpler way to approach this situation?

Or a two-liner, make 'Outcome' column False, and use loc to check if 'team' column equals to 'teamA_Win' replace True with 'A' and False with 'B', if it does, make the 'Outcome' column True:
df['Outcome']=False
df.loc[df['team']==df['teamA_Win'].map({True:'A',False:'B'}),'Outcome']=True
Output:
match_id team teamA_Win Outcome
0 1 A True True
1 2 B True False
2 3 A False False

You can use the np.select, which will allow you to define your conditions and their possible values, like this:
import pandas as pd
import numpy as np
def win(x):
conditions = [
(x['team']=='A') & (x['teamA_win']==True),
(x['team']=='A') & (x['teamA_win']==False),
(x['team']=='B') & (x['teamA_win']==True),
(x['team']=='B') & (x['teamA_win']==False)]
choices = ['Win', 'Loss', 'Loss', 'Win']
x['outcome'] = np.select(conditions, choices)
Hope it helps.

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()?

I'm noob with pandas, and recently, I got that 'ValueError' when I'm trying to modify the columns that follows some rules, as:
csv_input = pd.read_csv(fn, error_bad_lines=False)
if csv_input['ip.src'] == '192.168.1.100':
csv_input['flow_dir'] = 1
csv_input['ip.src'] = 1
csv_input['ip.dst'] = 0
else:
if csv_input['ip.dst'] == '192.168.1.100':
csv_input['flow_dir'] = 0
csv_input['ip.src'] = 0
csv_input['ip.dst'] = 1
I was searching about this error and I guess that it's because the 'if' statement and the '==' operator, but I don't know how to fix this.
Thanks!

So Andrew L's comment is correct, but I'm going to expand on it a bit for your benefit.
When you call, e.g.
csv_input['ip.dst'] == '192.168.1.100'
What this returns is a Series, with the same index as csv_input, but all the values in that series are boolean, and represent whether the value in csv_input['ip.dst'] for that row is equal to '192.168.1.100'.
So, when you call
if csv_input['ip.dst'] == '192.168.1.100':
You're asking whether that Series evaluates to True or False. Hopefully that explains what it meant by The truth value of a Series is ambiguous., it's a Series, it can't be boiled down to a boolean.
Now, what it looks like you're trying to do is set the values in the flow_dir,ip.src & ip.dst columns, based on the value in the ip.src column.
The correct way to do this is would be with .loc[], something like this:
#equivalent to first if statement
csv_input.loc[
csv_input['ip.src'] = '192.168.1.100',
('ip.src','ip.dst','flow_dir')
] = (1,0,1)
#equivalent to second if statement
csv_input.loc[
csv_input['ip.dst'] = '192.168.1.100',
('ip.src','ip.dst','flow_dir')
] = (0,1,0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

check if string is contained within cell value - python

Let us try str.contains df.Code = df.Code.astype(str) df.loc[df.Description.str.contains('16/06/2009ø01/08/2009:Date Breakpoint'),'Code'] += '-A'

Related

Python, comparing dataframe rows, adding new column - Truth Value Error?

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). - rolling sum, using & does not resolve it

Python-pandas: the truth value of a series is ambiguous

How to change values in a column based on a function applied on two other columns

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()?

Categories

Resources