Why am I receiving this error when looping through pandas dataframe

Why am I receiving this error when looping through pandas dataframe - python

I am attempting to loop through two columns in my dataframe and add either a 1 or 0 to a new column based on the two aforementioned column values. For example, if Column A is > Column B then add a 1 to Column C. However, I keep receiving the following error and I'm not sure why.
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
My code:
for i in df.itertuples():
if df['AdjClose'] > df['30ma']:
df['position'] = 1
elif df['AdjClose'] < df['30ma']:
df['position'] = 0

You aren't actually looping through the rows. In your if statement, instead of your condition being True or False, it's a Series. Hence, the error. A Series is not true or false, it's a Series. A more correct way to write your code would be
for i in range(len(df)):
if df.loc[i, 'AdjClose'] > df.loc[i, '30ma']:
df.loc[i, 'position'] = 1
elif df.loc[i, 'AdjClose'] < df.loc[i, '30ma']:
df.loc[i, 'position'] = 0
A shorter, cleaner, and more pandas-y way to write the code that also has the benefit of running faster would be:
df.loc[df.AdjClose > df['30ma'], 'position'] = 1
df.loc[df.AdjClose < df['30ma'], 'position'] = 0
I highly recommend reading the docs on indexing, it can be a bit tricky in pandas to start with. https://pandas.pydata.org/pandas-docs/stable/indexing.html
Edit:
Note, the for loop code makes the assumption that your index is made of unique values ranging from 0 to n-1. It's a bit more complicated if you have a different index. See https://pandas.pydata.org/pandas-docs/stable/whatsnew.html#deprecate-ix

Your code is calling df.itertuples, but not using the result. You could fix that using one of Ian Kent's suggestions, or something like this:
for row in df[['AdjClose', '30ma']].itertuples():
if row[1] > row[2]: # note: row[0] is the index value
df.loc[row.Index, 'position'] = 1
elif row[1] < row[2]:
df.loc[row.Index, 'position'] = 0
If your columns all had names that were valid Python identifiers, you could use something neater:
for row in df.itertuples():
if row.AdjClose > row.ma30:
df.loc[row.Index, 'position'] = 1
elif row.AdjClose < row.ma30:
df.loc[row.Index, 'position'] = 0
Note that neither of these will work if the index for df has duplicate values.
You might also be able to use df.apply, like this:
def pos(row):
if row['AdjClose'] > row['30ma']:
return 1
elif row['AdjClose'] > row['30ma']:
return 0
else:
return pd.np.nan # undefined?
df['position'] = df.apply(pos)
or just
df['position'] = df.apply(lambda row: 1 if row['AdjClose'] > row['30ma'] else 0)
This should work even if the index has duplicate values. However, you have to define a value for every row, even the ones where row['AdjClose'] == row['30ma'].
Overall, you're probably best off with Ian Kent's second recommendation.

You're trying to test a boolean over multiple values (similar to if pd.Series([False, True, False]) which is not clear what the result is), so pandas raises that error.
The message suggests you could use any() to return if any value (in this case the one value you're testing) is True.
So maybe something like this:
for i in df.itertuples():
if (df['AdjClose'] > df['30ma']).any():
df['position'] = 1
elif (df['AdjClose'] < df['30ma']).any():
df['position'] = 0
See these docs for further details Using If/Truth Statements with pandas

Related

In Pandas dataframe, how to append a new column of True / False based on each row's value?

I'm trying to create a dataframe of stock prices, and append a True/False column for each row based on certain conditions.
ind = [0,1,2,3,4,5,6,7,8,9]
close = [10,20,30,40,30,20,30,40,50]
open = [11,21,31,41,31,21,31,41,51]
upper = [11,21,31,41,31,21,31,41,51]
mid = [11,21,31,41,31,21,31,41,51]
cond1 = [True,True,True,False,False,True,False,True,True]
cond2 = [True,True,False,False,False,False,False,False,False]
cond3 = [True,True,False,False,False,False,False,False,False]
cond4 = [True,True,False,False,False,False,False,False,False]
cond5 = [True,True,False,False,False,False,False,False,False]
def check_conds(df, latest_price):
''''1st set of INT for early breakout of bollinger upper'''
df.loc[:, ('cond1')] = df.close.shift(1) > df.upper.shift(1)
df.loc[:, ('cond2')] = df.open.shift(1) < df.mid.shift(1).rolling(6).min()
df.loc[:, ('cond3')] = df.close.shift(1).rolling(7).min() <= 21
df.loc[:, ('cond4')] = df.upper.shift(1) < df.upper.shift(2)
df.loc[:, ('cond5')] = df.mid.tail(3).max() < 30
df.loc[:, ('Overall')] = all([df.cond1,df.cond2,df.cond3,df.cond4,df.cond5])
return df
The original 9 rows by 4 columns dataframe contains only the close / open / upper / mid columns.
that check_conds functions returns the df nicely with the new cond1-5 columns returning True / False appended for each row, resulting in a dataframe with 9 rows by 9 columns.
However when I tried to apply another logic to provide an 'Overall' True / False based on cond1-5 for each row, I receive that "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
df.loc[:, ('Overall')] = all([df.cond1,df.cond2,df.cond3,df.cond4,df.cond5])
So I tried pulling out each of the cond1-5, those are indeed series of True / False. How do I have that last line in the function to check each row's cond1-5 and return a True if all cond1-5 are True for that row?
Just can't wrap my head why those cond1-5 lines in the function works ok, just comparing the values within each row, but this above last line (written in similar style) is returning an entire series.
Please advise!

The error tells you to use pd.DataFrame.all. To check that all values are true per row for all conditions you have to specify the argument axis=1:
df.loc[:, df.columns.str.startswith('cond')].all(axis=1)
Note that df.columns.str.startswith('cond') is just a lazy way of selecting all columns that start with 'cond'. Of course you can achieve the same with df[['cond1', 'cond2', 'cond3', 'cond4', 'cond5']].

A Better solution to check if dataframe value is in another dataframe and within specific date boundaries or ther specifications

I want to check if each value in a column exist in another dataframe (df2) and if its date is at least 3 days close to the date in the second dataframe (df2) or if they meet other conditions.
The code I've written works, but I want to know if there's a better solution to this problem or a code that's more efficient
Exemple:
def check_answer(df):
if df.ticket_count == 1:
return 'Yes'
elif (df.ticket_count > 0) and (df.occurrences == 1):
return 'Yes'
elif any(
df2[df2.partnumber == df.partnumber]['ticket_date'] >= df['date']
) and any(
df2[df2.partnumber == df.partnumber]['ticket_date'] <= df['date'] + pd.DateOffset(days=3)
):
return 'Yes'
else:
return 'No'
df['result'] = df.apply(check_answer, axis=1)

You could try using list comprehension.
Here's an example:
list comprehension in pandas
And if you need to create a copy of your data frame with news columns containing the result of your conditions, you can check this exemple: Pandas DataFrame Comprehensions
I hope I could help
Best regards.

Pandas Error: Chained Indexing error when conditionally changing cell string values

Hey everyone I am getting an error with chained indexing/returning a copy instead of a view in Pandas. Here is my code:
import pandas as pd
duplicates = pd.read_csv("dupes_test.csv", parse_dates=["Updated At"])
dupe_df = pd.DataFrame(duplicates)
dupe_sorted = dupe_df.sort_values(['Email Address', 'Updated At'], ascending=False)
cols_to_change = list(dupe_sorted.columns)
opt_out_count = 0
unchanged_count = 0
error = 0
Here is my for loop:
for dupe in range (0, dupe_sorted.shape[0]):
try:
if dupe_sorted["Email Address"].iloc[dupe] == dupe_sorted["Email Address"].iloc[dupe + 1]:
for col in cols_to_change:
if dupe_sorted[col].iloc[dupe + 1] == 'Opt Out':
dupe_sorted[col].iloc[dupe] = "Opt Out"
opt_out_count +=1
else:
unchanged_count +=1
except:
error += 1
print("We're Done")
The error message:
//anaconda3/envs/DtownPlayground/lib/python3.6/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
My output seems to vary, sometimes it works as it should and I get the following output:
print(f"Opted Out: {opt_out_count}, Unchanged: {unchanged_count}, Error: {error}")
Opted Out: 22, Unchanged: 8, Error: 1
Other times it doesn't update any of the values. I guess I'm confused because I'm not using chained indexing and I don't know why Pandas is giving me this warning message. Also, let me know if I should paste the dataframe in a different format for ease of use!

Essentially, you are attempting to assign a single cell value across a slice of your data frame and hence the warning:
dupe_sorted[col].iloc[dupe] = "Opt Out"
Usually in Pandas operations, you want to assign entire Series (i.e., column) in one call, known as vectorized operations, and not by row and column cell. Similarly, in Numpy operations you want to assign Ndarrays in one call and not assign by cells.
Therefore, consider merging on a DataFrame.shift version of same data to capture next row values. Then re-assign with conditional logic, numpy.where, by each column. And use the functional form of == operator with Series.eq but both methods are valid:
cols_to_change = dupe_sorted.columns.tolist()
# MERGE ON SHIFT FORWARD
dupe_sorted = dupe_sorted.join(dupe_sorted.shift(-1), how='left', rsuffix = '_')
# ITERATE ACROSS COLUMNS FOR SERIES ASSIGNMENT
for col in cols_to_change:
dupe_sorted[col] = np.where((dupe_sorted['Email Address'].eq(dupe_sorted['Email Address_'])) &
(dupe_sorted[col+'_'].eq('Opt Out')),
'Opt Out',
dupe_sorted[col])
# REMOVE SHIFTED COLUMNS
final_df = dupe_sorted.reindex(cols_to_change, axis='columns')

if else condition pandas

I have a df with cols
start end strand
3 90290834 90290905 +
3 90290834 90291149 +
3 90291019 90291149 +
3 90291239 90291381 +
5 33977824 33984550 -
5 33983577 33984550 -
5 33984631 33986386 -
what i am trying to do is add new columns(5ss and 3ss)based on the strand column
f = pd.read_clipboard()
f
def addcolumns(row):
if row['strand'] == "+":
row["5ss"] == row["start"]
row["3ss"] == row["end"]
else:
row["5ss"] == row["end"]
row["3ss"] == row["start"]
return row
f = f.apply(addcolumns, axis=1)
KeyError: ('5ss', u'occurred at index 0')
which part of the code is wrong? or there is an easier way to do this?

Instead of using .apply() I'd suggest using np.where() instead:
df.loc[:, '5ss'] = np.where(f.strand == '+', f.start, f.end)
df.loc[:, '3ss'] = np.where(f.strand == '+', f.end, f.start)
np.where() creates a new object based on three arguments
A logical condition (in this case f.strand == '+')
A value to take when the condition is true
A value to take when the condition is false
Using apply() with axis=1 applies the function to each column. So even though you've named the variable row, it's actually iterating over columns. You could leave out the axis argument or specify axis=0 to apply the function to the rows. But given what you're trying to do, it would be simpler to use np.where(), which allows you to specify some conditional logic for column assignment.

Delete rows based on multiple conditions; including other column conditionals

I am seeking to drop some rows from a DataFrame based on two conditions needing to be met in the same row. So I have 5 columns, in which; if two columns have equal values (code1 and code2) AND one other column (count) is greater than 1, then when these two conditions are met in the same row - the column is dropped.
I could alternatively keep columns that meet the conditions of:
count == 1 'OR' (as opposed to AND) df_1.code1 != df_1.code2
In terms of the first idea what I am thinking is:
df_1 = '''drop row if''' [df_1.count == 1 & df_1.code1 == df_1.code2]
Here is what I have so far in terms of the second idea;
df_1 = df_1[df_1.count == 1 or df_1.code1 != df_1.code2]

You can use .loc to specify multiple conditions.
df_new = df_1.loc[(df_1.count != 1) & (df_1.code1 != df_1.code2), :]

df.drop(df[(df['code1'] == df['code2']) & (df['count'] > 1)].index, inplace=True)
Breaking it to steps:
df[(df['code1'] == df['code2']) & (df['count'] > 1)] returns a subset of rows from df where the value in code1 equals to the value in code2 and the value in count is greater than 1.
.index returns the indexes of those rows.
The last step is calling df.drop() that expects indexes to be dropped from the dataframe, and using inplace=True so we won't need to re-assign, ie
df = df.drop(...).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why am I receiving this error when looping through pandas dataframe - python

Related

In Pandas dataframe, how to append a new column of True / False based on each row's value?

A Better solution to check if dataframe value is in another dataframe and within specific date boundaries or ther specifications

Pandas Error: Chained Indexing error when conditionally changing cell string values

if else condition pandas

Delete rows based on multiple conditions; including other column conditionals

Categories

Resources