I'm trying to create a dataframe of stock prices, and append a True/False column for each row based on certain conditions.
ind = [0,1,2,3,4,5,6,7,8,9]
close = [10,20,30,40,30,20,30,40,50]
open = [11,21,31,41,31,21,31,41,51]
upper = [11,21,31,41,31,21,31,41,51]
mid = [11,21,31,41,31,21,31,41,51]
cond1 = [True,True,True,False,False,True,False,True,True]
cond2 = [True,True,False,False,False,False,False,False,False]
cond3 = [True,True,False,False,False,False,False,False,False]
cond4 = [True,True,False,False,False,False,False,False,False]
cond5 = [True,True,False,False,False,False,False,False,False]
def check_conds(df, latest_price):
''''1st set of INT for early breakout of bollinger upper'''
df.loc[:, ('cond1')] = df.close.shift(1) > df.upper.shift(1)
df.loc[:, ('cond2')] = df.open.shift(1) < df.mid.shift(1).rolling(6).min()
df.loc[:, ('cond3')] = df.close.shift(1).rolling(7).min() <= 21
df.loc[:, ('cond4')] = df.upper.shift(1) < df.upper.shift(2)
df.loc[:, ('cond5')] = df.mid.tail(3).max() < 30
df.loc[:, ('Overall')] = all([df.cond1,df.cond2,df.cond3,df.cond4,df.cond5])
return df
The original 9 rows by 4 columns dataframe contains only the close / open / upper / mid columns.
that check_conds functions returns the df nicely with the new cond1-5 columns returning True / False appended for each row, resulting in a dataframe with 9 rows by 9 columns.
However when I tried to apply another logic to provide an 'Overall' True / False based on cond1-5 for each row, I receive that "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
df.loc[:, ('Overall')] = all([df.cond1,df.cond2,df.cond3,df.cond4,df.cond5])
So I tried pulling out each of the cond1-5, those are indeed series of True / False. How do I have that last line in the function to check each row's cond1-5 and return a True if all cond1-5 are True for that row?
Just can't wrap my head why those cond1-5 lines in the function works ok, just comparing the values within each row, but this above last line (written in similar style) is returning an entire series.
Please advise!
The error tells you to use pd.DataFrame.all. To check that all values are true per row for all conditions you have to specify the argument axis=1:
df.loc[:, df.columns.str.startswith('cond')].all(axis=1)
Note that df.columns.str.startswith('cond') is just a lazy way of selecting all columns that start with 'cond'. Of course you can achieve the same with df[['cond1', 'cond2', 'cond3', 'cond4', 'cond5']].
I want to check if each value in a column exist in another dataframe (df2) and if its date is at least 3 days close to the date in the second dataframe (df2) or if they meet other conditions.
The code I've written works, but I want to know if there's a better solution to this problem or a code that's more efficient
Exemple:
def check_answer(df):
if df.ticket_count == 1:
return 'Yes'
elif (df.ticket_count > 0) and (df.occurrences == 1):
return 'Yes'
elif any(
df2[df2.partnumber == df.partnumber]['ticket_date'] >= df['date']
) and any(
df2[df2.partnumber == df.partnumber]['ticket_date'] <= df['date'] + pd.DateOffset(days=3)
):
return 'Yes'
else:
return 'No'
df['result'] = df.apply(check_answer, axis=1)
You could try using list comprehension.
Here's an example:
list comprehension in pandas
And if you need to create a copy of your data frame with news columns containing the result of your conditions, you can check this exemple: Pandas DataFrame Comprehensions
I hope I could help
Best regards.
Hey everyone I am getting an error with chained indexing/returning a copy instead of a view in Pandas. Here is my code:
import pandas as pd
duplicates = pd.read_csv("dupes_test.csv", parse_dates=["Updated At"])
dupe_df = pd.DataFrame(duplicates)
dupe_sorted = dupe_df.sort_values(['Email Address', 'Updated At'], ascending=False)
cols_to_change = list(dupe_sorted.columns)
opt_out_count = 0
unchanged_count = 0
error = 0
Here is my for loop:
for dupe in range (0, dupe_sorted.shape[0]):
try:
if dupe_sorted["Email Address"].iloc[dupe] == dupe_sorted["Email Address"].iloc[dupe + 1]:
for col in cols_to_change:
if dupe_sorted[col].iloc[dupe + 1] == 'Opt Out':
dupe_sorted[col].iloc[dupe] = "Opt Out"
opt_out_count +=1
else:
unchanged_count +=1
except:
error += 1
print("We're Done")
The error message:
//anaconda3/envs/DtownPlayground/lib/python3.6/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
My output seems to vary, sometimes it works as it should and I get the following output:
print(f"Opted Out: {opt_out_count}, Unchanged: {unchanged_count}, Error: {error}")
Opted Out: 22, Unchanged: 8, Error: 1
Other times it doesn't update any of the values. I guess I'm confused because I'm not using chained indexing and I don't know why Pandas is giving me this warning message. Also, let me know if I should paste the dataframe in a different format for ease of use!
Essentially, you are attempting to assign a single cell value across a slice of your data frame and hence the warning:
dupe_sorted[col].iloc[dupe] = "Opt Out"
Usually in Pandas operations, you want to assign entire Series (i.e., column) in one call, known as vectorized operations, and not by row and column cell. Similarly, in Numpy operations you want to assign Ndarrays in one call and not assign by cells.
Therefore, consider merging on a DataFrame.shift version of same data to capture next row values. Then re-assign with conditional logic, numpy.where, by each column. And use the functional form of == operator with Series.eq but both methods are valid:
cols_to_change = dupe_sorted.columns.tolist()
# MERGE ON SHIFT FORWARD
dupe_sorted = dupe_sorted.join(dupe_sorted.shift(-1), how='left', rsuffix = '_')
# ITERATE ACROSS COLUMNS FOR SERIES ASSIGNMENT
for col in cols_to_change:
dupe_sorted[col] = np.where((dupe_sorted['Email Address'].eq(dupe_sorted['Email Address_'])) &
(dupe_sorted[col+'_'].eq('Opt Out')),
'Opt Out',
dupe_sorted[col])
# REMOVE SHIFTED COLUMNS
final_df = dupe_sorted.reindex(cols_to_change, axis='columns')
I have a df with cols
start end strand
3 90290834 90290905 +
3 90290834 90291149 +
3 90291019 90291149 +
3 90291239 90291381 +
5 33977824 33984550 -
5 33983577 33984550 -
5 33984631 33986386 -
what i am trying to do is add new columns(5ss and 3ss)based on the strand column
f = pd.read_clipboard()
f
def addcolumns(row):
if row['strand'] == "+":
row["5ss"] == row["start"]
row["3ss"] == row["end"]
else:
row["5ss"] == row["end"]
row["3ss"] == row["start"]
return row
f = f.apply(addcolumns, axis=1)
KeyError: ('5ss', u'occurred at index 0')
which part of the code is wrong? or there is an easier way to do this?
Instead of using .apply() I'd suggest using np.where() instead:
df.loc[:, '5ss'] = np.where(f.strand == '+', f.start, f.end)
df.loc[:, '3ss'] = np.where(f.strand == '+', f.end, f.start)
np.where() creates a new object based on three arguments
A logical condition (in this case f.strand == '+')
A value to take when the condition is true
A value to take when the condition is false
Using apply() with axis=1 applies the function to each column. So even though you've named the variable row, it's actually iterating over columns. You could leave out the axis argument or specify axis=0 to apply the function to the rows. But given what you're trying to do, it would be simpler to use np.where(), which allows you to specify some conditional logic for column assignment.
I am seeking to drop some rows from a DataFrame based on two conditions needing to be met in the same row. So I have 5 columns, in which; if two columns have equal values (code1 and code2) AND one other column (count) is greater than 1, then when these two conditions are met in the same row - the column is dropped.
I could alternatively keep columns that meet the conditions of:
count == 1 'OR' (as opposed to AND) df_1.code1 != df_1.code2
In terms of the first idea what I am thinking is:
df_1 = '''drop row if''' [df_1.count == 1 & df_1.code1 == df_1.code2]
Here is what I have so far in terms of the second idea;
df_1 = df_1[df_1.count == 1 or df_1.code1 != df_1.code2]
You can use .loc to specify multiple conditions.
df_new = df_1.loc[(df_1.count != 1) & (df_1.code1 != df_1.code2), :]
df.drop(df[(df['code1'] == df['code2']) & (df['count'] > 1)].index, inplace=True)
Breaking it to steps:
df[(df['code1'] == df['code2']) & (df['count'] > 1)] returns a subset of rows from df where the value in code1 equals to the value in code2 and the value in count is greater than 1.
.index returns the indexes of those rows.
The last step is calling df.drop() that expects indexes to be dropped from the dataframe, and using inplace=True so we won't need to re-assign, ie
df = df.drop(...).