I am trying to get the values that are related to brand and manufacturer which are same (e.g brand==J.R. Watkins and manufacturer==J.R.Watkins)in last elif block.But it giving error as:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
My code is:
import csv
import pandas as pd
import sys
class sample:
def create_df(self, f):
self.z=pd.read_csv(f)
def get_resultant_df(self, list_cols):
self.data_frame = self.z[list_cols[:]]
def process_df(self, df, conditions):
resultant_df = self.data_frame
if conditions[2] == 'equals':
new_df=resultant_df[resultant_df[conditions[1]] == conditions[3]]
return new_df
elif conditions[2] == 'contains':
new_df = resultant_df[resultant_df[conditions[1]].str.contains(conditions[3])]
return new_df
elif conditions[2] == 'not equals':
new_df = resultant_df[resultant_df[conditions[1]] != conditions[3]]
return new_df
elif conditions[2] == 'startswith':
new_df = resultant_df[resultant_df[conditions[1]].str.startswith(conditions[3])]
return new_df
elif conditions[2] == 'in':
new_df = resultant_df[resultant_df[conditions[1]].isin(resultant_df[conditions[3]])]
return new_df
elif conditions[2] == 'not in':
new_df = resultant_df[~resultant_df[conditions[1]].isin(resultant_df[conditions[3]])]
return new_df
elif conditions[2]=='group':
new_df=list(resultant_df.groupby(conditions[0])[conditions[1]])
return new_df
elif conditions[2]=='specific':
new_df=resultant_df.loc[resultant_df[conditions[0]]==conditions[8]]
return new_df
elif conditions[2]=='same':
if(resultant_df.loc[(resultant_df[conditions[0]]==conditions[8]) & (resultant_df[conditions[1]]==conditions[8])]).all():
new_df=resultant_df
return new_df
if __name__ == '__main__':
sample = sample()
sample.create_df("/home/purpletalk/GrammarandProductReviews.csv")
df = sample.get_resultant_df(['brand', 'reviews.id','manufacturer','reviews.title','reviews.username'])
new_df = sample.process_df(df, ['brand','manufacturer','same','manufacturer', 'size', 'equal',8,700,'J.R. Watkins'])
print new_df['brand']
I am trying to get the values that are related to brand and
manufacturer which are same (e.g brand==J.R. Watkins and
manufacturer==J.R.Watkins)
Your logic is overcomplicated. Just apply a filter:
df = df[(df['brand'] == 'J.R. Watkins') & (df['manufacturer'] == 'J.R.Watkins')]
You don't need pd.DataFrame.all(), which appears to be what you are attempting. Nor do you need an inner if statement: if there's no match, you will have an empty dataframe.
Related
I have a pandas dataframe and would like to create a new column based on the below condition:
def confidence_level(row):
if (row['ctry_one'] == row['ctry_two']) and (row['Market'] == 'yes'):
return 'H'
if (row['ctry_one'] == row['ctry_two']) and (row['Market'] == 'no'):
return 'M'
if (row['ctry_one'] != row['ctry_two']) and (row['Market'] == 'yes'):
return 'M'
if (row['ctry_one'] != row['ctry_two']) and (row['Market'] == 'no'):
return 'L'
df['status'] = confidence_level(df)
This is the error I receive:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
func_test['Confidence'].value_counts()
Has anyone experienced this before? I tried applying .all() at the end of each argument like below, but this just returns 'None' for everything:
def confidence_level(row):
if (row['ctry_one'] == row['ctry_two']).all() and (row['Market'] == 'yes').all():
return 'H'
if (row['ctry_one'] == row['ctry_two']).all() and (row['Market'] == 'no').all():
return 'M'
if (row['ctry_one'] != row['ctry_two']).all() and (row['Market'] == 'yes').all():
return 'M'
if (row['ctry_one'] != row['ctry_two']).all() and (row['Market'] == 'no').all():
return 'L'
You need to call your function for each row, rather than for the whole dataframe, like this:
df['status'] = df.apply(confidence_level, axis=1)
That said, using np.select like Mayank's solution or using .loc like this will be run faster:
def confidence_level(df):
new_df = df.copy()
new_df.loc[(df['ctry_one'] == df['ctry_two']) & (df['Market'] == 'yes'), 'status'] = 'H'
new_df.loc[(df['ctry_one'] == df['ctry_two']) & (df['Market'] == 'no'), 'status'] = 'M'
new_df.loc[(df['ctry_one'] != df['ctry_two']) & (df['Market'] == 'yes'), 'status'] = 'M'
new_df.loc[(df['ctry_one'] != df['ctry_two']) & (df['Market'] == 'no'), 'status'] = 'L'
return df
df = confidence_level(df)
Use numpy.select instead, which is more performant and readable:
import numpy as np
conditions = [(df['ctry_one'] == df['ctry_two']) & (df['Market'] == 'yes'), (df['ctry_one'] == df['ctry_two']) & (df['Market'] == 'no'), (df['ctry_one'] != df['ctry_two']) & (df['Market'] == 'yes'), (df['ctry_one'] != df['ctry_two']) & (df['Market'] == 'no')]
choices = ['H', 'M', 'M', 'L']
df['status'] = np.select(conditions, choices)
I'm trying to apply this function to fill the Age column based on Pclass and Sex columns. But I'm unable to do so. How can I make it work?
def fill_age():
Age = train['Age']
Pclass = train['Pclass']
Sex = train['Sex']
if pd.isnull(Age):
if Pclass == 1:
return 34.61
elif (Pclass == 1) and (Sex == 'male'):
return 41.2813
elif (Pclass == 2) and (Sex == 'female'):
return 28.72
elif (Pclass == 2) and (Sex == 'male'):
return 30.74
elif (Pclass == 3) and (Sex == 'female'):
return 21.75
elif (Pclass == 3) and (Sex == 'male'):
return 26.51
else:
pass
else:
return Age
train['Age'] = train['Age'].apply(fill_age(),axis=1)
I'm getting the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You should consider using parenthesis to separate the arguments (which you already did) and change the boolean operator and for bitwise opeator & to avoid this type of errors. Also, keep in mind that if you want to use apply then you should use a parameter x for the function which will part of a lambda in the apply function:
def fill_age(x):
Age = x['Age']
Pclass = x['Pclass']
Sex = x['Sex']
if pd.isnull(Age):
if Pclass == 1:
return 34.61
elif (Pclass == 1) & (Sex == 'male'):
return 41.2813
elif (Pclass == 2) & (Sex == 'female'):
return 28.72
elif (Pclass == 2) & (Sex == 'male'):
return 30.74
elif (Pclass == 3) & (Sex == 'female'):
return 21.75
elif (Pclass == 3) & (Sex == 'male'):
return 26.51
else:
pass
else:
return Age
Now, using apply with the lambda:
train['Age'] = train['Age'].apply(lambda x: fill_age(x),axis=1)
In a sample dataframe:
df = pd.DataFrame({'Age':[1,np.nan,3,np.nan,5,6],
'Pclass':[1,2,3,3,2,1],
'Sex':['male','female','male','female','male','female']})
Using the answer provided above:
df['Age'] = df.apply(lambda x: fill_age(x),axis=1)
Output:
Age Pclass Sex
0 1.00 1 male
1 28.72 2 female
2 3.00 3 male
3 21.75 3 female
4 5.00 2 male
5 6.00 1 female
I should check if status is 'Yes' in column 'Bad_loan', if true, check other names in 'client_name', if client has another loans set value 'Yes' for all his loans
def bad_loan(df):
for row in df:
status = row['bad_loan']
name = row['name_client']
if status == 'Yes':
for n in df:
name_client = df['name_client']
if name == name_client:
df['bad_loan'] = 'Yes'
else:
df['bad_loan'] = 'No'
bad_loan(df)
it returns TypeError: string indices must be integers
for row in df:
You can't iterate this way with pandas
Try iloc like this
def bad_loan(df):
for i in range(len(df)):
row = df.iloc[i]
status = row['bad_loan']
name = row['name_client']
if status == 'Yes':
for n in df:
name_client = df['name_client']
if name == name_client:
df['bad_loan'] = 'Yes'
else:
df['bad_loan'] = 'No'
bad_loan(df)
import numpy as np
import pandas as pd
#make a list of all the names with bad loans:
names = df[df["bad_loan"]=="Yes"]["name_client"]
#set bad loan to yes if name in list
df["bad_loan"] = np.where(df["name_client"].isin(names),"Yes","No")
df = pd.read_csv('./test22.csv')
df.head(5)
df = df.replace(np.nan, None)
for index,col in df.iterrows():
# Extract only if date1 happened earlier than date2
load = 'No'
if col['date1'] == None or col['date2'] == None:
load = 'yes'
elif int(str(col['date1'])[:4]) >= int(str(col['date2'])[:4]) and \
(len(str(col['date1'])) == 4 or len(str(col['date2'])) == 4):
load = 'yes'
elif int(str(col['date1'])[:6]) >= int(str(col['date2'])[:6]) and \
(len(str(col['date1'])) == 6 or len(str(col['date2'])) == 6):
load = 'yes'
elif int(str(col['date1'])[:8]) >= int(str(col['date2'])[:8]):
load = 'yes'
df.head(5)
After preprocessing using iterrows in dataset, If you look at the above code (attached code), it will not be reflected in the actual dataset. I want to reflect the result in actual dataset.
How can I apply it to the actual dataset?
Replace your for loop with a function that returns a boolean, then you can use df.apply to apply it to all rows, and then filter your dataframe by that value:
def should_load(x):
if x['date1'] == None or x['date2'] == None:
return True
elif int(str(x['date1'])[:4]) >= int(str(x['date2'])[:4]) and \
(len(str(x['date1'])) == 4 or len(str(x['date2'])) == 4):
return True
elif int(str(x['date1'])[:6]) >= int(str(x['date2'])[:6]) and \
(len(str(x['date1'])) == 6 or len(str(x['date2'])) == 6):
return True
elif int(str(x['date1'])[:8]) >= int(str(x['date2'])[:8]):
return True
return False
df[df.apply(should_load, axis=1)].head(5)
Let's start with a Pandas DataFrame df with numerical columns pS, pS0 and pE:
import pandas as pd
df = pd.DataFrame([[0.1,0.2,0.7],[0.3,0.6,0.1],[0.9,0.1,0.0]],
columns=['pS','pE','pS0'])
We want to build a column indicating which of the 3 previous is dominating. I achieved it this way:
def class_morph(x):
y = [x['pE'],x['pS'],x['pS0']]
y.sort(reverse=True)
if (y[0] == y[1]):
return 'U'
elif (x['pE'] == y[0]):
return 'E'
elif (x['pS'] == y[0]):
return 'S'
elif (x['pS0'] == y[0]):
return 'S0'
df['Morph'] = df.apply(class_morph, axis=1)
Which gives the correct result:
But my initial try was the following:
def class_morph(x):
if (x['pE'] > np.max(x['pS'],x['pS0'])):
return 'E'
elif (x['pS'] > np.max(x['pE'],x['pS0'])):
return 'S'
elif (x['pS0'] > np.max(x['pS'],x['pE'])):
return 'S0'
else:
return 'U'
Which returned something wrong:
Could somebody explain to me what is my mistake in my first try?