I'm having this resultsdf in Panda
Antecedent Consequent confidence lift support
0 (3623,) (2568,) 0.829517 13.964925 0.0326
1 (4304,) (4305,) 0.808362 24.348264 0.0232
2 (3623, 3970) (2568,) 0.922581 15.531661 0.0286
and dictionary df
key name
0 1001 Boombox Ipod Classic
1 1002 USB Office Mirror Ball
I was trying to interpret Antecedent with dictionary by adding
resultsdf['Antecedent_name'] = resultsdf['Antecedent'].astype(str).map(df)
I'm getting error
The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), `a.item(), a.any() or a.all().`
Seems to me that you are trying to pass a df to map. you should pass a dict.
Try this:
resultsdf['Antecedent_name'] = resultsdf['Antecedent'].astype(str).map(df.to_dict())
I'm not sure if the to_dict() default output is enough. You can check the parameter to change the output formats here
Update:
Considering df1 your main df and df2 you key:name df, you can do something like this:
def check(x):
lst = []
for elem in x:
if elem in df2.to_dict("list")["key"]:
lst.append(df2.to_dict("list")["name"][df2.to_dict("list")["key"].index(elem)])
return tuple(lst)
df1['Antecedent_name'] = df1['Antecedent'].map(check)
It's not that beautiful, but I think that works. Maybe instead of a lambda you can just create a separated function and
This is the solution I used
resultsdf['Antecedent_name'] = resultsdf.Antecedent.map(lambda a: [df[int(id)] for id in a] )
Related
I am struggling with the following:
Row1 Row2
A 10
B 10
C 10
D 11
F 12
I have a large data and want to create a json file if its meets Row2. (It's an Object dtype)
if df['Row2'] == '10':
df.to_json(filelocation)
else:
df.to_json(diff_filelocation)
The error is receive is: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all. I used bool and still get the same error message. When I tried any(), then only the first file gets created. I have checked multiple posts, but nothing seems to work.
I have tried the following method as well
if df[df['Row2'] == '10']
or
if df.loc[(df.Row2=='10')]
but those aren't working either.
I am also confused as something like df[df["Row2"]] works, but not in an if statement.
Thanks in advance.
You need to separate df on two different segments basing on boolean mask:
m = df['Row2'].eq(10)
d[m].to_json(filelocation)
d[~m].to_json(diff_filelocation)
I am trying to create portfolios in dataframes depended on the variable 'scope' leaving the rows with the highest 33% of the scope-values in the first portfolio in a dataframe, middle 34% in the second and bottom 33% in the third for each time period and industry.
So far, I grouped the data on date and industry
group_first = data_clean.groupby(['date','industry'])
and used a lambda function afterwards to get the rows of the first tercile of 'scope' for every date and industry; for instance:
port = group_first.apply(lambda x: x[x['scope'] <= x.scope.quantile(0.33)]).reset_index(drop=True)
This works for the first and third tercile, however not for the middle one, because I get
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
putting two condition in the lambda function, like this:
group_middle = data_clean.groupby(['date','industry'])
port_middle = group_middle.apply(lambda x: (x[x['scope'] > x.scope.quantile(0.67)]) and (x[x['scope'] < x.scope.quantile(0.33)])).reset_index(drop=True)
In other words, how can I get the rows of a dataframe containing the values in 'scope' between the 33rd and 67th percentile after grouping for date and industry?
Any idea how to solve this?
I will guess - I don't have data to test it.
You use wrong < and > and you check scope<33 and scope>67 which gets 0...33 and 67...100 (and it may give empty data) but you need scope>33 and scope<67 to get 33..67
You may also use x[ scope>33 & scope<67 ] instead of x[scope>33] and x[scope<67]
port_middle = group_middle.apply(lambda x:
x[
(x['scope'] > x.scope.quantile(0.33)) & (x['scope'] < x.scope.quantile(0.67)
]
).reset_index(drop=True)
I have a dataframe test that I would like to alter the elements of. Specifically, I want to change the values in the scale column of the dataframe, iterating over each row except for the one with the largest mPowervalue. I want the scale values to become the value of the largest mPower value divided by the current row's mPower value. Below is a small example dataframe:
test = pd.DataFrame({'psnum':[0,1],'scale':[1,1],'mPower':[4.89842,5.67239]})
My code looks like this:
for index, row in test.iterrows():
if(test['psnum'][row] != bigps):
s = morepower/test['mPower'][row]
test.at[:,'scale'][row] = round(s,2)
where bigps = 1 (ie. the value of the psnum column with the largest mPower value) and morepower = 5.67239 (ie. the largest value in the mPower column of the test dataframe).
When I run this code, I get the error: "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." I've tried a few different methods on this now, most of which ending in errors or nothing being changed in the dataframe at all.
So in the end, I need the test dataframe to be as such:
test = pd.DataFrame({'psnum':[0,1],'scale':[1.16,1],'mPower':[4.89842,5.67239]})
Any insight on this matter is greatly appreciated!
This is exactly the sort of vectorized operation pandas is so great for. Rather than looping at all, you can use a mathematical operation and broadcast it to the whole column at once.
test = pd.DataFrame({'psnum':[0,1],'scale':[1,1],'mPower':[4.89842,5.67239]})
test
psnum scale mPower
0 0 1 4.89842
1 1 1 5.67239
test['scale']=test['scale']*(test['mPower'].max()/test['mPower']).round(2)
test
psnum scale mPower
0 0 1.16 4.89842
1 1 1.00 5.67239
please try this :
for index, row in test.iterrows():
if(test.iloc[index]['psnum'] != bigps):
s = morepower/test.iloc[index]['mPower']
test.at[index,'scale'] = round(s,2)
Don't use loop. Use vetorized operation. In this case, I use series where
test['scale'] = test['scale'].where(test.mPower == test.mPower.max(), test.mPower.max()/test.mPower)
Out[652]:
mPower psnum scale
0 4.89842 0 1.158004
1 5.67239 1 1.000000
I'm noob with pandas, and recently, I got that 'ValueError' when I'm trying to modify the columns that follows some rules, as:
csv_input = pd.read_csv(fn, error_bad_lines=False)
if csv_input['ip.src'] == '192.168.1.100':
csv_input['flow_dir'] = 1
csv_input['ip.src'] = 1
csv_input['ip.dst'] = 0
else:
if csv_input['ip.dst'] == '192.168.1.100':
csv_input['flow_dir'] = 0
csv_input['ip.src'] = 0
csv_input['ip.dst'] = 1
I was searching about this error and I guess that it's because the 'if' statement and the '==' operator, but I don't know how to fix this.
Thanks!
So Andrew L's comment is correct, but I'm going to expand on it a bit for your benefit.
When you call, e.g.
csv_input['ip.dst'] == '192.168.1.100'
What this returns is a Series, with the same index as csv_input, but all the values in that series are boolean, and represent whether the value in csv_input['ip.dst'] for that row is equal to '192.168.1.100'.
So, when you call
if csv_input['ip.dst'] == '192.168.1.100':
You're asking whether that Series evaluates to True or False. Hopefully that explains what it meant by The truth value of a Series is ambiguous., it's a Series, it can't be boiled down to a boolean.
Now, what it looks like you're trying to do is set the values in the flow_dir,ip.src & ip.dst columns, based on the value in the ip.src column.
The correct way to do this is would be with .loc[], something like this:
#equivalent to first if statement
csv_input.loc[
csv_input['ip.src'] = '192.168.1.100',
('ip.src','ip.dst','flow_dir')
] = (1,0,1)
#equivalent to second if statement
csv_input.loc[
csv_input['ip.dst'] = '192.168.1.100',
('ip.src','ip.dst','flow_dir')
] = (0,1,0)
I am having trouble "applying" a custom function in Pandas. When I test the function, directly passing the values it works and correctly returns the response. However, when I attempt to pass the column values this way
def feez (rides, plan):
pmt4 = 200
inc4 = 50 #number rides included
min_rate4 = 4
if plan == "4 Plan":
if rides > inc4:
fee = ((rides - inc4) * min_rate4) + pmt4
else:
fee = pmt4
return (fee)
else:
return 0.1
df['fee'].apply(feez(df.total_rides, df.plan_name))
I receive the error:
"The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
Passing the values directly works, i.e. feez (800, "4 Plan"), returns 3200
However, I receive errors when I try to apply the function above.
I am a newbie and suspect my syntax is poorly written. Any ideas much appreciated. TIA. Eli
apply is meant to work on one row at a time, so passing the entire column as you are doing so will not work. In these instances, it's best to use a lambda.
df['fee'] = df.apply(lambda x: feez(x['total_rides'], x['plan_name']), axis=1)
However, there are possibly faster ways to do this. One way is using np.vectorize. The other is using np.where.
Option 1
np.vectorize
v = np.vectorize(feez)
df['fee'] = v(df.total_rides, df.plan_name)
Option 2
Nested np.where
df['fee'] = np.where(
df.plan_name == "4 Plan",
np.where(df.total_rides > inc4, (df.total_rides - inc4) * min_rate4) + pmt4, pmt4),
0.1
)