How to make a loop that replaces pandas dataframe elements by position - python

I have a dataframe test that I would like to alter the elements of. Specifically, I want to change the values in the scale column of the dataframe, iterating over each row except for the one with the largest mPowervalue. I want the scale values to become the value of the largest mPower value divided by the current row's mPower value. Below is a small example dataframe:
test = pd.DataFrame({'psnum':[0,1],'scale':[1,1],'mPower':[4.89842,5.67239]})
My code looks like this:
for index, row in test.iterrows():
if(test['psnum'][row] != bigps):
s = morepower/test['mPower'][row]
test.at[:,'scale'][row] = round(s,2)
where bigps = 1 (ie. the value of the psnum column with the largest mPower value) and morepower = 5.67239 (ie. the largest value in the mPower column of the test dataframe).
When I run this code, I get the error: "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." I've tried a few different methods on this now, most of which ending in errors or nothing being changed in the dataframe at all.
So in the end, I need the test dataframe to be as such:
test = pd.DataFrame({'psnum':[0,1],'scale':[1.16,1],'mPower':[4.89842,5.67239]})
Any insight on this matter is greatly appreciated!

This is exactly the sort of vectorized operation pandas is so great for. Rather than looping at all, you can use a mathematical operation and broadcast it to the whole column at once.
test = pd.DataFrame({'psnum':[0,1],'scale':[1,1],'mPower':[4.89842,5.67239]})
test
psnum scale mPower
0 0 1 4.89842
1 1 1 5.67239
test['scale']=test['scale']*(test['mPower'].max()/test['mPower']).round(2)
test
psnum scale mPower
0 0 1.16 4.89842
1 1 1.00 5.67239

please try this :
for index, row in test.iterrows():
if(test.iloc[index]['psnum'] != bigps):
s = morepower/test.iloc[index]['mPower']
test.at[index,'scale'] = round(s,2)

Don't use loop. Use vetorized operation. In this case, I use series where
test['scale'] = test['scale'].where(test.mPower == test.mPower.max(), test.mPower.max()/test.mPower)
Out[652]:
mPower psnum scale
0 4.89842 0 1.158004
1 5.67239 1 1.000000

Related

Python, comparing dataframe rows, adding new column - Truth Value Error?

I am quite new to Python, and somewhat stuck here.
I just want to compare floats with a previous or forward row in a dataframe, and mark a new column accordingly. FWIW, I have 6000 rows to compare. I need to output a string or int as the result.
My Python code:
for row in df_an:
x = df_an['mid_h'].shift(1)
y = df_an['mid_h']
if y > x:
df_an['bar_type'] = 1
I get the error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The x and y variables are generated, but apparently things go wrong with the if y > x: statement.
Any advice much appreciated.
A different approach...
I managed to implement the suggested .gt operator.
df_an.loc[df_an.mid_h.gt(df_an.mid_h.shift()) &\
df_an.mid_l.gt(df_an.mid_l.shift()), "bar_type"] = UP
Instead of row; you basically shifting whole row then comparing; try this once;
df_an = pd.DataFrame({"mid_h":[1,3,5,7,7,5,3]})
df_an['bar_type'] = df_an.mid_h.gt(df_an.mid_h.shift())
# Output
mid_h bar_type
0 1 False
1 3 True
2 5 True
3 7 True
4 7 False
5 5 False
6 3 False

How to create a dataframe on two conditions in a lambda function using apply after groupby()?

I am trying to create portfolios in dataframes depended on the variable 'scope' leaving the rows with the highest 33% of the scope-values in the first portfolio in a dataframe, middle 34% in the second and bottom 33% in the third for each time period and industry.
So far, I grouped the data on date and industry
group_first = data_clean.groupby(['date','industry'])
and used a lambda function afterwards to get the rows of the first tercile of 'scope' for every date and industry; for instance:
port = group_first.apply(lambda x: x[x['scope'] <= x.scope.quantile(0.33)]).reset_index(drop=True)
This works for the first and third tercile, however not for the middle one, because I get
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
putting two condition in the lambda function, like this:
group_middle = data_clean.groupby(['date','industry'])
port_middle = group_middle.apply(lambda x: (x[x['scope'] > x.scope.quantile(0.67)]) and (x[x['scope'] < x.scope.quantile(0.33)])).reset_index(drop=True)
In other words, how can I get the rows of a dataframe containing the values in 'scope' between the 33rd and 67th percentile after grouping for date and industry?
Any idea how to solve this?
I will guess - I don't have data to test it.
You use wrong < and > and you check scope<33 and scope>67 which gets 0...33 and 67...100 (and it may give empty data) but you need scope>33 and scope<67 to get 33..67
You may also use x[ scope>33 & scope<67 ] instead of x[scope>33] and x[scope<67]
port_middle = group_middle.apply(lambda x:
x[
(x['scope'] > x.scope.quantile(0.33)) & (x['scope'] < x.scope.quantile(0.67)
]
).reset_index(drop=True)

Pandas: Divide column data by number if row of next column contains certain value

I have a dataframe that consists of three columns
qty unit_of_measure qty_cal
3 nodes nan
4 nodes nan
5 nodes nan
6 cores nan
7 nodes nan
10 cores nan
3 nodes nan
I would like to add a condition to populate qty_cal.
The condition is if unit_of_measure is equal to "nodes" populate the row value of qty into qty_cal
If it's "cores" divide qty value by 16 and populate qty_cal
The code I have tried is,
if ppn_df['unit_of_measure'] == 'Nodes':
ppn_df['qty']
elif ppn_df['unit_of_measure'] =='Cores':
ppn_df['qty'] / 16
I'm getting an error of
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I'm not sure why I'm getting this value error. I don't understand why the if statement is ambiguous.
Can anyone care to explain?
Use np.where:
df['qty_cal'] = np.where(df['unit_of_measure'] == 'nodes', df['qty'], df['qty']/16)
The statement ppn_df['unit_of_measure'] returns a series (a column) with all the values in it, not a single item. One way to do this is with an apply or a map
Try this
ppn_df.qty_cal = ppn_df.apply(lambda x: x['qty'] if x['unit_of_measure'] == 'nodes' else x['qty'] / 16, axis=1)
This function will execute the lambda function for each row in the series

Looping through iterrows and replacing null values with predicted values from Model

I am filling in some missing values (NaN) using a predicted value built with a KNN Regressor Model. Now, I'd like to input the predicted values as a new column in the original data frame, keeping the original values for those rows that weren't NaN. This will be a brand new column in my data frame which I'll use to build a feature.
I'm using iterrows to loop through the values to build a new column, but I'm getting an error. I've used 2 different ways to isolate the NaN values. However, I'm running problems across each method
sticker_price_preds = []
features = ['region_x', 'barrons', 'type_x', 'tier_x', 'iclevel_x',
'exp_instr_pc_2013']
for index, row in data.iterrows():
val = row['sticker_price_2013']
if data[data['sticker_price_2013'].isnull()]:
f = row['region_x', 'barrons', 'type_x', 'tier_x', 'iclevel_x',
'exp_instr_pc_2013']
val = knn.predict(f)
sticker_price_preds.append(val)
data['sticker_price_preds'] = sticker_price_preds
AND
sticker_price_preds = []
features = ['region_x', 'barrons', 'type_x', 'tier_x', 'iclevel_x',
'exp_instr_pc_2013']
for index, row in data.iterrows():
val = row['sticker_price_2013']
if not val:
f = row['region_x', 'barrons', 'type_x', 'tier_x', 'iclevel_x',
'exp_instr_pc_2013']
val = knn.predict(f)
sticker_price_preds.append(val)
data['sticker_price_preds'] = sticker_price_preds
I'm returning the following error message for the first method:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
For the second method, the NaN rows are remaining empty
A little tough without data to try it out, but if you want a vector solution this might work. Make a column that has the knn.predict values, and then filter the dataframe for np.NaN
df['predict'] = knn.predict(features)
--
data.loc[data['sticker_price_2013'].isna(),'sticker_price_2013'] = data.loc[data['sticker_price_2013'].isna(), 'predict']

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()?

I'm noob with pandas, and recently, I got that 'ValueError' when I'm trying to modify the columns that follows some rules, as:
csv_input = pd.read_csv(fn, error_bad_lines=False)
if csv_input['ip.src'] == '192.168.1.100':
csv_input['flow_dir'] = 1
csv_input['ip.src'] = 1
csv_input['ip.dst'] = 0
else:
if csv_input['ip.dst'] == '192.168.1.100':
csv_input['flow_dir'] = 0
csv_input['ip.src'] = 0
csv_input['ip.dst'] = 1
I was searching about this error and I guess that it's because the 'if' statement and the '==' operator, but I don't know how to fix this.
Thanks!
So Andrew L's comment is correct, but I'm going to expand on it a bit for your benefit.
When you call, e.g.
csv_input['ip.dst'] == '192.168.1.100'
What this returns is a Series, with the same index as csv_input, but all the values in that series are boolean, and represent whether the value in csv_input['ip.dst'] for that row is equal to '192.168.1.100'.
So, when you call
if csv_input['ip.dst'] == '192.168.1.100':
You're asking whether that Series evaluates to True or False. Hopefully that explains what it meant by The truth value of a Series is ambiguous., it's a Series, it can't be boiled down to a boolean.
Now, what it looks like you're trying to do is set the values in the flow_dir,ip.src & ip.dst columns, based on the value in the ip.src column.
The correct way to do this is would be with .loc[], something like this:
#equivalent to first if statement
csv_input.loc[
csv_input['ip.src'] = '192.168.1.100',
('ip.src','ip.dst','flow_dir')
] = (1,0,1)
#equivalent to second if statement
csv_input.loc[
csv_input['ip.dst'] = '192.168.1.100',
('ip.src','ip.dst','flow_dir')
] = (0,1,0)

Categories

Resources