I am filling in some missing values (NaN) using a predicted value built with a KNN Regressor Model. Now, I'd like to input the predicted values as a new column in the original data frame, keeping the original values for those rows that weren't NaN. This will be a brand new column in my data frame which I'll use to build a feature.
I'm using iterrows to loop through the values to build a new column, but I'm getting an error. I've used 2 different ways to isolate the NaN values. However, I'm running problems across each method
sticker_price_preds = []
features = ['region_x', 'barrons', 'type_x', 'tier_x', 'iclevel_x',
'exp_instr_pc_2013']
for index, row in data.iterrows():
val = row['sticker_price_2013']
if data[data['sticker_price_2013'].isnull()]:
f = row['region_x', 'barrons', 'type_x', 'tier_x', 'iclevel_x',
'exp_instr_pc_2013']
val = knn.predict(f)
sticker_price_preds.append(val)
data['sticker_price_preds'] = sticker_price_preds
AND
sticker_price_preds = []
features = ['region_x', 'barrons', 'type_x', 'tier_x', 'iclevel_x',
'exp_instr_pc_2013']
for index, row in data.iterrows():
val = row['sticker_price_2013']
if not val:
f = row['region_x', 'barrons', 'type_x', 'tier_x', 'iclevel_x',
'exp_instr_pc_2013']
val = knn.predict(f)
sticker_price_preds.append(val)
data['sticker_price_preds'] = sticker_price_preds
I'm returning the following error message for the first method:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
For the second method, the NaN rows are remaining empty
A little tough without data to try it out, but if you want a vector solution this might work. Make a column that has the knn.predict values, and then filter the dataframe for np.NaN
df['predict'] = knn.predict(features)
--
data.loc[data['sticker_price_2013'].isna(),'sticker_price_2013'] = data.loc[data['sticker_price_2013'].isna(), 'predict']
Related
I want to use a value from a specific column in my Pandas dataframe as the Y-axis label. The reason for this is that the label could change depending on the Unit of Measure (UoM) - it could be kg, number of bags etc.
#create function using plant and material input to chart planned and actual manufactured quantities
def filter_df(df, plant: str = "", material: str = ""):
output_df = df.loc[(df['Plant'] == plant) & (df['Material'].str.contains(material))].reset_index()
return output_df['Planned_Qty_Cumsum'].plot.area (label = 'Planned Quantity'),\
output_df['Goods_Receipted_Qty_Cumsum'].plot.line(label = 'Delivered Quantity'),\
plt.title('Planned and Deliverd Quanties'),\
plt.legend(),\
plt.xlabel('Number of Process Orders'),\
plt.ylabel(output_df['UoM (of GR)']),\
plt.show()
#run function
filter_df(df_yield_data_formatted,'*plant*','*material*')
When running the function I get the following error message:
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
Yes you can, but the way you are doing you are saying all the values of the Dataframe in that column and you should indicate what row and column you want for the label, use iloc for instace and it will work.
plt.ylabel(df.iloc[2,1])
I'm having this resultsdf in Panda
Antecedent Consequent confidence lift support
0 (3623,) (2568,) 0.829517 13.964925 0.0326
1 (4304,) (4305,) 0.808362 24.348264 0.0232
2 (3623, 3970) (2568,) 0.922581 15.531661 0.0286
and dictionary df
key name
0 1001 Boombox Ipod Classic
1 1002 USB Office Mirror Ball
I was trying to interpret Antecedent with dictionary by adding
resultsdf['Antecedent_name'] = resultsdf['Antecedent'].astype(str).map(df)
I'm getting error
The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), `a.item(), a.any() or a.all().`
Seems to me that you are trying to pass a df to map. you should pass a dict.
Try this:
resultsdf['Antecedent_name'] = resultsdf['Antecedent'].astype(str).map(df.to_dict())
I'm not sure if the to_dict() default output is enough. You can check the parameter to change the output formats here
Update:
Considering df1 your main df and df2 you key:name df, you can do something like this:
def check(x):
lst = []
for elem in x:
if elem in df2.to_dict("list")["key"]:
lst.append(df2.to_dict("list")["name"][df2.to_dict("list")["key"].index(elem)])
return tuple(lst)
df1['Antecedent_name'] = df1['Antecedent'].map(check)
It's not that beautiful, but I think that works. Maybe instead of a lambda you can just create a separated function and
This is the solution I used
resultsdf['Antecedent_name'] = resultsdf.Antecedent.map(lambda a: [df[int(id)] for id in a] )
I am trying to create portfolios in dataframes depended on the variable 'scope' leaving the rows with the highest 33% of the scope-values in the first portfolio in a dataframe, middle 34% in the second and bottom 33% in the third for each time period and industry.
So far, I grouped the data on date and industry
group_first = data_clean.groupby(['date','industry'])
and used a lambda function afterwards to get the rows of the first tercile of 'scope' for every date and industry; for instance:
port = group_first.apply(lambda x: x[x['scope'] <= x.scope.quantile(0.33)]).reset_index(drop=True)
This works for the first and third tercile, however not for the middle one, because I get
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
putting two condition in the lambda function, like this:
group_middle = data_clean.groupby(['date','industry'])
port_middle = group_middle.apply(lambda x: (x[x['scope'] > x.scope.quantile(0.67)]) and (x[x['scope'] < x.scope.quantile(0.33)])).reset_index(drop=True)
In other words, how can I get the rows of a dataframe containing the values in 'scope' between the 33rd and 67th percentile after grouping for date and industry?
Any idea how to solve this?
I will guess - I don't have data to test it.
You use wrong < and > and you check scope<33 and scope>67 which gets 0...33 and 67...100 (and it may give empty data) but you need scope>33 and scope<67 to get 33..67
You may also use x[ scope>33 & scope<67 ] instead of x[scope>33] and x[scope<67]
port_middle = group_middle.apply(lambda x:
x[
(x['scope'] > x.scope.quantile(0.33)) & (x['scope'] < x.scope.quantile(0.67)
]
).reset_index(drop=True)
I have a dataframe test that I would like to alter the elements of. Specifically, I want to change the values in the scale column of the dataframe, iterating over each row except for the one with the largest mPowervalue. I want the scale values to become the value of the largest mPower value divided by the current row's mPower value. Below is a small example dataframe:
test = pd.DataFrame({'psnum':[0,1],'scale':[1,1],'mPower':[4.89842,5.67239]})
My code looks like this:
for index, row in test.iterrows():
if(test['psnum'][row] != bigps):
s = morepower/test['mPower'][row]
test.at[:,'scale'][row] = round(s,2)
where bigps = 1 (ie. the value of the psnum column with the largest mPower value) and morepower = 5.67239 (ie. the largest value in the mPower column of the test dataframe).
When I run this code, I get the error: "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." I've tried a few different methods on this now, most of which ending in errors or nothing being changed in the dataframe at all.
So in the end, I need the test dataframe to be as such:
test = pd.DataFrame({'psnum':[0,1],'scale':[1.16,1],'mPower':[4.89842,5.67239]})
Any insight on this matter is greatly appreciated!
This is exactly the sort of vectorized operation pandas is so great for. Rather than looping at all, you can use a mathematical operation and broadcast it to the whole column at once.
test = pd.DataFrame({'psnum':[0,1],'scale':[1,1],'mPower':[4.89842,5.67239]})
test
psnum scale mPower
0 0 1 4.89842
1 1 1 5.67239
test['scale']=test['scale']*(test['mPower'].max()/test['mPower']).round(2)
test
psnum scale mPower
0 0 1.16 4.89842
1 1 1.00 5.67239
please try this :
for index, row in test.iterrows():
if(test.iloc[index]['psnum'] != bigps):
s = morepower/test.iloc[index]['mPower']
test.at[index,'scale'] = round(s,2)
Don't use loop. Use vetorized operation. In this case, I use series where
test['scale'] = test['scale'].where(test.mPower == test.mPower.max(), test.mPower.max()/test.mPower)
Out[652]:
mPower psnum scale
0 4.89842 0 1.158004
1 5.67239 1 1.000000
I am researching/backtesting a trading system.
I have a Pandas dataframe containing OHLC data and have added several calculated columns which identify price patterns that I will use as signals to initiate positions.
I would now like to add a further column that will keep track of the current net position. I have tried using df.apply(), but passing the dataframe itself as the argument instead of the row object, as with the latter I seem to be unable to look back at previous rows to determine whether they resulted in any price patterns:
open_campaigns = []
Campaign = namedtuple('Campaign', 'open position stop')
def calc_position(df):
# sum of current positions + any new positions
if entered_long(df):
open_campaigns.add(
Campaign(
calc_long_open(df.High.shift(1)),
calc_position_size(df),
calc_long_isl(df)
)
)
return sum(campaign.position for campaign in open_campaigns)
def entered_long(df):
return buy_pattern(df) & (df.High > df.High.shift(1))
df["Position"] = df.apply(lambda row: calc_position(df), axis=1)
However, this returns the following error:
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', u'occurred at index 1997-07-16 08:00:00')
Rolling window functions would seem to be the natural fit, but as I understand it, they only act on a single time series or column, so wouldn't work either as I need to access the values of multiple columns at multiple timepoints.
How should I in fact be doing this?
This problem has its roots in NumPy.
def entered_long(df):
return buy_pattern(df) & (df.High > df.High.shift(1))
entered_long is returning an array-like object. NumPy refuses to guess if an array is True or False:
In [48]: x = np.array([ True, True, True], dtype=bool)
In [49]: bool(x)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
To fix this, use any or all to specify what you mean for an array to be True:
def calc_position(df):
# sum of current positions + any new positions
if entered_long(df).any(): # or .all()
The any() method will return True if any of the items in entered_long(df) are True.
The all() method will return True if all the items in entered_long(df) are True.