Assign Variables Based on Datagrame Values

Assign Variables Based on Datagrame Values - python

I have a dataframe in which I am trying to define variables according to the values of particular cells in the dataframe in order to populate a (currently) empty final column based on the relationship between the price targets and current prices of the companies. Currently, the dataframe I’m working with looks like this, with the index being the companies:
Company
Current Price
High
Median
Low
Suggest
Company 1
$296.12
$410.00
$398.00
$365.43
Company 2
$143.18
$212.05
$200.34
$155.12
Company 3
$184.23
$214.09
$192.88
$123.63
How would I assign a variables (for example: target_high(company) = value in the “ticker, target_high” cell)? I don't think I can hard code it because the list of companies will be constantly changing. So far I’ve tried the following but it doesn’t seem to work:
for ticker in Company_List:
target_high(company) = str(Target_Frame.loc[ticker, "High"])
target_mid(company) = str(Target_Frame.loc[ticker, "Median"])
target_low(company) = str(Target_Frame.loc[ticker, "Low"])
current_price = str(Target_Frame.loc[ticker, "Price"])
if current_price(ticker) > target_high(ticker):
Target_Frame.loc[[ticker], ['Suggest']] = "Sell"
elif current_price(ticker) < target_low(ticker):
Target_Frame.loc[[ticker], ['Suggest']] = "Buy"
elif target_mid(ticker) < current_price(ticker) < target_high(ticker):
Target_Frame.loc[[ticker], ['Suggest']] = "Hold"
elif target_low(ticker) < current_price(ticker) < target_mid(ticker):
Target_Frame.loc[[ticker], ['Suggest']] = "Consider"
Thank you!

Otherwise you could use np.where or .map (see this question) with the conditions inside, rather than creating separate variables. So maybe something like this:
Target_Frame["Suggest"] = np.where(
Target_Frame["Price"] > Target_Frame["High"], "Sell", # if above high then sell
np.where(Target_Frame["Price"] < Target_Frame["Low"], "Buy", # if below low then buy
np.where(Target_Frame["Price"].between(
Target_Frame["Median"], Target_Frame["High"]), "Hold", # if between median and high then hold
"Consider"))) # else consider

Related

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.

If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

Faster loop in Pandas looking for ID and older date

So, I have a DataFrame that represents purchases with 4 columns:
date (date of purchase in format %Y-%m-%d)
customer_ID (string column)
claim (1-0 column that means 1-the customer complained about the purchase, 0-customer didn't complain)
claim_value (for claim = 1 it means how much the claim cost to the company, for claim = 0 it's NaN)
I need to build 3 new columns:
past_purchases (how many purchases the specific customer had before this purchase)
past_claims (how many claims the specific customer had before this purchase)
past_claims_value (how much did the customer's past claims cost)
This has been my approach until now:
past_purchases = []
past_claims = []
past_claims_value = []
for i in range(0, len(df)):
date = df['date'][i]
customer_ID = df['customer_ID'][i]
df_temp = df[(df['date'] < date) & (df['customer_ID'] == customer_ID)]
past_purchases.append(len(df_temp))
past_claims.append(df_temp['claim'].sum())
past_claims_value.append(df['claim_value'].sum())
df['past_purchases'] = pd.DataFrame(past_purchases)
df['past_claims'] = pd.DataFrame(past_claims)
df['past_claims_value'] = pd.DataFrame(past_claims_value)
The code works fine, but it's too slow. Can anyone make it work faster? Thanks!
Ps: It's importante to check that the date is older, if the customer had 2 purchases in the same date they shouldn't count for each other.
Pss: I'm willing to use libraries for parallel processing like multiprocessing, concurrent.futures, joblib or dask, but never had before in a similar way.
Expected outcome:

Maybe you can try using a cumsum over customers, if the dates are sorted ascendant
df.sort_values('date', inplace=True)
new_temp_columns = ['claim_s','claim_value_s']
df[['claim_s','claim_value_s']] = df[new_temp_columns].shift()
df['past_claims'] = df.groupby('customer_ID')['claim_s'].transform(pd.Series.cumsum)
df['past_claims_value'] = df.groupby('customer_ID')['claim_value_s'].transform(pd.Series.cumsum)
# set the min value for the groups
dfc = data.groupby(['customer_ID','date'])[['past_claims','past_claims_value']]
data[['past_claims', 'past_claims_value']] = dfc.transform(min)
# Remove temp columns
data = data.loc[:, ~data.columns.isin(new_temp_columns)]
Again, this will only works if te date are srotes

Conditional operations in dataframe (if else)

I have a data frame called Install_Date. I want to assign values to another data frame called age under two conditions- if value in Install_Date is null then age = current year - plant construct date, if value is not null then age = current year - INPUT_Asset["Install_Date"],
This is the code I have. First condition works fine but the second condition still gives 0 as values. :
Plant_Construct_Year = 1975
this_year= 2020
for i in INPUT_Asset["Install_Date"]:
if i != 0.0:
INPUT_Asset["Asset_Age"] = this_year- INPUT_Asset["Install_Date"]
else
INPUT_Asset["Asset_Age"] = this_year- Plant_Construct_Year

INPUT_Asset["Install_Date"] = pd.to_numeric(INPUT_Asset["Install_Date"], errors='coerce').fillna(0)
INPUT_Asset["Asset_Age"] = np.where(INPUT_Asset["Install_Date"] ==0.0, this_year- Plant_Construct_Year,INPUT_Asset["Asset_Age"])
INPUT_Asset["Asset_Age"] = np.where(INPUT_Asset["Install_Date"] !=0.0, this_year- INPUT_Asset["Install_Date"],INPUT_Asset["Asset_Age"])
print(INPUT_Asset["Asset_Age"])

Similar functions in Python don't produce same result

I'm having an issue with two functions I have defined in Python. Both functions have similar operations in the first few lines of the function body, and one will run and the other produces a 'key error' message. I will explain more below, but here are the two functions first.
#define function that looks at the number of claims that have a decider id that was dealer
#normalize by business amount
def decider(df):
#subset dataframe by date
df_sub = df[(df['vehicle_repair_date'] >= Q1_sd) & (df['vehicle_repair_date'] <= Q1_ed)]
#get the dealer id
did = df_sub['dealer_id'].unique()
#subset data further by selecting only records where 'dealer_decide' equals 1
df_dealer_decide = df_sub[df_sub['dealer_decide'] == 1]
#count the number of unique warranty claims
dealer_decide_count = df_dealer_decide['warranty_claim_number'].nunique()
#get the total sales amount for that dealer
total_sales = float(df_sub['amount'].max())
#get the number of warranty claims decided by dealer per $100k in dealer sales
decider_count_phk = dealer_decide_count * (100000/total_sales)
#create a dictionary to store results
output_dict = dict()
output_dict['decider_phk'] = decider_count_phk
output_dict['dealer_id'] = did
output_dict['total_claims_dealer_dec_Q1_2019'] = dealer_decide_count
output_dict['total_sales2019'] = total_sales
#convert resultant dictionary to dataframe
sum_df = pd.DataFrame.from_dict(output_dict)
#return the summarized dataframe
return sum_df
#apply the 'decider' function to each dealer in dataframe 'data'
decider_count = data.groupby('dealer_id').apply(decider)
#define a function that looks at the percentage change between 2018Q4 and 2019Q1 in terms of the number #of claims processed
def turnover(df):
#subset dealer records for Q1
df_subQ1 = df[(df['vehicle_repair_date'] >= Q1_sd) & (df['vehicle_repair_date'] <= Q1_ed)]
#subset dealer records for Q4
df_subQ4 = df[(df['vehicle_repair_date'] >= Q4_sd) & (df['vehicle_repair_date'] <= Q4_ed)]
#get the dealer id
did = df_subQ1['dealer_id'].unique()
#get the unique number of claims for Q1
unique_Q1 = df_subQ1['warranty_claim_number'].nunique()
#get the unique number of claims for Q1
unique_Q4 = df_subQ4['warranty_claim_number'].nunique()
#determine percent change from Q4 to Q1
percent_change = round((1 - (unique_Q1/unique_Q4))*100, ndigits = 1)
#create a dictionary to store results
output_dict = dict()
output_dict['nclaims_Q1_2019'] = unique_Q1
output_dict['nclaims_Q4_2018'] = unique_Q4
output_dict['dealer_id'] = did
output_dict['quarterly_pct_change'] = percent_change
#apply the 'turnover' function to each dealer in 'data' dataframe
dealer_turnover = data.groupby('dealer_id').apply(turnover)
Each function is being applied to the exact same dataset and I am obtaining the dealer id(variable did in function body) in the same way. I am also using the same groupby then apply code, but when I run the two functions the function decider runs as expected, but the function turnover gives the following error:
KeyError: 'dealer_id'.
At first I thought it might be a scoping issue, but that doesn't really make sense so if anyone can shed some light on what might be happening I would greatly appreciate it.
Thanks,
Curtis

IIUC, you are applying turnover function after decider function. You are getting the key error since dealer_id is present as index and not as a column.
Try replacing
decider_count = data.groupby('dealer_id').apply(decider)
with
decider_count = data.groupby('dealer_id', as_index=False).apply(decider)

turning a for loop into a dataframe.apply problem

This is my first ever question on here, so please forgive me if I don't explain it clearly, or overexplain. The task is to turn a for loop that contained 2 if statements in to dataframe.apply instead of the loop. I thought the way of doing it was turning the if statements inside the for loop into a defined function, then calling the function in the .apply line, but can only get so far. Not even sure I am trying to tackle this the right way. can provide original For loop code if necessary. Thanks in advance.
The goal is to import a csv of stock prices, compare the prices in one column to a moving average, which needed to be created, and if > MA, buy, if < MA, sell. Keep track of all buy/sells and determine overall wealth/return at the end. It worked as a for loop: for each x in prices, use the 2 if's, append prices to a list to determine ending wealth. I think I get to the point where I am to call the defined function into the .apply line, and errors out. In my code below there may still be some unnecessary lingering code from the for loop usage, but shouldn't interfere with the .apply attempt, just makes for messy coding until I figure it out.
df2 = pd.read_csv("MSFT.csv", index_col=0, parse_dates=True).sort_index(axis=0 ,ascending=True) #could get yahoo to work but not quandl, so imported the csv file from class
buyPrice = 0
sellPrice = 0
maWealth = 1.0
cash = 1
stock = 0
sma = 200
ma = np.round(df2['AdjClose'].rolling(window=sma, center=False).mean(), 2) #to create the moving average to compare to
n_days = len(df2['AdjClose'])
closePrices = df2['AdjClose'] #to only work with one column from original csv import
buy_data = []
sell_data = []
trade_price = []
wealth = []
def myiffunc(adjclose):
if closePrices > ma and cash == 1: # Buy if stock price > MA & if not bought yet
buyPrice = closePrices[0+ 1]
buy_data.append(buyPrice)
trade_price.append(buyPrice)
cash = 0
stock = 1
if closePrices < ma and stock == 1: # Sell if stock price < MA and if you have a stock to sell
sellPrice = closePrices[0+ 1]
sell_data.append(sellPrice)
trade_price.append(sellPrice)
cash = 1
stock = 0
wealth.append(1*(sellPrice / buyPrice))
closePrices.apply(myiffunc)

Checking the docs for apply, it seems like you need to use the index=1 version to process each row at a time, and pass two columns: the moving average and the closing price.
Something like this:
df2 = ...
df2['MovingAverage'] = ...
have_shares = False
def my_func(row):
global have_shares
if not have_shares and row['AdjClose'] > row['MovingAverage']:
# buy shares
have_shares = True
elif have_shares and row['AdjClose'] < row['MovingAverage']:
# sell shares
have_shares = False
However, it's worth pointing out that you can do the comparisons using numpy/pandas as well, just storing the results in another column:
df2['BuySignal'] = (df2.AdjClose > df2.MovingAverage)
df2['SellSignal'] = (df2.AdjClose < df2.MovingAverage)
Then you could .apply() a function that made use of the Buy/Sell signal columns.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Assign Variables Based on Datagrame Values - python

Related

Efficient way to loop through GroupBy DataFrame

Faster loop in Pandas looking for ID and older date

Conditional operations in dataframe (if else)

Similar functions in Python don't produce same result

turning a for loop into a dataframe.apply problem

Categories

Resources